step-2-intersect-genes

Step 2: Intersect genes with PHR intervals

Metadata

Statusdone
Assignedagent-15
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-03-31T21:03:36.016867686+00:00
Started2026-03-31T21:08:30.389375549+00:00
Completed2026-03-31T21:11:50.498248176+00:00
Tagsimpl, eval-scheduled
Eval score0.95
└ blocking impact0.95
└ completeness0.98
└ coordination overhead0.93
└ correctness0.95
└ downstream usability0.96
└ efficiency0.90
└ intent fidelity0.88
└ style adherence0.92

Description

Goal

Get the list of genes that fall within actual PHR intervals on CHM13.

Approach

Use bedtools intersect with the CHM13 gene annotation (RefSeq Liftoff or HPRC Liftoff) and the chm13.phrs.bed from Step 1.

zcat chm13v2.0_RefSeq_Liftoff_v5.2.gff3.gz \
  | awk '$3 == "gene"' \
  | bedtools intersect -a - -b chm13.phrs.bed -wa \
  > phrs.genes.gff3

Then extract gene names/IDs:

grep -oP 'Name=\K[^;]+' phrs.genes.gff3 > phrs.gene_names.txt
# Also extract Entrez/NCBI gene IDs if available for clusterProfiler

Context

  • Check the research task output for which annotation file to use
  • If the RefSeq Liftoff is insufficient, the research task should have identified the CHM13 file in Andrea's annotations at /moosefs/guarracino/HPRCv2/PHR_III/hprc_annotations/
  • Compare gene count to Angela's 327 unique genes — ours should be a subset since we're using tighter intervals

Validation

  • phrs.genes.gff3 exists with reasonable gene count
  • phrs.gene_names.txt has one gene per line, no empty lines
  • Log gene count, biotype breakdown if possible (protein-coding vs lncRNA vs pseudogene)
  • Briefly compare count to Angela's 327 unique genes

Depends on

Required by

Log