map-every-protein

Map every protein-coding gene copy to every arm it appears on

Metadata

Statusdone
Assignedagent-67
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-01T14:39:38.778122963+00:00
Started2026-04-01T14:40:03.148807675+00:00
Completed2026-04-01T14:43:50.110141987+00:00
Tagsanalysis,critical, eval-scheduled
Eval score0.80
└ blocking impact0.80
└ completeness0.80
└ coordination overhead0.75
└ correctness0.85
└ downstream usability0.75
└ efficiency0.75
└ intent fidelity0.83
└ style adherence0.80

Description

Goal

Create a COMPLETE, EXHAUSTIVE map of where each of the 23 protein-coding genes appears across ALL 29 non-acrocentric PHR intervals. We need to answer: is each gene on ONE arm, or does it appear on MULTIPLE arms as copies?

Context

The user is confused because previous reports mention genes on 1-2 arms, but these are subtelomeric PHRs — regions that SHARE sequence across chromosomes. If a gene is in shared sequence, it should appear on MULTIPLE arms. The previous analysis may have deduplicated gene names, losing the multi-arm information.

The 23 protein-coding genes are: DUX4, FRG2, FRG2B, GTPBP6, IL9R, IQSEC3, LOC105375112, LOC112268260, LOC124905300, OR4F17, OR4F29, OR4F3, OR4F5, PLCXD1, PPP2R3B, SCGB1C1, SHOX, SPRY3, TUBB8, TUBB8B, VAMP7, WASHC1, ZNF595

Approach

Critical: Do NOT deduplicate

We need EVERY copy of each gene. The GFF3 may have gene names like WASHC1, WASHC1_1, WASHC1_2 etc — or it may have the same gene name appearing at multiple genomic locations. We need ALL of them.

  1. From phrs.no_acro.genes.gff3 (the full intersection output, before dedup): For EACH of the 23 gene names, grep for all lines matching that gene name (including suffixed copies like _1, _2 etc). Extract: gene name, chromosome, start, end, strand.

  2. Also check the FULL phrs.genes.gff3 (including acrocentrics) to see if any copies are on acrocentric arms too.

  3. Cross-reference with chm13.phrs.no_acro.bed to get the PHR interval and arm (p or q) for each copy.

  4. Build a COMPLETE table with columns:

    • Gene family name (e.g. WASHC1)
    • Copy name (e.g. WASHC1_2)
    • Chromosome
    • Arm (p/q)
    • Start-End
    • PHR interval
    • Sharing pattern (column 4 from BED)
  5. Create a SUMMARY table showing for each gene:

    • How many copies total across PHR intervals
    • Which arms they appear on (list ALL)
    • Which Leiden communities
  6. For the non-coding genes too: do the same for the key ncRNA families:

    • LOC101928xxx/LOC101929xxx (the 8 snRNP lncRNAs — how many actual copies across how many arms?)
    • MIR8078 (we know 36 copies on chr4q+chr10q — but are there copies on other arms too?)
    • IL9RP1/3/4 (already mapped — 3 arms, 3 communities)

Output

  • all_gene_copies_by_arm.csv — every single gene copy with location
  • gene_copy_summary.csv — gene name | total copies | arms | communities
  • Log BOTH tables in full

Key question to answer

For each gene: is it on 1 arm, 2-3 arms, or widely dispersed across many arms? This tells us whether it's a unique subtelomeric gene or part of the inter-chromosomal shared content.

Validation

  • Every copy of every gene is listed (not just unique names)
  • The total copy count per gene matches what's in the GFF3
  • Arms and communities are correctly assigned
  • We can definitively answer: 'WASHC1 appears on N arms in communities X, Y, Z'

Depends on

Required by

Log