write-paper-ready

Write paper-ready gene catalog and enrichment summary

Metadata

Statusdone
Assignedagent-64
Agent identityead7f53029b7d01980e12f8beb6ad13f6907750479eb2951dd75eb63951922b8
Created2026-04-01T13:48:52.771067669+00:00
Started2026-04-01T13:49:28.673658538+00:00
Completed2026-04-01T13:52:27.653384051+00:00
Tagspaper,report, eval-scheduled
Eval score0.85
└ blocking impact0.90
└ completeness0.88
└ coordination overhead0.87
└ correctness0.82
└ downstream usability0.87
└ efficiency0.85
└ intent fidelity0.86
└ style adherence0.90

Description

Goal

Write a comprehensive, paper-ready markdown document that catalogs the protein-coding genes in non-acrocentric PHRs, reports GO enrichment results, and provides biological interpretation with disease associations and community mappings.

Context

Key data files to read:

  • phrs.no_acro.coding_gene_names.txt — the 23 protein-coding gene names
  • phr_coding_only_GO_BP_enrichment.csv — BP enrichment results
  • phr_coding_only_GO_MF_enrichment.csv — MF enrichment results
  • phr_no_acro_GO_BP_enrichment.csv — full gene set BP results (for comparison)
  • phr_no_acro_GO_MF_enrichment.csv — full gene set MF results (for comparison)
  • enriched_genes_detailed_map.csv — gene-to-chromosome-arm-community mapping
  • phrs.no_acro.genes.gff3 — all genes in PHR intervals (for counts/biotype breakdown)
  • chm13.phrs.no_acro.bed — the 29 PHR intervals
  • subtelomeric_analysis_report.md — for Andrea's section 9 community context and population enrichment data

What we know:

  • 220 genes total in non-acrocentric PHRs (29 intervals, 18 arms)
  • Biotype breakdown: ~204 pseudogenes, 108 lncRNAs, 51 miRNAs, 27 protein-coding, 21 transcribed pseudogenes
  • 23 protein-coding genes after dedup
  • Full gene set GO enrichment was dominated by lncRNAs/pseudogenes inheriting annotations
  • Protein-coding-only enrichment found 7 BP + 9 MF terms (p = 0.03-0.04), mostly olfactory + GPCR + cytoskeleton
  • Key protein-coding genes: DUX4, SHOX, IL9R, TUBB8/TUBB8B, OR4F family, WASHC1, PPP2R3B, GTPBP6, PLCXD1, SPRY3, VAMP7, ZNF595, FRG2/FRG2B, SCGB1C1

The story (from our analysis):

  1. Angela's 1Mb GSEA found dramatic enrichments (146-fold OR, z=18.0) but the wide window captured neighborhoods, not PHRs
  2. PHR-only analysis (245 genes) found snRNP/splicing, OR, miRNA signals — but these were driven by ncRNA/pseudogene annotation artifacts
  3. Excluding acrocentrics barely changed results — signals are genome-wide
  4. Protein-coding-only enrichment (23 genes) reveals modest but real olfactory and GPCR enrichment
  5. The gene list itself is more informative than the statistics: DUX4, SHOX, IL9R are disease-associated subtelomeric landmarks

Document structure

Write phr_gene_enrichment_report.md with the following sections:

1. Summary / Abstract (2-3 sentences)

What we did, what we found, key takeaway.

2. PHR Gene Content Overview

  • Total gene count by biotype (table)
  • Comparison: 37 full PHR intervals vs 29 non-acrocentric
  • Median PHR size (~105kb) vs Angela's 1Mb window

3. GO Enrichment Results

  • Full gene set (all 220 genes): table of top terms, note that signal is driven by ncRNA/pseudogenes
  • Protein-coding only (23 genes): table of significant terms
  • Acrocentric exclusion comparison: one paragraph noting signals are genome-wide
  • Interpretation: the GO enrichment is modest; the gene list tells the real story

4. Protein-Coding Gene Catalog

Master table with columns: Gene | Chromosome | Arm | Community | Function | Disease Associations | Notes For each gene, provide:

  • Full gene name
  • What it does (2-3 sentences of actual biology)
  • Known disease associations with OMIM numbers if relevant
  • Which Leiden community it belongs to
  • Whether it was newly resolved by T2T / CHM13

Group the table by functional category:

  • Disease-associated (DUX4, SHOX, IL9R)
  • PAR genes (GTPBP6, PPP2R3B, PLCXD1, SPRY3, VAMP7)
  • Olfactory receptors (OR4F family)
  • Cytoskeletal (TUBB8, TUBB8B)
  • Other (WASHC1, ZNF595, FRG2, FRG2B, SCGB1C1, IQSEC3, LOCs)

5. Non-coding RNA landscape

Brief section on the ncRNA content:

  • MIR8078 tandem array (36 copies, C1, D4Z4 context)
  • 8 LOC lncRNAs with snRNP annotations
  • IL9R pseudogene dispersal pattern

6. Comparison to Angela's 1Mb GSEA

What changed, what disappeared, what sharpened. Key point: the 1Mb GSEA captured the subtelomeric neighborhood; PHR-only analysis captures the inter-chromosomally shared content specifically.

7. Comparison to Andrea's Report Section 9

Reconciliation with the 374-gene, 15-community analysis. Which of our 23 protein-coding genes appear in Andrea's community gene lists?

8. Implications for the Paper

3-5 bullet points on what to say in the manuscript.

Style

  • Scientific but accessible
  • Include actual numbers, gene names, p-values
  • Tables should be proper markdown tables
  • Use the data from the files — don't make up numbers
  • When discussing genes, be specific about what they do biologically
  • Be honest about limitations (small query set, modest p-values)

Validation

  • All 23 protein-coding genes appear in the catalog with functions and disease associations
  • GO enrichment tables include actual p-values from the CSV files
  • Community assignments match the detailed mapping data
  • Angela and Andrea comparisons reference actual data from their results
  • The document reads as a coherent narrative, not a data dump

Depends on

Required by

Log