integration-testing-copy

Integration testing: copy-number-aware ORA with PHR data

Metadata

Statusdone
Assignedagent-387
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-01T19:05:14.394619506+00:00
Started2026-04-01T19:05:37.614943087+00:00
Completed2026-04-01T19:13:23.602509663+00:00
Tagsanalysis,integration, eval-scheduled
Eval score0.90
└ blocking impact0.93
└ completeness0.95
└ coordination overhead0.92
└ correctness0.92
└ downstream usability0.85
└ efficiency0.88
└ intent fidelity0.94
└ style adherence0.89

Description

Goal

Test the copy-number-aware enrichment methodology with the actual PHR dataset. This is the practical integration test.

Context

The research phase produced a methodology for copy-number-weighted ORA using R's phyper(). Key files are in the repo from previous completed tasks. The approach is:

  • Count gene COPIES (not unique names) in PHR intervals and genome-wide
  • Use phyper() with copy-weighted parameters
  • Compare results to the deduplicated g:Profiler ORA we already ran

Input files

  • gene_copy_summary.csv — copy counts per gene family in PHRs (23 protein-coding + ncRNA families, 1,189 total copies)
  • all_gene_copies_by_arm.csv — every gene copy with genomic location
  • phrs.no_acro.genes.gff3 — all gene copies in non-acrocentric PHR intervals
  • chm13v2.0_RefSeq_Liftoff_v5.2.gff3.gz — full genome annotation (for building copy-aware background)
  • phr_coding_only_GO_BP_enrichment.csv and phr_coding_only_GO_MF_enrichment.csv — previous deduplicated results for comparison

Approach

  1. Build genome-wide copy count background: For each gene family in PHRs, count how many total copies exist genome-wide (not just in PHRs). This tells us the denominator.

  2. For each GO term: Count how many gene COPIES in PHRs are annotated to that term vs how many copies genome-wide.

  3. Run phyper() with copy-weighted parameters:

    • q = copies of GO-term genes drawn into PHRs
    • m = total copies of GO-term genes genome-wide
    • n = total gene copies genome-wide NOT in this GO term
    • k = total gene copies in PHRs
  4. Compare to deduplicated ORA: Which terms get stronger? Which get weaker? Do new terms appear?

  5. Also try a permutation approach: Shuffle PHR intervals (bedtools shuffle), count gene copies in random intervals, repeat 1000x, compare to observed.

Output

  • phr_copy_weighted_enrichment.csv — copy-aware enrichment results
  • Comparison table: deduplicated ORA vs copy-weighted ORA
  • Clear statement on whether copy-awareness changes the picture

Validation

  • Background copy counts are computed for all gene families
  • phyper() results are reported with p-values
  • Comparison to previous results is documented

Depends on

Required by

Log