research-copy-number

Research: copy-number-aware enrichment methods

Metadata

Statusdone
Assignedagent-70
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-04-01T14:47:09.369933827+00:00
Started2026-04-01T14:47:58.138442817+00:00
Completed2026-04-01T14:49:47.919383106+00:00
Tagsresearch,methods, eval-scheduled
Eval score0.38
└ blocking impact0.35
└ completeness0.20
└ coordination overhead0.30
└ correctness0.25
└ downstream usability0.15
└ efficiency0.35
└ intent fidelity0.76
└ style adherence0.55

Description

Goal

Identify and evaluate statistical methods for gene enrichment analysis that properly account for multi-copy gene families — both in the query set (PHR genes) and the background (whole genome annotation).

The problem

Standard ORA (clusterProfiler, g:Profiler) deduplicates gene symbols. WASHC1 on 16 arms counts as 1 gene. But the genome-wide annotation also has multi-copy genes — if the background is also deduplicated, we might be comparing apples to apples (both sides lose copy info). OR we might be biasing results because PHR genes are systematically higher-copy than genome-average genes.

We need methods where the TEST STATISTIC accounts for copy number.

Approaches to investigate

1. Region-based enrichment (GREAT / rGREAT)

GREAT (Genomic Regions Enrichment of Annotations Tool) takes GENOMIC INTERVALS as input (not gene lists). It assigns regulatory domains to genes and tests whether your intervals are enriched near genes of particular functions. This naturally handles multi-copy because each interval is counted independently.

  • Can we use the 29 PHR BED intervals as input?
  • What's the appropriate background? All subtelomeric regions? Whole genome?
  • Is GREAT available as an R package (rGREAT) or only as a web tool?
  • Does it work with CHM13 coordinates or only GRCh38?

2. Weighted gene enrichment

Instead of a binary gene list, weight each gene by its copy number in PHRs:

  • WASHC1 gets weight 16, DUX4 gets weight 18, SHOX gets weight 2
  • Use a method like GSEA with pre-ranked lists (rank = copy number)
  • Or use a weighted hypergeometric / weighted Fisher test
  • The background also needs weighting: count genome-wide copies of each gene family

3. Permutation-based approach

  • Randomly sample N genomic intervals of the same sizes as PHRs from subtelomeric regions
  • Count gene copies in each random sample
  • Repeat 10,000x to build null distribution
  • Compare observed copy counts per GO term to the null
  • This is the most statistically rigorous but computationally heaviest
  • Could use bedtools shuffle + intersect in a loop

4. Copy-number-weighted ORA

  • Build a custom background where each gene's 'count' reflects its genome-wide copy number
  • PHR query: instead of 23 unique genes, submit 1,189 gene instances
  • Background: count all gene copies genome-wide (not just unique symbols)
  • Run a modified Fisher/hypergeometric on copy counts rather than unique genes
  • This might be implementable with a custom R script using phyper()

5. GAT (Genomic Association Tester)

  • Takes intervals and annotations, uses permutation to test enrichment
  • Specifically designed for repeat-rich genomic regions
  • May handle the multi-copy issue naturally

Questions to answer

  1. Which of these methods is most appropriate for our specific situation (29 intervals, ~1,189 gene copies, subtelomeric context)?
  2. Which can be run on the head node with available tools (R, Python, bedtools)?
  3. Which gives us the most defensible statistics for a paper?
  4. For region-based methods: can they work with CHM13 coordinates?
  5. Is there precedent in the subtelomeric/repeat biology literature for any of these approaches?

Output

  • A ranked recommendation of 2-3 methods to implement
  • For each: what tool/package to use, what the input format is, what the expected runtime is
  • Any papers that used similar approaches for repetitive regions
  • Clear assessment of which approach(es) we should actually run

Validation

  • At least 3 approaches are evaluated with pros/cons
  • Tool availability on the head node is checked
  • A clear recommendation is provided with justification

Depends on

Required by

Log