research-copy-number

Metadata

Status	done
Assigned	`agent-70`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-04-01T14:47:09.369933827+00:00
Started	2026-04-01T14:47:58.138442817+00:00
Completed	2026-04-01T14:49:47.919383106+00:00
Tags	`research,methods`, `eval-scheduled`
Eval score	0.38
└ blocking impact	0.35
└ completeness	0.20
└ coordination overhead	0.30
└ correctness	0.25
└ downstream usability	0.15
└ efficiency	0.35
└ intent fidelity	0.76
└ style adherence	0.55

Description

Goal

Identify and evaluate statistical methods for gene enrichment analysis that properly account for multi-copy gene families — both in the query set (PHR genes) and the background (whole genome annotation).

The problem

Standard ORA (clusterProfiler, g:Profiler) deduplicates gene symbols. WASHC1 on 16 arms counts as 1 gene. But the genome-wide annotation also has multi-copy genes — if the background is also deduplicated, we might be comparing apples to apples (both sides lose copy info). OR we might be biasing results because PHR genes are systematically higher-copy than genome-average genes.

We need methods where the TEST STATISTIC accounts for copy number.

Approaches to investigate

1. Region-based enrichment (GREAT / rGREAT)

GREAT (Genomic Regions Enrichment of Annotations Tool) takes GENOMIC INTERVALS as input (not gene lists). It assigns regulatory domains to genes and tests whether your intervals are enriched near genes of particular functions. This naturally handles multi-copy because each interval is counted independently.

Can we use the 29 PHR BED intervals as input?
What's the appropriate background? All subtelomeric regions? Whole genome?
Is GREAT available as an R package (rGREAT) or only as a web tool?
Does it work with CHM13 coordinates or only GRCh38?

2. Weighted gene enrichment

Instead of a binary gene list, weight each gene by its copy number in PHRs:

WASHC1 gets weight 16, DUX4 gets weight 18, SHOX gets weight 2
Use a method like GSEA with pre-ranked lists (rank = copy number)
Or use a weighted hypergeometric / weighted Fisher test
The background also needs weighting: count genome-wide copies of each gene family

3. Permutation-based approach

Randomly sample N genomic intervals of the same sizes as PHRs from subtelomeric regions
Count gene copies in each random sample
Repeat 10,000x to build null distribution
Compare observed copy counts per GO term to the null
This is the most statistically rigorous but computationally heaviest
Could use bedtools shuffle + intersect in a loop

4. Copy-number-weighted ORA

Build a custom background where each gene's 'count' reflects its genome-wide copy number
PHR query: instead of 23 unique genes, submit 1,189 gene instances
Background: count all gene copies genome-wide (not just unique symbols)
Run a modified Fisher/hypergeometric on copy counts rather than unique genes
This might be implementable with a custom R script using phyper()

5. GAT (Genomic Association Tester)

Takes intervals and annotations, uses permutation to test enrichment
Specifically designed for repeat-rich genomic regions
May handle the multi-copy issue naturally

Questions to answer

Which of these methods is most appropriate for our specific situation (29 intervals, ~1,189 gene copies, subtelomeric context)?
Which can be run on the head node with available tools (R, Python, bedtools)?
Which gives us the most defensible statistics for a paper?
For region-based methods: can they work with CHM13 coordinates?
Is there precedent in the subtelomeric/repeat biology literature for any of these approaches?

Output

A ranked recommendation of 2-3 methods to implement
For each: what tool/package to use, what the input format is, what the expected runtime is
Any papers that used similar approaches for repetitive regions
Clear assessment of which approach(es) we should actually run

Validation

At least 3 approaches are evaluated with pros/cons
Tool availability on the head node is checked
A clear recommendation is provided with justification

## Goal
Identify and evaluate statistical methods for gene enrichment analysis that properly account for multi-copy gene families — both in the query set (PHR genes) and the background (whole genome annotation).

## The problem
Standard ORA (clusterProfiler, g:Profiler) deduplicates gene symbols. WASHC1 on 16 arms counts as 1 gene. But the genome-wide annotation also has multi-copy genes — if the background is also deduplicated, we might be comparing apples to apples (both sides lose copy info). OR we might be biasing results because PHR genes are systematically higher-copy than genome-average genes.

We need methods where the TEST STATISTIC accounts for copy number.

## Approaches to investigate

### 1. Region-based enrichment (GREAT / rGREAT)
GREAT (Genomic Regions Enrichment of Annotations Tool) takes GENOMIC INTERVALS as input (not gene lists). It assigns regulatory domains to genes and tests whether your intervals are enriched near genes of particular functions. This naturally handles multi-copy because each interval is counted independently.
- Can we use the 29 PHR BED intervals as input?
- What's the appropriate background? All subtelomeric regions? Whole genome?
- Is GREAT available as an R package (rGREAT) or only as a web tool?
- Does it work with CHM13 coordinates or only GRCh38?

### 2. Weighted gene enrichment
Instead of a binary gene list, weight each gene by its copy number in PHRs:
- WASHC1 gets weight 16, DUX4 gets weight 18, SHOX gets weight 2
- Use a method like GSEA with pre-ranked lists (rank = copy number)
- Or use a weighted hypergeometric / weighted Fisher test
- The background also needs weighting: count genome-wide copies of each gene family

### 3. Permutation-based approach
- Randomly sample N genomic intervals of the same sizes as PHRs from subtelomeric regions
- Count gene copies in each random sample
- Repeat 10,000x to build null distribution
- Compare observed copy counts per GO term to the null
- This is the most statistically rigorous but computationally heaviest
- Could use bedtools shuffle + intersect in a loop

### 4. Copy-number-weighted ORA
- Build a custom background where each gene's 'count' reflects its genome-wide copy number
- PHR query: instead of 23 unique genes, submit 1,189 gene instances
- Background: count all gene copies genome-wide (not just unique symbols)
- Run a modified Fisher/hypergeometric on copy counts rather than unique genes
- This might be implementable with a custom R script using phyper()

### 5. GAT (Genomic Association Tester)
- Takes intervals and annotations, uses permutation to test enrichment
- Specifically designed for repeat-rich genomic regions
- May handle the multi-copy issue naturally

## Questions to answer
1. Which of these methods is most appropriate for our specific situation (29 intervals, ~1,189 gene copies, subtelomeric context)?
2. Which can be run on the head node with available tools (R, Python, bedtools)?
3. Which gives us the most defensible statistics for a paper?
4. For region-based methods: can they work with CHM13 coordinates?
5. Is there precedent in the subtelomeric/repeat biology literature for any of these approaches?

## Output
- A ranked recommendation of 2-3 methods to implement
- For each: what tool/package to use, what the input format is, what the expected runtime is
- Any papers that used similar approaches for repetitive regions
- Clear assessment of which approach(es) we should actually run

## Validation
- At least 3 approaches are evaluated with pros/cons
- Tool availability on the head node is checked
- A clear recommendation is provided with justification

Depends on

done .assign-research-copy-number

Required by

Log

2026-04-01T14:47:09.367788734+00:00 Task paused
2026-04-01T14:47:31.557976449+00:00 Task published
2026-04-01T14:47:58.061306040+00:00 Lightweight assignment: agent=Default Evaluator (31847164), exec_mode=light, context_scope=task, reason=Default Evaluator is ideal for evaluating and ranking statistical methods with clear pros/cons analysis; high score (0.91) on evaluation tasks, with light exec_mode for research-focused exploration.
2026-04-01T14:47:58.138443969+00:00 Spawned by coordinator --executor claude --model claude-sonnet-4-20250514
2026-04-01T14:48:22.964345154+00:00 Starting research task - will decompose into parallel subtasks for thorough investigation
2026-04-01T14:49:29.842399035+00:00 Decomposed into 8 parallel investigation subtasks + 1 synthesis task: investigate-great-rgreat, investigate-weighted-gene, investigate-permutation-based, investigate-copy-number, investigate-gat-genomic, check-tool-availability, literature-search-copy -> synthesize-findings-and
2026-04-01T14:49:43.892121028+00:00 Task decomposition complete - created comprehensive investigation plan covering all 5 methods, tool availability, literature search, and synthesis. Coordinator will dispatch subtasks automatically.
2026-04-01T14:49:47.919386493+00:00 Task marked as done