survey-existing-cnv

Metadata

Status	done
Assigned	`agent-25`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-04-28T21:51:16.984524013+00:00
Started	2026-04-28T21:51:56.861414194+00:00
Completed	2026-04-28T21:59:19.300111165+00:00
Tags	`eval-scheduled`
Eval score	0.87
└ blocking impact	0.95
└ completeness	0.95
└ constraint fidelity	0.40
└ coordination overhead	0.80
└ correctness	0.90
└ downstream usability	0.90
└ efficiency	0.85
└ intent fidelity	0.83
└ style adherence	0.95

Description

Characterize the existing copy-number-aware gene enrichment analysis code at ~/phrs. This is Erik Garrison's working bioinformatician implementation. It is the "Arm A" reference for a planned three-arm methodology comparison (solo bioinformatician, single-session agentic CLI, WorkGraph + human + agentic). It is also the methods substrate for a planned manuscript.

Nothing yet known to the orchestrator: directory layout, language, dependencies, test data, license, public/private state, what algorithms are used, how it relates to baseline methods (naive hypergeometric, Reimand "pick one per cluster" collapse, SAGO cyclic-permutation correction).

Context

Substrate is T2T-complete CHM13, NOT haplotype-resolved HPRC pangenomes. The contribution is on the duplicate-family axis (copy counts visible because reference is T2T-complete), not on the haplotype-distribution axis. Do not overstate.
Motivation came from Erik's observations of subtelomere gene-family clustering in HPRC, but the existing code is reference-based.
The analysis is meant to be generic, not subtelomere-specific.
The eventual demo replicates this method via agentic arms; an agentic-extension target (move to HPRC haplotypes) is under consideration.
The eventual paper is a methods paper, target venues Bioinformatics / NAR Genomics & Bioinformatics / Genome Biology.

Scope

Walk ~/phrs end to end. Look at:

Directory layout, README, top-level docs
Language(s), build system, dependencies
The actual enrichment algorithm: input format, statistic, what correction it applies for duplicated gene families, what gene/annotation databases it uses
Test data: is there a small example with known-good output? What format? How big?
Reproducibility: can someone clone and run it on a laptop? What hardware does it require?
License (or absence of one)
Public/private state: is this a private working directory, a private git repo, a public repo, etc. Check git remotes if a repo.
Recent commit activity and how settled the code is
Any existing comparison to baseline methods. If not, what would need to be added.
Anything that surprises you (good or bad).

Output

PHRS_SURVEY.md at root of ~/google_ai_competition/, with these sections:

One-paragraph summary for someone who has never seen the code.
What is in ~/phrs — directory tree, file roles, language, deps.
The method — algorithm description in 200 to 400 words. What inputs, what statistic, how it handles duplicated families, what assumptions about the substrate (CHM13). Be precise about whether it also handles spatial clustering or only duplicate counts.
Test data and reproducibility — what is bundled, what is needed to reproduce, hardware floor, runtime estimate.
State of the code — license, repo state, public/private, recent activity, technical debt.
Position vs prior art — concrete comparison points to (a) naive hypergeometric, (b) Reimand "pick one rep per cluster", (c) SAGO 2024. Where does this fit on the duplicate-family vs spatial axis. If the comparison is not actually implemented in the code, say so plainly.
Readiness for Arm A — what is needed to make this the public reference arm of the demo. Specifically: is there a small input + expected-output pair we can hand to arms B and C? Is the README sufficient for a third party to reproduce in a day?
Readiness for the agentic-extension question — if arms B and C were asked to extend this toward HPRC-haplotype-aware enrichment, what does the existing code give them as a starting point and what does it lack?
Surprises — anything unexpected.

Constraints

Read code, do not modify it.
Run small commands to characterize (ls, wc -l, git log, cargo metadata / pip show / etc as appropriate). Do not run large analyses.
If a test case can be run in under 5 minutes on a laptop, run it once and capture the output. Otherwise, describe what would be involved.
No em dashes in the output.
Honest about what is missing or unclear. The downstream tasks need accuracy more than enthusiasm.

Validation

PHRS_SURVEY.md exists at repo root
All nine sections present
Section 3 (method) is precise enough that a reader can place the contribution on the duplicate-family vs spatial axis
Section 6 says explicitly which baseline comparisons are or are not in the code
Section 7 answers the test-input/expected-output question with yes-or-no
License status named explicitly (specific license, or "no LICENSE file")
Output is honest about gaps and unclarities

## Description
Characterize the existing copy-number-aware gene enrichment analysis code at `~/phrs`. This is Erik Garrison's working bioinformatician implementation. It is the "Arm A" reference for a planned three-arm methodology comparison (solo bioinformatician, single-session agentic CLI, WorkGraph + human + agentic). It is also the methods substrate for a planned manuscript.

Nothing yet known to the orchestrator: directory layout, language, dependencies, test data, license, public/private state, what algorithms are used, how it relates to baseline methods (naive hypergeometric, Reimand "pick one per cluster" collapse, SAGO cyclic-permutation correction).

## Context
- Substrate is **T2T-complete CHM13**, NOT haplotype-resolved HPRC pangenomes. The contribution is on the duplicate-family axis (copy counts visible because reference is T2T-complete), not on the haplotype-distribution axis. Do not overstate.
- Motivation came from Erik's observations of subtelomere gene-family clustering in HPRC, but the existing code is reference-based.
- The analysis is meant to be generic, not subtelomere-specific.
- The eventual demo replicates this method via agentic arms; an agentic-extension target (move to HPRC haplotypes) is under consideration.
- The eventual paper is a methods paper, target venues *Bioinformatics* / *NAR Genomics & Bioinformatics* / *Genome Biology*.

## Scope
Walk `~/phrs` end to end. Look at:
1. Directory layout, README, top-level docs
2. Language(s), build system, dependencies
3. The actual enrichment algorithm: input format, statistic, what correction it applies for duplicated gene families, what gene/annotation databases it uses
4. Test data: is there a small example with known-good output? What format? How big?
5. Reproducibility: can someone clone and run it on a laptop? What hardware does it require?
6. License (or absence of one)
7. Public/private state: is this a private working directory, a private git repo, a public repo, etc. Check git remotes if a repo.
8. Recent commit activity and how settled the code is
9. Any existing comparison to baseline methods. If not, what would need to be added.
10. Anything that surprises you (good or bad).

## Output
`PHRS_SURVEY.md` at root of `~/google_ai_competition/`, with these sections:

1. **One-paragraph summary** for someone who has never seen the code.
2. **What is in `~/phrs`** — directory tree, file roles, language, deps.
3. **The method** — algorithm description in 200 to 400 words. What inputs, what statistic, how it handles duplicated families, what assumptions about the substrate (CHM13). Be precise about whether it also handles spatial clustering or only duplicate counts.
4. **Test data and reproducibility** — what is bundled, what is needed to reproduce, hardware floor, runtime estimate.
5. **State of the code** — license, repo state, public/private, recent activity, technical debt.
6. **Position vs prior art** — concrete comparison points to (a) naive hypergeometric, (b) Reimand "pick one rep per cluster", (c) SAGO 2024. Where does this fit on the duplicate-family vs spatial axis. If the comparison is not actually implemented in the code, say so plainly.
7. **Readiness for Arm A** — what is needed to make this the public reference arm of the demo. Specifically: is there a small input + expected-output pair we can hand to arms B and C? Is the README sufficient for a third party to reproduce in a day?
8. **Readiness for the agentic-extension question** — if arms B and C were asked to extend this toward HPRC-haplotype-aware enrichment, what does the existing code give them as a starting point and what does it lack?
9. **Surprises** — anything unexpected.

## Constraints
- Read code, do not modify it.
- Run small commands to characterize (`ls`, `wc -l`, `git log`, `cargo metadata` / `pip show` / etc as appropriate). Do not run large analyses.
- If a test case can be run in under 5 minutes on a laptop, run it once and capture the output. Otherwise, describe what would be involved.
- No em dashes in the output.
- Honest about what is missing or unclear. The downstream tasks need accuracy more than enthusiasm.

## Validation
- [ ] `PHRS_SURVEY.md` exists at repo root
- [ ] All nine sections present
- [ ] Section 3 (method) is precise enough that a reader can place the contribution on the duplicate-family vs spatial axis
- [ ] Section 6 says explicitly which baseline comparisons are or are not in the code
- [ ] Section 7 answers the test-input/expected-output question with yes-or-no
- [ ] License status named explicitly (specific license, or "no LICENSE file")
- [ ] Output is honest about gaps and unclarities

Depends on

done .assign-survey-existing-cnv

Required by

done .flip-survey-existing-cnv

Log

2026-04-28T21:51:16.983670493+00:00 Task paused
2026-04-28T21:51:21.604522397+00:00 Task published
2026-04-28T21:51:56.861418572+00:00 Spawned by coordinator --executor claude --model opus
2026-04-28T21:52:06.524984141+00:00 Starting survey of ~/phrs directory
2026-04-28T21:58:43.545464481+00:00 Wrote PHRS_SURVEY.md at repo root. 9 sections, no em dashes. License=no LICENSE file. Tested core function runs in ~1s on laptop.
2026-04-28T21:59:13.002016380+00:00 Committed b045cde, pushed to origin/wg/agent-25/survey-existing-cnv. Validated all 9 sections present, no em dashes, license stated explicitly, test fixture answer is no.
2026-04-28T21:59:19.300120633+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-04-28T22:02:05.485026883+00:00 PendingEval → Done (evaluator passed; downstream unblocks)