survey-existing-cnv

Survey: existing CNV-aware enrichment code in ~/phrs

Metadata

Statusdone
Assignedagent-25
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-04-28T21:51:16.984524013+00:00
Started2026-04-28T21:51:56.861414194+00:00
Completed2026-04-28T21:59:19.300111165+00:00
Tagseval-scheduled
Eval score0.87
└ blocking impact0.95
└ completeness0.95
└ constraint fidelity0.40
└ coordination overhead0.80
└ correctness0.90
└ downstream usability0.90
└ efficiency0.85
└ intent fidelity0.83
└ style adherence0.95

Description

Description

Characterize the existing copy-number-aware gene enrichment analysis code at ~/phrs. This is Erik Garrison's working bioinformatician implementation. It is the "Arm A" reference for a planned three-arm methodology comparison (solo bioinformatician, single-session agentic CLI, WorkGraph + human + agentic). It is also the methods substrate for a planned manuscript.

Nothing yet known to the orchestrator: directory layout, language, dependencies, test data, license, public/private state, what algorithms are used, how it relates to baseline methods (naive hypergeometric, Reimand "pick one per cluster" collapse, SAGO cyclic-permutation correction).

Context

  • Substrate is T2T-complete CHM13, NOT haplotype-resolved HPRC pangenomes. The contribution is on the duplicate-family axis (copy counts visible because reference is T2T-complete), not on the haplotype-distribution axis. Do not overstate.
  • Motivation came from Erik's observations of subtelomere gene-family clustering in HPRC, but the existing code is reference-based.
  • The analysis is meant to be generic, not subtelomere-specific.
  • The eventual demo replicates this method via agentic arms; an agentic-extension target (move to HPRC haplotypes) is under consideration.
  • The eventual paper is a methods paper, target venues Bioinformatics / NAR Genomics & Bioinformatics / Genome Biology.

Scope

Walk ~/phrs end to end. Look at:

  1. Directory layout, README, top-level docs
  2. Language(s), build system, dependencies
  3. The actual enrichment algorithm: input format, statistic, what correction it applies for duplicated gene families, what gene/annotation databases it uses
  4. Test data: is there a small example with known-good output? What format? How big?
  5. Reproducibility: can someone clone and run it on a laptop? What hardware does it require?
  6. License (or absence of one)
  7. Public/private state: is this a private working directory, a private git repo, a public repo, etc. Check git remotes if a repo.
  8. Recent commit activity and how settled the code is
  9. Any existing comparison to baseline methods. If not, what would need to be added.
  10. Anything that surprises you (good or bad).

Output

PHRS_SURVEY.md at root of ~/google_ai_competition/, with these sections:

  1. One-paragraph summary for someone who has never seen the code.
  2. What is in ~/phrs — directory tree, file roles, language, deps.
  3. The method — algorithm description in 200 to 400 words. What inputs, what statistic, how it handles duplicated families, what assumptions about the substrate (CHM13). Be precise about whether it also handles spatial clustering or only duplicate counts.
  4. Test data and reproducibility — what is bundled, what is needed to reproduce, hardware floor, runtime estimate.
  5. State of the code — license, repo state, public/private, recent activity, technical debt.
  6. Position vs prior art — concrete comparison points to (a) naive hypergeometric, (b) Reimand "pick one rep per cluster", (c) SAGO 2024. Where does this fit on the duplicate-family vs spatial axis. If the comparison is not actually implemented in the code, say so plainly.
  7. Readiness for Arm A — what is needed to make this the public reference arm of the demo. Specifically: is there a small input + expected-output pair we can hand to arms B and C? Is the README sufficient for a third party to reproduce in a day?
  8. Readiness for the agentic-extension question — if arms B and C were asked to extend this toward HPRC-haplotype-aware enrichment, what does the existing code give them as a starting point and what does it lack?
  9. Surprises — anything unexpected.

Constraints

  • Read code, do not modify it.
  • Run small commands to characterize (ls, wc -l, git log, cargo metadata / pip show / etc as appropriate). Do not run large analyses.
  • If a test case can be run in under 5 minutes on a laptop, run it once and capture the output. Otherwise, describe what would be involved.
  • No em dashes in the output.
  • Honest about what is missing or unclear. The downstream tasks need accuracy more than enthusiasm.

Validation

  • PHRS_SURVEY.md exists at repo root
  • All nine sections present
  • Section 3 (method) is precise enough that a reader can place the contribution on the duplicate-family vs spatial axis
  • Section 6 says explicitly which baseline comparisons are or are not in the code
  • Section 7 answers the test-input/expected-output question with yes-or-no
  • License status named explicitly (specific license, or "no LICENSE file")
  • Output is honest about gaps and unclarities

Depends on

Required by

Log