pedigree-untangle-multimap-tracts — octopus01:/moosefs/erikg/phrs

Metadata

Status	done
Assigned	`agent-2553`
Agent identity	`289ccc9f03fc7c121a5ab8d685ffd018371bcdac67ceab1d50b03e7347d29155`
Created	2026-06-18T20:11:14.727622047+00:00
Started	2026-06-18T20:15:29.643716235+00:00
Completed	2026-06-18T20:27:28.510781043+00:00
Tags	`pedigree`, `untangle`, `recombination`, `eval-scheduled`
Tokens	2077509 in / 22374 out
Eval score	0.77
└ blocking impact	0.90
└ completeness	0.66
└ constraint fidelity	0.40
└ coordination overhead	0.78
└ correctness	0.77
└ downstream usability	0.79
└ efficiency	0.88
└ intent fidelity	0.84
└ style adherence	0.82

Description

Objective: improve the WashU pedigree tract-calling analysis so it recovers interpretable recombination-tract candidates from odgi untangle alignments without defaulting to an arbitrary nth.best=1 projection. The key question is whether consecutive m1000 or lower-threshold untangle runs can be merged into biologically meaningful tracts when multimapping and equivalent donors are represented explicitly.

Scientific framing:

The pedigree remains a supportive compatibility analysis, not a new headline result. Preserve candidate language.
The aim is to measure tract lengths more honestly, especially through repeats and equivalent haplotypes, and to show when untangle is genuinely inconclusive rather than pretending a first-best donor is unique.
WFMASH 1 kb segment length is not a hard tract-length lower bound. Treat it as part of graph/seed construction, not as proof that alignments must occur in exact 1 kb increments.

Required inputs and starting points:

Existing WashU untangle BEDs under /moosefs/guarracino/HPRCv2/PHR_III/pedigrees/washu/untangle/, especially PAN027_vs_PAN010.e50000.m1000.bed.gz, PAN027_vs_PAN011.e50000.m1000.bed.gz, and PAN028_vs_PAN027.e50000.m1000.bed.gz.
Existing patch table: /moosefs/guarracino/HPRCv2/PHR_III/pedigrees/washu/untangle/recombination/patches.tsv.
Existing code and reports: scripts/pedigree/patch_tract_lengths.py, scripts/pedigree/run_patch_tract_lower_merge.sh, scripts/pedigree/patch_tract_length_summary.tsv, scripts/pedigree/patch_tract_lower_merge_summary.tsv, paper_prep/_brainstorming/pedigree_patch_tract_lengths.md, and the pedigree Methods in submission/paper.tex.
sweepga is available at /home/erikg/.cargo/bin/sweepga. Inspect sweepga --help. Its --num-mappings option may be useful for retaining n:m-best mappings if a PAF or FASTA-derived path is practical. If sweepga does not fit odgi untangle BEDs cleanly, document why and implement an equivalent interval-sweep merger instead.

Implementation requirements:

Add a reproducible script, preferably scripts/pedigree/untangle_multimap_tracts.py, that runs from the repo root on moosefs.
Support parameters for top-N mappings, score delta or tie epsilon, minimum segment score, maximum bridge gap, and bridge mode. Do not hard-code nth.best=1 as the only interpretation.
Build equivalence classes per child/query interval: collect all donor/reference hits that are tied or near-tied to the best hit under the chosen threshold. Keep exact donor haplotype, chromosome arm, and any available community annotation.
Merge adjacent/consecutive runs when donor equivalence classes are compatible. At minimum distinguish these resolvability classes: unique donor haplotype, unique donor arm with multiple haplotypes, same-community ambiguous donors, cross-community ambiguous donors, and unresolved/no-call.
Explore merging through repeats/ambiguous segments rather than breaking every tract at a multimapping interval. Bridge only when flanking evidence remains compatible, and record the bridge length and reason. Provide sensitivity across at least two gap/bridge settings.
Compare against current first-best behavior: existing high-confidence m1000 patch table and the lower-merge m0/n1 run-level summary. Report how many tracts merge, split, or become ambiguous under the multimap-aware method.
Quantify tract-length distributions and the primate literature ranges already discussed: 22-95 bp, 318-688 bp, and 159-1376 bp. Include counts, proportions, medians, IQRs, and max/min under each parameter setting.
Produce a small visual artifact that makes the case visible for representative Fig. 5/WashU regions: unique best segments, equivalent alternatives, bridged ambiguous/repeat intervals, and final tract calls. Use PDF/PNG or TSV plus a plotting script in paper_prep/_brainstorming/pedigree_multimap_tracts/.
Produce a concise Markdown report explaining what was tried, whether sweepga was used or rejected, the recommended default parameters, and which claims are supported versus inconclusive.
Only edit the manuscript if the result is robust and useful. If editing, keep it light: one Methods sentence or one cautious Results sentence. Do not promote the pedigree analysis to a headline result, do not add defensive caveats, and keep candidate/compatible-with wording.

Expected outputs:

scripts/pedigree/untangle_multimap_tracts.py or a clearly named equivalent.
A summary TSV in scripts/pedigree/ with parameter settings and tract length statistics.
A tract-level TSV with resolvability class and donor equivalence metadata.
paper_prep/_brainstorming/pedigree_multimap_tracts.md.
Representative visualization files under paper_prep/_brainstorming/pedigree_multimap_tracts/.
If paper.tex is touched, rebuild submission/paper.pdf and confirm grep -c undefined submission/paper.log is 0.

Acceptance criteria:

The analysis no longer silently treats first-best untangle as uniquely true when multiple equivalent donors exist.
Consecutive m1000 and lower-threshold runs are explicitly tested for mergeability.
Multimapping is represented as evidence/resolvability, not discarded noise.
The report makes clear whether the result strengthens conversion-vs-crossover tract-length interpretation or remains inconclusive.

Scientific framing:
- The pedigree remains a supportive compatibility analysis, not a new headline result. Preserve candidate language.
- The aim is to measure tract lengths more honestly, especially through repeats and equivalent haplotypes, and to show when untangle is genuinely inconclusive rather than pretending a first-best donor is unique.
- WFMASH 1 kb segment length is not a hard tract-length lower bound. Treat it as part of graph/seed construction, not as proof that alignments must occur in exact 1 kb increments.

Required inputs and starting points:
- Existing WashU untangle BEDs under /moosefs/guarracino/HPRCv2/PHR_III/pedigrees/washu/untangle/, especially PAN027_vs_PAN010.e50000.m1000.bed.gz, PAN027_vs_PAN011.e50000.m1000.bed.gz, and PAN028_vs_PAN027.e50000.m1000.bed.gz.
- Existing patch table: /moosefs/guarracino/HPRCv2/PHR_III/pedigrees/washu/untangle/recombination/patches.tsv.
- Existing code and reports: scripts/pedigree/patch_tract_lengths.py, scripts/pedigree/run_patch_tract_lower_merge.sh, scripts/pedigree/patch_tract_length_summary.tsv, scripts/pedigree/patch_tract_lower_merge_summary.tsv, paper_prep/_brainstorming/pedigree_patch_tract_lengths.md, and the pedigree Methods in submission/paper.tex.
- sweepga is available at /home/erikg/.cargo/bin/sweepga. Inspect sweepga --help. Its --num-mappings option may be useful for retaining n:m-best mappings if a PAF or FASTA-derived path is practical. If sweepga does not fit odgi untangle BEDs cleanly, document why and implement an equivalent interval-sweep merger instead.

Implementation requirements:
1. Add a reproducible script, preferably scripts/pedigree/untangle_multimap_tracts.py, that runs from the repo root on moosefs.
2. Support parameters for top-N mappings, score delta or tie epsilon, minimum segment score, maximum bridge gap, and bridge mode. Do not hard-code nth.best=1 as the only interpretation.
3. Build equivalence classes per child/query interval: collect all donor/reference hits that are tied or near-tied to the best hit under the chosen threshold. Keep exact donor haplotype, chromosome arm, and any available community annotation.
4. Merge adjacent/consecutive runs when donor equivalence classes are compatible. At minimum distinguish these resolvability classes: unique donor haplotype, unique donor arm with multiple haplotypes, same-community ambiguous donors, cross-community ambiguous donors, and unresolved/no-call.
5. Explore merging through repeats/ambiguous segments rather than breaking every tract at a multimapping interval. Bridge only when flanking evidence remains compatible, and record the bridge length and reason. Provide sensitivity across at least two gap/bridge settings.
6. Compare against current first-best behavior: existing high-confidence m1000 patch table and the lower-merge m0/n1 run-level summary. Report how many tracts merge, split, or become ambiguous under the multimap-aware method.
7. Quantify tract-length distributions and the primate literature ranges already discussed: 22-95 bp, 318-688 bp, and 159-1376 bp. Include counts, proportions, medians, IQRs, and max/min under each parameter setting.
8. Produce a small visual artifact that makes the case visible for representative Fig. 5/WashU regions: unique best segments, equivalent alternatives, bridged ambiguous/repeat intervals, and final tract calls. Use PDF/PNG or TSV plus a plotting script in paper_prep/_brainstorming/pedigree_multimap_tracts/.
9. Produce a concise Markdown report explaining what was tried, whether sweepga was used or rejected, the recommended default parameters, and which claims are supported versus inconclusive.
10. Only edit the manuscript if the result is robust and useful. If editing, keep it light: one Methods sentence or one cautious Results sentence. Do not promote the pedigree analysis to a headline result, do not add defensive caveats, and keep candidate/compatible-with wording.

Expected outputs:
- scripts/pedigree/untangle_multimap_tracts.py or a clearly named equivalent.
- A summary TSV in scripts/pedigree/ with parameter settings and tract length statistics.
- A tract-level TSV with resolvability class and donor equivalence metadata.
- paper_prep/_brainstorming/pedigree_multimap_tracts.md.
- Representative visualization files under paper_prep/_brainstorming/pedigree_multimap_tracts/.
- If paper.tex is touched, rebuild submission/paper.pdf and confirm grep -c undefined submission/paper.log is 0.

Acceptance criteria:
- The analysis no longer silently treats first-best untangle as uniquely true when multiple equivalent donors exist.
- Consecutive m1000 and lower-threshold runs are explicitly tested for mergeability.
- Multimapping is represented as evidence/resolvability, not discarded noise.
- The report makes clear whether the result strengthens conversion-vs-crossover tract-length interpretation or remains inconclusive.

Depends on

Required by

Log

2026-06-18T20:13:45.083937819+00:00 Spawn failed (attempt 1/5): Invalid --timeout value. exec_mode=full, executor=codex
2026-06-18T20:14:11.857408685+00:00 Spawn failed (attempt 2/5): Invalid --timeout value. exec_mode=full, executor=codex
2026-06-18T20:14:37.331391486+00:00 Spawn failed (attempt 1/5): Invalid --timeout value. exec_mode=full, executor=codex
2026-06-18T20:15:03.428137481+00:00 Spawned by coordinator --executor codex --model gpt-5.5
2026-06-18T20:15:10.031797776+00:00 Task reset for retry from in-progress (attempt #1) — killed agent agent-2552 (PID 4605) — reason: clear invalid 1d timeout from task metadata and retry spawn
2026-06-18T20:15:29.643721155+00:00 Spawned by coordinator --executor codex --model gpt-5.5
2026-06-18T20:15:40.064977968+00:00 Starting review/implementation pass; checking prior WIP and existing pedigree artifacts
2026-06-18T20:16:48.683090823+00:00 Implementing multimap-aware untangle tract caller with configurable tie/top-N and bridge settings
2026-06-18T20:25:35.771178774+00:00 Validated: ran untangle_multimap_tracts.py on m1000 WashU BEDs; wrote tract TSV, summary TSV, Markdown report, and representative SVG/TSV
2026-06-18T20:26:01.036362309+00:00 Validated: python3 -m py_compile passed for the new tract caller and plot script
2026-06-18T20:26:55.696722660+00:00 Committed: 0b1b7ba — pushed to remote
2026-06-18T20:27:28.510788798+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-06-18T20:34:53.798255990+00:00 PendingEval → Done (evaluator passed; downstream unblocks)