pedigree-untangle-multimap-tracts

Pedigree untangle multimap-aware tract caller

Metadata

Statusdone
Assignedagent-2553
Agent identity289ccc9f03fc7c121a5ab8d685ffd018371bcdac67ceab1d50b03e7347d29155
Created2026-06-18T20:11:14.727622047+00:00
Started2026-06-18T20:15:29.643716235+00:00
Completed2026-06-18T20:27:28.510781043+00:00
Tagspedigree, untangle, recombination, eval-scheduled
Tokens2077509 in / 22374 out
Eval score0.77
└ blocking impact0.90
└ completeness0.66
└ constraint fidelity0.40
└ coordination overhead0.78
└ correctness0.77
└ downstream usability0.79
└ efficiency0.88
└ intent fidelity0.84
└ style adherence0.82

Description

Objective: improve the WashU pedigree tract-calling analysis so it recovers interpretable recombination-tract candidates from odgi untangle alignments without defaulting to an arbitrary nth.best=1 projection. The key question is whether consecutive m1000 or lower-threshold untangle runs can be merged into biologically meaningful tracts when multimapping and equivalent donors are represented explicitly.

Scientific framing:

  • The pedigree remains a supportive compatibility analysis, not a new headline result. Preserve candidate language.
  • The aim is to measure tract lengths more honestly, especially through repeats and equivalent haplotypes, and to show when untangle is genuinely inconclusive rather than pretending a first-best donor is unique.
  • WFMASH 1 kb segment length is not a hard tract-length lower bound. Treat it as part of graph/seed construction, not as proof that alignments must occur in exact 1 kb increments.

Required inputs and starting points:

  • Existing WashU untangle BEDs under /moosefs/guarracino/HPRCv2/PHR_III/pedigrees/washu/untangle/, especially PAN027_vs_PAN010.e50000.m1000.bed.gz, PAN027_vs_PAN011.e50000.m1000.bed.gz, and PAN028_vs_PAN027.e50000.m1000.bed.gz.
  • Existing patch table: /moosefs/guarracino/HPRCv2/PHR_III/pedigrees/washu/untangle/recombination/patches.tsv.
  • Existing code and reports: scripts/pedigree/patch_tract_lengths.py, scripts/pedigree/run_patch_tract_lower_merge.sh, scripts/pedigree/patch_tract_length_summary.tsv, scripts/pedigree/patch_tract_lower_merge_summary.tsv, paper_prep/_brainstorming/pedigree_patch_tract_lengths.md, and the pedigree Methods in submission/paper.tex.
  • sweepga is available at /home/erikg/.cargo/bin/sweepga. Inspect sweepga --help. Its --num-mappings option may be useful for retaining n:m-best mappings if a PAF or FASTA-derived path is practical. If sweepga does not fit odgi untangle BEDs cleanly, document why and implement an equivalent interval-sweep merger instead.

Implementation requirements:

  1. Add a reproducible script, preferably scripts/pedigree/untangle_multimap_tracts.py, that runs from the repo root on moosefs.
  2. Support parameters for top-N mappings, score delta or tie epsilon, minimum segment score, maximum bridge gap, and bridge mode. Do not hard-code nth.best=1 as the only interpretation.
  3. Build equivalence classes per child/query interval: collect all donor/reference hits that are tied or near-tied to the best hit under the chosen threshold. Keep exact donor haplotype, chromosome arm, and any available community annotation.
  4. Merge adjacent/consecutive runs when donor equivalence classes are compatible. At minimum distinguish these resolvability classes: unique donor haplotype, unique donor arm with multiple haplotypes, same-community ambiguous donors, cross-community ambiguous donors, and unresolved/no-call.
  5. Explore merging through repeats/ambiguous segments rather than breaking every tract at a multimapping interval. Bridge only when flanking evidence remains compatible, and record the bridge length and reason. Provide sensitivity across at least two gap/bridge settings.
  6. Compare against current first-best behavior: existing high-confidence m1000 patch table and the lower-merge m0/n1 run-level summary. Report how many tracts merge, split, or become ambiguous under the multimap-aware method.
  7. Quantify tract-length distributions and the primate literature ranges already discussed: 22-95 bp, 318-688 bp, and 159-1376 bp. Include counts, proportions, medians, IQRs, and max/min under each parameter setting.
  8. Produce a small visual artifact that makes the case visible for representative Fig. 5/WashU regions: unique best segments, equivalent alternatives, bridged ambiguous/repeat intervals, and final tract calls. Use PDF/PNG or TSV plus a plotting script in paper_prep/_brainstorming/pedigree_multimap_tracts/.
  9. Produce a concise Markdown report explaining what was tried, whether sweepga was used or rejected, the recommended default parameters, and which claims are supported versus inconclusive.
  10. Only edit the manuscript if the result is robust and useful. If editing, keep it light: one Methods sentence or one cautious Results sentence. Do not promote the pedigree analysis to a headline result, do not add defensive caveats, and keep candidate/compatible-with wording.

Expected outputs:

  • scripts/pedigree/untangle_multimap_tracts.py or a clearly named equivalent.
  • A summary TSV in scripts/pedigree/ with parameter settings and tract length statistics.
  • A tract-level TSV with resolvability class and donor equivalence metadata.
  • paper_prep/_brainstorming/pedigree_multimap_tracts.md.
  • Representative visualization files under paper_prep/_brainstorming/pedigree_multimap_tracts/.
  • If paper.tex is touched, rebuild submission/paper.pdf and confirm grep -c undefined submission/paper.log is 0.

Acceptance criteria:

  • The analysis no longer silently treats first-best untangle as uniquely true when multiple equivalent donors exist.
  • Consecutive m1000 and lower-threshold runs are explicitly tested for mergeability.
  • Multimapping is represented as evidence/resolvability, not discarded noise.
  • The report makes clear whether the result strengthens conversion-vs-crossover tract-length interpretation or remains inconclusive.

Depends on

Required by

Log