pedigree-patch-tract

Pedigree patch tract-length analysis

Metadata

Statusdone ‖ paused
Agent identity46f6237a65ec4f1002c4d3fb201dc8633638d0947c276be7008c227e1051ba5e
Created2026-06-18T17:07:03.675494464+00:00
Started2026-06-18T17:08:30.733222488+00:00
Completed2026-06-18T18:04:38.853198752+00:00
Tagsmanuscript, pedigree, analysis, tract-length, eval-scheduled
Eval score0.84
└ blocking impact0.95
└ completeness0.82
└ constraint fidelity0.55
└ coordination overhead0.91
└ correctness0.80
└ downstream usability0.93
└ efficiency0.74
└ intent fidelity0.74
└ style adherence0.88

Description

Run the missing analysis behind the conversion-vs-crossover claim. The manuscript currently hedges the wrong thing: recurrent ectopic recombination is the established mechanism maintaining high-identity non-allelic subtelomeric copies; the open question is the direction of the sequence-proximity feedback and the conversion-vs-crossover split.

Analysis objective:

  1. Identify the WashU three-generation T2T pedigree patch table(s) used in submission/paper.tex and scripts/methods. Use the actual assembly-derived candidate patch lengths, not a proxy if the direct field exists.
  2. Summarize candidate inter-chromosomal patch length distributions overall and by available patch class/pattern/community status/transmission, preserving the manuscript's existing candidate language.
  3. Compare observed patch lengths to the primate tract-length ranges already cited or available in the repo: short conversion tracts (~22-95 bp) and longer crossover-associated tracts (~318-688 bp). Do not invent event-level validation; report compatibility/proportions relative to those ranges.
  4. Produce a small reproducible script and TSV/MD result artifact under scripts/pedigree/ or paper_prep/_brainstorming/ with clear input paths and exact counts.
  5. If the result is interpretable, make a light-touch manuscript update: state plainly that recurrent ectopic exchange/recombination maintains shared sequence; reserve “open” language for sequence-vs-proximity direction and conversion-vs-crossover split; add one concise sentence reporting the tract-length comparison if supported. Do not inflate pedigree to a headline result and do not add new defensive caveats.

Validation

  • Input pedigree patch table paths are identified and recorded.
  • Script runs from repo root without manual steps in the moosefs environment.
  • Output reports N, length field used, quantiles, and proportions in/near conversion-like and crossover-associated tract ranges.
  • Manuscript wording no longer implies that recombination-driven maintenance of shared sequence is open.
  • Open/hedged wording is limited to sequence-proximity directionality and conversion-vs-crossover mechanism split.
  • submission builds with guix shell texlive -- make and final paper.log has zero undefined hits.

Depends on

Required by

Messages 8 messages (replied)

  1. #1user2026-06-18T17:09:40.212649494+00:00read
    Important update from main thread after inspecting source tables: the current WashU all_pedigrees_patches.tsv and washu/untangle/recombination/patches.tsv are left-truncated around 1000 bp (filenames show e50000.m1000; observed min patch_size 1001). That means simply reporting zero 22-95 bp or 318-688 bp patches from the existing HQ table is not sufficient. Please explicitly determine whether a lower-threshold/raw untangle or segment-level analysis can be run from the available graph/targets to measure sub-kb tracts. If feasible within scope, run/reuse odgi untangle or parse lower-level output with parameters that can resolve 22-95 bp and 318-688 bp candidate intervals, then compare. If not feasible or not valid because graph/untangle resolution imposes a threshold, report the resolution limit plainly and do not claim biological absence of conversion-length tracts. The manuscript update should not cite a zero-overlap result unless the input resolution can actually observe those ranges.
  2. #2pedigree-patch-tract2026-06-18T17:12:02.719900725+00:00read
    Acknowledged — I will check for lower-threshold/raw untangle or segment-level outputs and whether odgi untangle can be rerun from available graph inputs. I will revise the manuscript/artifact so the 1 kb resolution limit is explicit and no biological absence of 22-95 bp or 318-688 bp tracts is claimed unless those ranges are observable.
  3. #3user2026-06-18T17:20:05.292521571+00:00read
    Additional user note for interpreting the rerun/resolution audit: the wfmash 1 kb segment length is not a definitive lower bound on event or alignment tract length. It is a seed/graph-construction scale, not evidence that alignments or untangle intervals can only occur in exact 1 kb increments. Alignments have many lengths, so applying the same documented untangling process at lower merge thresholds is feasible and interpretable as an analysis of what the graph/untangle process resolves. Make sure any scripts used for this analysis are documented and integrated in the repo so they can be referenced in Methods and the paper.
  4. #4pedigree-patch-tract2026-06-18T17:20:33.143763516+00:00read
    Acknowledged — I will not treat wfmash -s 1000 as a definitive lower bound. I will add a documented lower-merge untangle/segment summarization path to the repo and run a feasible lower-threshold analysis from the existing graph/targets, clearly separating the m1000 manuscript patch table from lower-merge graph/untangle-resolved interchromosomal intervals.
  5. #5user2026-06-18T17:34:17.375159944+00:00read
    Important analysis refinement: do not report raw `-m 0` BED rows as tract-length candidates. They are graph mapping fragments and are too granular. Summarize lower-merge results after grouping consecutive best-hit (`nth.best == 1`) query intervals with the same donor chr/arm/haplotype into run-level intervals, then filter interchromosomal runs and report score >= 0.8. A sanity check on the completed `-m 0 -n 1` BEDs gives run-grouped interchromosomal intervals with score >= 0.8: N=37,401 overall, median 17 bp, 11,600 in 22-95 bp, 927 in 318-688 bp, and 3,445 in 159-1376 bp. Keep this separate from the current m1000 high-quality patch table (N=538, median 1,517 bp). The manuscript should not say the graph/untangle table cannot observe those ranges; it should say lower-merge run-level intervals include both ranges, so tract lengths are compatible with both short conversion-like and longer CO-associated processes, while event-level classification remains unresolved.
  6. #6pedigree-patch-tract2026-06-18T17:37:23.434283385+00:00read
    Acknowledged — I grouped consecutive nth.best==1 intervals by donor chr/arm/haplotype and kept the lower-merge run-level analysis separate from the m1000 table. My current integrated parser reports N=37,400 with the same range counts (11,600; 927; 3,445), so I will reconcile the one-run discrepancy against the completed BEDs and update the TSV/manuscript if the sanity-check N=37,401 reflects the intended grouping.
  7. #7user2026-06-18T17:37:53.245367218+00:00read
    No need to regenerate BEDs just to reconcile the one-run discrepancy. The difference is the high-quality upper-size filter: the one-off sanity check counted score >= 0.8 run-level intervals without the <=100 kb cap (N=37,401, max 143,431 bp); the integrated summary appears to apply <=100 kb and therefore reports N=37,400, max 55,098 bp. Either is defensible if labeled, but avoid spending time rerunning only for that. Prefer recording both denominators or state explicitly that the committed lower-merge summary uses score >= 0.8 and size <= 100 kb for consistency with the m1000 table; the range counts are unchanged for 22-95, 318-688, and 159-1376.
  8. #8user2026-06-18T17:48:50.454586239+00:00read
    Additional clarification from Erik: wfmash 1 kb segment length is not a hard lower bound on alignment or tract lengths; it is a seed/graph-construction scale. The lower-merge odgi untangle/run-grouping process is feasible and interpretable when documented. Do not frame sub-kb or short-range intervals as impossible from first principles; keep the scripts/results integrated so Methods can cite exactly what was run.

Log