fig5-whole-genome-joint-parent-sweepga

Whole-genome joint-parent sweepGA for Fig5 pedigree events

Metadata

Statusdone
Assignedagent-2620
Agent identity46f6237a65ec4f1002c4d3fb201dc8633638d0947c276be7008c227e1051ba5e
Created2026-06-20T16:29:48.460461075+00:00
Started2026-06-20T18:48:02.224823145+00:00
Completed2026-06-21T00:08:18.742553050+00:00
Tagspedigree, fig5, sweepga, slurm, correction, whole-genome, chopped-paf, rustybam, pipeline-correction, whole-genome-alignment, no-window-fallback, eval-scheduled, devshm-scratch
Tokens65120110 in / 96355 out
Eval score0.89
└ blocking impact0.88
└ completeness0.94
└ constraint fidelity0.55
└ coordination overhead0.80
└ correctness0.93
└ downstream usability0.92
└ efficiency0.72
└ intent fidelity0.82
└ style adherence0.88

Description

Input:

  • Handoff: paper_prep/_brainstorming/PEDIGREE_SWEEPGA_HANDOFF_2026-06-20.md.
  • Prior diagnostics for comparison only: paper_prep/_brainstorming/pedigree_direct_sweepga_concordance/, paper_prep/_brainstorming/pedigree_direct_sweepga_joint_parent/, paper_prep/_brainstorming/fig5_synteny_recombination_schematic/.
  • Full WashU pedigree assemblies: /moosefs/pangenomes/washu_pedigree/PAN010.fa.gz, /moosefs/pangenomes/washu_pedigree/PAN011.fa.gz, /moosefs/pangenomes/washu_pedigree/PAN027.fa.gz, /moosefs/pangenomes/washu_pedigree/PAN028.fa.gz, plus their .fai indexes.

Task: Run the corrected direct sweepGA experiment for the WashU Fig5 pedigree events from full whole-genome assembly FASTAs. Whole-genome alignment is mandatory. Do not satisfy this task with 500 kb telomeric-window FASTAs, per-chromosome-only extracts, or arm/window substitutes. Reduced runs may be recorded only as failed/debug controls.

Critical pipeline correction: Run full whole-genome alignments first, then chop/bound the resulting PAF intervals before joint filtering. The similarity/path metric can be wrong when alignment segments merge too far together. Preserve raw unchopped whole-genome PAFs and chopped whole-genome-derived PAFs. Run the primary 1:1, 1:many, 2:many, and 4:many filters on chopped PAFs.

Required full whole-genome comparisons:

  • PAN027 paternal hap2 query vs PAN011 both parental haplotypes as one combined target.
  • PAN027 maternal hap1 query vs PAN010 both parental haplotypes as one combined target.
  • PAN028 maternal hap1 query vs PAN027 both parental haplotypes as one combined target.

Implementation requirements:

  • Inspect .fai indexes to determine exact sequence-name patterns; record naming decisions in the README and input manifest.
  • Build one whole-genome query FASTA for each transmitting child haplotype and one combined whole-genome target FASTA containing both parental haplotypes.
  • Submit extraction, whole-genome alignment, PAF chopping, and filtering through Slurm only.
  • Give sweepGA/FastGA /dev/shm-backed scratch explicitly for temporary graph/database files. Use $SLURM_TMPDIR or /tmp only for input staging, manifests, chopping, and non-sweepGA temporary files. Add cleanup traps for /dev/shm job scratch.
  • Preserve raw whole-genome many:many PAFs as first-class artifacts.
  • Produce chopped PAFs as first-class artifacts, with summaries/chop_manifest.tsv recording tool/command, parameters, input raw PAF, output chopped PAF, and rationale. Prefer rustybam if available; otherwise use a deterministic PAF chopper.
  • Apply filters jointly across each combined parental target on chopped PAFs: 1:1, 1:many, 2:many, 4:many, plus chopped raw/many:many.
  • Treat strict 1:1 as diagnostic only; chopped raw many:many and chopped 4:many are the likely evidence layers.
  • Do not generate a final Fig5 schematic in this task. Do not overwrite existing Fig5 schematic directories.
  • Commit and push scripts, manifests, logs/summaries, and outputs with WG provenance.

Output:

  • paper_prep/_brainstorming/pedigree_whole_genome_sweepga_joint_parent/README.md.
  • Rerunnable preparation/submission/alignment/chopping/filtering scripts under paper_prep/_brainstorming/pedigree_whole_genome_sweepga_joint_parent/.
  • summaries/input_manifest.tsv, summaries/slurm_jobs.tsv, summaries/chop_manifest.tsv, summaries/filter_manifest.tsv, and per-filter summaries.
  • raw_paf/*.paf.gz, chopped_paf/*.paf.gz, and filtered_paf/*.paf.gz for all three required comparisons.

Acceptance:

  • The input manifest proves full whole-genome assemblies were used for all three required comparisons.
  • Slurm logs/manifests prove sweepGA/FastGA used /dev/shm scratch.
  • Raw whole-genome many:many PAFs, chopped PAFs, and chopped-input joint filtered PAFs exist for all three comparisons and pass gzip integrity checks.
  • Existing Fig5 schematic directories are unchanged.

Depends on

Required by

Messages 2 messages (2 unread)

  1. #1codex-coordinator2026-06-20T18:52:58.516393773+00:00read
    The retry got past PAN011 but Slurm prep job 1704285 now fails on canonical `/moosefs/pangenomes/washu_pedigree/PAN027.fa.gz` with BGZF/gzip Input/output error. This means the canonical pangenomes FASTAs are not reliable enough for the whole-genome run, not that the analysis should fall back to windows.
    
    Please switch strategy: stage/rebuild validated full whole-genome bgzip+faidx copies for ALL required WashU samples (PAN010, PAN011, PAN027, PAN028) into a writable recovery/staging area, using the same public WashU v1.1 source pattern used by `prereq-restore-readable` if needed:
    `https://public.gi.ucsc.edu/~mcechova/pedigree/assemblies/v1.1/assembly.v1.1.<SAMPLE>.diploid.fa`
    Convert headers to `<SAMPLE>#<hap>#<chr>` consistently, bgzip, faidx, and validate on a Slurm compute node. Then update `config/comparisons.tsv` to use those recovered full-genome paths for all four samples and rerun prep/alignment. Keep the no-window constraint, use `/dev/shm` for sweepGA/FastGA scratch, and keep chopped PAF before filtering.
  2. #2fig5-whole-genome-joint-parent-sweepga2026-06-20T18:55:14.452081154+00:00read
    Acknowledged — PAN027 failed the same way as PAN011 on Slurm. I started recovery work; I will switch to rebuilt/staged full-genome bgzip+faidx copies for all four samples from the public WashU v1.1 URLs, use consistent <SAMPLE>#<hap>#<chr> headers, validate on Slurm, update comparisons.tsv to recovered paths for all samples, and rerun the Slurm prep/alignment/chop/filter pipeline with /dev/shm sweepGA scratch.

Log