fig5-whole-genome-sweepga-updated-bin

Rerun Fig5 whole-genome sweepGA with updated Cargo binary

Metadata

Statusdone
Assignedagent-2639
Agent identity289ccc9f03fc7c121a5ab8d685ffd018371bcdac67ceab1d50b03e7347d29155
Created2026-06-21T09:56:51.160876022+00:00
Started2026-06-21T10:00:13.486969489+00:00
Completed2026-06-21T12:58:14.027118780+00:00
Tagspedigree, fig5, sweepga, updated-binary, whole-genome-alignment, raw-paf, chopped-paf, chr3-homology, devshm-scratch, eval-scheduled
Eval score0.83
└ blocking impact0.66
└ completeness0.92
└ constraint fidelity0.85
└ coordination overhead0.56
└ correctness0.93
└ downstream usability0.89
└ efficiency0.58
└ style adherence0.88

Description

Input context:

  • Prior corrected whole-genome sweepGA package: paper_prep/_brainstorming/pedigree_whole_genome_sweepga_joint_parent/.
  • Prior sweepGA evidence review: paper_prep/_brainstorming/fig5_whole_genome_sweepga_evidence_review/REPORT.md.
  • Prior QA closeout: paper_prep/_brainstorming/fig5_whole_genome_sweepga_closeout/QA_REPORT.md.
  • Recovered full assemblies: /moosefs/erikg/phrs/recovery/fig5-whole-genome-joint-parent-sweepga/.

User update: Use the newly updated sweepGA binary at the same Cargo path:

  • Required binary: /home/erikg/.cargo/bin/sweepga
  • Expected realpath at task creation: /export/local/home/erikg/.cargo/bin/sweepga
  • Expected version at task creation: sweepga 0.1.1
  • Expected sha256 at task creation: a0d7ac0c3312080d67de96d85cdcad9ce0c5a7e523897109b7f598c186ab85a6 The previous final sweepGA run also used /home/erikg/.cargo/bin/sweepga and reported sweepga 0.1.1, but old logs did not capture sha256; this is a fresh rerun with byte-level provenance.

Task: Rerun the corrected Fig5 whole-genome sweepGA/FastGA workflow using the updated /home/erikg/.cargo/bin/sweepga binary. This must repeat the whole-genome joint-parent workflow, not the older telomeric/window-only tests.

Comparisons:

  • PAN027pat_vs_PAN011_joint
  • PAN027mat_vs_PAN010_joint
  • PAN028mat_vs_PAN027_joint Use the same full whole-genome haplotype/joint-parent definitions as pedigree_whole_genome_sweepga_joint_parent/config/comparisons.tsv.

Required workflow:

  • Run through Slurm, parallelizable across comparisons where safe.
  • Explicitly set/use SWEEPGA=/home/erikg/.cargo/bin/sweepga.
  • Record which, explicit path, realpath, --version, sha256sum, and relevant --help/command provenance for the binary actually used.
  • Raw alignment stage must be full whole-genome: sweepga --fastga --num-mappings many:many --scaffold-jump 0 --temp-dir /dev/shm/... --output-file ... QUERY.fa TARGET.fa.
  • SweepGA/FastGA scratch must be under /dev/shm; do not use $SLURM_TMPDIR for sweepGA/FastGA scratch.
  • Keep raw many:many PAFs first-class.
  • Chop raw PAFs before filtering, using the same chunking convention as the corrected package (PAF_CHOP_LENGTH=500000, overlap 0 unless there is a documented reason to match the previous config exactly another way).
  • Run the same chopped filter matrix as the corrected package, including many:many, 1:1, 1:many, 2:many, and 4:many with scaffold jump 0 where applicable. many:many chopped can be a copy but must be recorded.

Output package: Create paper_prep/_brainstorming/pedigree_whole_genome_sweepga_updated_bin/ with README, config, scripts, logs, summaries, and ignored raw/chopped/filtered PAF paths/checksums. Required summaries:

  • summaries/sweepga_binary.tsv
  • summaries/slurm_jobs.tsv
  • summaries/chop_manifest.tsv
  • summaries/filter_manifest.tsv
  • summaries/paf_file_summary.tsv
  • summaries/candidate_window_support.tsv covering raw/chopped/filtered support for the Fig5 PAN027/PAN028 chr9 candidate windows.

Acceptance:

  • All three full whole-genome raw many:many -j 0 sweepGA/FastGA jobs complete with updated binary provenance, or failures are diagnosed with logs and exact next commands.
  • Chopped/filter pipeline completes or any failure is clearly isolated; raw evidence is still reported separately.
  • The package states whether updated sweepGA emits chr3-target rows for the PAN027 and PAN028 candidate windows in raw PAFs and after chopped filtering.
  • Exact command logs prove /home/erikg/.cargo/bin/sweepga was used and /dev/shm scratch was used.
  • No submission/ files are modified and no Fig5 schematic is created.

Depends on

Required by

Messages 10 messages (10 unread)

  1. #1codex2026-06-21T12:34:14.593117244+00:00read
    URGENT correction from chat/user: stop the Python PAF chopping path. I cancelled Slurm job 1704331 (`wgsg_chop`) because it was running `scripts/chop_paf.py` for ~2h. User explicitly expected rustybam or another compiled/parallel chopper, not Python.
    
    Do not use `scripts/chop_paf.py` outputs as final evidence. Delete or quarantine partial `chopped_paf/*.paf.gz` generated by the Python job and regenerate all chopped PAFs with a non-Python implementation.
    
    I found RustyBAM source at `/moosefs/erikg/src/rustybam`, but no `rustybam` binary is currently on PATH and its existing CLI appears to have PAF utilities (`break-paf`, `liftover`, etc.) rather than the exact required operation. Use RustyBAM only if you can prove an exact equivalent to the previous query-axis split rule. Otherwise, compile a small Rust chopper inside this package that exactly reproduces the intended semantics:
    - split each raw PAF row into <= `PAF_CHOP_LENGTH=500000` query-axis fragments with overlap 0
    - linearly interpolate target coordinates and match/alignment counts as before
    - preserve PAF optional tags and append chop provenance tags
    - write manifests with raw/chopped record counts and bp counts
    
    Run this in parallel through Slurm, preferably one job per comparison or an array, with adequate CPUs and multi-threaded compression (`pigz -p` or bgzip threads). The current slow path was single-process Python + gzip and is not acceptable.
    
    After regenerating chopped PAFs, run the same chopped filter matrix and candidate-window support summaries. README/manifests must explicitly document the compiled/parallel chopper, RustyBAM availability/source check, Slurm job IDs, and that Python chopping was cancelled and not used for final evidence.
  2. #2fig5-whole-genome-sweepga-updated-bin2026-06-21T12:35:13.307938171+00:00read
    Acknowledged — I will quarantine/delete the Python-chopped PAFs, add a compiled Rust chopper reproducing the query-axis split semantics, rerun chopping per comparison through Slurm with threaded compression, then run the filter matrix and candidate summaries from regenerated outputs only.
  3. #3codex2026-06-21T12:35:34.542839204+00:00read
    SECOND URGENT correction from chat/user: 500 kb chunks are rejected as far too large. Final chopped/filter evidence must use `PAF_CHOP_LENGTH=10000` or smaller. Use 10 kb as the primary whole-genome chopped rerun unless there is a hard technical impossibility, in which case stop and report that explicitly rather than substituting 500 kb/50 kb.
    
    Implementation requirements:
    - Do not use Python for chopping.
    - Do not use the old 500 kb partial outputs.
    - Use a compiled/parallel chopper. Rust is available (`cargo 1.94.1`, `rustc 1.94.1`). RustyBAM source exists at `/moosefs/erikg/src/rustybam`; use it only if it can do exact <=10 kb query-axis splitting. Otherwise compile a focused Rust tool in this package.
    - To avoid exploding file size, do NOT blindly duplicate huge inherited optional tags (e.g. long CIGAR/cs/cg tags) into every 10 kb fragment unless sweepGA demonstrably requires them. For sweepGA filtering, emit valid PAF core 12 columns plus concise chop provenance tags (`cg:Z:chopped` or equivalent short provenance, chunk index, original q interval, chop length). Document this choice.
    - Split raw PAF rows into <=10,000 bp query-axis fragments with overlap 0. Interpolate target coordinates and matches/block length consistently with the prior rule.
    - Run per-comparison Slurm jobs or an array, with multithreaded compression (`pigz`/`bgzip -@`). Capture job IDs, binary sha/version, command logs, and counts.
    - Then run the chopped filter matrix on the 10 kb chopped PAFs. If the full matrix is too expensive, run/report raw 10 kb chopped many:many and `4:many` first, but do not silently fall back to larger chunks.
    - The README and final review must state plainly: previous Python 500 kb chop was cancelled and not used; final evidence uses 10 kb-or-smaller compiled chopping.
  4. #4codex2026-06-21T12:39:41.301614362+00:00read
    Use the new Rust PAF chopper created in the main repo:
    
    - Crate: `paper_prep/_brainstorming/pafchop-rs/`
    - Binary after build: `paper_prep/_brainstorming/pafchop-rs/target/release/pafchop`
    - Current release sha256 from build: `2cf2acbd183e61e07e5fdee1ff9093bf354f790c2f68c6df6b9e5e6b33ae09a9`
    - Tests passed: `cargo test --manifest-path paper_prep/_brainstorming/pafchop-rs/Cargo.toml` (4/4)
    - Build passed: `cargo build --release --manifest-path paper_prep/_brainstorming/pafchop-rs/Cargo.toml`
    - Smoke test passed with configurable `--length 5`; default/primary for this task must be `--length 10000`.
    
    Wrapper scripts:
    - `paper_prep/_brainstorming/pafchop-rs/scripts/chop_one.sh RAW.paf.gz OUT.paf.gz SUMMARY.tsv COMPARISON_ID [CHOP_LENGTH=10000] [THREADS=8]`
    - `paper_prep/_brainstorming/pafchop-rs/scripts/chop_package_parallel.sh PACKAGE_DIR [CHOP_LENGTH=10000]`
    
    Use `PAFCHOP_JOBS=3 PAFCHOP_THREADS_PER_JOB=<cpus>` to run the three comparison chops in parallel. The executable streams core PAF and drops inherited huge optional tags by default, appending concise chop provenance tags. This is intentional for 10 kb chunks so the output does not explode by duplicating long tags.
    
    Please rebuild/test in your worktree context if needed, run 10 kb primary (`10000`, overlap 0), regenerate all chopped outputs and manifests under a 10 kb-specific directory/name, then rerun the filter matrix. Do not use Python chopping or 500 kb outputs as final evidence.
  5. #5codex2026-06-21T12:40:33.611696472+00:00read
    STOP: I cancelled compiled chop jobs 1704332-1704334 because they still used PAF_CHOP_LENGTH=500000. User explicitly rejected 500 kb. Do not resubmit with default 500000. Set PAF_CHOP_LENGTH=10000 explicitly in submit/run/filter/validate paths, use 10kb-specific output names/manifests, and verify summaries say 10000 before submitting. Prefer the main repo tool paper_prep/_brainstorming/pafchop-rs with --length 10000; if using your local tool, it must be equivalent and invoked with 10000.
  6. #6fig5-whole-genome-sweepga-updated-bin2026-06-21T12:41:07.611115099+00:00read
    Acknowledged message #4 — I will use paper_prep/_brainstorming/pafchop-rs, rebuild/test it in this worktree, run the 10 kb primary chop with PAFCHOP_JOBS=3 and threaded compression, and regenerate manifests/filters from 10 kb outputs only.
  7. #7fig5-whole-genome-sweepga-updated-bin2026-06-21T12:41:07.810837105+00:00read
    Acknowledged message #5 — I will not resubmit 500 kb chopping. I am setting PAF_CHOP_LENGTH=10000 explicitly for chop, filter, validation, filenames, and README/summaries before submitting.
  8. #8codex2026-06-21T12:41:35.491955142+00:00read
    I directly submitted the valid Rust 10 kb PAF chop jobs against your package to avoid any further 500 kb drift. These are now the authoritative chop jobs to use:
    
    - 1704335 PAN027pat_vs_PAN011_joint
    - 1704336 PAN027mat_vs_PAN010_joint
    - 1704337 PAN028mat_vs_PAN027_joint
    
    Manifest: `.wg-worktrees/agent-2639/paper_prep/_brainstorming/pedigree_whole_genome_sweepga_updated_bin/summaries/pafchop10k_slurm_jobs.tsv`
    
    Outputs are under:
    `chopped_paf_l10000_o0/*.chopped_l10000_o0.paf.gz`
    
    Summaries are under:
    `summaries/pafchop_l10000_o0/*.summary.tsv`
    
    Binary:
    `/moosefs/erikg/phrs/paper_prep/_brainstorming/pafchop-rs/target/release/pafchop`
    
    Do not submit or use any 500 kb chop jobs. Continue from these 10 kb outputs when they finish; adapt filter paths to `chopped_paf_l10000_o0` and filter labels/manifests to `l10000_o0`.
  9. #9codex2026-06-21T12:43:53.284046196+00:00read
    The 10 kb Rust chops are complete and valid. Failed first attempt jobs: 1704335-1704337 failed immediately due `/bin/sh` pipefail; ignore. Valid jobs: 1704338-1704340 completed successfully.
    
    Valid outputs:
    - `chopped_paf_l10000_o0/*.chopped_l10000_o0.paf.gz`
    - compatibility symlinks also exist in `chopped_paf/*.chopped_l10000_o0.paf.gz`
    - merged manifest: `summaries/chop_manifest_l10000_o0.tsv`
    
    Counts:
    - PAN027pat_vs_PAN011_joint: 260 raw -> 313,481 10kb fragments
    - PAN027mat_vs_PAN010_joint: 409 raw -> 316,431 10kb fragments
    - PAN028mat_vs_PAN027_joint: 16,477 raw -> 670,629 10kb fragments
    
    Use these only. Continue with filter matrix using explicit `PAF_CHOP_LENGTH=10000 PAF_CHOP_OVERLAP=0`. Do not use `chop_manifest.tsv`/500kb outputs unless marked cancelled/legacy. Update README/manifests accordingly.
  10. #10fig5-whole-genome-sweepga-updated-bin2026-06-21T12:45:20.331896652+00:00read
    Acknowledged message #9 — I will use only the valid 10 kb Rust chop outputs from jobs 1704338-1704340, cancel/ignore duplicate job 1704341 if active, run the filter matrix with PAF_CHOP_LENGTH=10000 and PAF_CHOP_OVERLAP=0, and update README/manifests to mark Python/500kb attempts cancelled/legacy.

Log