fig5-raw-manymany-impg-similarity-2kb-sharded — octopus01:/moosefs/erikg/phrs

Metadata

Status	done
Assigned	`agent-2837`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-06-27T11:17:22.132544854+00:00
Started	2026-06-27T11:19:15.350170134+00:00
Completed	2026-06-27T11:48:10.519787483+00:00
Tags	`fig5`, `impg`, `slurm`, `raw-manymany`, `sharded`, `eval-scheduled`
Eval score	0.76
└ blocking impact	0.78
└ completeness	0.57
└ constraint fidelity	0.55
└ coordination overhead	0.82
└ correctness	0.74
└ downstream usability	0.84
└ efficiency	0.86
└ intent fidelity	0.87
└ style adherence	0.91

Description

Correct replacement execution for Fig5 IMPG similarity at 2 kb resolution.

Goal:

Run IMPG similarity over full-genome 2 kb target windows for the existing raw unfiltered many:many alignments.
Do not run WFMASH, SweepGA, FastGA, minimap2, seqwish, odgi, or any new alignment/graph construction. This task consumes already-generated PAFs and FASTAs only.

Inputs / evidence layers:

WFMASH updated-bin raw many:many unfiltered PAFs from paper_prep/_brainstorming/pedigree_whole_genome_wfmash_p95_updated_bin/summaries/query_grid_filter_manifest.tsv, using only raw_paf.
SweepGA/FastGA f32 raw many:many unfiltered PAFs from paper_prep/_brainstorming/pedigree_whole_genome_sweepga_fastga_frequency32/summaries/query_grid_chop_filter_manifest.tsv, using only raw_paf.
Query/target FASTA paths from paper_prep/_brainstorming/pedigree_whole_genome_wfmash_p95_updated_bin/summaries/input_manifest.tsv.
Previous failed 10 kb task fig5-raw-manymany-impg-similarity-fullbed may be used only for scripts, manifests, WFMASH command validation, and existing BGZF-normalized SweepGA PAF copies if validated with bgzip. Do not treat its partial SweepGA TSVs as valid evidence.

Required execution shape:

Build exact full-genome target BEDs with 2,000 bp windows from each target FASTA .fai. Do not expand windows to fixed display widths; last window can be shorter at contig end.
Use impg similarity --alignment-files EXISTING_RAW_OR_BGZF_PAF --target-bed SHARD_2KB.bed --sequence-files QUERY.fa TARGET.fa --gfa-engine poa --no-merge --num-mappings many:many --scaffold-jump 0 --threads ${SLURM_CPUS_PER_TASK}.
Because 10 kb monolithic SweepGA timed out at 24h/48 CPUs, shard the 2 kb BEDs and submit Slurm arrays/jobs over shards. Every Slurm job must pass exactly ${SLURM_CPUS_PER_TASK} to IMPG. Choose shard size/concurrency pragmatically so work runs in parallel across the cluster without launching one monolithic full-BED job.
For SweepGA raw PAFs, IMPG requires BGZF; reuse validated BGZF copies from the previous task if present, otherwise bgzip-normalize the existing raw PAFs only. This is not a new alignment.
Record exact raw source PAF path, IMPG alignment PAF path, query FASTA, target FASTA, BED shard, command, Slurm job ID, node/partition, SLURM_CPUS_PER_TASK, IMPG version/path, and output path for every shard.

Required comparisons:

PAN027mat_vs_PAN010_joint
PAN027pat_vs_PAN011_joint
PAN028mat_vs_PAN027_joint

Required methods:

wfmash_p95_updated_bin
sweepga_fastga_frequency32

Deliverables:

One finalized compressed 2 kb IMPG similarity TSV per method x comparison, assembled from completed shards with header/format handled correctly.
Shard manifest and Slurm manifest with success/failure state for all shards.
Summary tables: per-window target similarity/support, top/all interchromosomal targets, chr9q->chr3q windows, PAR controls, acrocentric controls, and full-genome target-pattern tracks.
Concise report explaining that this is raw unfiltered PAF-backed IMPG similarity over 2 kb target windows, with no new alignments.

Correct replacement execution for Fig5 IMPG similarity at 2 kb resolution.

Goal:
- Run IMPG similarity over full-genome 2 kb target windows for the existing raw unfiltered many:many alignments.
- Do not run WFMASH, SweepGA, FastGA, minimap2, seqwish, odgi, or any new alignment/graph construction. This task consumes already-generated PAFs and FASTAs only.

Inputs / evidence layers:
- WFMASH updated-bin raw many:many unfiltered PAFs from `paper_prep/_brainstorming/pedigree_whole_genome_wfmash_p95_updated_bin/summaries/query_grid_filter_manifest.tsv`, using only `raw_paf`.
- SweepGA/FastGA f32 raw many:many unfiltered PAFs from `paper_prep/_brainstorming/pedigree_whole_genome_sweepga_fastga_frequency32/summaries/query_grid_chop_filter_manifest.tsv`, using only `raw_paf`.
- Query/target FASTA paths from `paper_prep/_brainstorming/pedigree_whole_genome_wfmash_p95_updated_bin/summaries/input_manifest.tsv`.
- Previous failed 10 kb task `fig5-raw-manymany-impg-similarity-fullbed` may be used only for scripts, manifests, WFMASH command validation, and existing BGZF-normalized SweepGA PAF copies if validated with bgzip. Do not treat its partial SweepGA TSVs as valid evidence.

Required execution shape:
- Build exact full-genome target BEDs with 2,000 bp windows from each target FASTA .fai. Do not expand windows to fixed display widths; last window can be shorter at contig end.
- Use `impg similarity --alignment-files EXISTING_RAW_OR_BGZF_PAF --target-bed SHARD_2KB.bed --sequence-files QUERY.fa TARGET.fa --gfa-engine poa --no-merge --num-mappings many:many --scaffold-jump 0 --threads ${SLURM_CPUS_PER_TASK}`.
- Because 10 kb monolithic SweepGA timed out at 24h/48 CPUs, shard the 2 kb BEDs and submit Slurm arrays/jobs over shards. Every Slurm job must pass exactly `${SLURM_CPUS_PER_TASK}` to IMPG. Choose shard size/concurrency pragmatically so work runs in parallel across the cluster without launching one monolithic full-BED job.
- For SweepGA raw PAFs, IMPG requires BGZF; reuse validated BGZF copies from the previous task if present, otherwise bgzip-normalize the existing raw PAFs only. This is not a new alignment.
- Record exact raw source PAF path, IMPG alignment PAF path, query FASTA, target FASTA, BED shard, command, Slurm job ID, node/partition, `SLURM_CPUS_PER_TASK`, IMPG version/path, and output path for every shard.

Required comparisons:
- PAN027mat_vs_PAN010_joint
- PAN027pat_vs_PAN011_joint
- PAN028mat_vs_PAN027_joint

Required methods:
- wfmash_p95_updated_bin
- sweepga_fastga_frequency32

Deliverables:
- One finalized compressed 2 kb IMPG similarity TSV per method x comparison, assembled from completed shards with header/format handled correctly.
- Shard manifest and Slurm manifest with success/failure state for all shards.
- Summary tables: per-window target similarity/support, top/all interchromosomal targets, chr9q->chr3q windows, PAR controls, acrocentric controls, and full-genome target-pattern tracks.
- Concise report explaining that this is raw unfiltered PAF-backed IMPG similarity over 2 kb target windows, with no new alignments.

Depends on

done .assign-fig5-raw-manymany-impg-similarity-2kb-sharded

Required by

Log

2026-06-27T11:18:14.042133189+00:00 Lightweight assignment: agent=random-agent-46f6237a (46f6237a), exec_mode=full, context_scope=task, reason=This is a Slurm-backed data pipeline orchestration task with sharded execution, manifest tracking, and careful reuse of existing PAF/FASTA inputs, which fits a Pipeline Operator better than a generic programmer or evaluator.
2026-06-27T11:19:15.350174072+00:00 Spawned by coordinator --executor codex --model gpt-5.5
2026-06-27T11:19:24.272847770+00:00 Starting task: inspect messages, worktree state, and existing Fig5 IMPG pipeline assets
2026-06-27T11:39:24.953573318+00:00 Generated and submitted 2kb sharded IMPG arrays: 906 shard tasks, 6 arrays, 48 CPUs/task, max 6 concurrent tasks per array
2026-06-27T11:41:36.761692585+00:00 Added follow-up finalize task for assembling outputs after arrays 1706840-1706845 complete
2026-06-27T11:43:15.225262941+00:00 Committed sharded 2kb IMPG pipeline and submitted Slurm manifests: ca2167b
2026-06-27T11:45:49.291173991+00:00 Pushed commit ca2167b; artifacts recorded for report and manifests
2026-06-27T11:46:52.417605133+00:00 Final committed and pushed hash: a773a59. Slurm arrays remain active; finalize-fig5-raw follow-up owns assembly after completion.
2026-06-27T11:47:34.283906450+00:00 Validated: py_compile passed for generator/finalizer; manifests generated for 906 shards; previous SweepGA BGZF copies validated with bgzip -t; arrays 1706840-1706845 submitted with 48 CPUs/task and literal SLURM_CPUS_PER_TASK command recording
2026-06-27T11:48:10.519796691+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-06-27T11:57:49.005820666+00:00 PendingEval → Done (evaluator passed; downstream unblocks)