validate-pafchop-rs-paf-semantics — octopus01:/moosefs/erikg/phrs

Metadata

Status	done
Assigned	`agent-2663`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-06-22T16:04:04.549701654+00:00
Started	2026-06-22T16:05:50.191802063+00:00
Completed	2026-06-22T16:20:35.395060901+00:00
Tags	`pafchop`, `rust`, `paf`, `validation`, `sweepga`, `eval-scheduled`
Eval score	0.89
└ blocking impact	0.94
└ completeness	0.88
└ constraint fidelity	0.70
└ coordination overhead	0.92
└ correctness	0.90
└ downstream usability	0.92
└ efficiency	0.86
└ intent fidelity	0.79
└ style adherence	0.88

Description

Problem: The current PAF chopper must not be trusted until validated. Chopping PAF rows is only valid for downstream sweepGA filtering if all alignment-derived fields are correctly recomputed per chunk.

Task:

Audit paper_prep/_brainstorming/pafchop-rs implementation.
Determine whether the source f16 PAFs contain enough per-base alignment information (cg:Z, cs:Z, or equivalent) to exactly split alignments. If they do not, document that exact per-chunk identity cannot be recovered from PAF alone and mark existing chopped identity-sensitive outputs as not valid for identity filtering.
Implement or repair chunking so each output row recomputes, at minimum: query start/end, target start/end, residue matches (PAF col 10), alignment block length (PAF col 11), identity-relevant optional tags (NM:i, dv:f, de:f / gap-compressed divergence where present and computable), and clipped cg:Z CIGAR/cs:Z strings where present.
Reverse-strand target coordinate semantics must be tested. Chunks crossing matches, mismatches, insertions, deletions, and chunk boundaries inside operations must be tested.
Do not silently copy stale alignment-derived tags. Either recompute them exactly or drop them with an explicit validation note explaining why downstream sweepGA will not use them.
Add golden and property-style Rust tests. Run cargo test and a release build.

Acceptance:

cargo test passes and includes tests for M/=/X/I/D operations, reverse strand, chunks ending inside CIGAR ops, and recomputed col10/col11 identity.
PAF_SEMANTICS_VALIDATION.md states exactly which PAF columns/tags are recomputed, copied, dropped, or impossible from PAF alone.
Existing f16 chopped outputs are classified as valid or invalid for identity-sensitive sweepGA filtering based on the audit; no ambiguous result.
Commit and push with WG provenance.

Problem:
The current PAF chopper must not be trusted until validated. Chopping PAF rows is only valid for downstream sweepGA filtering if all alignment-derived fields are correctly recomputed per chunk.

Task:
- Audit `paper_prep/_brainstorming/pafchop-rs` implementation.
- Determine whether the source f16 PAFs contain enough per-base alignment information (`cg:Z`, `cs:Z`, or equivalent) to exactly split alignments. If they do not, document that exact per-chunk identity cannot be recovered from PAF alone and mark existing chopped identity-sensitive outputs as not valid for identity filtering.
- Implement or repair chunking so each output row recomputes, at minimum: query start/end, target start/end, residue matches (PAF col 10), alignment block length (PAF col 11), identity-relevant optional tags (`NM:i`, `dv:f`, `de:f` / gap-compressed divergence where present and computable), and clipped `cg:Z` CIGAR/`cs:Z` strings where present.
- Reverse-strand target coordinate semantics must be tested. Chunks crossing matches, mismatches, insertions, deletions, and chunk boundaries inside operations must be tested.
- Do not silently copy stale alignment-derived tags. Either recompute them exactly or drop them with an explicit validation note explaining why downstream sweepGA will not use them.
- Add golden and property-style Rust tests. Run `cargo test` and a release build.

Acceptance:
- `cargo test` passes and includes tests for M/=/X/I/D operations, reverse strand, chunks ending inside CIGAR ops, and recomputed col10/col11 identity.
- `PAF_SEMANTICS_VALIDATION.md` states exactly which PAF columns/tags are recomputed, copied, dropped, or impossible from PAF alone.
- Existing f16 chopped outputs are classified as valid or invalid for identity-sensitive sweepGA filtering based on the audit; no ambiguous result.
- Commit and push with WG provenance.

Depends on

Required by

Log

2026-06-22T16:05:01.890923713+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=graph, reason=Careful Programmer is the best fit for a correctness-critical Rust implementation and test repair task that spans validation, semantics, and output classification across dependent components.
2026-06-22T16:05:50.191805700+00:00 Spawned by coordinator --executor codex --model gpt-5.5
2026-06-22T16:06:08.305452779+00:00 Starting audit of pafchop-rs implementation and f16 PAF inputs
2026-06-22T16:07:01.969576509+00:00 Audit found current implementation interpolates target coordinates and identity fields; now checking f16 PAF tag availability from prior artifact paths
2026-06-22T16:12:53.483680690+00:00 Implemented strict CIGAR-based chunking, tests, and validation doc; rerunning cargo validation
2026-06-22T16:16:58.675721241+00:00 Validated: cargo test passed (11 tests), release build passed, sampled f16 raw rows accepted by exact CIGAR path; old f16 chopped outputs classified invalid for identity-sensitive filtering
2026-06-22T16:19:04.493811181+00:00 Committed: 64aec0a — pushing to remote
2026-06-22T16:19:57.999982948+00:00 Committed: 64aec0a — pushed to remote
2026-06-22T16:20:35.395071602+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-06-22T16:28:05.344388839+00:00 PendingEval → Done (evaluator passed; downstream unblocks)