validate-pafchop-rs-paf-semantics

Validate and repair pafchop-rs PAF semantics

Metadata

Statusdone
Assignedagent-2663
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-06-22T16:04:04.549701654+00:00
Started2026-06-22T16:05:50.191802063+00:00
Completed2026-06-22T16:20:35.395060901+00:00
Tagspafchop, rust, paf, validation, sweepga, eval-scheduled
Eval score0.89
└ blocking impact0.94
└ completeness0.88
└ constraint fidelity0.70
└ coordination overhead0.92
└ correctness0.90
└ downstream usability0.92
└ efficiency0.86
└ intent fidelity0.79
└ style adherence0.88

Description

Problem: The current PAF chopper must not be trusted until validated. Chopping PAF rows is only valid for downstream sweepGA filtering if all alignment-derived fields are correctly recomputed per chunk.

Task:

  • Audit paper_prep/_brainstorming/pafchop-rs implementation.
  • Determine whether the source f16 PAFs contain enough per-base alignment information (cg:Z, cs:Z, or equivalent) to exactly split alignments. If they do not, document that exact per-chunk identity cannot be recovered from PAF alone and mark existing chopped identity-sensitive outputs as not valid for identity filtering.
  • Implement or repair chunking so each output row recomputes, at minimum: query start/end, target start/end, residue matches (PAF col 10), alignment block length (PAF col 11), identity-relevant optional tags (NM:i, dv:f, de:f / gap-compressed divergence where present and computable), and clipped cg:Z CIGAR/cs:Z strings where present.
  • Reverse-strand target coordinate semantics must be tested. Chunks crossing matches, mismatches, insertions, deletions, and chunk boundaries inside operations must be tested.
  • Do not silently copy stale alignment-derived tags. Either recompute them exactly or drop them with an explicit validation note explaining why downstream sweepGA will not use them.
  • Add golden and property-style Rust tests. Run cargo test and a release build.

Acceptance:

  • cargo test passes and includes tests for M/=/X/I/D operations, reverse strand, chunks ending inside CIGAR ops, and recomputed col10/col11 identity.
  • PAF_SEMANTICS_VALIDATION.md states exactly which PAF columns/tags are recomputed, copied, dropped, or impossible from PAF alone.
  • Existing f16 chopped outputs are classified as valid or invalid for identity-sensitive sweepGA filtering based on the audit; no ambiguous result.
  • Commit and push with WG provenance.

Depends on

Required by

Log