deprecate-pending-validation — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-185`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-04-27T00:20:07.059463902+00:00
Started	2026-04-27T00:20:46.044369231+00:00
Completed	2026-04-27T01:06:27.343189556+00:00
Tags	`eval-scheduled`
Eval score	0.84
└ blocking impact	0.87
└ completeness	0.73
└ constraint fidelity	0.40
└ coordination overhead	0.88
└ correctness	0.88
└ downstream usability	0.85
└ efficiency	0.85
└ intent fidelity	0.86
└ style adherence	0.87

Description

User insight: pending-validation status is a holdover from the deprecated --verify / --validation=llm era. It now exists to stall tasks indefinitely waiting for wg approve / wg reject — a synchronous human gate that nobody runs. Better model: dependent tasks unblock when the parent's .evaluate-X task passes (score >= eval_gate_threshold). Agency eval IS the verification.

The machinery already exists in config:

eval_gate_threshold = 0.7
auto_rescue_on_eval_fail = true
auto_evaluate = true

What's missing: making .evaluate-X a HARD prerequisite for downstream tasks, and removing pending-validation from the routine state machine.

Spec

Status state machine:
- Drop PendingValidation from the routine task lifecycle. Tasks go: open → in-progress → done | failed | abandoned.
- If retained at all, it's only for very rare cases (e.g. cross-org review in a public visibility task that explicitly opts in via --validation=human-review). Never the default for any task.
Dependency unblock model:
- Today: Task A done → Task B (--after A) becomes ready as soon as A is Done.
- New: Task A done → .evaluate-A scaffolded → eval runs → if score >= eval_gate_threshold, Task B becomes ready. If score < threshold, eval-fail handler fires:
  - If auto_rescue_on_eval_fail = true (default): re-spawn A with the eval feedback as additional context; Task B stays blocked.
  - If auto_rescue_on_eval_fail = false: A transitions to Failed with eval reason; Task B blocked until manually unblocked.
Display + UX:
- wg ready and wg viz show eval-gating: 'Task B blocked on .evaluate-A pending'.
- wg show A shows whether A's eval has run, current score vs threshold, downstream blockers.
- Eval failure surfaces in chat / TUI immediately with the reason.
Migration for existing PendingValidation tasks:
- On dispatcher boot, scan for tasks in PendingValidation. For each: log a one-time migration message, transition to Done (assume the agent's claim was accepted; if user wanted to reject it they would have).
- Clearly document the migration in the upgrade notes.
Drop wg approve / wg reject as routine commands:
- Keep them as overrides (wg approve <task> to bypass eval gate; wg reject <task> to force re-spawn) for emergency human intervention.
- Mark them as 'expert mode' in --help; not surfaced in quickstart.
Cascade-failure guardrails:
- If eval consistently fails (e.g. 3 consecutive auto-rescues without passing), task transitions to Failed instead of looping forever.
- Configurable via existing max_verify_failures (rename to max_eval_rescues for clarity).

Why this matters now

Showstopper concrete example: thin-wrapper-impl is sitting in PendingValidation (6 hours stale). Downstream tasks that depend on it (any thin-wrapper-smoke / thin-wrapper-docs / etc.) are blocked. The eval (.evaluate-thin-wrapper-impl) probably ran or will run; if it passed, downstream should already be unblocked. PendingValidation is just adding a synchronous human checkpoint that nobody is running.

User's point verbatim: 'a mode of behavior that should be deprecated... maybe dependent tasks should actually depend on evaluation passing them?'

Out of scope

Re-implementing the agency eval system (already works; just plug it into the gate model)
The smoke-gate-is task (separate concern: agent's own self-check before claiming done)

Validation

Failing tests first:
- test_dependent_task_unblocks_when_eval_passes — Task A done + .evaluate-A scored 0.8 (above threshold) → Task B becomes ready
- test_dependent_task_stays_blocked_when_eval_fails — Task A done + .evaluate-A scored 0.5 → Task B stays blocked AND A re-spawned with feedback
- test_no_routine_pending_validation_state — wg add 'foo'; wg done foo ends up in Done, never PendingValidation
- test_legacy_pending_validation_migrated_on_boot — boot finds an existing PendingValidation task, transitions to Done with migration log
- test_max_eval_rescues_caps_loops — task that consistently fails eval transitions to Failed after N retries
Implementation makes tests pass
cargo build + cargo test pass with no regressions
Manual smoke (in a scratch dir):
- Add task A and task B (--after A); publish both
- A runs, claims done; B should NOT be ready until .evaluate-A passes
- If eval passes, B becomes ready; if fails, A re-spawns
- PendingValidation never appears in wg list for either
Approve thin-wrapper-impl (or reject if smoke shows broken) to unblock its downstream NOW, separate from this task

## Description

User insight: pending-validation status is a holdover from the deprecated --verify / --validation=llm era. It now exists to stall tasks indefinitely waiting for `wg approve` / `wg reject` — a synchronous human gate that nobody runs. Better model: dependent tasks unblock when the parent's `.evaluate-X` task passes (score >= eval_gate_threshold). Agency eval IS the verification.

The machinery already exists in config:
- `eval_gate_threshold = 0.7`
- `auto_rescue_on_eval_fail = true`
- `auto_evaluate = true`

What's missing: making `.evaluate-X` a HARD prerequisite for downstream tasks, and removing pending-validation from the routine state machine.

### Spec

1. **Status state machine**:
   - Drop `PendingValidation` from the routine task lifecycle. Tasks go: `open → in-progress → done | failed | abandoned`.
   - If retained at all, it's only for very rare cases (e.g. cross-org review in a public visibility task that explicitly opts in via `--validation=human-review`). Never the default for any task.

2. **Dependency unblock model**:
   - Today: Task A done → Task B (`--after A`) becomes ready as soon as A is Done.
   - New: Task A done → `.evaluate-A` scaffolded → eval runs → if score >= eval_gate_threshold, Task B becomes ready. If score < threshold, eval-fail handler fires:
     - If `auto_rescue_on_eval_fail = true` (default): re-spawn A with the eval feedback as additional context; Task B stays blocked.
     - If `auto_rescue_on_eval_fail = false`: A transitions to Failed with eval reason; Task B blocked until manually unblocked.

3. **Display + UX**:
   - `wg ready` and `wg viz` show eval-gating: 'Task B blocked on .evaluate-A pending'.
   - `wg show A` shows whether A's eval has run, current score vs threshold, downstream blockers.
   - Eval failure surfaces in chat / TUI immediately with the reason.

4. **Migration for existing PendingValidation tasks**:
   - On dispatcher boot, scan for tasks in PendingValidation. For each: log a one-time migration message, transition to Done (assume the agent's claim was accepted; if user wanted to reject it they would have).
   - Clearly document the migration in the upgrade notes.

5. **Drop wg approve / wg reject as routine commands**:
   - Keep them as overrides (`wg approve <task>` to bypass eval gate; `wg reject <task>` to force re-spawn) for emergency human intervention.
   - Mark them as 'expert mode' in --help; not surfaced in quickstart.

6. **Cascade-failure guardrails**:
   - If eval consistently fails (e.g. 3 consecutive auto-rescues without passing), task transitions to Failed instead of looping forever.
   - Configurable via existing `max_verify_failures` (rename to `max_eval_rescues` for clarity).

### Why this matters now

Showstopper concrete example: thin-wrapper-impl is sitting in PendingValidation (6 hours stale). Downstream tasks that depend on it (any thin-wrapper-smoke / thin-wrapper-docs / etc.) are blocked. The eval (`.evaluate-thin-wrapper-impl`) probably ran or will run; if it passed, downstream should already be unblocked. PendingValidation is just adding a synchronous human checkpoint that nobody is running.

User's point verbatim: 'a mode of behavior that should be deprecated... maybe dependent tasks should actually depend on evaluation passing them?'

### Out of scope

- Re-implementing the agency eval system (already works; just plug it into the gate model)
- The smoke-gate-is task (separate concern: agent's own self-check before claiming done)

## Validation

- [ ] Failing tests first:
  - test_dependent_task_unblocks_when_eval_passes — Task A done + .evaluate-A scored 0.8 (above threshold) → Task B becomes ready
  - test_dependent_task_stays_blocked_when_eval_fails — Task A done + .evaluate-A scored 0.5 → Task B stays blocked AND A re-spawned with feedback
  - test_no_routine_pending_validation_state — `wg add 'foo'; wg done foo` ends up in Done, never PendingValidation
  - test_legacy_pending_validation_migrated_on_boot — boot finds an existing PendingValidation task, transitions to Done with migration log
  - test_max_eval_rescues_caps_loops — task that consistently fails eval transitions to Failed after N retries
- [ ] Implementation makes tests pass
- [ ] cargo build + cargo test pass with no regressions
- [ ] Manual smoke (in a scratch dir):
  - Add task A and task B (`--after A`); publish both
  - A runs, claims done; B should NOT be ready until .evaluate-A passes
  - If eval passes, B becomes ready; if fails, A re-spawns
  - PendingValidation never appears in `wg list` for either
- [ ] Approve thin-wrapper-impl (or reject if smoke shows broken) to unblock its downstream NOW, separate from this task

Depends on

done .assign-deprecate-pending-validation

Required by

(none)

Log

2026-04-27T00:20:07.056805126+00:00 Task paused
2026-04-27T00:20:21.319661974+00:00 Task published
2026-04-27T00:20:45.820022781+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=task, reason=Careful Programmer is the only role suited to implementing core state machine changes; Careful tradeoff matches the regression-risk profile of modifying task lifecycle and dependency resolution.
2026-04-27T00:20:46.044374671+00:00 Spawned by coordinator --executor claude --model opus
2026-04-27T00:20:58.806647034+00:00 Starting work — exploring existing PendingValidation code paths
2026-04-27T00:22:34.066197236+00:00 Mapping codebase: PendingValidation enters via verify_mode=separate, validation=llm, validation=external paths in done.rs. Routine done already lands in Done. Need to focus on (1) ready_tasks gate on .evaluate-X completion, (2) boot migration of stale PendingValidation, (3) cascade-failure guardrails.
2026-04-27T01:04:42.821156742+00:00 Manual smoke complete: (1) eval-pending blocks dependents, (2) eval-pass unblocks them, (3) legacy PendingValidation migrates to Done on next tick. All 11 new integration tests pass; ran broad sweep — no regressions caused by my changes (failures observed are all pre-existing wg init --executor and smoke_context.rs ResumeConfig issues).
2026-04-27T01:06:02.764727031+00:00 Committed: fffdfc3bf — pushed to remote
2026-04-27T01:06:27.343195928+00:00 Task marked as done