Metadata
| Status | done |
|---|---|
| Assigned | agent-185 |
| Agent identity | f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e |
| Created | 2026-04-27T00:20:07.059463902+00:00 |
| Started | 2026-04-27T00:20:46.044369231+00:00 |
| Completed | 2026-04-27T01:06:27.343189556+00:00 |
| Tags | eval-scheduled |
| Eval score | 0.84 |
| └ blocking impact | 0.87 |
| └ completeness | 0.73 |
| └ constraint fidelity | 0.40 |
| └ coordination overhead | 0.88 |
| └ correctness | 0.88 |
| └ downstream usability | 0.85 |
| └ efficiency | 0.85 |
| └ intent fidelity | 0.86 |
| └ style adherence | 0.87 |
Description
Description
User insight: pending-validation status is a holdover from the deprecated --verify / --validation=llm era. It now exists to stall tasks indefinitely waiting for wg approve / wg reject — a synchronous human gate that nobody runs. Better model: dependent tasks unblock when the parent's .evaluate-X task passes (score >= eval_gate_threshold). Agency eval IS the verification.
The machinery already exists in config:
eval_gate_threshold = 0.7auto_rescue_on_eval_fail = trueauto_evaluate = true
What's missing: making .evaluate-X a HARD prerequisite for downstream tasks, and removing pending-validation from the routine state machine.
Spec
-
Status state machine:
- Drop
PendingValidationfrom the routine task lifecycle. Tasks go:open → in-progress → done | failed | abandoned. - If retained at all, it's only for very rare cases (e.g. cross-org review in a public visibility task that explicitly opts in via
--validation=human-review). Never the default for any task.
- Drop
-
Dependency unblock model:
- Today: Task A done → Task B (
--after A) becomes ready as soon as A is Done. - New: Task A done →
.evaluate-Ascaffolded → eval runs → if score >= eval_gate_threshold, Task B becomes ready. If score < threshold, eval-fail handler fires:- If
auto_rescue_on_eval_fail = true(default): re-spawn A with the eval feedback as additional context; Task B stays blocked. - If
auto_rescue_on_eval_fail = false: A transitions to Failed with eval reason; Task B blocked until manually unblocked.
- If
- Today: Task A done → Task B (
-
Display + UX:
wg readyandwg vizshow eval-gating: 'Task B blocked on .evaluate-A pending'.wg show Ashows whether A's eval has run, current score vs threshold, downstream blockers.- Eval failure surfaces in chat / TUI immediately with the reason.
-
Migration for existing PendingValidation tasks:
- On dispatcher boot, scan for tasks in PendingValidation. For each: log a one-time migration message, transition to Done (assume the agent's claim was accepted; if user wanted to reject it they would have).
- Clearly document the migration in the upgrade notes.
-
Drop wg approve / wg reject as routine commands:
- Keep them as overrides (
wg approve <task>to bypass eval gate;wg reject <task>to force re-spawn) for emergency human intervention. - Mark them as 'expert mode' in --help; not surfaced in quickstart.
- Keep them as overrides (
-
Cascade-failure guardrails:
- If eval consistently fails (e.g. 3 consecutive auto-rescues without passing), task transitions to Failed instead of looping forever.
- Configurable via existing
max_verify_failures(rename tomax_eval_rescuesfor clarity).
Why this matters now
Showstopper concrete example: thin-wrapper-impl is sitting in PendingValidation (6 hours stale). Downstream tasks that depend on it (any thin-wrapper-smoke / thin-wrapper-docs / etc.) are blocked. The eval (.evaluate-thin-wrapper-impl) probably ran or will run; if it passed, downstream should already be unblocked. PendingValidation is just adding a synchronous human checkpoint that nobody is running.
User's point verbatim: 'a mode of behavior that should be deprecated... maybe dependent tasks should actually depend on evaluation passing them?'
Out of scope
- Re-implementing the agency eval system (already works; just plug it into the gate model)
- The smoke-gate-is task (separate concern: agent's own self-check before claiming done)
Validation
-
Failing tests first:
- test_dependent_task_unblocks_when_eval_passes — Task A done + .evaluate-A scored 0.8 (above threshold) → Task B becomes ready
- test_dependent_task_stays_blocked_when_eval_fails — Task A done + .evaluate-A scored 0.5 → Task B stays blocked AND A re-spawned with feedback
- test_no_routine_pending_validation_state —
wg add 'foo'; wg done fooends up in Done, never PendingValidation - test_legacy_pending_validation_migrated_on_boot — boot finds an existing PendingValidation task, transitions to Done with migration log
- test_max_eval_rescues_caps_loops — task that consistently fails eval transitions to Failed after N retries
- Implementation makes tests pass
- cargo build + cargo test pass with no regressions
-
Manual smoke (in a scratch dir):
- Add task A and task B (
--after A); publish both - A runs, claims done; B should NOT be ready until .evaluate-A passes
- If eval passes, B becomes ready; if fails, A re-spawns
- PendingValidation never appears in
wg listfor either
- Add task A and task B (
- Approve thin-wrapper-impl (or reject if smoke shows broken) to unblock its downstream NOW, separate from this task
Depends on
Required by
- (none)
Log
- 2026-04-27T00:20:07.056805126+00:00 Task paused
- 2026-04-27T00:20:21.319661974+00:00 Task published
- 2026-04-27T00:20:45.820022781+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=task, reason=Careful Programmer is the only role suited to implementing core state machine changes; Careful tradeoff matches the regression-risk profile of modifying task lifecycle and dependency resolution.
- 2026-04-27T00:20:46.044374671+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-27T00:20:58.806647034+00:00 Starting work — exploring existing PendingValidation code paths
- 2026-04-27T00:22:34.066197236+00:00 Mapping codebase: PendingValidation enters via verify_mode=separate, validation=llm, validation=external paths in done.rs. Routine done already lands in Done. Need to focus on (1) ready_tasks gate on .evaluate-X completion, (2) boot migration of stale PendingValidation, (3) cascade-failure guardrails.
- 2026-04-27T01:04:42.821156742+00:00 Manual smoke complete: (1) eval-pending blocks dependents, (2) eval-pass unblocks them, (3) legacy PendingValidation migrates to Done on next tick. All 11 new integration tests pass; ran broad sweep — no regressions caused by my changes (failures observed are all pre-existing wg init --executor and smoke_context.rs ResumeConfig issues).
- 2026-04-27T01:06:02.764727031+00:00 Committed: fffdfc3bf — pushed to remote
- 2026-04-27T01:06:27.343195928+00:00 Task marked as done