Metadata
| Status | done |
|---|---|
| Assigned | agent-685 |
| Agent identity | f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e |
| Created | 2026-04-27T13:44:47.924051900+00:00 |
| Started | 2026-04-27T13:45:18.007361244+00:00 |
| Completed | 2026-04-27T14:42:54.880143898+00:00 |
| Tags | eval-scheduled |
| Eval score | 0.75 |
| └ blocking impact | 0.85 |
| └ completeness | 0.72 |
| └ constraint fidelity | 0.85 |
| └ coordination overhead | 0.80 |
| └ correctness | 0.80 |
| └ downstream usability | 0.75 |
| └ efficiency | 0.85 |
| └ intent fidelity | 0.76 |
| └ style adherence | 0.85 |
Description
Description
deprecate-pending-validation only shipped half the spec. Auto-migration of old PendingValidation → Done landed, but the intermediate eval-gated state and dep-unblock gating did NOT.
User's verbatim mental model:
'soft done state where the agent is like, yo, I think I'm done. But then that converts... light green, like between yellow and green. And then when the evaluation actually can run and verify completion, if its grade goes above .7, then it should flip to real green and things behind it should be possible. But if it fails, it should go to red and go into this retry triage type thing.'
Required state machine
open → in-progress → pending-eval ─┬─ eval pass → done → downstream unblocks
└─ eval fail → failed → triage / auto-rescue
Status enum additions:
PendingEval(light green / chartreuse): agent calledwg done; eval has been scaffolded but not yet scored.- Existing
Done(green): eval scored ≥eval_gate_threshold(0.7 default). - Existing
Failed(red): eval scored < threshold; auto-rescue handler may re-spawn or escalate.
Behavior changes
wg donefrom agent transitions toPendingEval, NOTDone. This is invisible to the agent — it still calls the same command — but the dispatcher's state machine now waits..evaluate-Xtask runs as today (already auto-scheduled by auto_evaluate).- On eval pass: dispatcher transitions task PendingEval → Done. Downstream tasks (
--after this) become ready ONLY at this point. - On eval fail (below
eval_gate_threshold):- If
auto_rescue_on_eval_fail(default true): task transitions PendingEval → InProgress (or back to Open with assignment cleared) and respawn with eval feedback as additional context. Countermax_eval_rescues(default 3) caps loops. - If
auto_rescue_on_eval_failfalse: task → Failed; downstream stays blocked until manual intervention.
- If
- Visual (TUI / wg viz): PendingEval renders in a distinct color between yellow (in-progress) and green (done) — chartreuse / light-green per user spec.
- Manual override:
wg approve <task>(already exists) bypasses eval gate, transitions PendingEval → Done.wg reject <task>transitions PendingEval → Failed (or Open for retry, depending on flag).
Migration
- Existing
Donetasks stay Done — the gate only applies to NEWwg donecalls after this lands. - Update
deprecate-pending-validation's migration logic: legacy PendingValidation now goes through the new gate (transitions to PendingEval, eval runs, then resolves). Only fully-resolved-already tasks go straight to Done.
Files likely to touch
src/graph.rs— Status enum: add PendingEval variant; dependency-unblock check in query.rs/coordinator.rssrc/commands/done.rs— change wg done's terminal status from Done → PendingEvalsrc/commands/service/coordinator.rs— add Phase that watches PendingEval tasks for eval completion and resolves them; auto-rescue handlersrc/commands/viz/mod.rs,src/tui/viz_viewer/render.rs— color renderingsrc/commands/show.rs,src/commands/list.rs— status displaysrc/agency/eval/(or wherever eval-pass detection lives) — emit a signal that flips PendingEval → Done
What stays
wg approve/wg rejectfor human override- Auto-rescue config (auto_rescue_on_eval_fail, max_eval_rescues, eval_gate_threshold)
- The .evaluate-X scaffolding mechanism
- The PendingValidation → Done migration on boot (legacy cleanup; PendingValidation as a status will eventually be deprecated entirely once no graphs reference it)
Validation
-
Failing tests first:
- test_wg_done_transitions_to_pending_eval — wg done from agent → status PendingEval (NOT Done)
- test_dep_unblocks_after_eval_pass — Task B (--after A) becomes ready ONLY after .evaluate-A scores ≥ threshold
- test_dep_stays_blocked_on_eval_fail — Task B stays blocked AND A re-spawned
- test_max_eval_rescues_caps_to_failed — task that fails eval N times transitions to Failed
- test_pending_eval_renders_in_distinct_color — viz/show output uses chartreuse/light-green code (not yellow, not green)
- test_legacy_done_tasks_unchanged — existing Done tasks before this lands stay Done
- Implementation makes tests pass
- cargo build + cargo test pass with no regressions
-
Manual smoke (HARD GATE):
- Add task A and task B (
--after A); publish both - A runs, agent calls wg done → wg show A reports Status: PendingEval (color: light green)
- B is NOT yet ready (
wg readydoes not list it) - Eval scores A high → A becomes Done → B becomes ready and dispatches
- Repeat with low-scoring A: A fails or auto-rescues; B stays blocked
- Add task A and task B (
Depends on
Required by
- (none)
Log
- 2026-04-27T13:44:47.900607722+00:00 Task paused
- 2026-04-27T13:44:57.111820784+00:00 Task published
- 2026-04-27T13:45:17.099539085+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=graph, reason=Careful Programmer best matches this core state-machine implementation requiring comprehensive TDD, careful dependency-unblock logic, and extensive test coverage across multiple subsystems.
- 2026-04-27T13:45:18.007367696+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-27T13:45:36.169891060+00:00 Starting: exploring codebase to understand current state machine — Status enum, wg done flow, evaluation pipeline
- 2026-04-27T13:48:38.954438623+00:00 Implementation plan: 1) Add PendingEval variant to Status (non-terminal). 2) wg done transitions to PendingEval if eval is scheduled, else Done. 3) Eval pass → PendingEval→Done. 4) Eval fail uses existing auto_rescue path. 5) Color rendering. Writing failing tests first.
- 2026-04-27T14:40:37.206807919+00:00 Implementation complete. Validated: 11 new + 11 existing pending-validation tests pass. 2911 unit + integration tests pass. Manual smoke: wg done -> PendingEval, system tasks (.flip/.evaluate) bypass and run, dispatcher tick promotes PendingEval -> Done after eval pass, downstream task-b unblocks. wg approve and wg fail accept PendingEval.
- 2026-04-27T14:42:42.096144952+00:00 Committed: fdc36e0a1 — pushed to remote
- 2026-04-27T14:42:54.880149439+00:00 Task marked as done