add-pendingeval-state

Add PendingEval state + make .evaluate-X the dependency-unblock gate

Metadata

Statusdone
Assignedagent-685
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-27T13:44:47.924051900+00:00
Started2026-04-27T13:45:18.007361244+00:00
Completed2026-04-27T14:42:54.880143898+00:00
Tagseval-scheduled
Eval score0.75
└ blocking impact0.85
└ completeness0.72
└ constraint fidelity0.85
└ coordination overhead0.80
└ correctness0.80
└ downstream usability0.75
└ efficiency0.85
└ intent fidelity0.76
└ style adherence0.85

Description

Description

deprecate-pending-validation only shipped half the spec. Auto-migration of old PendingValidation → Done landed, but the intermediate eval-gated state and dep-unblock gating did NOT.

User's verbatim mental model:

'soft done state where the agent is like, yo, I think I'm done. But then that converts... light green, like between yellow and green. And then when the evaluation actually can run and verify completion, if its grade goes above .7, then it should flip to real green and things behind it should be possible. But if it fails, it should go to red and go into this retry triage type thing.'

Required state machine

open → in-progress → pending-eval ─┬─ eval pass → done   → downstream unblocks
                                   └─ eval fail → failed → triage / auto-rescue

Status enum additions:

  • PendingEval (light green / chartreuse): agent called wg done; eval has been scaffolded but not yet scored.
  • Existing Done (green): eval scored ≥ eval_gate_threshold (0.7 default).
  • Existing Failed (red): eval scored < threshold; auto-rescue handler may re-spawn or escalate.

Behavior changes

  1. wg done from agent transitions to PendingEval, NOT Done. This is invisible to the agent — it still calls the same command — but the dispatcher's state machine now waits.
  2. .evaluate-X task runs as today (already auto-scheduled by auto_evaluate).
  3. On eval pass: dispatcher transitions task PendingEval → Done. Downstream tasks (--after this) become ready ONLY at this point.
  4. On eval fail (below eval_gate_threshold):
    • If auto_rescue_on_eval_fail (default true): task transitions PendingEval → InProgress (or back to Open with assignment cleared) and respawn with eval feedback as additional context. Counter max_eval_rescues (default 3) caps loops.
    • If auto_rescue_on_eval_fail false: task → Failed; downstream stays blocked until manual intervention.
  5. Visual (TUI / wg viz): PendingEval renders in a distinct color between yellow (in-progress) and green (done) — chartreuse / light-green per user spec.
  6. Manual override: wg approve <task> (already exists) bypasses eval gate, transitions PendingEval → Done. wg reject <task> transitions PendingEval → Failed (or Open for retry, depending on flag).

Migration

  • Existing Done tasks stay Done — the gate only applies to NEW wg done calls after this lands.
  • Update deprecate-pending-validation's migration logic: legacy PendingValidation now goes through the new gate (transitions to PendingEval, eval runs, then resolves). Only fully-resolved-already tasks go straight to Done.

Files likely to touch

  • src/graph.rs — Status enum: add PendingEval variant; dependency-unblock check in query.rs/coordinator.rs
  • src/commands/done.rs — change wg done's terminal status from Done → PendingEval
  • src/commands/service/coordinator.rs — add Phase that watches PendingEval tasks for eval completion and resolves them; auto-rescue handler
  • src/commands/viz/mod.rs, src/tui/viz_viewer/render.rs — color rendering
  • src/commands/show.rs, src/commands/list.rs — status display
  • src/agency/eval/ (or wherever eval-pass detection lives) — emit a signal that flips PendingEval → Done

What stays

  • wg approve / wg reject for human override
  • Auto-rescue config (auto_rescue_on_eval_fail, max_eval_rescues, eval_gate_threshold)
  • The .evaluate-X scaffolding mechanism
  • The PendingValidation → Done migration on boot (legacy cleanup; PendingValidation as a status will eventually be deprecated entirely once no graphs reference it)

Validation

  • Failing tests first:
    • test_wg_done_transitions_to_pending_eval — wg done from agent → status PendingEval (NOT Done)
    • test_dep_unblocks_after_eval_pass — Task B (--after A) becomes ready ONLY after .evaluate-A scores ≥ threshold
    • test_dep_stays_blocked_on_eval_fail — Task B stays blocked AND A re-spawned
    • test_max_eval_rescues_caps_to_failed — task that fails eval N times transitions to Failed
    • test_pending_eval_renders_in_distinct_color — viz/show output uses chartreuse/light-green code (not yellow, not green)
    • test_legacy_done_tasks_unchanged — existing Done tasks before this lands stay Done
  • Implementation makes tests pass
  • cargo build + cargo test pass with no regressions
  • Manual smoke (HARD GATE):
    • Add task A and task B (--after A); publish both
    • A runs, agent calls wg done → wg show A reports Status: PendingEval (color: light green)
    • B is NOT yet ready (wg ready does not list it)
    • Eval scores A high → A becomes Done → B becomes ready and dispatches
    • Repeat with low-scoring A: A fails or auto-rescues; B stays blocked

Depends on

Required by

Messages 2 messages (2 unread)

  1. #1user2026-04-27T13:56:01.254263031+00:00read
    Architectural clarification from user: when eval gate fails and auto_rescue fires, the rescue MUST reuse the SAME agent identity AND the SAME worktree. NOT a fresh agent in a fresh worktree.
    
    User quote: 'the failed gate, it should result in a retry, but again without destruction of the particular agent. Like we should regenerate that agent so that it has the same work tree and so on. it's just another iteration, right?'
    
    So:
    - PendingEval → eval fail → status transitions to InProgress (or similar resumable state) AGAIN with same task.agent (same role+tradeoff identity hash) AND same worktree dir
    - The eval feedback (low-scoring dimensions, evaluator notes) is appended to the agent's prompt context as 'prior iteration eval feedback'
    - The agent picks up where it left off, addresses the gap, calls wg done again, eval re-runs
    - Counter `max_eval_rescues` (default 3) caps the loop; on cap, transitions to Failed (now triage)
    
    This converges with worktree-retention-don (don't reap until eval+merge): both tasks together produce the proper resumable iteration loop. NEITHER touches agent identity, NEITHER reaps the worktree, until the eval-pass + merge actually completes.
    
    Same iteration semantic as a chat session reattaching: pick up the prior state, continue from there.
  2. #2add-pendingeval-state2026-04-27T14:41:02.134892529+00:00read
    Acknowledged. The architecture clarification is significant and changes the eval-fail path: instead of fresh-agent+fresh-worktree rescue, it needs same-identity in-place iteration with eval feedback in context, capped by max_eval_rescues. My current implementation uses the EXISTING rescue path (which spawns a fresh worker via super::rescue::run). The new direction requires retiring that for the eval-gate path and wiring a 'reset-and-resume' transition. Plan: land the PendingEval state machine + colors + dep gating + dispatcher resolution as committed (the eval-PASS path is correct and complete), then open a follow-up task 'in-place eval-fail iteration' that (a) on eval fail transitions PendingEval -> Open with assigned/agent retained, (b) increments task.rescue_count, (c) appends evaluator notes to next-spawn previous-attempt context, (d) preserves worktree (depends on worktree-retention-don). This keeps THIS PR focused and surgical while pinning the iteration-loop work as its own task.

Log