design-failed-pending

Design: failed-pending-eval state — let evaluator rescue tasks where agent exited without wg done

Metadata

Statusdone
Assignedagent-1143
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Modelclaude:opus
Created2026-04-29T17:11:40.793519862+00:00
Started2026-04-29T17:17:45.242633372+00:00
Completed2026-04-29T17:22:40.678184189+00:00
Tagspriority-high,design,agency,state-machine, eval-scheduled
Eval score0.76
└ blocking impact0.95
└ completeness0.85
└ constraint fidelity0.85
└ coordination overhead0.85
└ correctness0.90
└ downstream usability0.90
└ efficiency0.90
└ intent fidelity0.79
└ style adherence0.95

Description

Description

User report 2026-04-29: in ~/autohaiku, a codex worker did its work but exited without calling wg done, so the task was marked failed. The evaluator then ran, looked at the output, and judged it acceptable. But the task remained in failed state — the eval verdict couldn't rescue it. This blocked the haiku loop's cycle progression.

User quote: 'the agent didn't say wg done so it failed then the evaluation said no this is good. what if we had a kind of orange state to show when the failure was pending eval and then it could go red or green if it was ok? like a yellower-red than we have to match the yellow-green for the other direction. but also architecturally. it should change the state.'

Goal

Decide the state machine semantics for 'agent exited without wg done but evaluator approves the work' and the visual treatment for the in-flight pending state.

Forks to resolve

Fork 1: Should failed-without-wg-done auto-trigger eval?

  • A) Yes, always — failed becomes a non-terminal state that goes through pending-eval exactly like done does. Eval verdict transitions to done (rescued) or stays failed (confirmed).
  • B) Only when the failure mode is 'agent exited without wg done' — distinguish from other failure types (cargo build fail, signal kill, OOM). Crash-class failures stay terminal; missing-wg-done failures get the rescue path.
  • C) Opt-in per task: --allow-eval-rescue flag. Default off (current behavior); on for cycle/loop tasks where the eval is the real success criterion.

Option B is probably right — distinguishes 'agent flaked but produced output' from 'task crashed before producing anything'. Option A is simpler but rescues things that shouldn't be rescued (e.g., a cargo build failure with a passing test eval — the build is still broken).

Fork 2: Eval threshold for rescue

  • Same threshold as confirm-done eval (current behavior for done tasks)?
  • Higher threshold (more conservative — must be CLEARLY good, not borderline)?
  • Or 'borderline = stay pending for human review' (introduces new manual-review state)?

Fork 3: Visual color treatment

User's proposal: orange (yellow-red) for failed-pending-eval, mirroring the yellow-green we have for done-pending-eval. After verdict:

  • Eval passes → green (Done)
  • Eval fails → red (Failed)

Color symmetry:

done-pending-eval (yellow-green)  → done (green)       OR ← review needed for failed below
failed-pending-eval (yellow-red)   → done (green-rescued)  OR  failed (red)

Confirm: should the rescued-to-done state visually distinguish from regular done? (Slight tint? Same color but with a 'rescued' badge?)

Fork 4: Cycle interaction

For cycle tasks: if iteration N failed then rescued-to-done, does iteration N+1 dispatch normally? Probably yes (same as if iteration N had cleanly done'd). Confirm and document.

Fork 5: Existing PendingEval state

add-pendingeval-state (recent commit) introduced PendingEval. Does this fix re-use that state with a flag (PendingEval { from: Done } vs PendingEval { from: Failed }) or add a second state (PendingEvalAfterFailure)?

Deliverable

Design doc posted via wg log resolving all five forks with rationale. Hand off to implementation with:

  • State diagram (boxes + arrows) for the new transitions
  • Schema changes if any (graph.jsonl status enum extensions)
  • Color values matching the user's yellow-red proposal — actual RGB triples consistent with the existing TUI palette
  • Smoke scenarios needed to gate the implementation

Validation

  • All five forks resolved with rationale
  • State diagram in task log
  • RGB values for new color states (consistent with existing palette in src/tui/viz_viewer/state.rs)
  • Repro scenario: an agent exits non-zero without wg done, output exists, evaluator runs and approves — task transitions to done
  • Counter-repro: an agent exits non-zero without wg done, output is bad, evaluator rejects — task stays failed
  • No source modifications — design only

Depends on

Required by

Log