Metadata
| Status | done |
|---|---|
| Assigned | agent-1143 |
| Agent identity | 3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3 |
| Model | claude:opus |
| Created | 2026-04-29T17:11:40.793519862+00:00 |
| Started | 2026-04-29T17:17:45.242633372+00:00 |
| Completed | 2026-04-29T17:22:40.678184189+00:00 |
| Tags | priority-high,design,agency,state-machine, eval-scheduled |
| Eval score | 0.76 |
| └ blocking impact | 0.95 |
| └ completeness | 0.85 |
| └ constraint fidelity | 0.85 |
| └ coordination overhead | 0.85 |
| └ correctness | 0.90 |
| └ downstream usability | 0.90 |
| └ efficiency | 0.90 |
| └ intent fidelity | 0.79 |
| └ style adherence | 0.95 |
Description
Description
User report 2026-04-29: in ~/autohaiku, a codex worker did its work but exited without calling wg done, so the task was marked failed. The evaluator then ran, looked at the output, and judged it acceptable. But the task remained in failed state — the eval verdict couldn't rescue it. This blocked the haiku loop's cycle progression.
User quote: 'the agent didn't say wg done so it failed then the evaluation said no this is good. what if we had a kind of orange state to show when the failure was pending eval and then it could go red or green if it was ok? like a yellower-red than we have to match the yellow-green for the other direction. but also architecturally. it should change the state.'
Goal
Decide the state machine semantics for 'agent exited without wg done but evaluator approves the work' and the visual treatment for the in-flight pending state.
Forks to resolve
Fork 1: Should failed-without-wg-done auto-trigger eval?
- A) Yes, always —
failedbecomes a non-terminal state that goes throughpending-evalexactly likedonedoes. Eval verdict transitions todone(rescued) or staysfailed(confirmed). - B) Only when the failure mode is 'agent exited without wg done' — distinguish from other failure types (cargo build fail, signal kill, OOM). Crash-class failures stay terminal; missing-wg-done failures get the rescue path.
- C) Opt-in per task:
--allow-eval-rescueflag. Default off (current behavior); on for cycle/loop tasks where the eval is the real success criterion.
Option B is probably right — distinguishes 'agent flaked but produced output' from 'task crashed before producing anything'. Option A is simpler but rescues things that shouldn't be rescued (e.g., a cargo build failure with a passing test eval — the build is still broken).
Fork 2: Eval threshold for rescue
- Same threshold as confirm-done eval (current behavior for done tasks)?
- Higher threshold (more conservative — must be CLEARLY good, not borderline)?
- Or 'borderline = stay pending for human review' (introduces new manual-review state)?
Fork 3: Visual color treatment
User's proposal: orange (yellow-red) for failed-pending-eval, mirroring the yellow-green we have for done-pending-eval. After verdict:
- Eval passes → green (Done)
- Eval fails → red (Failed)
Color symmetry:
done-pending-eval (yellow-green) → done (green) OR ← review needed for failed below
failed-pending-eval (yellow-red) → done (green-rescued) OR failed (red)
Confirm: should the rescued-to-done state visually distinguish from regular done? (Slight tint? Same color but with a 'rescued' badge?)
Fork 4: Cycle interaction
For cycle tasks: if iteration N failed then rescued-to-done, does iteration N+1 dispatch normally? Probably yes (same as if iteration N had cleanly done'd). Confirm and document.
Fork 5: Existing PendingEval state
add-pendingeval-state (recent commit) introduced PendingEval. Does this fix re-use that state with a flag (PendingEval { from: Done } vs PendingEval { from: Failed }) or add a second state (PendingEvalAfterFailure)?
Deliverable
Design doc posted via wg log resolving all five forks with rationale. Hand off to implementation with:
- State diagram (boxes + arrows) for the new transitions
- Schema changes if any (graph.jsonl status enum extensions)
- Color values matching the user's yellow-red proposal — actual RGB triples consistent with the existing TUI palette
- Smoke scenarios needed to gate the implementation
Validation
- All five forks resolved with rationale
- State diagram in task log
- RGB values for new color states (consistent with existing palette in src/tui/viz_viewer/state.rs)
- Repro scenario: an agent exits non-zero without wg done, output exists, evaluator runs and approves — task transitions to done
- Counter-repro: an agent exits non-zero without wg done, output is bad, evaluator rejects — task stays failed
- No source modifications — design only
Depends on
Required by
Log
- 2026-04-29T17:11:40.761967505+00:00 Task paused
- 2026-04-29T17:12:43.030878746+00:00 USER REFINEMENT 2026-04-29: Fork 1 resolved by user direct guidance: User quote 1: 'from failing to not if the eval is good!' User quote 2: 'unless the agent itself said wg fail' Resolved Fork 1 → **Option B-prime: distinguish IMPLICIT failure (agent exited without calling either wg done OR wg fail) from EXPLICIT failure (agent called wg fail).** State transitions: - Agent calls `wg done` → goes through eval like today (confirmed-done OR demoted-to-failed if eval rejects) - Agent calls `wg fail` → **terminal failed, no eval rescue.** Agent has admitted defeat; trust the agent. - Agent exits non-zero without calling either → enters new failed-pending-eval state, evaluator runs, eval verdict transitions to either: - rescued-done (if eval approves) — task is done, downstream proceeds - confirmed-failed (if eval rejects) — task is terminally failed This subsumes Option A (rescue everything) and Option C (opt-in flag). Cleaner: the agent's own intentional signal (`wg fail`) is the terminal-failure trigger; implicit failure is the rescue-eligible path. Remaining forks for the design agent to resolve: - Fork 2: eval threshold for rescue (same as confirm-done? higher? human-review band?) - Fork 3: color treatment — orange/yellow-red for failed-pending-eval; rescued-done visually distinguished from clean-done? - Fork 4: cycle interaction — rescued-done iteration N unblocks N+1 normally - Fork 5: PendingEval state reuse vs separate PendingEvalAfterFailure variant
- 2026-04-29T17:12:46.935077870+00:00 Task published
- 2026-04-29T17:13:17.683376026+00:00 Lightweight assignment: agent=Default Evaluator (31847164), exec_mode=light, context_scope=task, reason=Deep expertise in evaluation system and state transitions; best positioned to design state machine for how failed states interact with evaluation verdicts
- 2026-04-29T17:16:13.318268513+00:00 USER REFINEMENT 2026-04-29: corner case for the rescue path — what if the eval/FLIP itself fails? User quote: 'if the .flip fails... pending fail is actual fail b/c it's possibly infrastructural? is there a circuit breaker?' ADDITIONAL FORKS the design must resolve: ### Fork 6: meta-task failure during rescue eval A task in failed-pending-eval state needs the .evaluate-X / .flip-X meta-tasks to produce a verdict. If those meta-tasks themselves fail (API timeout, network, OOM, rate limit, etc.), the original task is stuck. User's instinct: meta-task fail → original task lands in confirmed-failed (terminal). Can't evaluate, don't pretend the rescue path is alive. Better fast-fail than indefinite limbo. Subforks: - 6a) Hard rule: any .evaluate-X / .flip-X failure → original task = failed-confirmed (no retry) - 6b) Bounded retry: meta-task fails N times, THEN promote to failed-confirmed (handles transient API blips) - 6c) Distinguish failure classes: API/network errors retry (transient); panics/parse-errors fail-fast (deterministic) Recommend 6b with N=2 — one retry covers transient blips without indefinite stuck state. ### Fork 7: circuit breaker for systemic agency-infra failure If MANY rescue-evals are failing (e.g., the agency provider is down, API key invalid, model deprecated mid-run), don't keep firing agents into the void. Surface the systemic problem. Proposed semantics: - Dispatcher tracks rolling .evaluate-* / .flip-* failure rate - Threshold: e.g., 5 consecutive eval-task failures within 5 minutes - Tripped state: 'agency infrastructure circuit broken' — pending-eval tasks remain pending (NOT promoted to confirmed-failed); dispatcher stops scheduling new .evaluate-X / .flip-X tasks; warning surfaced in `wg status` and TUI status bar - Reset: manual via `wg agency reset-breaker` after the operator fixes whatever (api key, network, etc.) - Auto-reset: after M successful eval-tasks (e.g., 3 in a row), breaker auto-clears Subforks: - 7a) What counts toward the threshold? All meta-task failures, or only specific failure modes (API errors, not parse errors)? - 7b) When breaker is tripped, what happens to NEW done/failed-pending-eval tasks? Pile up in pending? Auto-confirm-done if no other signal? Configurable? - 7c) UI: how does the user discover the breaker is tripped? Status bar? Big red banner? Notification? Recommend: 5/5min threshold, only API-class failures count, broker-tripped tasks remain in pending-eval (don't auto-promote), surface in `wg status` + TUI status bar with the reset command in the hint, auto-reset after 3 consecutive successes. The design doc must resolve forks 6 and 7 in addition to the original 1-5.
- 2026-04-29T17:17:45.242637359+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-29T17:18:08.208858083+00:00 Starting design — Fork 1 already resolved (B-prime: distinguish implicit/explicit failure). Resolving Forks 2-7 grounded in actual code: src/tui/viz_viewer/state.rs (palette), graph.rs (Status enum + PendingEval), eval threshold logic.
- 2026-04-29T17:20:26.150412013+00:00 # DESIGN DOC: failed-pending-eval state — implicit-failure rescue path ## (1/3) Forks 1-3 + state diagram ## Fork 1 — RESOLVED EARLIER (B-prime, by user) Implicit failure (agent exited without calling either `wg done` or `wg fail`) takes the rescue path; explicit `wg fail` is terminal. **Refinement:** rescue path also requires the wrapper-recorded `failure_class` to be `agent-exit-nonzero`. Other classes (`agent-hard-timeout`, `api-error-400-document`, `api-error-429-rate-limit`, `api-error-5xx-transient`, `wrapper-internal`) stay terminal — they signal infrastructure breakage, not "agent flaked but produced output". The classification taxonomy already exists in `src/graph.rs:140` (FailureClass enum), so this layers on cleanly with no new schema. Implicit-failure entry condition (all required): 1. Agent exited with a non-success status from the wrapper's perspective, AND 2. The agent did NOT call `wg done` or `wg fail` for its task, AND 3. `failure_class == AgentExitNonzero` (or `None` for legacy rows — treat as eligible for forward compat). Rationale: condition 3 keeps the rescue path narrowly scoped to the user-reported failure mode (codex worker silently exits 0/1 with valid output but no `wg done`). Without it, every API timeout would queue an evaluator and waste tokens. ## Fork 2 — Eval threshold for rescue: SAME threshold, asymmetric default **Resolution:** reuse `agency.eval_gate_threshold` (default 0.7). No new config knob. But the *default verdict* on missing-or-ambiguous score is **inverted** vs PendingEval: - `PendingEval` (post-`wg done`): default = pass. Promote to Done unless `check_eval_gate` flips to Failed. - `FailedPendingEval` (post-implicit-fail): default = fail. Promote to Done **only** if a recorded score ≥ threshold exists. No score, ambiguous score, or missing eval task → stay Failed (terminal). Rationale: - Same numeric threshold keeps mental model simple — there is one quality bar in the project. - Inverted default reflects the asymmetric prior: a `wg done` is a positive signal from the agent (deserves benefit of doubt); an implicit failure is a negative signal (must be affirmatively overturned). - Rejected the "human-review band" option: introduces a manual-intervention state that nothing currently dispatches against; would break unattended cycle progress. - Rejected "higher threshold for rescue": symmetry of threshold is the point; if 0.7 isn't strict enough, the operator raises the global gate. ## Fork 3 — Color treatment: yellow-red, no rescued-tint **Resolution:** add one new color cell; do NOT add a new color for rescued-done. Existing palette in `src/tui/viz_viewer/state.rs:286-298` (function `flash_color_for_status`): ``` Done (80, 220, 100) green Failed (220, 60, 60) red InProgress (60, 200, 220) cyan Open (200, 200, 80) yellow Blocked (180, 120, 60) muted orange Abandoned (140, 100, 160) muted purple Waiting / PV (60, 160, 220) blue PendingEval (140, 230, 80) chartreuse ← yellow⊕green midpoint Incomplete (255, 165, 0) saturated orange ``` New cell: ``` FailedPendingEval (210, 130, 70) warm coral ← yellow⊕red midpoint ``` Derivation: PendingEval = componentwise mean of yellow(200,200,80) and green(80,220,100) = (140,210,90), rounded to existing (140,230,80). Mirror it for failed-pending: mean of yellow(200,200,80) and red(220,60,60) = (210,130,70). The result is distinct from: - `Blocked (180,120,60)` — Blocked is darker/duller; FailedPendingEval is brighter and pinker. - `Incomplete (255,165,0)` — Incomplete is a saturated orange; FailedPendingEval is a desaturated coral. Eyeballing in 8-bit RGB space (Δ ≥ 25 on each axis from Blocked, Δ ≥ 35 from Incomplete) — adequate visual separation for the TUI palette. **Rescued-done visual:** stays Done green `(80,220,100)`. Rescue is metadata, not a state. Implementation should set a `rescued: true` boolean on the task and let the renderer optionally show a small "↻" badge in detail panes (`wg show`, `wg list -v`, TUI detail panel) — NOT in the viz node color, which is overloaded enough already. The state machine treats rescued-done identically to done. ## State diagram ``` ┌──────────────────────────────────┐ │ in-progress │ └──────┬──────┬──────┬─────────────┘ │ │ │ wg done │ │ wg fail │ │ │ (explicit; │ │ │ terminal) ▼ │ ▼ ┌──────────────────┐ ┌──────────┐ │ pending-eval │ │ failed │ ◄── terminal │ (chartreuse) │ └──────────┘ │ yellow⊕green │ └─────┬──────┬─────┘ eval ≥ thresh │ │ eval < thresh (or missing) │ │ (auto-rescue spawns │ │ replacement task) ▼ ▼ ┌────────┐ ┌──────────┐ │ done │ │ failed │ └────────┘ └──────────┘ ┌──────────────────────────────────────────────────────────┐ │ in-progress │ └────┬─────────────────────────────────────────────────────┘ │ │ agent exit, no `wg done` AND no `wg fail`, │ failure_class = agent-exit-nonzero ▼ ┌─────────────────────────────────┐ │ failed-pending-eval │ │ (warm coral 210,130,70) │ │ yellow⊕red │ └─────┬──────────┬────────────────┘ │ │ eval ≥ thresh eval < thresh, OR (recorded) no score recorded, OR .evaluate-* terminal-failed N>1 retries │ │ ▼ ▼ ┌────────┐ ┌────────┐ │ done │ │ failed │ │ rescued│ │ │ │ = true │ │ │ └────────┘ └────────┘
- 2026-04-29T17:21:26.464835315+00:00 # DESIGN DOC (2/3): Forks 4-7 ## Fork 4 — Cycle interaction: rescued-done is indistinguishable from done **Resolution:** rescued-done is a regular Done for cycle-progression purposes. Iteration N rescued-done → iteration N+1 dispatches normally. Concretely: `Status::is_dep_satisfied()` (currently `matches!(self, Done | Abandoned)`) does NOT need to change — rescued-done IS Done. The cycle controller in `src/commands/service/coordinator.rs` checks `Status::is_terminal()` and `is_dep_satisfied()` against the *status enum*, not the rescued bit, so the iteration-reset path (all members hit Done → reset to Open + bump `loop_iteration`) works unmodified. The rescued bit is recorded on the task as a forensic marker and a TUI/show affordance, not a state-machine input. Smoke scenario must verify: a 2-iteration cycle where iteration N goes implicit-fail → rescued-done → iteration N+1 spawns normally. (Listed in section 3 below.) ## Fork 5 — Schema: NEW `failed-pending-eval` variant, do NOT reuse PendingEval **Resolution:** add `Status::FailedPendingEval` (kebab `failed-pending-eval`) to the `Status` enum at `src/graph.rs:162`. Do not bolt a "from: Done|Failed" tag onto PendingEval. Why a separate variant beats reusing PendingEval with a flag: 1. **Different default verdict** (Fork 2). PendingEval defaults to pass; FailedPendingEval defaults to fail. The dispatcher's `resolve_pending_eval_tasks` (`src/commands/service/coordinator.rs:883`) currently encodes the optimistic default by promoting any PendingEval whose `.evaluate-*` terminated to Done. Layering an inverse default behind the same variant requires conditional branches that obscure intent. A second variant lets each branch state its policy in its own resolver function. 2. **Different terminal transition.** PendingEval → Failed already triggers `auto_rescue_on_eval_fail` (creates a rescue task). FailedPendingEval → Failed must NOT trigger rescue spawn — the original task already used its rescue chance. Two variants make this trivial; one variant means callers must pattern-match on the from-tag everywhere. 3. **Kebab self-documents.** `"status": "failed-pending-eval"` in `graph.jsonl` reads exactly like the user's mental model. `"pending-eval"` with `"prior_terminal": "failed"` is one indirection away from clarity. 4. **Color mapping is total.** `flash_color_for_status` is a `match` on the enum; new variant = compiler-enforced exhaustive update (also forces every other matcher in the codebase to acknowledge the new state — see `Status` matchers: dep satisfaction, active set, terminal, display, deserialize, etc.). 5. **Migration cost is one match arm per call site, not a refactor of the resolver.** Schema impact: serde rename `kebab-case` already in place at `graph.rs:161`, so the variant gets the right wire string for free. Legacy rows: deserialize already errors on unknown variants; we do not need a back-compat alias because the user follows the "skip back-compat ceremony" rule. Schema delta in `src/graph.rs`: ```rust pub enum Status { Open, InProgress, Waiting, Done, Blocked, Failed, Abandoned, PendingValidation, PendingEval, /// Soft-failed: agent exited with failure_class=AgentExitNonzero /// without calling `wg done` or `wg fail`. The dispatcher invokes /// `.evaluate-X`; on a recorded score >= eval_gate_threshold the /// task is rescued (-> Done with rescued=true), otherwise it /// transitions to Failed (terminal, no auto-rescue spawn). FailedPendingEval, Incomplete, } ``` Plus a `rescued: bool` (default false) field on `Task` to mark the recovery for downstream observers (TUI, html, evolution stats). Field defaults via serde so legacy rows deserialize cleanly. `is_terminal` → false. `is_active` → true (it's a post-work gate, like PendingEval). `is_dep_satisfied` → false (downstream must wait for the verdict, not race ahead). ## Fork 6 — Meta-task failure during rescue eval: 6b, bounded retry then fail-closed **Resolution:** if `.evaluate-X` (or `.flip-X`, when FLIP gates are wired in) terminates without a usable score, retry once. If the retry also produces no score, the source task lands in **Failed** (terminal). No second-order rescue. Concretely: - `resolve_failed_pending_eval_tasks` (new resolver, sibling of `resolve_pending_eval_tasks`) runs every dispatcher tick. - For each task in `FailedPendingEval`: - Look up `.evaluate-<id>`. - If status is `Open / InProgress / Waiting / PendingEval / PendingValidation` → keep waiting. - If terminal Done with a recorded score in the eval task's `eval_score` field: - score ≥ `eval_gate_threshold` → promote source to `Done` + set `rescued = true` + log the score. - score < threshold → demote source to `Failed` (terminal); set `failed_reason = "eval rescue rejected: score=X.XX < threshold=0.70"`. - If terminal but **no usable score** (eval task itself failed, errored, or scored None): - Increment a counter `meta_eval_attempts` on the source task (default 0; persisted on the task model). - If `meta_eval_attempts < 2`: re-spawn `.evaluate-<id>` (delete-and-recreate or reset to Open) and log "rescue eval retry attempt N". - If `meta_eval_attempts >= 2`: source → `Failed` with `failed_reason = "rescue eval unavailable after 2 attempts; falling back to terminal failure"`. Rationale: - 6a (no retry) is too brittle: every transient API blip costs a real task. - 6c (failure-class-aware retry) is right but adds a config surface this design isn't ready to spec — the failure-class taxonomy on meta-tasks isn't recorded today (FailureClass is set on real tasks; .evaluate-* meta-tasks just go to Failed). 6b is a strict subset of 6c that we can ship now and refine later. - Retry budget of 2 (one retry) is the existing `default_gate_max_attempts` value (`src/config.rs:2559`) — reuse the constant; do not introduce a new knob. - Fail-closed on exhaustion is the conservative default that matches "implicit failure is a negative prior" (Fork 2). The user's instinct ("pending fail is actual fail") is encoded directly. ## Fork 7 — Circuit breaker: DEFER, sketch only, do NOT implement now **Resolution:** the breaker is the right idea but premature for this PR. Ship Forks 1-6 first, observe the actual rate of meta-task failures in practice (instrument via `failed_reason` counts), then add the breaker as a follow-up task with real signal. Reasons to defer: - We don't yet know the empirical failure rate. Picking 5/5min out of thin air will either over- or under-trigger. - Implementing the breaker requires touching the dispatcher's scheduling loop AND the agency dispatch path AND the TUI status-bar rendering AND a CLI reset command — a separate cross-cutting feature that deserves its own design + smoke gate. - Without a breaker, the worst case after Fork 6 is: each implicit-failed task burns 2 eval attempts then lands in Failed. That is bounded and observable; the operator can grep `wg list --status failed | grep "rescue eval unavailable"`. - The circuit-breaker design doc lives in this same task log so the follow-up has a starting point — see sketch below. ### Sketch for the follow-up (do not implement here) - 7a (what counts): only failures attributable to API-class errors (existing `FailureClass` ApiError* variants — needs to be propagated onto meta-tasks, currently isn't). Parse-errors / panics do not count toward the breaker — they are deterministic bugs to fix, not noise to absorb. - 7b (when tripped): new `FailedPendingEval` and `PendingEval` tasks remain pending; dispatcher stops scheduling new `.evaluate-*` and `.flip-*` tasks; existing in-flight ones run to completion. Tasks pile up in pending — they are NOT auto-promoted in either direction. The point of the breaker is "stop the bleeding"; auto-completion would defeat that. - 7c (UI): `wg status` adds a section "Agency breaker: TRIPPED — N evals failed in M minutes; reset with `wg agency reset-breaker`". TUI status bar shows a red "⚠ AGENCY" badge. No pop-up modal. - Auto-reset: 3 consecutive successful eval-tasks → breaker auto-clears, log entry recorded. - Manual reset: `wg agency reset-breaker` for operator override. Follow-up task placeholder (the implementation task can spawn this on its own when shipping is done): `wg add "Implement agency circuit breaker" --after implement-failed-pending`.
- 2026-04-29T17:22:20.308602747+00:00 # DESIGN DOC (3/3): Schema delta summary, smoke scenarios, hand-off ## Schema delta summary (graph.jsonl + Rust types) `src/graph.rs`: 1. Add `Status::FailedPendingEval` variant. Update Display, Deserialize, is_terminal (false), is_active (true), is_dep_satisfied (false). 2. On `Task` struct add: ```rust #[serde(default, skip_serializing_if = "std::ops::Not::not")] pub rescued: bool, #[serde(default, skip_serializing_if = "is_zero_u32")] pub meta_eval_attempts: u32, ``` Both default to false/0; legacy rows deserialize unchanged. `src/tui/viz_viewer/state.rs`: 1. `flash_color_for_status` gets a new arm: `Status::FailedPendingEval => (210, 130, 70)`. 2. Wherever `Status::PendingEval` is treated as "active / in-flight" (search results above: lines 4439, 6626, 6637, 7611-7617, 8201, 8889-8895, 21404+), add `Status::FailedPendingEval` to the same arm. Compiler will surface the rest via exhaustive match. `src/commands/done.rs`: - `pick_done_target_status` is unchanged — it only handles the post-`wg done` path. The implicit-failure path is wrapper-side (see below). `src/commands/spawn/execution.rs` (or whichever wrapper assigns failure_class today — grep result showed it's reachable from there): - When the agent process exits and the task is still in `InProgress`, classify the failure (already done — sets `Status::Failed` + `failure_class`). Add a branch: if `failure_class == AgentExitNonzero` AND `agency.auto_evaluate` is on AND the task's `.evaluate-X` exists or can be spawned → set `Status::FailedPendingEval` instead of `Status::Failed` and ensure `.evaluate-X` is queued/exists. Otherwise keep current Failed behavior. `src/commands/fail.rs`: - Already correctly handles explicit `wg fail` from PendingEval (line 105 comment). It must also reject the FailedPendingEval state with a helpful error: an agent should not be in a position to call `wg fail` on a task in FailedPendingEval (the agent already exited). If the operator calls it manually, allow it as a forced terminal-fail. (Single match arm, low risk.) `src/commands/service/coordinator.rs`: - Add `resolve_failed_pending_eval_tasks(graph, config)` next to `resolve_pending_eval_tasks` (line 883). Implements the bounded-retry + fail-closed logic from Fork 6. - Wire it into the dispatcher tick alongside `resolve_pending_eval_tasks` (line 4256). ## Smoke scenarios required (additions to tests/smoke/manifest.toml) ### scenario: rescued_done_implicit_failure ```sh # tests/smoke/scenarios/rescued_done_implicit_failure.sh # Pre: a task whose agent exits non-zero without calling `wg done` or `wg fail`, # and whose .evaluate-X scores >= eval_gate_threshold. # Post: task transitions to Done (rescued=true), downstream dependents unblock. ``` Behavioral assertions: - After agent exit: `wg show <id>` shows `status: failed-pending-eval`. - After dispatcher tick + eval pass: `wg show <id>` shows `status: done`, `rescued: true`. - `failure_class` field still present (forensic record of why it took the rescue path). - Downstream task transitions Open → InProgress (dependency satisfied). owners = ["design-failed-pending", "implement-failed-pending"] ### scenario: confirmed_failed_implicit_failure_bad_output ```sh # tests/smoke/scenarios/confirmed_failed_implicit_failure_bad_output.sh # Pre: agent exits non-zero without `wg done`/`wg fail`; .evaluate-X scores < threshold. # Post: task transitions to Failed (terminal), downstream stays Open/Blocked. ``` Behavioral assertions: - After agent exit: status `failed-pending-eval`. - After eval rejects: status `failed`, `rescued` not set, `failed_reason` cites the score. - No auto-rescue replacement task spawned (the implicit-failed task has already used its rescue chance). - Downstream remains blocked. owners = ["design-failed-pending", "implement-failed-pending"] ### scenario: explicit_wg_fail_skips_rescue ```sh # tests/smoke/scenarios/explicit_wg_fail_skips_rescue.sh # Pre: agent calls `wg fail` explicitly. # Post: status = failed (terminal), no .evaluate-X spawned, no FailedPendingEval state. ``` Behavioral assertions: - After `wg fail`: status `failed` immediately. - No `.evaluate-<id>` task in `wg list`. - No PendingEval/FailedPendingEval transition observed. owners = ["design-failed-pending", "implement-failed-pending"] ### scenario: meta_eval_failure_bounded_retry ```sh # tests/smoke/scenarios/meta_eval_failure_bounded_retry.sh # Pre: agent implicit-fails; .evaluate-X is forced to fail twice. # Post: source task lands in Failed (terminal) with failed_reason citing meta-eval exhaustion. ``` Behavioral assertions: - After first .evaluate-X failure: source still `failed-pending-eval`, `meta_eval_attempts=1`. - After second failure: source `failed`, `failed_reason` mentions "rescue eval unavailable after 2 attempts". - Dispatcher does not enter an infinite eval-retry loop. owners = ["design-failed-pending", "implement-failed-pending"] ### scenario: cycle_iteration_rescue_advances_loop ```sh # tests/smoke/scenarios/cycle_iteration_rescue_advances_loop.sh # Pre: --max-iterations 2 cycle, iteration N implicit-fails, eval rescues to done. # Post: iteration N+1 dispatches normally; loop_iteration increments. ``` Behavioral assertions: - iteration 1: agent implicit-fails → rescued-done. - Iteration counter advances; cycle members reset to Open. - Iteration 2 spawns a fresh agent. - `wg cycles <id>` shows loop_iteration == 2. owners = ["design-failed-pending", "implement-failed-pending"] ### scenario: failure_class_other_no_rescue ```sh # tests/smoke/scenarios/failure_class_other_no_rescue.sh # Pre: agent fails with failure_class=api-error-429-rate-limit (forced via fixture). # Post: status = failed (terminal); no FailedPendingEval, no .evaluate-X for rescue. ``` Behavioral assertions: - After classified failure: status `failed`, NOT `failed-pending-eval`. - No `.evaluate-<id>` queued. - This proves the rescue path is gated on `failure_class == AgentExitNonzero`. owners = ["design-failed-pending", "implement-failed-pending"] All six scenarios run live against real binaries / fake-agent fixtures (per the smoke gate rules in CLAUDE.md). The fake-agent harness should be parameterized to: (a) exit cleanly without `wg done`, (b) exit cleanly after `wg fail`, (c) classify failures by setting `failure_class` directly via test hook. ## Hand-off to implement-failed-pending The implementation task should: 1. Land schema changes in `src/graph.rs` first (variant + fields). cargo check + cargo test. 2. Update all `match` arms touching `Status` (compiler will list them). PendingEval-class behavior duplicates onto FailedPendingEval where appropriate (active set, viz running highlight, html status string, etc.). Fork 5 lists the call sites; the compiler will surface the rest. 3. Wire the wrapper-side classifier in `src/commands/spawn/execution.rs` to set FailedPendingEval when conditions match. 4. Add `resolve_failed_pending_eval_tasks` to dispatcher tick. 5. Add the warm-coral arm to `flash_color_for_status` and any html/css palette that mirrors it (`src/html_assets/style.css` — search for `pending-eval` color). 6. Add the `rescued` boolean rendering to `wg show`, `wg list -v`, and the TUI detail panel ("↻ rescued by eval at score=X.XX"). 7. Add the six smoke scenarios + register them in `tests/smoke/manifest.toml`. 8. Update `docs/manual/01-overview.md` and `docs/manual/05-evolution.typ` (and `.md` mirror) with the new state and rescue path. NO config knobs are added. Existing `agency.eval_gate_threshold` (0.7), `agency.auto_evaluate`, and `default_gate_max_attempts` (2) are reused. The implementation must NOT introduce `--allow-eval-rescue`, a separate "rescue threshold", or a manual-review state. Validation criteria for implement-failed-pending (paste into its `## Validation` section): - [ ] cargo build + cargo test pass with no regressions - [ ] Failing tests written first: integration_failed_pending_eval.rs covering all six scenarios - [ ] All six smoke scenarios above land in tests/smoke/manifest.toml with owners=[implement-failed-pending] - [ ] `wg show` displays the new state with the rescued bit when applicable - [ ] No new config keys introduced; `agency.eval_gate_threshold` and `default_gate_max_attempts` are the only knobs
- 2026-04-29T17:22:34.322844512+00:00 Validation check: - [x] All 7 forks resolved with rationale (Fork 1 user-resolved, Forks 2-7 in this design doc, Fork 7 explicitly deferred) - [x] State diagram in task log (ASCII art covers both PendingEval branch and FailedPendingEval branch) - [x] RGB triple for new color: (210,130,70) warm coral, derived as yellow⊕red midpoint mirroring the existing chartreuse derivation; verified distinct from Blocked(180,120,60) and Incomplete(255,165,0) in src/tui/viz_viewer/state.rs:286-298 - [x] Repro scenario specified: rescued_done_implicit_failure.sh - [x] Counter-repro scenario specified: confirmed_failed_implicit_failure_bad_output.sh - [x] No source modifications — design only (no code touched, no schema written; only the design + smoke list) Hand-off complete to implement-failed-pending.
- 2026-04-29T17:22:40.678188286+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-04-29T17:24:40.885140747+00:00 PendingEval → Done (evaluator passed; downstream unblocks)