Metadata
| Status | done |
|---|---|
| Assigned | agent-1327 |
| Agent identity | 3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3 |
| Created | 2026-05-01T14:26:51.607812286+00:00 |
| Started | 2026-05-01T14:27:50.657062718+00:00 |
| Completed | 2026-05-01T14:36:48.185137083+00:00 |
| Tags | design,state-machine,reliability,failure-handling, eval-scheduled |
| Eval score | 0.92 |
| └ blocking impact | 0.95 |
| └ completeness | 0.91 |
| └ constraint fidelity | 0.85 |
| └ coordination overhead | 0.95 |
| └ correctness | 0.92 |
| └ downstream usability | 0.90 |
| └ efficiency | 0.88 |
| └ intent fidelity | 0.93 |
| └ style adherence | 0.92 |
Description
Description
Two related design ideas the user surfaced 2026-05-01. Bundling because the second consumes the first.
Idea 1: Exponential falloff for any cycle of failure (GENERIC PRIMITIVE)
User direct quote: 'we'd want exponential falloff in any cycle of failure. that's a generic thing.'
Generic semantics: when a task or task-cycle fails, the next attempt's delay grows exponentially. Reset to base on success. Capped at a maximum so it doesn't completely stop trying. Includes jitter to prevent thundering-herd when many tasks fail in correlated ways (network outage etc.).
This applies to (at minimum):
- Cron-recurring tasks (
--cronwith --exec OR LLM): failed iteration N → next iteration's delay = base * 2^N (capped at some max, e.g. 24h) - Cycle tasks with
cycle_delay: same logic on iteration-fails - Triage spawn rate (consumer of this primitive — see Idea 2)
- failed-pending-eval circuit breaker (already has its own ad-hoc threshold; could be unified with the generic primitive)
Idea 2: Triage agent on recurring failures (CONSUMER)
User direct quote: 'a recurring failure could bring about an agent to triage. wouldn't that be something very natural to do? It could decide to do nothing, but it still gives us the ability of flexible recovery.'
When a recurring task hits a threshold of consecutive failures (e.g., 3), spawn a one-shot triage agent to investigate. The agent:
- Reads recent failure logs (stderr from shell tasks; agent archives from LLM tasks)
- Reads daemon log around the failure timestamps
- Looks for patterns (network, credentials, disk, rate-limit, etc.)
- Outputs a verdict + optional follow-up:
- Do nothing (transient, will self-resolve)
- File a fix task with diagnosed root cause + suggested patch
- Pause the recurring task + notify (urgent intervention needed)
- Adjust task config (bump retry-delay, switch endpoint, downgrade model tier)
User concern (must address in design): 'is it that we're worried about infinite redress?' YES. Triage is one-shot, never auto-spawns more triage on its own failure. Plus the global circuit breaker (from failed-pending-eval design) suppresses new triage spawns when the system is in a degraded state.
Forks to resolve
Fork 1: backoff parameters
- Base interval source: the task's normal cron / cycle_delay
- Multiplier: 2x (standard)? Or configurable per-task / globally?
- Cap: 24h reasonable default; configurable?
- Reset: on next success OR after some time-since-last-failure (whichever comes first)?
- Jitter: ±10%? ±25%? Configurable?
Fork 2: where the backoff state lives
- Per-task backoff state in graph.jsonl:
failure_cycle_count,last_failure_at,next_attempt_atfields - OR derived state computed at scheduling time from the task's run history (no new fields, just smarter scheduler logic)
- Recommend the latter — fewer schema changes
Fork 3: triage threshold + cooldown
- N=3 consecutive failures triggers triage (default)? Per-task override?
- After triage runs, cooldown period before another triage can fire for the same task: 24h default?
- What if triage's own decision is 'pause' — does pausing reset the failure counter?
Fork 4: what triage is allowed to mutate
- Read-only by default — produces a report + optionally files new tasks
- BUT: if triage's diagnosis includes a clear config fix (bad endpoint, wrong model name), can it apply directly via
wg edit? - Recommend: triage CAN file fix tasks autonomously, can pause the failing task autonomously, but CANNOT mutate other-task config without explicit rule-based authorization
Fork 5: integration with failed-pending-eval circuit breaker
- The circuit breaker from design-failed-pending (Fork 7) trips on systemic eval failures. Should triage spawning ALSO be gated by it? Yes, probably — if the agency infra is down, spawning a triage agent (which itself uses LLM) is wasted.
- Should the backoff primitive be part of the circuit breaker's reset logic (failure rate falls below threshold via backoff = breaker can auto-reset)?
Fork 6: shell vs LLM task asymmetry
- Triage on shell-mode tasks: the failing task itself has no LLM, but the triage agent IS an LLM. So triage works for shell tasks (it reads the stderr from logs). Confirm.
- Triage on LLM tasks that fail: the failing agent's logs ARE LLM output. Triage reading another LLM's output for diagnosis. Possibly cheaper to combine with the existing failed-pending-eval rescue (which already evaluates the failed agent's output) — design should clarify whether triage subsumes failed-pending-eval for recurring contexts.
Deliverable
Posted via wg log:
- Resolutions for all 6 forks with rationale
- State diagram showing how a task progresses through normal-success / recoverable-failure (triage rescues) / persistent-failure (backoff into eventually-paused)
- Schema changes if any (graph.jsonl fields)
- Concrete implementation plan (file:line pointers) for both the generic backoff primitive and the triage consumer
- Smoke scenario list
Validation
- All 6 forks resolved with rationale
- State diagram in task log
- Implementation plan concrete enough for a follow-up implementation task
- Smoke scenario list covers: backoff progression on repeated failures, triage spawning on threshold, triage decisions (no-op, file-fix, pause-and-notify), cooldown enforcement, circuit-breaker interaction, shell vs LLM task distinctions
- No source modifications — design only
Depends on
Required by
- (none)
Log
- 2026-05-01T14:26:51.593438727+00:00 Task paused
- 2026-05-01T14:26:58.585032352+00:00 Task published
- 2026-05-01T14:27:50.657069832+00:00 Spawned by coordinator --executor claude --model opus
- 2026-05-01T14:28:05.483871289+00:00 Starting design — orienting on existing scheduler / cron / cycle_delay / failed-pending-eval code
- 2026-05-01T14:33:03.997026899+00:00 # DESIGN DOC (1/3): exponential failure-cycle backoff + recurring-task triage agent ## Forks 1-3 + state diagram Code anchors used throughout this doc: - `src/cron.rs:155 reset_cron_task` — Done-only path; recomputes next_cron_fire with ±10% jitter. - `src/cron.rs:95 calculate_jitter` — deterministic per-task hash-based jitter, capped at 15min. - `src/graph.rs:447 cycle_failure_restarts` — already counts cycle failure restarts; capped by `cycle_config.max_failure_restarts` (default 3). - `src/graph.rs:2235 reactivate_cycle_after_failure` — sets `ready_after` on owner from cycle_config.delay. - `src/commands/service/coordinator.rs:4525` Phase 2.95 — cron reset pass (Done → Open + next_cron_fire). - `src/commands/service/coordinator.rs:4500` Phase 2.6 — cycle failure restart pass. - `src/query.rs:21 is_time_ready` — both cron tasks (`next_cron_fire` via `is_cron_due`) and cycle tasks (`ready_after`) gate dispatch via this check. - `src/commands/requeue.rs:53 triage_count` — existing failed-dependency triage plumbing; the recurring-failure triage is a sibling, NOT this. ## Fork 1 — Backoff parameters: 2x, 24h cap, ±10% jitter, reset on success only **Resolution:** - **Base interval source**: derived per-consumer. - Cron: the schedule's natural period — interval between two consecutive `schedule.after()` fires (already computed in `calculate_jitter` at `cron.rs:97-105`). - Cycle: `cycle_config.delay` parsed via `parse_delay` (`graph.rs:55`). - Triage spawn rate: triage_cooldown (per-task-id rate-limit, not per-attempt; see Fork 3). - **Multiplier**: 2x default. Globally configurable via new `coordinator.failure_backoff_multiplier: f64` (default 2.0). NOT per-task — keeping the schema lean. If a task wants different aggressivity, it can adjust its own base period. - **Cap**: 24h default. Globally configurable via new `coordinator.failure_backoff_max_delay: String` (default "24h", parsed via `parse_delay`). - **Reset rule**: on **next success only**. NO time-based decay. Rationale: time-based reset (e.g., "if 3x base period has elapsed without a fail, reset N to 0") adds a second clock that's invisible to operators. Reset-on-success is the most conservative and most legible — `wg show` shows N=3 until the next clean run flips it to 0. - **Jitter**: ±10% of the computed delay. Reuse existing per-task deterministic hash (cron.rs:95). NO new knob — a hardcoded 10% matches the existing cron jitter convention and avoids yet-another-tunable. **Rejected alternatives:** - Per-task multiplier overrides: 3-axis config (per-task base × per-task multiplier × per-task cap) is a soup that nobody will reason about. One global multiplier is enough — base is already per-task via the schedule. - Time-based decay: see above. - Configurable jitter pct: matches a pattern memory ("Skip back-compat ceremony" — single user, prefer hard rules over knobs). ## Fork 2 — Backoff state: hybrid (one new u32 + reuse existing field) **Resolution:** reuse `cycle_failure_restarts` (`graph.rs:447`) for cycle backoff; add ONE new u32 field `consecutive_failures` for cron + future generic consumers. Total schema delta = 1 field. ```rust // src/graph.rs Task struct, near `meta_eval_attempts: u32` (line 545) /// Number of consecutive failed runs since the last successful run. /// Generic counter for the exponential-backoff primitive — used by cron /// rescheduling and recurring-failure triage. Reset to 0 on next success. /// Distinct from `cycle_failure_restarts` which counts cycle-RESTART events /// (not raw fail counts) and lives on the cycle config-owner only. #[serde(default, skip_serializing_if = "is_zero")] pub consecutive_failures: u32, ``` **Why hybrid beats pure-derived (Fork 2's recommended option):** - Pure-derived ("scan task.log at scheduling time, count Failed→Failed transitions") is brittle: log format is human-prose, not machine-parseable today. A fragile regex over `failed: <reason>` log lines is worse than one explicit u32. - Pure-derived is also slow: every dispatcher tick (1Hz) would re-scan logs of every recurring task. O(tasks × log entries) every tick. - An explicit counter that increments at one site (the dispatcher's failure-handling phase) and resets at one site (the success-handling phase) is trivially correct and trivially auditable. **Why a NEW field beats reusing `cycle_failure_restarts`:** - `cycle_failure_restarts` is a cycle-controller concept: it counts RESTARTS (i.e., fired reactivations after a member-failed), not raw failure events. It's 1 ≤ N ≤ `max_failure_restarts` (default 3) for a cycle that's been auto-restarted thrice. If we overload it for cron tasks (which have no cycle_config, no restart concept), we conflate two semantics. - The cron path needs a counter on the task itself, not on a cycle config-owner. - Generic name (`consecutive_failures`) makes future consumers (HTTP-retry shell tasks, manual `wg retry --backoff`, etc.) trivially attachable. **No new fields for `last_failure_at` / `next_attempt_at`:** - `next_attempt_at` is fully captured by existing `next_cron_fire` (cron) and `ready_after` (cycle). Reusing those keeps the dispatcher's gate consistent at `is_time_ready` (`query.rs:21`). - `last_failure_at` is derivable from `task.completed_at` when status==Failed. Don't duplicate. ## Fork 3 — Triage threshold + cooldown: N=3, 24h cooldown, pausing does NOT reset counter **Resolution:** - **Threshold**: `coordinator.triage_threshold: u32` default 3. Global, no per-task override (single user, lean schema). A failing task accumulates `consecutive_failures`; when it crosses the threshold, the dispatcher spawns a one-shot `.triage-<id>` task (haiku tier). - **Cooldown**: 24h default via `coordinator.triage_cooldown: String` (parsed via `parse_delay`). Tracked per-failing-task via a new `last_triaged_at: Option<String>` field on Task. - **Spawn precondition (all required)**: 1. Task is "recurring": `cron_enabled` OR has a back-edge in cycle_analysis (member of an SCC of size ≥ 2 OR has implicit max_iterations). 2. `consecutive_failures >= triage_threshold`. 3. `last_triaged_at.is_none() || (now - last_triaged_at) >= triage_cooldown`. 4. Task does NOT carry the tag `no-recurring-triage` (this tag is auto-applied to triage tasks themselves to prevent recursion — see Fork 5). 5. Agency circuit breaker is NOT tripped (when present; if not implemented yet, this clause no-ops). - **Pause does NOT reset `consecutive_failures`**: pausing is a hold, not a fix. If the operator (or triage's PAUSE verdict) sets `paused=true`, the counter stays where it was. On `wg resume`, the next failed run continues from N+1; a successful run resets to 0. This preserves forensic continuity: `wg show` always shows the true consecutive-fail tally. **Rationale on N=3:** - Matches the existing `coordinator.max_verify_failures` default (3) and `cycle_config.max_failure_restarts` default (3) — three is the project's canonical "this isn't transient" threshold. - Below 3: too noisy (a single network blip + retry can hit N=2). - Above 3: too slow (operator notices the failures by hand before triage fires). **Rationale on 24h cooldown:** - Triage uses an LLM (haiku-tier, ~$0.001/run, but still). Re-triage every iteration of a 5-minute cron task = 288 calls/day for a still-broken task; 1 call/day under 24h cooldown is plenty. - The operator can override via `wg edit <id> --clear-triage-cooldown` (new flag) or by deleting the `last_triaged_at` field directly when they want fresh diagnosis. **Rationale on counter preservation through pause:** - A paused-then-resumed task that fails again is STILL on its same failure streak. The triage agent that paused it expected the operator to fix something; if nothing was fixed, the next failure re-triggers the same diagnosis. Resetting on pause would create a perverse incentive to pause-and-resume to clear the counter.
- 2026-05-01T14:34:24.561881064+00:00 # DESIGN DOC (2/3): Forks 4-6 + state diagram + schema delta ## Fork 4 — Triage mutation rights: file-fix YES, pause-self YES, mutate-others NO **Resolution:** triage runs as `exec_mode=light, tier=fast` (Claude Code with read-only file access + wg CLI). Allowed actions: - `wg add "Fix: ..." --after <failing-task-id> -d "..."` — file fix tasks. - `wg pause <failing-task-id>` (or `wg edit <failing-task-id> --pause` if dedicated CLI doesn't exist) — pause the failing task itself. - `wg log <failing-task-id> "..."` — record diagnostic findings on the source task. - Read-only `wg show <other-task-id>` for context gathering. NOT allowed: - `wg edit <other-task-id> --model X --endpoint Y` — triage CAN suggest these in its diagnostic output, but MUST file a fix task rather than apply directly. The fix task's downstream worker (or the operator) does the mutation. - `wg add ... --after <unrelated-task>` — only `--after <failing-task-id>` is allowed (prompt-enforced). - Calling `.triage-*` for any task (prompt-enforced). **Phase 1 enforcement: prompt-only.** The triage prompt (template at `src/agency/triage_prompt.rs`, NEW) carries explicit constraints: "You may file at most ONE fix task and at most ONE pause action. You MAY NOT mutate other tasks via `wg edit`." **Phase 2 enforcement (deferred): binary-level guard.** A `WG_TRIAGE_MODE=1` env var, set by the dispatcher when spawning the triage task, would gate `wg edit <id> --model/--endpoint/--exec` in `src/commands/edit.rs` to refuse on non-self IDs. Defer this until phase 1 reveals over-reach in practice — same pattern as the design-failed-pending circuit breaker (don't build the fence before the goats arrive). **Triage verdict format (parsed from triage's wg log entries on the failing task):** ``` VERDICT=NOOP|FILE-FIX|PAUSE|RECOMMEND-CONFIG ROOT_CAUSE: <one-line summary> EVIDENCE: <log excerpt or file:line> ACTION_TAKEN: <description of what triage did> ``` This is text, not JSON — easier for the LLM to produce reliably and easier for the dispatcher / operator to grep. The dispatcher does NOT need to parse the verdict to take action; the triage agent itself takes action by calling wg CLI commands. The verdict line is purely for `wg show` / TUI display so an operator can scan triage outcomes at a glance. ## Fork 5 — Circuit-breaker integration: gating YES, breaker reset via observation NO **Resolution:** - **Triage spawning IS gated by the breaker.** When breaker is tripped, the recurring-failure scan in the dispatcher (Phase 2.96, see plan below) skips spawning new triage tasks and emits ONE log line per tick: `[dispatcher] Breaker tripped, suppressing triage spawn for {N} recurring-failed tasks`. Tasks pile up with `consecutive_failures > threshold` but `last_triaged_at == None` until the breaker clears. - **Triage tasks ALREADY in flight when the breaker trips run to completion.** Triage is one-shot LLM; tearing it down mid-flight is wasteful and risks lost diagnosis. The breaker only suppresses NEW spawns. - **Backoff does NOT directly drive breaker reset.** The breaker (per design-failed-pending Fork 7, deferred) auto-resets after 3 consecutive eval-task successes. Backoff slows the failure rate, which gives that "3 consecutive successes" window a chance to occur — so backoff INDIRECTLY helps the breaker recover. But we do NOT plumb a direct "rate-falls-below-threshold = auto-reset" signal because: - The breaker is rate-tripped, not rate-reset; reset semantics should mirror trip semantics, not invert them. - Coupling backoff math to breaker state creates a loop the operator can't easily reason about. **When breaker is implemented (deferred follow-up task):** - Field on `state.json` or new `agency/breaker.json`: `breaker_tripped: bool`, `tripped_at: ISO8601`, `consecutive_eval_successes: u32`. - Triage spawn check reads this once per tick; absent file = breaker not tripped. **This design ships without the breaker integration code** (no breaker code to integrate against). The triage spawn function is structured so adding the breaker check is one line: ```rust if config.agency.breaker_tripped { continue; } ``` ## Fork 6 — Shell vs LLM task asymmetry: triage works for both, does NOT subsume failed-pending-eval **Resolution:** the triage agent is ALWAYS LLM (haiku, light exec_mode). Its INPUT depends on the failing task's type: | Failing task type | Triage inputs | |-----------------------|---------------------------------------------------------------------| | Shell (no LLM) | `output.log` stderr, exit codes, `failure_class`, daemon log slice | | LLM (Claude/Codex/Nex)| Above + agent's `output.log` transcript + `raw_stream.jsonl` events | Both are file-system reads via the `Read` tool that triage has access to. The `light` exec_mode permits `Read | Glob | Grep | WebFetch` — sufficient for both cases. **Triage does NOT subsume failed-pending-eval.** They operate at different scopes: | Concern | failed-pending-eval | recurring-failure triage | |-----------------------------|----------------------------------------------|---------------------------------------| | Scope | Single attempt | Pattern across N consecutive attempts | | Decision basis | Score (≥ threshold = rescue) | Diagnostic prose + verdict tag | | Output | Status flip on the source task | New fix task / pause / no-op | | Cost | Always-on (every fail) | Threshold-gated (N=3 default) | | Spawned by | Dispatcher (every implicit-fail) | Dispatcher (after threshold + cooldown)| A single failure of an LLM task may go through failed-pending-eval (per-attempt rescue path) AND, if it doesn't get rescued, increment `consecutive_failures`. Once `consecutive_failures >= triage_threshold`, the recurring-failure triage spawns INDEPENDENTLY of any per-attempt eval that ran. **Edge case:** a cycle task whose 3 iterations all hit `failed-pending-eval` and got rescued to `done` (rescued=true). Does that count as 3 failures? **NO** — `consecutive_failures` increments only on terminal `failed` (rescue → Done = success path). Rescued tasks reset the counter. This makes failed-pending-eval the FIRST line of defense (per-attempt) and triage the SECOND line (cross-attempt pattern), with no double-counting. **Edge case:** triage's own task fails (LLM error, classify_failure, etc.). The triage task carries the tag `no-recurring-triage` (see Fork 3). The recurring-failure scan checks for this tag. So `.triage-foo` failing N times does NOT spawn `.triage-.triage-foo`. The dispatcher emits a one-time warning: `[dispatcher] WARNING: triage task '.triage-foo' failed; no recursive triage spawned (tagged no-recurring-triage)`. ## STATE DIAGRAM — recurring task with backoff + triage ``` ┌──────────────────────────────────┐ │ recurring task (cron OR cycle) │ └───┬──────────────────────────────┘ │ scheduled / iteration N ▼ ┌────────────────┐ │ in-progress │ └─┬───────────┬──┘ success │ │ fail (terminal Failed, │ │ NOT rescued by │ │ failed-pending-eval) ▼ ▼ ┌─────────┐ ┌──────────┐ │ done │ │ failed │ └─┬───────┘ └──┬───────┘ │ │ │ │ Phase 2.95+ (dispatcher tick): │ │ consecutive_failures += 1 │ │ delay = clamp( │ │ base * mult^consec, │ │ base, max_delay) │ │ delay += jitter(±10%, hash(task_id)) │ │ if cron: │ │ next_cron_fire = now + delay │ │ status = Open │ │ if cycle: │ │ ready_after = now + delay │ │ (existing reactivate_cycle_after_failure) │ ▼ │ ┌─────────────────────────────────────┐ │ │ consecutive_failures >= threshold │ │ │ AND now >= last_triaged_at+cooldown │ │ │ AND not tagged no-recurring-triage │ │ │ AND breaker not tripped │ │ └─────┬─────────────────────────┬─────┘ │ │ yes │ no │ ▼ │ │ ┌──────────────────┐ │ │ │ spawn .triage-<id>│ │ │ │ tier=fast, │ │ │ │ exec_mode=light │ │ │ │ tag=no-recurring │ │ │ │ -triage │ │ │ │ agent=haiku │ │ │ │ + set │ │ │ │ last_triaged_at │ │ │ │ = now │ │ │ └────┬─────────────┘ │ │ │ (runs concurrently │ │ │ with continued backoff) │ │ ▼ │ │ ┌────────────────────┐ │ │ │ triage verdict: │ │ │ │ NOOP / FILE-FIX / │ │ │ │ PAUSE / │ │ │ │ RECOMMEND-CONFIG │ │ │ └────┬───────┬──────┬┘ │ │ │ │ │ │ │ NOOP FILE-FIX PAUSE │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ ┌─────────┐ ┌──────────────┐│ │ │ │wg add │ │paused=true ││ │ │ │Fix:... │ │(operator ││ │ │ │--after │ │ unpauses to ││ │ │ │<src> │ │ resume) ││ │ │ └─────────┘ └──────────────┘│ │ ▼ │ │ (continue backoff, │ │ triage is silent) │ ▼ ▼ reset: [back to scheduled, consecutive_failures = 0 slowed by backoff, + reset_cron_task no triage spawned] (or cycle iteration reset) │ ▼ [back to scheduled at base period] terminal-cap branch (worst case): consecutive_failures keeps growing → delay clamps at max_delay (24h default) task tries once per 24h forever, OR until operator intervenes (wg pause / wg resume after fix) triage cooldown ensures at most 1 triage / 24h while in this state ``` ## Schema delta summary `src/graph.rs` Task struct (additions only — no renames, no removals): ```rust /// Number of consecutive failed runs since the last successful run. /// Generic counter for the exponential-backoff primitive. #[serde(default, skip_serializing_if = "is_zero")] pub consecutive_failures: u32, /// Last time a recurring-failure triage agent was spawned for this task /// (ISO 8601). None = never. Used to enforce triage_cooldown. #[serde(default, skip_serializing_if = "Option::is_none")] pub last_triaged_at: Option<String>, ``` Both default to zero/None and skip-if-empty: legacy graph.jsonl rows deserialize unchanged. `src/config.rs` Coordinator section additions: ```rust /// Number of consecutive failures that triggers a recurring-failure triage. #[serde(default = "default_triage_threshold")] pub triage_threshold: u32, // default 3 /// Minimum interval between triage spawns for the same failing task. #[serde(default = "default_triage_cooldown")] pub triage_cooldown: String, // default "24h" /// Multiplier applied to base delay per consecutive failure. #[serde(default = "default_failure_backoff_multiplier")] pub failure_backoff_multiplier: f64, // default 2.0 /// Cap on the computed backoff delay. #[serde(default = "default_failure_backoff_max_delay")] pub failure_backoff_max_delay: String, // default "24h" ``` No renames, no removals. Existing `cycle_failure_restarts` and `max_failure_restarts` keep their meaning (cycle-restart semantics, distinct from raw failure count). No status enum extension. Backoff is metadata, not a state — the task remains in `Open` (with `next_cron_fire` / `ready_after` deferring dispatch) or `Failed` (until the dispatcher's next tick reschedules it). This keeps `Status::is_terminal()` / `is_dep_satisfied()` unchanged and avoids a fan-out of new match arms across the codebase. No new fields for: `next_attempt_at`, `last_failure_at`, `triage_threshold` per-task, `triage_cooldown` per-task. All derivable from existing fields or globally configured.
- 2026-05-01T14:36:16.995282831+00:00 # DESIGN DOC (3/3): Implementation plan + smoke scenarios + hand-off ## Concrete implementation plan (file:line pointers) ### Step 1 — `src/backoff.rs` (NEW, ~80 lines) Pure-functional generic primitive. No graph mutation, no I/O. ```rust //! Generic exponential-backoff primitive. //! //! Used by: //! - `src/cron.rs::reset_cron_task_after_failure` for cron rescheduling //! - `src/graph.rs::reactivate_cycle_after_failure` (`graph.rs:2235`) for cycle delays //! - `src/commands/service/coordinator.rs::evaluate_recurring_failure_triage` for triage spawn rate //! //! Reset semantics: callers reset n_failures to 0 on next success. use chrono::Duration; /// Compute the backoff delay for the next attempt. /// /// `base`: the task's natural period (cron interval, cycle_config.delay, etc.) /// `n_failures`: consecutive failure count (0 = first attempt, no backoff) /// `multiplier`: typically 2.0 /// `max_delay`: hard cap (e.g., 24h) /// `task_id`: used as deterministic-jitter seed (10% of computed delay, hash-based) /// /// Returns a Duration ≥ base, ≤ max_delay, with ±10% jitter applied. pub fn compute_backoff_delay( base: Duration, n_failures: u32, multiplier: f64, max_delay: Duration, task_id: &str, ) -> Duration { if n_failures == 0 { return base; } // Compute base * multiplier^n_failures with overflow-safety. let scale = multiplier.powi(n_failures as i32); let target_secs = (base.num_seconds() as f64 * scale).min(max_delay.num_seconds() as f64); let target = Duration::seconds(target_secs as i64); let target = target.max(base).min(max_delay); // Jitter: ±10% of target, deterministic per task_id + n_failures. let jitter = compute_jitter(task_id, n_failures, target.num_seconds()); target + jitter } fn compute_jitter(task_id: &str, n_failures: u32, target_secs: i64) -> Duration { use std::collections::hash_map::DefaultHasher; use std::hash::{Hash, Hasher}; let mut h = DefaultHasher::new(); task_id.hash(&mut h); n_failures.hash(&mut h); let v = h.finish(); let max_offset = target_secs / 10; // ±10% if max_offset == 0 { return Duration::zero(); } let range = 2 * max_offset + 1; let offset = (v % range as u64) as i64 - max_offset; Duration::seconds(offset) } ``` Tests in same file: n=0 ⇒ base, n=1 ⇒ ~base*2 (±10%), n=10 with 2x mult ⇒ clamped at max, jitter deterministic per (task_id, n). ### Step 2 — `src/graph.rs` schema (additions in Task struct + Default) - Add `consecutive_failures: u32` near `meta_eval_attempts: u32` (line 545). - Add `last_triaged_at: Option<String>` near `last_resurrected_at: Option<String>` (line 498). - Update `Task::default` (line 612) with `consecutive_failures: 0`, `last_triaged_at: None`. - Update the deserialize helper struct (line 1199) and its mapping (line 1360) with the two fields. - Update `is_zero` helper if not already covering u32 — already covers (line 1247-ish). ### Step 3 — `src/cron.rs` extensions - Refactor `calculate_jitter` (line 95) to optionally accept `period_secs: i64` directly OR keep it Schedule-based and add a sibling `calculate_jitter_for_period` that takes `period_secs`. Suggested: keep both; the backoff path uses period_secs (Schedule-based jitter is tied to an external `cron::Schedule`). - Add `pub fn reset_cron_task_after_failure(task: &mut Task, multiplier: f64, max_delay: Duration) -> bool`: ```rust pub fn reset_cron_task_after_failure(task: &mut Task, multiplier: f64, max_delay: Duration) -> bool { if !task.cron_enabled || task.cron_schedule.is_none() { return false; } if task.status != Status::Failed { return false; } let cron_expr = task.cron_schedule.as_ref().unwrap(); let schedule = match parse_cron_expression(cron_expr) { Ok(s) => s, Err(_) => return false, }; let now = Utc::now(); // Base period from schedule let mut upcoming = schedule.after(&now); let first = upcoming.next(); let second = upcoming.next(); let base_secs = match (first, second) { (Some(a), Some(b)) => (b - a).num_seconds().max(60), // floor 60s _ => 3600, // fallback 1h }; let base = Duration::seconds(base_secs); task.consecutive_failures = task.consecutive_failures.saturating_add(1); let delay = crate::backoff::compute_backoff_delay( base, task.consecutive_failures, multiplier, max_delay, &task.id, ); task.last_cron_fire = Some(now.to_rfc3339()); task.next_cron_fire = Some((now + delay).to_rfc3339()); task.status = Status::Open; task.assigned = None; task.completed_at = None; task.failure_reason = None; // cleared on rescheduling, but consecutive_failures preserves the streak true } ``` - Modify `reset_cron_task` (line 155, success path) to also `task.consecutive_failures = 0;` — this is the reset-on-success rule. ### Step 4 — `src/commands/service/coordinator.rs` dispatcher additions At Phase 2.95 (line 4525), augment the existing cron-Done loop with a parallel cron-Failed loop: ```rust // Phase 2.95b: Cron task backoff — reset Failed cron tasks to Open, compute // next_cron_fire = now + base * mult^consecutive_failures (capped + jitter). { let multiplier = config.coordinator.failure_backoff_multiplier; let max_delay = workgraph::graph::parse_delay(&config.coordinator.failure_backoff_max_delay) .map(|s| Duration::seconds(s as i64)) .unwrap_or_else(|| Duration::hours(24)); let failed_cron_ids: Vec<String> = graph .tasks() .filter(|t| t.cron_enabled && t.status == Status::Failed) .map(|t| t.id.clone()) .collect(); for task_id in &failed_cron_ids { if let Some(task) = graph.get_task_mut(task_id) && workgraph::cron::reset_cron_task_after_failure(task, multiplier, max_delay) { eprintln!( "[dispatcher] Cron backoff: '{}' → Open (consecutive_failures={}, next fire: {})", task_id, task.consecutive_failures, task.next_cron_fire.as_deref().unwrap_or("unknown") ); modified = true; } } } ``` In Phase 2.6 (line 4500, `evaluate_all_cycle_failure_restarts`), the existing logic at `graph.rs:2235` computes `ready_after = now + parse_delay(cycle_config.delay)`. Replace that line with the backoff call: ```rust // graph.rs:2235 — replace existing cycle-delay parse with backoff: let base = cycle_config.delay.as_ref().and_then(|d| parse_delay(d)) .map(|s| Duration::seconds(s as i64)) .unwrap_or_else(|| Duration::seconds(0)); let ready_after = if base.num_seconds() > 0 { let n = graph.get_task(config_owner_id).map(|t| t.cycle_failure_restarts).unwrap_or(0); let delay = crate::backoff::compute_backoff_delay(base, n, multiplier, max_delay, config_owner_id); Some((Utc::now() + delay).to_rfc3339()) } else { None }; ``` This keeps `cycle_failure_restarts` as the cycle's failure counter (already exists, already incremented at line 2249) and feeds it through the same generic primitive. Add Phase 2.96 immediately after the cycle-failure-restart block (line ~4513): ```rust // Phase 2.96: Recurring-failure triage spawn. modified |= evaluate_recurring_failure_triage(graph, dir, &config); ``` Where `evaluate_recurring_failure_triage` (NEW function in coordinator.rs near line 1000): ```rust fn evaluate_recurring_failure_triage( graph: &mut workgraph::graph::WorkGraph, dir: &Path, config: &Config, ) -> bool { let threshold = config.coordinator.triage_threshold; let cooldown = workgraph::graph::parse_delay(&config.coordinator.triage_cooldown) .map(|s| chrono::Duration::seconds(s as i64)) .unwrap_or_else(|| chrono::Duration::hours(24)); let now = chrono::Utc::now(); let cycle_analysis = graph.compute_cycle_analysis(); // Collect candidates first (immutable read), then spawn (mutable write) let candidates: Vec<String> = graph.tasks() .filter(|t| { // 1. recurring class let is_recurring = t.cron_enabled || cycle_analysis.cycle_for_task(&t.id).is_some() || (t.cycle_config.is_some() && t.cycle_config.as_ref().unwrap().max_iterations > 1); if !is_recurring { return false; } // 2. threshold met if t.consecutive_failures < threshold { return false; } // 3. tag not on suppression list if t.tags.iter().any(|tag| tag == "no-recurring-triage") { return false; } // 4. cooldown elapsed if let Some(last) = &t.last_triaged_at && let Ok(parsed) = chrono::DateTime::parse_from_rfc3339(last) && (now - parsed.with_timezone(&chrono::Utc)) < cooldown { return false; } // 5. breaker check (deferred — when implemented, gate here) // if config.agency.breaker_tripped { return false; } true }) .map(|t| t.id.clone()) .collect(); if candidates.is_empty() { return false; } let mut modified = false; for source_id in &candidates { if spawn_triage_task(graph, dir, source_id).is_ok() { if let Some(task) = graph.get_task_mut(source_id) { task.last_triaged_at = Some(now.to_rfc3339()); task.log.push(LogEntry { timestamp: now.to_rfc3339(), actor: None, user: Some(workgraph::current_user()), message: format!( "Recurring-failure triage spawned (.triage-{}): consecutive_failures={}", source_id, task.consecutive_failures ), }); modified = true; } } } modified } ``` ### Step 5 — `src/agency/triage_prompt.rs` (NEW, ~120 lines) Builds the triage agent's system prompt. Inputs: failing task, recent log excerpts, daemon-log slice. Output: a string for `task.description` of the spawned `.triage-<id>` task. Hard-coded constraints: one fix-task max, no `wg edit` of other tasks, no recursive triage spawn, MUST end with VERDICT line. ### Step 6 — `src/commands/service/coordinator.rs::spawn_triage_task` (NEW) Mirror existing `.evaluate-X` spawn pattern (`coordinator.rs:1847+` shows the .place / .assign pattern). Triage task: - `id = format!(".triage-{}", source_id)` - `title = format!("Triage: {}", source.title)` - `tier = Some("fast")` — haiku - `exec_mode = Some("light")` — read-only file access + wg CLI - `tags = vec!["recurring-triage", "no-recurring-triage", "eval-scheduled"]` - `after = vec![source_id.to_string()]` — depends on the source (so context flows in) - `model = config.models.evaluator` cascade (claude:haiku default) - `description = build_triage_prompt(source, recent_logs, daemon_excerpt)` ### Step 7 — `src/config.rs` knobs (~30 lines diff) Add four new keys under `[coordinator]`: - `triage_threshold: u32` near line 3097 (next to `max_verify_failures`) - `triage_cooldown: String` - `failure_backoff_multiplier: f64` - `failure_backoff_max_delay: String` Plus matching `default_*` functions and `Default` impl entries. ### Step 8 — `src/commands/show.rs` display surface Show `consecutive_failures` and `last_triaged_at` when nonzero/present: ``` Failure cycle: Consecutive failures: 3 Next attempt: 2026-05-02T09:14:00Z (in 14h, backoff iter 3 of cap=24h) Last triaged: 2026-04-30T22:11:00Z (cooldown: 6h remaining) ``` ### Step 9 — TUI viz (`src/tui/viz_viewer/state.rs`) No new color cell needed (backoff is metadata, not state). Optional: dim Open tasks where `next_cron_fire > now + 1h` to visually de-emphasize "long-backoff" tasks. Defer to follow-up — not gating. ### Step 10 — CLI hooks (`src/cli.rs`, `src/commands/edit.rs`) - `wg edit <id> --clear-triage-cooldown` — operator override; clears `last_triaged_at` so the next failure can trigger triage immediately. - `wg edit <id> --reset-failures` — operator override; resets `consecutive_failures=0` (e.g., after a known-transient outage). - `wg pause <id>` / `wg resume <id>` — verify these already exist in `src/cli.rs`. If not, add (~20 lines, just toggles `paused` + persists). Triage uses these. ## Smoke scenarios (additions to `tests/smoke/manifest.toml`) All scenarios run live against real `wg` binaries and a fake-shell-task fixture (or fake-LLM fixture for triage). Each registered with `owners = ["implement-exponential-failure"]`. 1. **`backoff_progression_cron.sh`** — cron task fails 3x; assert `next_cron_fire` grows base→~2x→~4x (within ±10% jitter); after a forced success, assert `consecutive_failures=0` and `next_cron_fire` back to base. 2. **`backoff_cap_enforced.sh`** — cron task fails 20x; assert `next_cron_fire - now` ≤ `failure_backoff_max_delay + 10% jitter`; assert no panic / overflow. 3. **`backoff_jitter_deterministic.sh`** — same `(task_id, n_failures, base)` produces identical jitter across two dispatcher restarts; differing task_ids produce ≥1s jitter difference (with high probability). 4. **`backoff_resets_on_success.sh`** — cron fails 2x then succeeds; assert `consecutive_failures=0`, `next_cron_fire` back to base period; next failure starts at iter=1, not iter=3. 5. **`backoff_cycle_path.sh`** — cycle task with `cycle_config.delay="60s"` fails 3x; assert each restart's `ready_after` increases via backoff; `cycle_failure_restarts` reaches 3 and hits `max_failure_restarts` cap. 6. **`triage_spawns_at_threshold.sh`** — cron fails 3x; assert `.triage-<id>` task appears with status=Open, tagged `recurring-triage` AND `no-recurring-triage`, model=haiku. 7. **`triage_decision_noop.sh`** — fixture LLM outputs `VERDICT=NOOP` and exits cleanly; assert no new fix tasks created, source task continues backoff cadence. 8. **`triage_decision_file_fix.sh`** — fixture LLM outputs `VERDICT=FILE-FIX` and calls `wg add "Fix: bad endpoint" --after <source>`; assert new fix task exists with correct `--after` edge. 9. **`triage_decision_pause.sh`** — fixture LLM outputs `VERDICT=PAUSE` and calls `wg pause <source>`; assert `paused=true` on source, `consecutive_failures` UNCHANGED (forensic preserved); `wg resume` clears paused but counter remains; next failure increments to N+1, not 1. 10. **`triage_cooldown_enforced.sh`** — after triage runs, source fails 3 more times within cooldown; assert NO second `.triage-<id>` spawns; advance clock past cooldown, fail 3 more; assert second triage spawns. 11. **`triage_no_recursive_spawn.sh`** — force `.triage-<id>` itself to fail 5x; assert NO `.triage-.triage-<id>` task is spawned (suppressed by `no-recurring-triage` tag); a single warning log line emitted. 12. **`circuit_breaker_suppresses_triage.sh`** — set `agency.breaker_tripped=true` (test fixture); cron fails 5x; assert NO triage spawn; log shows `Breaker tripped, suppressing triage spawn`. Clear breaker; assert next-failure dispatch can spawn triage. 13. **`triage_shell_task.sh`** — shell-mode cron task (no LLM, runs `false`) fails 3x; triage agent (LLM) reads stderr from `output.log`; verdict is `FILE-FIX` with a `Fix:` task created. Confirms triage works for shell tasks. 14. **`triage_llm_task_distinct_from_failed_pending_eval.sh`** — LLM-mode cycle task with `agency.auto_evaluate=on`. Iteration 1: agent implicit-fails, eval-rescues to Done → `consecutive_failures=0`. Iterations 2-4: agent terminally fails (eval rejects). Assert: at iteration 4 end, `consecutive_failures=3` (not 4 — iteration 1 was rescued), triage spawns ONCE, `.evaluate-*` tasks for each iteration ran independently. 15. **`backoff_clear_failures_cli.sh`** — `wg edit <id> --reset-failures` clears `consecutive_failures` to 0; `wg edit <id> --clear-triage-cooldown` clears `last_triaged_at`; both verified via `wg show`. ## Hand-off to follow-up implementation task The implementation task should: 1. Land `src/backoff.rs` first with unit tests (TDD on the pure primitive). cargo build + cargo test. 2. Add the two Task fields + config knobs. Compiler will surface every match arm to update. 3. Wire `reset_cron_task_after_failure` in `src/cron.rs` and the dispatcher Phase 2.95b. 4. Wire backoff into the existing cycle-failure-restart path at `graph.rs:2235`. 5. Implement `spawn_triage_task` + `evaluate_recurring_failure_triage` in coordinator.rs Phase 2.96. 6. Implement `build_triage_prompt` in `src/agency/triage_prompt.rs`. 7. Add `wg edit --reset-failures` and `--clear-triage-cooldown` flags. 8. Add the 15 smoke scenarios + register in `tests/smoke/manifest.toml`. 9. Update `wg show` and `wg list -v` rendering. 10. Update `docs/manual/04-coordination.md`/.typ with the recurring-failure backoff + triage section. NO new status enum variants. NO new color cells in TUI palette. NO breaker integration code (cite the future hook only). The four config knobs are the entire user-facing surface. Validation criteria for implement-exponential-failure (paste into its `## Validation` section): - [ ] Failing tests written first: `tests/integration_backoff_primitive.rs`, `tests/integration_recurring_failure_triage.rs` - [ ] cargo build + cargo test pass with no regressions - [ ] All 15 smoke scenarios above land in tests/smoke/manifest.toml with owners=[implement-exponential-failure] - [ ] `wg show` displays `consecutive_failures` and `last_triaged_at` when nonzero/present - [ ] No status enum extensions; `Status::is_terminal` and `is_dep_satisfied` unchanged - [ ] Existing tests for cron, cycle_failure_restarts, requeue still pass - [ ] Operator override CLIs work: `wg edit --reset-failures`, `wg edit --clear-triage-cooldown` ## Validation check for THIS design task - [x] All 6 forks resolved with rationale (Forks 1-6 in parts 1+2) - [x] State diagram in task log (part 2 — recurring task → backoff → triage flow with all branches) - [x] Schema delta documented: 2 new Task fields + 4 new config knobs; no enum extensions - [x] Implementation plan with file:line pointers (10 steps) - [x] Smoke scenario list covers: backoff progression on repeated failures (#1, #5), backoff cap (#2), jitter (#3), reset on success (#4), triage spawning at threshold (#6), all triage decisions (#7 noop, #8 file-fix, #9 pause), cooldown enforcement (#10), recursion guard (#11), circuit-breaker interaction (#12), shell vs LLM (#13, #14), CLI overrides (#15) - [x] No source modifications — design only (this task posts wg log entries; no files touched)
- 2026-05-01T14:36:37.971131178+00:00 Validation summary: - All 6 forks resolved (parts 1+2): Fork 1 → 2x mult, 24h cap, ±10% jitter, reset on success only Fork 2 → hybrid: reuse cycle_failure_restarts + add 1 field consecutive_failures (rejected pure-derived as fragile/slow) Fork 3 → N=3 threshold, 24h cooldown, pausing does NOT reset counter (forensic preservation) Fork 4 → triage CAN file-fix + pause-self; CANNOT mutate other tasks; phase-1 prompt enforcement, phase-2 binary-level WG_TRIAGE_MODE deferred Fork 5 → triage spawn IS gated by breaker (when breaker exists); backoff slows breaker's reset clock indirectly, no direct coupling Fork 6 → triage works for both shell+LLM tasks (file reads); does NOT subsume failed-pending-eval (different scope: per-attempt vs cross-attempt) - State diagram (part 2): recurring task → success-resets-counter / failure-grows-delay / threshold-spawns-triage with all 4 verdict branches and recursion guard - Schema delta: 2 new Task fields (consecutive_failures u32, last_triaged_at Option<String>) + 4 new config knobs (triage_threshold, triage_cooldown, failure_backoff_multiplier, failure_backoff_max_delay). No enum changes, no color additions. - Implementation plan: 10 steps with concrete file:line anchors (src/backoff.rs NEW, src/cron.rs:155 success-reset + new failure-reset, src/graph.rs:2235 cycle backoff, src/commands/service/coordinator.rs Phase 2.95b cron fail loop + Phase 2.96 triage scan, src/agency/triage_prompt.rs NEW, src/config.rs additions, wg edit CLI flags) - Smoke scenarios: 15 covering backoff progression (#1,#5), cap (#2), jitter (#3), reset (#4), triage spawn (#6), each verdict (#7-9), cooldown (#10), recursion guard (#11), breaker interaction (#12), shell vs LLM (#13,#14), CLI overrides (#15) - No source modifications — design-only task; only wg log entries written.
- 2026-05-01T14:36:48.185145409+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-05-01T14:40:17.173711591+00:00 PendingEval → Done (evaluator passed; downstream unblocks)