design-exponential-failure

Design: exponential failure-cycle backoff (generic primitive) + recurring-task triage agent (consumer)

Metadata

Statusdone
Assignedagent-1327
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-05-01T14:26:51.607812286+00:00
Started2026-05-01T14:27:50.657062718+00:00
Completed2026-05-01T14:36:48.185137083+00:00
Tagsdesign,state-machine,reliability,failure-handling, eval-scheduled
Eval score0.92
└ blocking impact0.95
└ completeness0.91
└ constraint fidelity0.85
└ coordination overhead0.95
└ correctness0.92
└ downstream usability0.90
└ efficiency0.88
└ intent fidelity0.93
└ style adherence0.92

Description

Description

Two related design ideas the user surfaced 2026-05-01. Bundling because the second consumes the first.

Idea 1: Exponential falloff for any cycle of failure (GENERIC PRIMITIVE)

User direct quote: 'we'd want exponential falloff in any cycle of failure. that's a generic thing.'

Generic semantics: when a task or task-cycle fails, the next attempt's delay grows exponentially. Reset to base on success. Capped at a maximum so it doesn't completely stop trying. Includes jitter to prevent thundering-herd when many tasks fail in correlated ways (network outage etc.).

This applies to (at minimum):

  • Cron-recurring tasks (--cron with --exec OR LLM): failed iteration N → next iteration's delay = base * 2^N (capped at some max, e.g. 24h)
  • Cycle tasks with cycle_delay: same logic on iteration-fails
  • Triage spawn rate (consumer of this primitive — see Idea 2)
  • failed-pending-eval circuit breaker (already has its own ad-hoc threshold; could be unified with the generic primitive)

Idea 2: Triage agent on recurring failures (CONSUMER)

User direct quote: 'a recurring failure could bring about an agent to triage. wouldn't that be something very natural to do? It could decide to do nothing, but it still gives us the ability of flexible recovery.'

When a recurring task hits a threshold of consecutive failures (e.g., 3), spawn a one-shot triage agent to investigate. The agent:

  • Reads recent failure logs (stderr from shell tasks; agent archives from LLM tasks)
  • Reads daemon log around the failure timestamps
  • Looks for patterns (network, credentials, disk, rate-limit, etc.)
  • Outputs a verdict + optional follow-up:
    • Do nothing (transient, will self-resolve)
    • File a fix task with diagnosed root cause + suggested patch
    • Pause the recurring task + notify (urgent intervention needed)
    • Adjust task config (bump retry-delay, switch endpoint, downgrade model tier)

User concern (must address in design): 'is it that we're worried about infinite redress?' YES. Triage is one-shot, never auto-spawns more triage on its own failure. Plus the global circuit breaker (from failed-pending-eval design) suppresses new triage spawns when the system is in a degraded state.

Forks to resolve

Fork 1: backoff parameters

  • Base interval source: the task's normal cron / cycle_delay
  • Multiplier: 2x (standard)? Or configurable per-task / globally?
  • Cap: 24h reasonable default; configurable?
  • Reset: on next success OR after some time-since-last-failure (whichever comes first)?
  • Jitter: ±10%? ±25%? Configurable?

Fork 2: where the backoff state lives

  • Per-task backoff state in graph.jsonl: failure_cycle_count, last_failure_at, next_attempt_at fields
  • OR derived state computed at scheduling time from the task's run history (no new fields, just smarter scheduler logic)
  • Recommend the latter — fewer schema changes

Fork 3: triage threshold + cooldown

  • N=3 consecutive failures triggers triage (default)? Per-task override?
  • After triage runs, cooldown period before another triage can fire for the same task: 24h default?
  • What if triage's own decision is 'pause' — does pausing reset the failure counter?

Fork 4: what triage is allowed to mutate

  • Read-only by default — produces a report + optionally files new tasks
  • BUT: if triage's diagnosis includes a clear config fix (bad endpoint, wrong model name), can it apply directly via wg edit?
  • Recommend: triage CAN file fix tasks autonomously, can pause the failing task autonomously, but CANNOT mutate other-task config without explicit rule-based authorization

Fork 5: integration with failed-pending-eval circuit breaker

  • The circuit breaker from design-failed-pending (Fork 7) trips on systemic eval failures. Should triage spawning ALSO be gated by it? Yes, probably — if the agency infra is down, spawning a triage agent (which itself uses LLM) is wasted.
  • Should the backoff primitive be part of the circuit breaker's reset logic (failure rate falls below threshold via backoff = breaker can auto-reset)?

Fork 6: shell vs LLM task asymmetry

  • Triage on shell-mode tasks: the failing task itself has no LLM, but the triage agent IS an LLM. So triage works for shell tasks (it reads the stderr from logs). Confirm.
  • Triage on LLM tasks that fail: the failing agent's logs ARE LLM output. Triage reading another LLM's output for diagnosis. Possibly cheaper to combine with the existing failed-pending-eval rescue (which already evaluates the failed agent's output) — design should clarify whether triage subsumes failed-pending-eval for recurring contexts.

Deliverable

Posted via wg log:

  • Resolutions for all 6 forks with rationale
  • State diagram showing how a task progresses through normal-success / recoverable-failure (triage rescues) / persistent-failure (backoff into eventually-paused)
  • Schema changes if any (graph.jsonl fields)
  • Concrete implementation plan (file:line pointers) for both the generic backoff primitive and the triage consumer
  • Smoke scenario list

Validation

  • All 6 forks resolved with rationale
  • State diagram in task log
  • Implementation plan concrete enough for a follow-up implementation task
  • Smoke scenario list covers: backoff progression on repeated failures, triage spawning on threshold, triage decisions (no-op, file-fix, pause-and-notify), cooldown enforcement, circuit-breaker interaction, shell vs LLM task distinctions
  • No source modifications — design only

Depends on

Required by

Log