audit-recovery-outage

Audit: recovery + outage workflows in agent-visible docs

Metadata

Statusdone
Assignedagent-824
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-04-27T19:14:51.280326986+00:00
Started2026-04-27T19:15:51.799036995+00:00
Completed2026-04-27T19:22:35.818475859+00:00
Tagseval-scheduled

Description

Description

Today's outage exposed that the recovery process — credit-exhaustion / mass-failure batch retry, openrouter→claude:opus migration, stale coordinator-state model_override — is not documented in places worker agents can see. The chat agent (Claude with project memory) figured it out, but workers of any model/executor would be stuck.

Audit scope

Read every text surface a worker agent sees on a fresh task:

  • wg quickstart output (start-of-session orientation)
  • Agent guide (whatever wg ships as the worker prompt prelude)
  • CLAUDE.md at repo root
  • wg <command> --help for: recover, endpoints, service, config, agents
  • Any AGENT.md / AGENTS.md / docs/ files referenced by the bootstrap path
  • .workgraph/ README or template files if any

For each, ask: could a worker agent that hit a credit-exhaustion / mass-failure / wrong-model-routing situation recover without external help?

Specific gaps to look for

  1. wg recover — is it mentioned anywhere outside its own --help? Quickstart? Agent guide?
  2. --keep-agency, --set-model, --set-endpoint, --filter patterns — are example invocations documented?
  3. Model precedence chain (per-task > coordinator-state model_override > local config > global) — is this written down anywhere? Today proved it's load-bearing.
  4. Stale coordinator-state model_override — the trap that bit us today. Is the existence + location (.wg/service/coordinator-state-N.json) documented? Is the recovery procedure (manual edit or any CLI to clear it)?
  5. Endpoint cleanupwg endpoints remove exists but is its role in recovery flagged? is_default = true in global vs is_default = false in local: does the merge behavior get documented or do agents have to discover it the hard way?
  6. .wg/ vs .workgraph/ directory split — which is canonical now? Today's session showed both exist with stale duplicates of service state. Is the migration documented?

Deliverable

Produce docs/audit-recovery-docs-2026-04-27.md with:

  • Table: each text surface × each gap (✓ documented / ✗ missing / ~ partial)
  • Prioritized list of fixes (which surfaces need updating, with proposed wording for the highest-leverage gap)
  • One concrete recommendation: should recovery live in quickstart, agent guide, a new RECOVERY.md, or all three?

Validation

  • Every text surface in 'Audit scope' has been read and explicitly assessed
  • Gap table covers all 6 items in 'Specific gaps to look for'
  • Deliverable file exists at docs/audit-recovery-docs-2026-04-27.md
  • Top recommendation is actionable (i.e., a follow-up implementation task could be filed from it without re-doing research)

Depends on

Required by

Log