Metadata
| Status | done |
|---|---|
| Assigned | agent-824 |
| Agent identity | 3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3 |
| Created | 2026-04-27T19:14:51.280326986+00:00 |
| Started | 2026-04-27T19:15:51.799036995+00:00 |
| Completed | 2026-04-27T19:22:35.818475859+00:00 |
| Tags | eval-scheduled |
Description
Description
Today's outage exposed that the recovery process — credit-exhaustion / mass-failure batch retry, openrouter→claude:opus migration, stale coordinator-state model_override — is not documented in places worker agents can see. The chat agent (Claude with project memory) figured it out, but workers of any model/executor would be stuck.
Audit scope
Read every text surface a worker agent sees on a fresh task:
wg quickstartoutput (start-of-session orientation)- Agent guide (whatever
wgships as the worker prompt prelude) CLAUDE.mdat repo rootwg <command> --helpfor: recover, endpoints, service, config, agents- Any AGENT.md / AGENTS.md / docs/ files referenced by the bootstrap path
.workgraph/README or template files if any
For each, ask: could a worker agent that hit a credit-exhaustion / mass-failure / wrong-model-routing situation recover without external help?
Specific gaps to look for
wg recover— is it mentioned anywhere outside its own--help? Quickstart? Agent guide?--keep-agency,--set-model,--set-endpoint,--filterpatterns — are example invocations documented?- Model precedence chain (per-task > coordinator-state model_override > local config > global) — is this written down anywhere? Today proved it's load-bearing.
- Stale coordinator-state model_override — the trap that bit us today. Is the existence + location (
.wg/service/coordinator-state-N.json) documented? Is the recovery procedure (manual edit or any CLI to clear it)? - Endpoint cleanup —
wg endpoints removeexists but is its role in recovery flagged?is_default = truein global vsis_default = falsein local: does the merge behavior get documented or do agents have to discover it the hard way? .wg/vs.workgraph/directory split — which is canonical now? Today's session showed both exist with stale duplicates of service state. Is the migration documented?
Deliverable
Produce docs/audit-recovery-docs-2026-04-27.md with:
- Table: each text surface × each gap (✓ documented / ✗ missing / ~ partial)
- Prioritized list of fixes (which surfaces need updating, with proposed wording for the highest-leverage gap)
- One concrete recommendation: should recovery live in quickstart, agent guide, a new RECOVERY.md, or all three?
Validation
- Every text surface in 'Audit scope' has been read and explicitly assessed
- Gap table covers all 6 items in 'Specific gaps to look for'
-
Deliverable file exists at
docs/audit-recovery-docs-2026-04-27.md - Top recommendation is actionable (i.e., a follow-up implementation task could be filed from it without re-doing research)
Depends on
Required by
- (none)
Log
- 2026-04-27T19:14:51.246678094+00:00 Task paused
- 2026-04-27T19:14:53.907665690+00:00 Task published
- 2026-04-27T19:15:49.763324744+00:00 Lightweight assignment: agent=Default Evaluator (31847164), exec_mode=full, context_scope=task, reason=Evaluator role is ideal for systematic audit tasks; highest proven score (0.66) with 115 completed evaluations, and task is explicitly tagged eval-scheduled.
- 2026-04-27T19:15:51.799043197+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-27T19:16:00.507970173+00:00 Starting audit. Inspecting prior worktree state, then enumerating each text surface in the audit scope.
- 2026-04-27T19:19:48.195620635+00:00 Evidence collection complete. Surfaces audited: wg quickstart, AGENT-GUIDE.md, CLAUDE.md, ~/.claude/skills/wg/SKILL.md, AGENT-LIFECYCLE.md, AGENT-SERVICE.md, wg recover/endpoints/service/config/agents --help. No AGENT.md, AGENTS.md, RECOVERY.md, or .workgraph/README found. .wg/ does not exist in this worktree (only .workgraph/). Writing deliverable.
- 2026-04-27T19:22:04.636767569+00:00 Validated against task criteria: (1) all 14 surfaces in audit scope explicitly assessed (incl. note that .wg/, AGENT.md, AGENTS.md, RECOVERY.md, .workgraph/README do not exist); (2) gap matrix covers all 6 specific gaps G1-G6 with row per surface; (3) deliverable at docs/audit-recovery-docs-2026-04-27.md exists (211 lines); (4) top recommendation is concrete — produced a ready-to-file follow-up task with file scope, validation checklist, and one human-decision flag for the .wg/ vs .workgraph/ canonicality question.
- 2026-04-27T19:22:28.490699644+00:00 Committed: 963ad78d7 — pushed to remote at wg/agent-824/audit-recovery-outage
- 2026-04-27T19:22:35.818479596+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-04-27T19:24:49.406133658+00:00 PendingEval → Done (evaluator passed; downstream unblocks)