Metadata
| Status | done |
|---|---|
| Assigned | agent-851 |
| Agent identity | f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e |
| Created | 2026-04-27T20:59:53.085395732+00:00 |
| Started | 2026-04-27T21:02:09.321497763+00:00 |
| Completed | 2026-04-27T21:39:33.965758920+00:00 |
| Tags | eval-scheduled |
| Eval score | 0.77 |
| └ blocking impact | 0.75 |
| └ completeness | 0.80 |
| └ coordination overhead | 0.80 |
| └ correctness | 0.75 |
| └ downstream usability | 0.80 |
| └ efficiency | 0.75 |
| └ intent fidelity | 0.78 |
| └ style adherence | 0.70 |
Description
Description
Today exposed missing primitives for recovering hung worker agents. The current 'soft retry' pattern requires:
wg agents --aliveto find the PIDkill <pid>from a separate shell- Wait for reaper grace (~30s)
- Trust the dispatcher will respawn
This works (verified today) but is undocumented and hostile. Users — and worker agents themselves, when they're triaging stuck workflows — need a single primitive.
Note: this is distinct from wg recover (mass-failure batch tool) and distinct from agency iterations (auto_rescue_on_eval_fail, which is for quality-failed tasks not stuck ones). Hung-task recovery is its own missing axis.
Desired primitives
1. wg retry <task-id>
- If task is in-progress with an alive agent: kill the agent, let the reaper transition, dispatcher respawns
- If task is failed: re-queue it (equivalent to
wg recover --keep-agency --filter id=<task-id>) - Increments attempt counter regardless
- Does NOT abandon
.evaluate-*/.flip-*companions --reasonflag for log entry- Idempotent: running twice with no time between is safe
2. wg agents kill <agent-id>
- Lower-level building block. Sends SIGTERM (with --force for SIGKILL) to the named agent process
- No-op if agent already dead
- Used internally by
wg retryfor the in-progress case
Investigation needed
- How does the reaper currently transition a dead-PID task? (Heartbeat timeout? Process-watcher? Both?)
- What's the minimum touch to graph state to make 'retry now' work — does the dispatcher need to be poked, or does kill-then-wait already cover it?
- Are there race conditions if user runs
wg retrywhile the reaper is mid-transition?
Validation
-
wg retry <task-id>works on an in-progress task with hung agent (verified by killing CPU=0% agent and seeing respawn) -
wg retry <task-id>works on a failed task (re-queues without churning agency followups) -
wg agents kill <agent-id>cleanly terminates a worker -
Both commands documented in
wg quickstartrecovery section + agent guide (closes one of the gaps fromaudit-recovery-outage) - Smoke scenario added: kill an agent → verify task respawns → verify attempt counter incremented
- cargo build + cargo test pass with no regressions
Depends on
Required by
- (none)
Log
- 2026-04-27T20:59:53.074714022+00:00 Task paused
- 2026-04-27T21:00:17.748395215+00:00 Task published
- 2026-04-27T21:02:09.321502211+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-27T21:02:24.729940108+00:00 Starting investigation: looking at existing reaper, recover, agents commands
- 2026-04-27T21:05:26.619364003+00:00 Investigation complete. Plan: extend wg retry to handle in-progress (kill agent, reset Open, increment retry_count, --reason flag). Add wg agents kill subcommand (no-op if dead). Update docs + smoke.
- 2026-04-27T21:38:51.545156374+00:00 Implementation complete: wg retry handles in-progress + --reason; wg agents kill subcommand; smoke test; live verified all 4 cases (kill live, kill dead=noop, kill missing=noop, --force). Pre-existing test_global_config_path failure unrelated.
- 2026-04-27T21:39:16.351946573+00:00 Committed: 6581fd5da — pushed to remote
- 2026-04-27T21:39:33.965762747+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-04-27T21:41:39.757503454+00:00 PendingEval → Done (evaluator passed; downstream unblocks)