add-wg-retry — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-851`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-04-27T20:59:53.085395732+00:00
Started	2026-04-27T21:02:09.321497763+00:00
Completed	2026-04-27T21:39:33.965758920+00:00
Tags	`eval-scheduled`
Eval score	0.77
└ blocking impact	0.75
└ completeness	0.80
└ coordination overhead	0.80
└ correctness	0.75
└ downstream usability	0.80
└ efficiency	0.75
└ intent fidelity	0.78
└ style adherence	0.70

Description

Today exposed missing primitives for recovering hung worker agents. The current 'soft retry' pattern requires:

wg agents --alive to find the PID
kill <pid> from a separate shell
Wait for reaper grace (~30s)
Trust the dispatcher will respawn

This works (verified today) but is undocumented and hostile. Users — and worker agents themselves, when they're triaging stuck workflows — need a single primitive.

Note: this is distinct from wg recover (mass-failure batch tool) and distinct from agency iterations (auto_rescue_on_eval_fail, which is for quality-failed tasks not stuck ones). Hung-task recovery is its own missing axis.

Desired primitives

1. wg retry <task-id>

If task is in-progress with an alive agent: kill the agent, let the reaper transition, dispatcher respawns
If task is failed: re-queue it (equivalent to wg recover --keep-agency --filter id=<task-id>)
Increments attempt counter regardless
Does NOT abandon .evaluate-* / .flip-* companions
--reason flag for log entry
Idempotent: running twice with no time between is safe

2. wg agents kill <agent-id>

Lower-level building block. Sends SIGTERM (with --force for SIGKILL) to the named agent process
No-op if agent already dead
Used internally by wg retry for the in-progress case

Investigation needed

How does the reaper currently transition a dead-PID task? (Heartbeat timeout? Process-watcher? Both?)
What's the minimum touch to graph state to make 'retry now' work — does the dispatcher need to be poked, or does kill-then-wait already cover it?
Are there race conditions if user runs wg retry while the reaper is mid-transition?

Validation

wg retry <task-id> works on an in-progress task with hung agent (verified by killing CPU=0% agent and seeing respawn)
wg retry <task-id> works on a failed task (re-queues without churning agency followups)
wg agents kill <agent-id> cleanly terminates a worker
Both commands documented in wg quickstart recovery section + agent guide (closes one of the gaps from audit-recovery-outage)
Smoke scenario added: kill an agent → verify task respawns → verify attempt counter incremented
cargo build + cargo test pass with no regressions

## Description

Today exposed missing primitives for recovering hung worker agents. The current 'soft retry' pattern requires:
1. `wg agents --alive` to find the PID
2. `kill <pid>` from a separate shell
3. Wait for reaper grace (~30s)
4. Trust the dispatcher will respawn

This works (verified today) but is undocumented and hostile. Users — and worker agents themselves, when they're triaging stuck workflows — need a single primitive.

Note: this is **distinct from** `wg recover` (mass-failure batch tool) and **distinct from** agency iterations (auto_rescue_on_eval_fail, which is for quality-failed tasks not stuck ones). Hung-task recovery is its own missing axis.

## Desired primitives

**1. `wg retry <task-id>`**
- If task is in-progress with an alive agent: kill the agent, let the reaper transition, dispatcher respawns
- If task is failed: re-queue it (equivalent to `wg recover --keep-agency --filter id=<task-id>`)
- Increments attempt counter regardless
- Does NOT abandon `.evaluate-*` / `.flip-*` companions
- `--reason` flag for log entry
- Idempotent: running twice with no time between is safe

**2. `wg agents kill <agent-id>`**
- Lower-level building block. Sends SIGTERM (with --force for SIGKILL) to the named agent process
- No-op if agent already dead
- Used internally by `wg retry` for the in-progress case

## Investigation needed

- How does the reaper currently transition a dead-PID task? (Heartbeat timeout? Process-watcher? Both?)
- What's the minimum touch to graph state to make 'retry now' work — does the dispatcher need to be poked, or does kill-then-wait already cover it?
- Are there race conditions if user runs `wg retry` while the reaper is mid-transition?

## Validation

- [ ] `wg retry <task-id>` works on an in-progress task with hung agent (verified by killing CPU=0% agent and seeing respawn)
- [ ] `wg retry <task-id>` works on a failed task (re-queues without churning agency followups)
- [ ] `wg agents kill <agent-id>` cleanly terminates a worker
- [ ] Both commands documented in `wg quickstart` recovery section + agent guide (closes one of the gaps from `audit-recovery-outage`)
- [ ] Smoke scenario added: kill an agent → verify task respawns → verify attempt counter incremented
- [ ] cargo build + cargo test pass with no regressions

Depends on

done .assign-add-wg-retry

Required by

(none)

Log

2026-04-27T20:59:53.074714022+00:00 Task paused
2026-04-27T21:00:17.748395215+00:00 Task published
2026-04-27T21:02:09.321502211+00:00 Spawned by coordinator --executor claude --model opus
2026-04-27T21:02:24.729940108+00:00 Starting investigation: looking at existing reaper, recover, agents commands
2026-04-27T21:05:26.619364003+00:00 Investigation complete. Plan: extend wg retry to handle in-progress (kill agent, reset Open, increment retry_count, --reason flag). Add wg agents kill subcommand (no-op if dead). Update docs + smoke.
2026-04-27T21:38:51.545156374+00:00 Implementation complete: wg retry handles in-progress + --reason; wg agents kill subcommand; smoke test; live verified all 4 cases (kill live, kill dead=noop, kill missing=noop, --force). Pre-existing test_global_config_path failure unrelated.
2026-04-27T21:39:16.351946573+00:00 Committed: 6581fd5da — pushed to remote
2026-04-27T21:39:33.965762747+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-04-27T21:41:39.757503454+00:00 PendingEval → Done (evaluator passed; downstream unblocks)