add-wg-retry

Add wg retry / wg agents kill — clean primitives for hung-task recovery

Metadata

Statusdone
Assignedagent-851
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-27T20:59:53.085395732+00:00
Started2026-04-27T21:02:09.321497763+00:00
Completed2026-04-27T21:39:33.965758920+00:00
Tagseval-scheduled
Eval score0.77
└ blocking impact0.75
└ completeness0.80
└ coordination overhead0.80
└ correctness0.75
└ downstream usability0.80
└ efficiency0.75
└ intent fidelity0.78
└ style adherence0.70

Description

Description

Today exposed missing primitives for recovering hung worker agents. The current 'soft retry' pattern requires:

  1. wg agents --alive to find the PID
  2. kill <pid> from a separate shell
  3. Wait for reaper grace (~30s)
  4. Trust the dispatcher will respawn

This works (verified today) but is undocumented and hostile. Users — and worker agents themselves, when they're triaging stuck workflows — need a single primitive.

Note: this is distinct from wg recover (mass-failure batch tool) and distinct from agency iterations (auto_rescue_on_eval_fail, which is for quality-failed tasks not stuck ones). Hung-task recovery is its own missing axis.

Desired primitives

1. wg retry <task-id>

  • If task is in-progress with an alive agent: kill the agent, let the reaper transition, dispatcher respawns
  • If task is failed: re-queue it (equivalent to wg recover --keep-agency --filter id=<task-id>)
  • Increments attempt counter regardless
  • Does NOT abandon .evaluate-* / .flip-* companions
  • --reason flag for log entry
  • Idempotent: running twice with no time between is safe

2. wg agents kill <agent-id>

  • Lower-level building block. Sends SIGTERM (with --force for SIGKILL) to the named agent process
  • No-op if agent already dead
  • Used internally by wg retry for the in-progress case

Investigation needed

  • How does the reaper currently transition a dead-PID task? (Heartbeat timeout? Process-watcher? Both?)
  • What's the minimum touch to graph state to make 'retry now' work — does the dispatcher need to be poked, or does kill-then-wait already cover it?
  • Are there race conditions if user runs wg retry while the reaper is mid-transition?

Validation

  • wg retry <task-id> works on an in-progress task with hung agent (verified by killing CPU=0% agent and seeing respawn)
  • wg retry <task-id> works on a failed task (re-queues without churning agency followups)
  • wg agents kill <agent-id> cleanly terminates a worker
  • Both commands documented in wg quickstart recovery section + agent guide (closes one of the gaps from audit-recovery-outage)
  • Smoke scenario added: kill an agent → verify task respawns → verify attempt counter incremented
  • cargo build + cargo test pass with no regressions

Depends on

Required by

Log