design-claim-lifecycle

Design: claim lifecycle for wg reset / wg retry / dispatcher heartbeat

Metadata

Statusdone
Assignedagent-979
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Modelclaude:opus
Created2026-04-28T22:23:26.155276213+00:00
Started2026-04-28T22:32:48.975627211+00:00
Completed2026-04-28T22:42:26.255573735+00:00
Tagsbug,design,claims, eval-scheduled
Eval score0.94
└ blocking impact0.95
└ completeness0.92
└ constraint fidelity0.70
└ coordination overhead0.97
└ correctness0.95
└ downstream usability0.94
└ efficiency0.90
└ intent fidelity0.90
└ style adherence0.93

Description

Description

Pick the implementation approach for two related bugs:

  • bug-reset-leaves-stale-claims.md — wg reset doesn't clear claimed_by / assigned_agent, so dispatcher silently skips reset tasks (sees them as still claimed by dead agents)
  • bug-retry-doesnt-clear-stale-downstream-claims.md — wg retry <upstream> reopens upstream but doesn't touch downstream tasks; downstream stays claimed by long-dead agents and never spawns

Full details at: /home/erik/workgraph/bug-reset-leaves-stale-claims.md and /home/erik/workgraph/bug-retry-doesnt-clear-stale-downstream-claims.md

Options to evaluate

  • Eager (A+C combined): wg reset and wg retry both unconditionally clear claimed_by on the target task AND on transitive downstream tasks. Simple, no liveness check needed.
  • Lazy (B): Dispatcher heartbeat — at tick time, validate each claimed task's agent is still alive. If not, unclaim and re-queue. Catches stale claims from any source (kill -9, panic, crash), not just user-initiated reset/retry.
  • Both: Eager on reset/retry (fast user feedback) + heartbeat as safety net (catches all other paths).

Goal

Decide which approach to implement. Write a 1-page design doc with:

  • Chosen approach + rationale (why not the others)
  • Field/column changes needed in graph.jsonl
  • Code locations that need changes (reset.rs, retry.rs, dispatcher poll loop) — list paths only, do NOT modify
  • Backward compat concerns (existing graphs with stale claims)
  • Concrete repro test list — what live-smoke scenarios this needs to ship with

Also fix the misleading hint in wg service status ("check agent configuration") to mention stale claims as a possibility.

Validation

  • Design doc written and posted to task log via wg log
  • Approach chosen with explicit reasoning vs. alternatives
  • File paths and function names identified for the implementation task to follow
  • At least 2 smoke scenarios specified (reset path + retry-with-downstream path)

Depends on

Required by

Log