re-implement-fix — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-1863`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-05-03T04:33:06.494762897+00:00
Started	2026-05-03T04:44:41.254887422+00:00
Completed	2026-05-03T05:00:56.881295210+00:00
Tags	`eval-scheduled`
Eval score	0.91
└ blocking impact	0.90
└ completeness	0.92
└ constraint fidelity	0.55
└ coordination overhead	0.92
└ correctness	0.92
└ downstream usability	0.90
└ efficiency	0.88
└ intent fidelity	0.89
└ style adherence	0.90

Description

fix-supervisor-restart-backoff was marked Done by the evaluator at 0.04 (constraint_fidelity=0.70, intent_fidelity=0.01) with the explicit finding "no implementation artifacts found in assigned worktree; branch has 0 commits ahead of main and no diffs in scoped files". The intended backoff for session-lock contention was never landed.

Symptom (surfaced by integrate-nex-chat-end-to-end): the existing rate-limit at coordinator_agent.rs:645-659 increments restart_timestamps on EVERY spawn (clean exits included, line 892), so 3 normal TUI handoff cycles (write sentinel → cooperative release → respawn) trigger a 10-minute pause. The intended fix was to detect "exit status=1 within ~1s of spawn AND live session-lock holder present" → back off ≥10s, increment a SEPARATE counter, and only exit-supervisor after N consecutive contentions.

File scope

src/commands/service/coordinator_agent.rs (restart loop around lines 630-895)
tests/ (unit test that simulates exit-status-1 + recent live lock holder)

Validation

Failing test written first: simulates exit-status-1 + live-session-lock holder → assert ≥10s sleep recorded, NOT 600s rate-limit
Implementation distinguishes session-lock contention exits from clean exits
Clean exits do NOT push restart_timestamps (or use a separate counter)
Three TUI handoff cycles in <1 minute do not trip the rate-limit
cargo build + cargo test pass with no regressions

## Description
fix-supervisor-restart-backoff was marked Done by the evaluator at 0.04 (constraint_fidelity=0.70, intent_fidelity=0.01) with the explicit finding "no implementation artifacts found in assigned worktree; branch has 0 commits ahead of main and no diffs in scoped files". The intended backoff for session-lock contention was never landed.

Symptom (surfaced by integrate-nex-chat-end-to-end): the existing rate-limit at coordinator_agent.rs:645-659 increments restart_timestamps on EVERY spawn (clean exits included, line 892), so 3 normal TUI handoff cycles (write sentinel → cooperative release → respawn) trigger a 10-minute pause. The intended fix was to detect "exit status=1 within ~1s of spawn AND live session-lock holder present" → back off ≥10s, increment a SEPARATE counter, and only exit-supervisor after N consecutive contentions.

## File scope
- src/commands/service/coordinator_agent.rs (restart loop around lines 630-895)
- tests/ (unit test that simulates exit-status-1 + recent live lock holder)

## Validation
- [ ] Failing test written first: simulates exit-status-1 + live-session-lock holder → assert ≥10s sleep recorded, NOT 600s rate-limit
- [ ] Implementation distinguishes session-lock contention exits from clean exits
- [ ] Clean exits do NOT push restart_timestamps (or use a separate counter)
- [ ] Three TUI handoff cycles in <1 minute do not trip the rate-limit
- [ ] cargo build + cargo test pass with no regressions

Depends on

Required by

(none)

Log

2026-05-03T04:44:41.254890938+00:00 Spawned by coordinator --executor claude --model opus
2026-05-03T04:44:59.274296824+00:00 Starting: investigating restart loop at coordinator_agent.rs:630-895
2026-05-03T04:56:53.921488204+00:00 Implemented: classify_child_exit + lock_contention_backoff + rate_limit_wait helpers in src/commands/service/coordinator_agent.rs (lines 58-156). Restart-rate timestamps now only push on Crash (genuine non-zero exit, no live lock holder). SessionLockContention exits go through a separate counter with linear backoff (10s base, 60s ceiling), capped at 6 consecutive contentions before exiting supervisor. Clean exits reset both counters.
2026-05-03T05:00:44.400914742+00:00 Committed: 2cee3e559 — pushed to remote. Validated: classify_child_exit + lock_contention_backoff + rate_limit_wait pure-fn helpers + 7 new regression tests in coordinator_agent.rs::tests. All 25 coordinator_agent unit tests pass; cargo test --bin wg passes (3372/3372); cargo test --lib passes (2159/2159). 2 pre-existing failures in integration_chat_rename (CLAUDE.md/coordinator.rs terminology guards) confirmed pre-existing and unrelated.
2026-05-03T05:00:56.881301822+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-03T05:03:52.844710918+00:00 PendingEval → Done (evaluator passed; downstream unblocks)