Metadata
| Status | done |
|---|---|
| Assigned | agent-1863 |
| Agent identity | 3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3 |
| Created | 2026-05-03T04:33:06.494762897+00:00 |
| Started | 2026-05-03T04:44:41.254887422+00:00 |
| Completed | 2026-05-03T05:00:56.881295210+00:00 |
| Tags | eval-scheduled |
| Eval score | 0.91 |
| └ blocking impact | 0.90 |
| └ completeness | 0.92 |
| └ constraint fidelity | 0.55 |
| └ coordination overhead | 0.92 |
| └ correctness | 0.92 |
| └ downstream usability | 0.90 |
| └ efficiency | 0.88 |
| └ intent fidelity | 0.89 |
| └ style adherence | 0.90 |
Description
Description
fix-supervisor-restart-backoff was marked Done by the evaluator at 0.04 (constraint_fidelity=0.70, intent_fidelity=0.01) with the explicit finding "no implementation artifacts found in assigned worktree; branch has 0 commits ahead of main and no diffs in scoped files". The intended backoff for session-lock contention was never landed.
Symptom (surfaced by integrate-nex-chat-end-to-end): the existing rate-limit at coordinator_agent.rs:645-659 increments restart_timestamps on EVERY spawn (clean exits included, line 892), so 3 normal TUI handoff cycles (write sentinel → cooperative release → respawn) trigger a 10-minute pause. The intended fix was to detect "exit status=1 within ~1s of spawn AND live session-lock holder present" → back off ≥10s, increment a SEPARATE counter, and only exit-supervisor after N consecutive contentions.
File scope
- src/commands/service/coordinator_agent.rs (restart loop around lines 630-895)
- tests/ (unit test that simulates exit-status-1 + recent live lock holder)
Validation
- Failing test written first: simulates exit-status-1 + live-session-lock holder → assert ≥10s sleep recorded, NOT 600s rate-limit
- Implementation distinguishes session-lock contention exits from clean exits
- Clean exits do NOT push restart_timestamps (or use a separate counter)
- Three TUI handoff cycles in <1 minute do not trip the rate-limit
- cargo build + cargo test pass with no regressions
Depends on
Required by
- (none)
Log
- 2026-05-03T04:44:41.254890938+00:00 Spawned by coordinator --executor claude --model opus
- 2026-05-03T04:44:59.274296824+00:00 Starting: investigating restart loop at coordinator_agent.rs:630-895
- 2026-05-03T04:56:53.921488204+00:00 Implemented: classify_child_exit + lock_contention_backoff + rate_limit_wait helpers in src/commands/service/coordinator_agent.rs (lines 58-156). Restart-rate timestamps now only push on Crash (genuine non-zero exit, no live lock holder). SessionLockContention exits go through a separate counter with linear backoff (10s base, 60s ceiling), capped at 6 consecutive contentions before exiting supervisor. Clean exits reset both counters.
- 2026-05-03T05:00:44.400914742+00:00 Committed: 2cee3e559 — pushed to remote. Validated: classify_child_exit + lock_contention_backoff + rate_limit_wait pure-fn helpers + 7 new regression tests in coordinator_agent.rs::tests. All 25 coordinator_agent unit tests pass; cargo test --bin wg passes (3372/3372); cargo test --lib passes (2159/2159). 2 pre-existing failures in integration_chat_rename (CLAUDE.md/coordinator.rs terminology guards) confirmed pre-existing and unrelated.
- 2026-05-03T05:00:56.881301822+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-05-03T05:03:52.844710918+00:00 PendingEval → Done (evaluator passed; downstream unblocks)