research-supervisor-lock-churn — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-1788`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-05-02T23:53:18.145533061+00:00
Started	2026-05-03T00:53:34.323014230+00:00
Completed	2026-05-03T01:03:41.017532454+00:00
Tags	`research,nex,chat,supervisor,bug`, `eval-scheduled`
Eval score	0.88
└ blocking impact	0.85
└ completeness	0.95
└ coordination overhead	0.90
└ correctness	0.87
└ downstream usability	0.85
└ efficiency	0.80
└ intent fidelity	0.94
└ style adherence	0.90

Description

~/.wg/service/nex-handler-stderr-{32,33}.log shows a tight cycle: handler N takes the session lock, handler N+1 (supervisor restart) hits "Error: session lock busy" and exits status=1, supervisor immediately restarts AGAIN (coordinator_agent.rs:929-963 has no idle-gate / no backoff on the failure path), eventually handler N dies, recovery loop picks up the stale lock and starts fresh. End result: chat appears alive but the user sees lock-busy errors when the TUI tries to take over.

Investigation scope (READ ONLY — no source mods)

src/commands/service/coordinator_agent.rs:929-971 (restart logic on exit-status)
src/commands/nex.rs:294-310 (session-lock acquisition w/ HandlerKind::ChatNex vs Adapter)
src/session_lock.rs (lock semantics)
Hypotheses to test:
1. Supervisor's nex-handler is being killed by something else (TUI takeover, daemon reload, signal) → exits status=1 → supervisor restart-loops
2. Two distinct supervisor instances are spawning for the same chat (stale state, daemon reload)
3. The 5-restart-per-window cap (RESTART_WINDOW_SECS) is high enough that the loop runs tens of times before the cap kicks in
Useful evidence sources:
- daemon.log + .log.1 (timestamp pattern of "nex subprocess exited / nex subprocess running")
- nex-handler-stderr-N.log (exact error message at each exit)
- lsof on .handler.pid file during reproducible churn

Deliverable (via wg log)

State machine of HandlerKind values (ChatNex vs Adapter vs InteractiveNex etc.) — when does each apply, who acquires which kind
Root cause: WHY does the nex subprocess exit status=1 even when the lock should be available
Backoff proposal: line-level patch to coordinator_agent.rs:929-963 — likely "if exit-status-1 within Ns of spawn AND last-seen lock holder pid is recent and live, treat as lock contention; back off ≥10s; if backoff fires N times in a row, exit supervisor cleanly"
Repro recipe (deterministic — using session_lock test helpers if they exist)

Validation

HandlerKind state machine documented
Root cause cited with file:line
Backoff proposal is line-level, not architectural
Repro recipe produces churn deterministically
No source modifications in this task

## Description
~/.wg/service/nex-handler-stderr-{32,33}.log shows a tight cycle: handler N takes the session lock, handler N+1 (supervisor restart) hits "Error: session lock busy" and exits status=1, supervisor immediately restarts AGAIN (coordinator_agent.rs:929-963 has no idle-gate / no backoff on the failure path), eventually handler N dies, recovery loop picks up the stale lock and starts fresh. End result: chat appears alive but the user sees lock-busy errors when the TUI tries to take over.

## Investigation scope (READ ONLY — no source mods)
- src/commands/service/coordinator_agent.rs:929-971 (restart logic on exit-status)
- src/commands/nex.rs:294-310 (session-lock acquisition w/ HandlerKind::ChatNex vs Adapter)
- src/session_lock.rs (lock semantics)
- Hypotheses to test:
  1. Supervisor's nex-handler is being killed by something else (TUI takeover, daemon reload, signal) → exits status=1 → supervisor restart-loops
  2. Two distinct supervisor instances are spawning for the same chat (stale state, daemon reload)
  3. The 5-restart-per-window cap (RESTART_WINDOW_SECS) is high enough that the loop runs tens of times before the cap kicks in
- Useful evidence sources:
  - daemon.log + .log.1 (timestamp pattern of "nex subprocess exited / nex subprocess running")
  - nex-handler-stderr-N.log (exact error message at each exit)
  - lsof on .handler.pid file during reproducible churn

## Deliverable (via wg log)
- State machine of HandlerKind values (ChatNex vs Adapter vs InteractiveNex etc.) — when does each apply, who acquires which kind
- Root cause: WHY does the nex subprocess exit status=1 even when the lock should be available
- Backoff proposal: line-level patch to coordinator_agent.rs:929-963 — likely "if exit-status-1 within Ns of spawn AND last-seen lock holder pid is recent and live, treat as lock contention; back off ≥10s; if backoff fires N times in a row, exit supervisor cleanly"
- Repro recipe (deterministic — using session_lock test helpers if they exist)

## Validation
- [ ] HandlerKind state machine documented
- [ ] Root cause cited with file:line
- [ ] Backoff proposal is line-level, not architectural
- [ ] Repro recipe produces churn deterministically
- [ ] No source modifications in this task

Depends on

Required by

done fix-supervisor-restart-backoff

Log

2026-05-02T23:53:18.129107535+00:00 Task paused
2026-05-03T00:51:28.019757510+00:00 Task published
2026-05-03T00:53:34.323018538+00:00 Spawned by coordinator --executor claude --model opus
2026-05-03T00:53:44.102191700+00:00 Starting research: read-only investigation of supervisor restart-loop on session-lock-busy. Will examine coordinator_agent.rs:929-963, nex.rs:294-310, session_lock.rs, and stderr logs.
2026-05-03T01:01:20.919935478+00:00 === DELIVERABLE 1/4: HandlerKind state machine === Defined in src/session_lock.rs:43-55, 5 variants. Selection lives entirely at src/commands/nex.rs:293-303 (linear if/else): HandlerKind = match (eval_mode, autonomous, chat_ref|chat_id) { eval_mode=true => Adapter // nex.rs:294 autonomous && chat-set => ChatNex // nex.rs:296 autonomous && !chat-set => AutonomousNex // nex.rs:298 !autonomous && chat-set => ChatNex // nex.rs:300 !autonomous && !chat-set => InteractiveNex // nex.rs:302 } TuiPty: only used by external code paths (TUI's PTY embedding when it owns the chat) — no caller in src/* selects TuiPty for a `wg nex` invocation; only test code (session_lock.rs:383). The TUI's PTY also goes through `wg spawn-task` -> `wg nex --chat <ref>` (autonomous=true) which lands in ChatNex, NOT TuiPty. So TuiPty is currently dead-code-ish at the lock layer; in practice all chat-tab handlers (daemon-spawned AND TUI-spawned) compete for the SAME ChatNex lock against each other. Acquisition: ALL kinds funnel through SessionLock::acquire (nex.rs:307 -> session_lock.rs:120). HandlerKind is metadata-only — it's written into .handler.pid line 3 for diagnostics ("kind=chat-nex") and is NOT used to decide who can preempt whom. There is no kind-based priority. Two ChatNex acquirers are indistinguishable to each other. Daemon supervisor (coordinator_agent.rs:781-808 / 858-870) always launches via `wg spawn-task .chat-N` -> `wg nex --chat <ref>` with WG_AUTONOMOUS implicit (autonomous mode set by spawn-task). Every supervisor-launched handler is ChatNex. TUI-launched (tui/viz_viewer/event.rs:3104) also runs `wg spawn-task <task-id>` -> ChatNex. So daemon + TUI converge on ChatNex and the lock arbitrates them with no concept of takeover: one wins by O_EXCL, the other gets "lock busy" + exit 1.
2026-05-03T01:01:52.640698737+00:00 === DELIVERABLE 2/4: Root cause (file:line evidence) === PRIMARY BUG: src/commands/service/coordinator_agent.rs:929-971 match exit_status { Ok(status) if status.success() => { // L946-952: idle gate (chat_session_is_idle); L957: 2s sleep } Ok(status) => { // L959-963: log "exited {} — will restart" — NO SLEEP, NO INSPECTION } Err(e) => { // L965-970: log "wait() failed: {} — will restart" — NO SLEEP } } The non-success branch falls straight through to the next loop iteration (line 624 `loop {`). The only throttle is the rate-limit window at L623-652: MAX_RESTARTS_PER_WINDOW=3 (L46, NOT 5 as the parent task hypothesized — correction filed below) within RESTART_WINDOW_SECS=600 (L49). With one timestamp pushed per spawn (L875), the supervisor burns 3 rapid spawn-and-die cycles before pausing. OBSERVED CHURN (daemon.log.1, after the 00:09:54.366 boot enumeration line 7): Coordinator-27: :54.477 running 783537 -> :54.506 exited 1 (29ms life) :54.553 running 783555 -> :54.585 exited 1 (32ms life) :54.633 running 783574 -> :54.664 exited 1 (31ms life) 3 cycles in 187ms, all writing "session lock busy" to nex-handler-stderr-27.log. Then rate-limit fires and 27 next runs at 00:27:56.344 (~18 min later, matches `window.saturating_sub(now - oldest)` ~ 600s + reset overhead). WHY IS THE LOCK BUSY when the supervisor's just-spawned nex starts? The spawning supervisor's PREVIOUS child is dead (child.wait() returned), but the .handler.pid file still names a LIVE PID. That live PID is NOT the supervisor's just-exited child — it's an EXTERNAL handler from one of three sources: (a) ORPHANED NEX from a prior daemon process — strongest hypothesis. CoordinatorAgent::shutdown() (coordinator_agent.rs:499-511) SIGTERMs the current child and drops tx, but it explicitly does NOT join the supervisor thread (L508-510 comment). On `wg service stop` -> daemon exit -> SIGKILL or panic, no shutdown() runs at all; the wg-nex children are reparented to init and keep their lock alive (PID stays valid). The new daemon's enumerate_chat_supervisors_for_boot (service/dispatch_boot.rs:83) re-spawns a supervisor for every still-alive chat task; first-spawn race -> "lock busy". Direct evidence: the daemon-boot pattern at log.1 line 7 is followed within ~300ms by 3-restart cycles for every coordinator that had a prior live handler. (b) TUI taking over the same chat-N — when the user opens chat-N in the dashboard, TUI spawns its own `wg spawn-task .chat-N` (event.rs:3104). It races against the supervisor's currently-live ChatNex handler. Loser exits 1; the daemon supervisor then enters the same restart-loop on its OWN exit-1 path because the new winner (TUI) is alive. (c) Two concurrent spawn-task invocations during a "kick" event — less common but possible. Hypothesis 1 from the parent task ("supervisor's nex killed by something else") = (a). Hypothesis 2 ("two distinct supervisor instances spawning for the same chat") = also (a) at a process level — different daemon generations spawning sibling supervisors against an overlapping orphan. Hypothesis 3 ("5-restart cap high enough that loop runs tens of times") = MISSTATED: cap is 3, not 5, and 3 restarts in <250ms is ALREADY enough churn to be visible to the user as a torrent of "lock busy" lines in stderr-N.log. SECONDARY BUG: idle-gate (CHAT_IDLE_THRESHOLD_SECS, L947) only fires on status.success(). A handler that always exits 1 due to lock contention can never reach the idle-respawn-suppression branch. So even when a chat is otherwise idle, an orphan-induced lock-busy keeps the supervisor in the rate-limit-pause loop indefinitely (3 spawns/10min), instead of exiting cleanly so a daemon restart can recover.
2026-05-03T01:02:27.575858794+00:00 === DELIVERABLE 3/4: Backoff proposal (line-level patch) === Target: src/commands/service/coordinator_agent.rs:929-971 (the post-wait() match). Add: counter declared near restart_timestamps (L623), inspect lock holder on failure, exponential backoff while contention persists, exit cleanly after N consecutive failures with no progress. Patch sketch (drop-in replacement of L929-971; new state at L623): // L623 area — declare alongside restart_timestamps: let mut restart_timestamps: VecDeque<std::time::Instant> = VecDeque::new(); let mut consecutive_lock_busy: u32 = 0; let mut last_lock_holder_pid: Option<u32> = None; // L929-971 replacement: match exit_status { Ok(status) if status.success() => { consecutive_lock_busy = 0; last_lock_holder_pid = None; // ... existing idle-gate + 2s sleep, unchanged ... } Ok(status) | Err(_) => { let status_label = match &exit_status { Ok(s) => format!("{}", s), Err(e) => format!("wait() failed: {}", e), }; // Diagnose: is the chat-dir lock held by a live foreign holder? // (Foreign = a pid that is NOT our just-exited child_pid.) let chat_dir = workgraph::chat::chat_dir_for_ref(dir, &task_id); let holder = workgraph::session_lock::read_holder(&chat_dir).ok().flatten(); let foreign_live_holder = holder .as_ref() .filter(|h| h.alive && h.pid != child_pid) .map(|h| h.pid); if let Some(holder_pid) = foreign_live_holder { if last_lock_holder_pid == Some(holder_pid) { consecutive_lock_busy = consecutive_lock_busy.saturating_add(1); } else { consecutive_lock_busy = 1; last_lock_holder_pid = Some(holder_pid); } // 10s, 20s, 40s, 80s, 120s (capped). Five attempts max. let backoff_secs = std::cmp::min(120u64, 10u64 << consecutive_lock_busy.saturating_sub(1).min(4)); if consecutive_lock_busy >= 5 { logger.error(&format!( "Coordinator-{}: nex subprocess {} — lock held by foreign live pid={} for {} consecutive attempts. \ Exiting supervisor cleanly (no further restart). \ Recover by killing pid {} or running `wg session release {}`. \ A daemon restart will respawn this supervisor once the orphan is gone.", coordinator_id, status_label, holder_pid, consecutive_lock_busy, holder_pid, task_id, )); super::CoordinatorState::remove_for(dir, coordinator_id); return; } logger.error(&format!( "Coordinator-{}: nex subprocess {} — lock-busy on foreign live pid={} (consecutive={}); backing off {}s before respawn", coordinator_id, status_label, holder_pid, consecutive_lock_busy, backoff_secs, )); std::thread::sleep(std::time::Duration::from_secs(backoff_secs)); } else { consecutive_lock_busy = 0; last_lock_holder_pid = None; logger.error(&format!( "Coordinator-{}: nex subprocess {} — will restart", coordinator_id, status_label, )); // Small floor so a tight crash-loop doesn't burn the rate-limit budget in <1s. std::thread::sleep(std::time::Duration::from_secs(1)); } } } KEY PROPERTIES: * Strictly additive — does not change the success path (success-idle-gate untouched). * No new architecture; same loop, same rate-limit window. * Resets to a clean state whenever a successor handler succeeds (status.success or holder-pid-changes). * 1s floor on non-contention failure means a model-level bug still respects the existing 3/10min cap but doesn't churn the stderr log at sub-second speed. * Exits CLEANLY after 5 consecutive same-holder failures (the orphan isn't going away on its own — escalate to human via wg session release). * Calls CoordinatorState::remove_for so a stale state file doesn't keep the orphaned supervisor entry around (matches the orphan-task path at L709). TUNING JUSTIFICATION: * 10s initial: longer than typical handler turn boundary (so a cooperatively-releasing TUI handler has a chance), shorter than the 600s rate-limit pause. * 5-attempt cap before clean exit: ~10+20+40+80+120 = ~270s of patient waiting before giving up — still inside one rate-limit window so we don't hit the existing pause path with stale state. * The 120s cap prevents the backoff from racing past the rate-limit window. NON-GOAL: this patch does NOT solve the "orphaned nex on daemon SIGKILL" root issue. Fixing that belongs in a SEPARATE change to the daemon-shutdown path (process-group kill, or a "claim" file the new daemon checks at boot before enumerating supervisors). This patch only stops the supervisor from churning the user-visible log when contention exists.
2026-05-03T01:03:00.254676687+00:00 === DELIVERABLE 4/4: Repro recipe (deterministic) === GOAL: reproduce the lock-busy churn on demand, confirming the without-backoff baseline (3 rapid spawn-and-die in ~250ms followed by a ~600s pause), and verifying the patched behavior (one attempt, then a 10s sleep, then escalating up to 270s before clean exit). DETERMINISTIC SHELL REPRO (no source mods needed): # 1. Pick a clean test workdir TEST=$(mktemp -d) cd "$TEST" wg init wg chat new --id chat-99 # creates .chat-99 task # 2. Plant a foreign live holder on the chat-99 lock. CHAT_DIR=".wg/chat/.chat-99" mkdir -p "$CHAT_DIR" # setsid + nohup so the sleeper survives shell exit; we want its PID alive. setsid sleep 600 < /dev/null > /dev/null 2>&1 & SLEEPER_PID=$! printf '%s\n%s\nchat-nex\n' "$SLEEPER_PID" "$(date -Iseconds)" > "$CHAT_DIR/.handler.pid" # 3. Start the daemon — supervisor for chat-99 will try to spawn nex, # nex will read $CHAT_DIR/.handler.pid, kill($SLEEPER_PID,0) -> alive, # bail with "session lock busy", exit 1. wg service start --max-agents 0 # max-agents 0 keeps task agents off, leaves only chat supervisors # 4. Observe the churn: tail -F .wg/service/daemon.log | grep -E "Coordinator-99: (nex subprocess (running|exited)|will restart)" # Expect (BASELINE, current code): # running pid=N1 -> exited 1 -> running pid=N2 -> exited 1 -> running pid=N3 -> exited 1 # (3 cycles in <250ms, then 10-min rate-limit sleep) # stderr-99.log fills with "session lock busy" * 3. # 5. Stop sleeper to demonstrate recovery: kill $SLEEPER_PID # Next supervisor cycle (after the rate-limit pause) sees dead pid in lock file, # recovers stale, "resuming session .chat-99". Chat is restored. # 6. With patch applied (deliverable 3): # running pid=N1 -> exited 1 -> [10s sleep] -> running pid=N2 -> exited 1 # -> [20s] -> running pid=N3 -> ...escalating up to 120s caps. After 5 consec # same-holder failures, supervisor logs "Exiting supervisor cleanly" and the # thread terminates. CoordinatorState removed. # stderr-99.log shows ONE "session lock busy" line per attempt, with 10s+ gaps # instead of three within 250ms. UNIT-TEST REPRO (preferred, doesn't need a real daemon): The supervisor loop is structured so that test code can drive it via a fake child + faked SessionLock. session_lock.rs already has acceptable test seams: - SessionLock::lock_path(chat_dir) -> PathBuf (session_lock.rs:105) - read_holder(chat_dir) -> Result<Option<LockInfo>> (session_lock.rs:216) - The lock file format is documented + simple — tests can write any pid+kind. A new test (NOT in scope for this research task, listed for the impl task): #[test] fn supervisor_backs_off_on_foreign_live_holder() { let dir = tempdir().unwrap(); let chat_dir = dir.path().join("chat/.chat-99"); std::fs::create_dir_all(&chat_dir).unwrap(); // Plant a foreign-live-holder pointing at a real living pid (sleeper). let mut sleeper = std::process::Command::new("sleep").arg("60").spawn().unwrap(); std::fs::write( chat_dir.join(".handler.pid"), format!("{}\n2026-01-01T00:00:00Z\nchat-nex\n", sleeper.id()), ).unwrap(); // Drive the (refactored-for-testability) backoff helper directly: // assert_eq!(consecutive_lock_busy_after_n_calls(5), Decision::ExitClean); // assert!(backoff_for(2) >= Duration::from_secs(20)); sleeper.kill().ok(); } The session_lock.rs file's existing tests `second_acquire_fails_while_first_held` (L353-364) and `stale_lock_recovers` (L367-378) already prove the underlying SessionLock::acquire semantics that this patch depends on. The new test above only needs to validate the new backoff state machine — see the impl task (fix-supervisor-restart-backoff) for placement. WHAT THE REPRO PROVES: 1. The lock-busy + exit-1 path is deterministically reachable when ANY foreign-live-pid is in .handler.pid. 2. Without backoff, the supervisor burns 3 spawn-cycles in <250ms (matches daemon.log.1:11-37 evidence). 3. With backoff, churn is replaced by O(1) attempt + escalating sleep + clean exit on persistent contention.
2026-05-03T01:03:36.627648488+00:00 === SUMMARY + VALIDATION === VALIDATION CHECKLIST (against task description): [x] HandlerKind state machine documented — see deliverable 1 (selection logic at nex.rs:293-303, all paths funnel through ChatNex for chat tasks). [x] Root cause cited with file:line — coordinator_agent.rs:929-971 (status != 0 falls through with no sleep, no inspection); secondary at L947 (idle gate gated on status.success only). [x] Backoff proposal is line-level, not architectural — replaces L929-971 in place + adds two state vars near L623; ~50 lines total. No new modules, no structural changes. [x] Repro recipe produces churn deterministically — shell repro with `setsid sleep 600` planted into .handler.pid + `wg service start`; observable in daemon.log + nex-handler-stderr-N.log. [x] No source modifications in this task — verified `git diff main...HEAD` is empty (only untracked .wg pointer file). CORRECTIONS to the parent task description: - MAX_RESTARTS_PER_WINDOW=3 (coordinator_agent.rs:46), NOT 5. Cap fires after 3 rapid restarts, not "tens". - The churn pattern is 3-cycles-in-~250ms + ~600s pause, then 3 more, etc. Not "tens of times" between caps — but the rapid-fire 3 is enough to flood the user-visible stderr with lock-busy lines. - Hypothesis 1 (killed by something else) and Hypothesis 2 (two distinct supervisors) collapse into one root cause: nex children orphan-survive their parent supervisor across daemon SIGKILL/panic, and on daemon reboot the new supervisor's first spawn races them. REFERRAL TO DOWNSTREAM IMPL TASK (fix-supervisor-restart-backoff): - Apply deliverable 3's patch to coordinator_agent.rs verbatim. - Add the unit test sketch from deliverable 4. - DO NOT extend scope to "kill orphaned children on daemon restart" — that's a separate fix in the daemon-shutdown path (process group, claim file, or pidfd_send_signal). This task only stops the supervisor from churning; not solving the orphan creation itself. - Validate by running the shell repro and confirming the new log pattern (10s/20s/40s/80s/120s gaps, then clean exit). EVIDENCE FILES (preserved, not modified): - /home/erik/workgraph/.wg/service/nex-handler-stderr-32.log (145 lines, alternating lock-busy / recovering-stale) - /home/erik/workgraph/.wg/service/nex-handler-stderr-33.log (270 lines, same pattern) - /home/erik/workgraph/.wg/service/daemon.log.1 (38MB, contains 10+ daemon-boot cascades each with 3-rapid-restart bursts on coordinators 27/28/29/31/33/34) - /home/erik/workgraph/.wg/service/coordinator-state-32.json + -33.json (model_override=qwen3-coder, executor_override=native, endpoint=https://lambda01.tail334fe6.ts.net:30000) NO COMMIT, NO PUSH — research task is read-only by design (per task description: "READ ONLY — no source mods").
2026-05-03T01:03:41.017541952+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-03T01:06:28.825165769+00:00 PendingEval → Done (evaluator passed; downstream unblocks)