research-supervisor-lock-churn

Research: nex supervisor restart-loop on session-lock-busy exits — characterize churn + propose backoff

Metadata

Statusdone
Assignedagent-1788
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-05-02T23:53:18.145533061+00:00
Started2026-05-03T00:53:34.323014230+00:00
Completed2026-05-03T01:03:41.017532454+00:00
Tagsresearch,nex,chat,supervisor,bug, eval-scheduled
Eval score0.88
└ blocking impact0.85
└ completeness0.95
└ coordination overhead0.90
└ correctness0.87
└ downstream usability0.85
└ efficiency0.80
└ intent fidelity0.94
└ style adherence0.90

Description

Description

~/.wg/service/nex-handler-stderr-{32,33}.log shows a tight cycle: handler N takes the session lock, handler N+1 (supervisor restart) hits "Error: session lock busy" and exits status=1, supervisor immediately restarts AGAIN (coordinator_agent.rs:929-963 has no idle-gate / no backoff on the failure path), eventually handler N dies, recovery loop picks up the stale lock and starts fresh. End result: chat appears alive but the user sees lock-busy errors when the TUI tries to take over.

Investigation scope (READ ONLY — no source mods)

  • src/commands/service/coordinator_agent.rs:929-971 (restart logic on exit-status)
  • src/commands/nex.rs:294-310 (session-lock acquisition w/ HandlerKind::ChatNex vs Adapter)
  • src/session_lock.rs (lock semantics)
  • Hypotheses to test:
    1. Supervisor's nex-handler is being killed by something else (TUI takeover, daemon reload, signal) → exits status=1 → supervisor restart-loops
    2. Two distinct supervisor instances are spawning for the same chat (stale state, daemon reload)
    3. The 5-restart-per-window cap (RESTART_WINDOW_SECS) is high enough that the loop runs tens of times before the cap kicks in
  • Useful evidence sources:
    • daemon.log + .log.1 (timestamp pattern of "nex subprocess exited / nex subprocess running")
    • nex-handler-stderr-N.log (exact error message at each exit)
    • lsof on .handler.pid file during reproducible churn

Deliverable (via wg log)

  • State machine of HandlerKind values (ChatNex vs Adapter vs InteractiveNex etc.) — when does each apply, who acquires which kind
  • Root cause: WHY does the nex subprocess exit status=1 even when the lock should be available
  • Backoff proposal: line-level patch to coordinator_agent.rs:929-963 — likely "if exit-status-1 within Ns of spawn AND last-seen lock holder pid is recent and live, treat as lock contention; back off ≥10s; if backoff fires N times in a row, exit supervisor cleanly"
  • Repro recipe (deterministic — using session_lock test helpers if they exist)

Validation

  • HandlerKind state machine documented
  • Root cause cited with file:line
  • Backoff proposal is line-level, not architectural
  • Repro recipe produces churn deterministically
  • No source modifications in this task

Depends on

Required by

Log