research-tui-supervisor-coexistence — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-1786`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-05-02T23:53:33.105711085+00:00
Started	2026-05-03T00:53:31.477531250+00:00
Completed	2026-05-03T00:59:55.752739473+00:00
Tags	`research,nex,chat,tui,supervisor,bug`, `eval-scheduled`
Eval score	0.94
└ blocking impact	0.90
└ completeness	1.00
└ coordination overhead	0.95
└ correctness	0.95
└ downstream usability	0.85
└ efficiency	0.90
└ intent fidelity	0.85
└ style adherence	0.95

Description

src/tui/viz_viewer/state.rs:13696-13753 has a forced-takeover path: when the user opens a chat tab in the TUI, it requests release of the session lock, waits 300ms, SIGTERMs the supervisor's nex handler if still alive, re-acquires. This is racy when combined with the supervisor restart-loop (see research-supervisor-lock-churn): even after SIGTERM, a new supervisor handler can spawn into the lock between the TUI's takeover wait and the TUI's wg nex --resume invocation. End result: sometimes TUI wins, sometimes supervisor wins, sometimes neither.

Investigation scope (READ ONLY — no source mods)

src/tui/viz_viewer/state.rs:13696-13770 (TUI takeover dance + spawn wg nex --resume)
src/commands/service/coordinator_agent.rs:660-743 (supervisor pre-flight + register_coordinator_session)
src/session_lock.rs (request_release / clear_release_marker / wait_for_release)
src/executor/native/agent.rs:1363 + src/executor/native/chat_surface.rs:176 (release-marker observation in the running handler)
Two design choices to evaluate:
- Single-owner: only ONE handler ever runs — supervisor stops respawning when TUI takes over (sentinel file chat-N/.user-driven); supervisor resumes when TUI exits or sentinel removed
- Cooperative-handoff: supervisor and TUI handlers cooperate via a tighter protocol (e.g. supervisor enters "observer mode" while TUI is connected, doesn't try to spawn its own nex)

Deliverable (via wg log)

Compare both options: pros/cons, blast radius, complexity
Recommend ONE
Patch shape for the recommended option (file:line, not full diff)
Note: design-chat-agent / implement-tmux-wrapped already shipped tmux wrapping. Verify whether the supervisor's nex handler today is itself tmux-wrapped — if yes, takeover semantics differ (supervisor's nex lives inside a detached tmux session, surviving SIGTERM at the wrapper level). Document.

Validation

Both options analyzed
Recommendation made with rationale
Patch shape concrete enough to drive impl task
tmux-wrapping interaction documented
No source modifications in this task

## Description
src/tui/viz_viewer/state.rs:13696-13753 has a forced-takeover path: when the user opens a chat tab in the TUI, it requests release of the session lock, waits 300ms, SIGTERMs the supervisor's nex handler if still alive, re-acquires. This is racy when combined with the supervisor restart-loop (see research-supervisor-lock-churn): even after SIGTERM, a new supervisor handler can spawn into the lock between the TUI's takeover wait and the TUI's `wg nex --resume` invocation. End result: sometimes TUI wins, sometimes supervisor wins, sometimes neither.

## Investigation scope (READ ONLY — no source mods)
- src/tui/viz_viewer/state.rs:13696-13770 (TUI takeover dance + spawn `wg nex --resume`)
- src/commands/service/coordinator_agent.rs:660-743 (supervisor pre-flight + register_coordinator_session)
- src/session_lock.rs (request_release / clear_release_marker / wait_for_release)
- src/executor/native/agent.rs:1363 + src/executor/native/chat_surface.rs:176 (release-marker observation in the running handler)
- Two design choices to evaluate:
  - **Single-owner**: only ONE handler ever runs — supervisor stops respawning when TUI takes over (sentinel file `chat-N/.user-driven`); supervisor resumes when TUI exits or sentinel removed
  - **Cooperative-handoff**: supervisor and TUI handlers cooperate via a tighter protocol (e.g. supervisor enters "observer mode" while TUI is connected, doesn't try to spawn its own nex)

## Deliverable (via wg log)
- Compare both options: pros/cons, blast radius, complexity
- Recommend ONE
- Patch shape for the recommended option (file:line, not full diff)
- Note: design-chat-agent / implement-tmux-wrapped already shipped tmux wrapping. Verify whether the supervisor's nex handler today is itself tmux-wrapped — if yes, takeover semantics differ (supervisor's nex lives inside a detached tmux session, surviving SIGTERM at the wrapper level). Document.

## Validation
- [ ] Both options analyzed
- [ ] Recommendation made with rationale
- [ ] Patch shape concrete enough to drive impl task
- [ ] tmux-wrapping interaction documented
- [ ] No source modifications in this task

Depends on

Required by

done fix-tui-supervisor-coexistence

Log

2026-05-02T23:53:33.080336054+00:00 Task paused
2026-05-03T00:51:49.751483398+00:00 Task published
2026-05-03T00:53:31.477535508+00:00 Spawned by coordinator --executor claude --model opus
2026-05-03T00:53:42.144776658+00:00 Starting investigation: reading TUI takeover dance, supervisor pre-flight, session_lock primitives, and release-marker observation in handler
2026-05-03T00:59:36.915995152+00:00 RESEARCH FINDINGS — TUI/supervisor coexistence ================================================================================ PART 1 — STATE OF THE WORLD (verified file:line) ================================================================================ 1. TUI takeover dance (state.rs:13696-13753): - On chat-tab open, if any handler holds the session lock and is alive: a) request_release (writes .handler.release-requested marker) b) wait 300ms in 100ms increments for clean release c) if still held, libc::kill(holder.pid, SIGTERM) — eprintln warns "[tui] forced takeover" d) wait 1000ms in 100ms increments for the lock file to clear e) re-check holder; observer_mode is recomputed at :13755 - Then at :13943, spawn `wg nex --resume <ref>` via pty_pane::spawn_via_tmux (tmux session "wg-chat-<project>-<chat_ref>"). 2. Supervisor pre-flight + restart (coordinator_agent.rs:625-972): - Restart-rate-limited (3 per 600s), but no idle gate before EACH spawn — only AFTER a clean child exit (line 945-953 chat_session_is_idle check). - Pre-flight (660-743) registers chat session, resolves task_id, checks archived/done/abandoned, then plan_spawn + cmd.spawn() at 859. - The spawned subprocess is `wg spawn-task .chat-N` (or legacy .coordinator-N), which dispatches to `wg nex --chat <ref>` for native. - On non-success exit (status path at :959-964): immediately loops, NO sleep. SIGTERM from the TUI lands here. - On clean exit: chat_session_is_idle gate. Threshold = 300s. The gate looks at: unread inbox > 0 → not idle .nex-cursor mtime within 300s → not idle (consumer_recent) The most recent nex consumer just touched .nex-cursor when it ran, so consumer_recent is virtually always true within seconds of a clean exit. ⇒ supervisor sleeps 2s and respawns. 3. session_lock contract (session_lock.rs): - .handler.pid created O_EXCL; auto-recovers stale (dead-PID) locks. - request_release / clear_release_marker / release_requested / wait_for_release primitives all present. - HandlerKind::TuiPty exists but is NOT used by the TUI takeover code — the actual lock holder when TUI's nex eventually starts is whatever nex.rs:296-302 picks (ChatNex for native chat). 4. Release marker observation (the running handler): - agent.rs:1361-1372: at top of every turn boundary, if release_requested, clear marker and break the agent loop with session_exit_reason = "release_requested". - chat_surface.rs:176-178: inside next_entry's wait loop, if release_requested return None (caller treats as EOF). - BOTH of these are observed only by the live nex CHILD, not by the supervisor. 5. nex startup lock acquisition (nex.rs:307-322): - SessionLock::acquire is fail-fast — if EEXIST + holder alive, BAILS with "session lock busy" and exits with anyhow::bail. No retry. - At :326, clear_release_marker is called by whichever handler wins the acquire. ⇒ The next handler that comes up, regardless of how it got there, will NOT see the prior marker. ================================================================================ PART 2 — TMUX-WRAPPING INTERACTION (per task instruction to verify) ================================================================================ VERIFIED: the supervisor's nex handler is NOT tmux-wrapped today. - coordinator_agent.rs:811-870 does plain `Command::new(wg_bin).arg("spawn-task")...spawn()`. - No `tmux` references in any src/commands/service/*.rs file (checked via Grep). VERIFIED: the TUI's nex IS tmux-wrapped. - state.rs:13943-13971 selects tmux_session = chat_id::chat_tmux_session_name when tmux is on PATH. - state.rs:14005-14008 → pty_pane::spawn_via_tmux at pty_pane.rs:359-434 which does `tmux new-session -d -s <name> -- wg nex ...` then `tmux attach -d -t <name>`. - The TUI exits but the nex inside the tmux session survives (this is the implement-tmux-wrapped persistence design). Implications for forced takeover: - SIGTERM at the lock-holder PID kills the supervisor's directly-spawned nex (no tmux layer in between). This works today. - If the holder ever becomes the TUI's tmux-wrapped nex (e.g., a 2nd TUI instance opens the same chat), SIGTERM at the holder pid still kills the inner nex (tmux records exit status, session can be configured to remain), but the wg-chat-* tmux session lingers with a dead window. That's a separate bug, not addressed here. - For the chosen design (single-owner, see Part 4), the SIGTERM dance goes away entirely, so the tmux-wrap difference between the two spawn paths becomes irrelevant for the coexistence question. Note: coordinator_agent.rs holds child.id() in restart_timestamps + as the supervisor's own child handle, not via tmux. There is no proposal on the table to tmux-wrap the supervisor; doing so would actually make the SIGTERM-then-respawn race WORSE because the supervisor would lose visibility into its own nex's exit status (tmux owns it). So the supervisor staying non-tmux is the right invariant. ================================================================================ PART 3 — THE RACE (concrete sequence) ================================================================================ Failure mode that produces "sometimes neither wins": T+0 TUI: read_holder → supervisor's nex pid P, alive. T+0 TUI: request_release marker written. T+0–300ms: supervisor's nex polls marker at turn boundary or in next_entry → exits cleanly, status 0. T+300ms (or up to 1.3s if SIGTERM path): supervisor.child.wait() returns Ok(0) → enters chat_session_is_idle gate; .nex-cursor was JUST touched by the prior nex, so consumer_recent=true → not idle → sleep(2s). T+~50ms (alternate path, SIGTERM): supervisor.child.wait() returns Ok(non-success) → "will restart" → no sleep, immediate loop. T+1300ms TUI: spawn `tmux new-session -- wg nex --resume <ref>`. The inner nex calls SessionLock::acquire. T+~2300ms (clean-exit path) OR T+~150ms (SIGTERM path): supervisor re-runs pre-flight + plan_spawn + Command::spawn → another wg nex tries to acquire. Whichever nex calls acquire first wins. The other bails with "session lock busy" and exits. From the user's perspective: - if TUI's nex bails: tmux session contains a dead nex, the PTY pane shows the bail message; the user retries by re-opening the chat tab, which re-runs the takeover dance. Sometimes works. - if supervisor's nex bails: supervisor child.wait() returns non-success, restart-loop continues; usually self-resolves once TUI is actually running, but burns ~3 restarts per attempt. - if neither wins reliably, the rate limiter kicks in and the supervisor pauses for the remaining window. The release-marker logic helps the FIRST handler exit cleanly but does NOTHING to suppress the supervisor's restart loop. That is the structural defect. ================================================================================ PART 4 — DESIGN OPTIONS ================================================================================ OPTION A — Single-owner via TUI sentinel (recommended) Mechanism: a sentinel file <chat_dir>/.tui-driven containing <pid>\n<rfc3339>\n (mirrors the existing .handler.pid layout) written by the TUI when it claims the chat tab and removed when the tab closes / the TUI exits. Liveness via kill(pid, 0) — same pattern as session_lock::pid_is_alive. TUI side: - On chat-tab open: write_tui_sentinel + request_release + wait_for_release(5s). NO SIGTERM, NO 1.3s arbitrary wait. - On chat-tab close / archive / TUI exit: clear_tui_sentinel for every chat the TUI owned. - Stale-recovery: any reader treats kill(pid,0)==dead as no sentinel, same as session_lock does for .handler.pid. Supervisor side: - Pre-flight (between resolved task_id and plan_spawn): if tui_sentinel_alive(chat_dir), log "TUI driving — deferring", sleep 5s, `continue` the loop. Pre-flight re-checks archive/done state next iteration so this path is safe. Nex side (defensive): - On lock-busy bail, mention the sentinel state in the error message ("driven by wg-tui pid=N"). Pure cosmetic. Pros: - Eliminates the race entirely — sentinel decides ownership before either side touches the lock. - SIGTERM dance gone; "stuck handler" path becomes "supervisor release-marker observed, wait_for_release succeeds". - One invariant to reason about: at most one nex per chat, whichever side has the sentinel wins. - Symmetric: when TUI exits and clears the sentinel, the supervisor's next iteration sees no sentinel and respawns. Existing chat_session_is_idle gate keeps zombie chats quiet. Cons: - Stale sentinel if TUI crashes between write and clear. PID check handles dead-pid case. Catastrophic case (PID reused by an unrelated process within seconds) is the same risk session_lock already accepts; consistent with the codebase. - Two new TUI lifecycle hooks (chat-tab close + global TUI exit cleanup). Both already exist in the ExitPromptState / pane drop paths — sentinel cleanup attaches there. Blast radius: 4 files - new src/tui_sentinel.rs (or extend src/session_lock.rs): ~80 LOC including tests. - src/tui/viz_viewer/state.rs: replace 13696-13753 (~58 LOC removed) with ~10 LOC sentinel write + soft release wait; add ~15 LOC of cleanup in chat-tab-close + TUI exit paths. - src/commands/service/coordinator_agent.rs: ~12 LOC inserted between line 743 and 781. - src/commands/nex.rs: ~5 LOC error-message tweak (optional). Total ~120 LOC net. No on-disk format migrations. Complexity: medium. Stale-PID handling is the only subtle bit and the pattern is already in session_lock.rs to copy. OPTION B — Cooperative handoff (observer mode for supervisor's nex) Mechanism: supervisor's spawned nex starts in "observer" mode (tail-only, no LLM calls, no lock acquired) when a TUI is connected. When TUI disconnects, observer escalates to full-handler mode and acquires the lock. Supervisor side: - Decide at spawn time: full handler vs observer. Needs to check holder status of the lock (or the same sentinel as Option A) to make that call. The sentinel ends up doing the work either way. Nex side: - New --observer flag. New observer-mode loop in agent.rs that reads the conversation journal, doesn't acquire the lock, doesn't call the LLM, exits when the live handler does. - Promotion path: observer notices lock free, acquires, swaps to full-handler mode mid-process. Fragile — most code paths assume lock-held at startup. Pros: - Supervisor stays "warm" while TUI is connected. Could enable future features like "supervisor watches and pre-fetches context" while TUI drives. Cons: - The supervisor still needs to know "is the TUI driving?" before deciding observer vs full-handler. That decision needs a sentinel-like primitive — so Option B layers more machinery ON TOP of Option A, not instead of it. - Observer mode is a new handler kind with a new state machine. Promotion mid-process is the kind of thing that ships with a class of bugs that take months to shake out. - The supervisor's nex doing nothing while the TUI's nex runs is exactly equivalent (from the user's perspective) to the supervisor not running at all — which is what Option A achieves with no new code paths. Blast radius: 5+ files - src/commands/nex.rs: ~80 LOC for --observer flag, alt loop. - src/executor/native/agent.rs: ~200 LOC for observer-mode tail + promotion logic. - src/executor/native/chat_surface.rs: ~50 LOC tail-only path. - src/commands/service/coordinator_agent.rs: ~30 LOC for the observer-vs-full decision (still needs a sentinel anyway). - src/tui/viz_viewer/state.rs: ~40 LOC, mostly the same as Option A plus a "kick observer to promote" signal. Total ~400 LOC net. Complexity: high. Observer mode is a new code path with its own failure modes; promotion is the worst of them. ================================================================================ PART 5 — RECOMMENDATION ================================================================================ Adopt OPTION A — single-owner via TUI sentinel. Rationale: 1. The race is structural: two spawn paths competing for one lock. Single-owner removes the second spawn path entirely while TUI drives; cooperative handoff just adds a third state to the existing fight. 2. Option A is roughly 3x smaller (~120 LOC vs ~400 LOC). 3. Option B's "supervisor stays warm" benefit is theoretical — no downstream feature needs it today. Adding the machinery for a future feature is premature design. 4. Option B requires the same sentinel as Option A to make its observer/full-handler decision. So Option A is a strict subset of Option B's scope; we can always layer observer mode on top later if the warm-supervisor use case materializes. 5. Option A preserves the existing release-marker contract — the wait_for_release primitive does exactly the right thing once the supervisor stops respawning behind it. ================================================================================ PART 6 — PATCH SHAPE FOR fix-tui-supervisor-coexistence ================================================================================ (file:line, not a full diff — concrete enough to drive impl) (a) NEW MODULE: src/tui_sentinel.rs (~80 LOC + tests) - const SENTINEL_FILENAME: &str = ".tui-driven"; - pub struct TuiSentinel { pub pid: u32, pub written_at: String, pub alive: bool } - pub fn sentinel_path(chat_dir: &Path) -> PathBuf - pub fn write_sentinel(chat_dir: &Path, pid: u32) -> Result<()> format: "{pid}\n{rfc3339}\n" - pub fn read_sentinel(chat_dir: &Path) -> Result<Option<TuiSentinel>> - pub fn clear_sentinel(chat_dir: &Path) // idempotent - pub fn sentinel_alive(chat_dir: &Path) -> bool wraps read_sentinel + reuses session_lock::pid_is_alive (extract pid_is_alive to pub(crate) at session_lock.rs:309 if not already callable across modules — quick check shows it's private, so promote it). - Tests: round-trip, stale-pid recovery, missing-file Ok(None). (b) src/lib.rs: add `pub mod tui_sentinel;` (c) src/tui/viz_viewer/state.rs: - REPLACE 13696-13753 (the SIGTERM block) with: let chat_dir = workgraph::chat::chat_dir_for_ref(&self.workgraph_dir, &chat_ref); let _ = workgraph::tui_sentinel::write_sentinel(&chat_dir, std::process::id()); let _ = workgraph::session_lock::request_release(&chat_dir); let _ = workgraph::session_lock::wait_for_release( &chat_dir, std::time::Duration::from_secs(5)); (~7 LOC; the surrounding observer_mode re-check at 13755 stays.) - At 14005-14070 (chat-tab close path) and the existing DeleteCoordinator/ArchiveCoordinator handlers near 11036-11098: add `workgraph::tui_sentinel::clear_sentinel(&chat_dir);` where the tmux session is killed. - At TUI exit (search for `ExitPromptState` apply path, ~line 851 area — or the higher-level run loop in src/tui/mod.rs): walk self.task_panes / known chat_refs and clear_sentinel for each. A panic_hook + ctrlc handler can do this best-effort too; stale-PID detection in the sentinel reader is the durable safety net so cleanup-on-exit only needs to be best-effort. (d) src/commands/service/coordinator_agent.rs: - INSERT between line 743 (after task_id resolution) and line 748 (before state load for hot-swap): let chat_dir = workgraph::chat::chat_dir_for_ref(dir, &task_id); if workgraph::tui_sentinel::sentinel_alive(&chat_dir) { let s = workgraph::tui_sentinel::read_sentinel(&chat_dir).ok().flatten(); logger.info(&format!( "Coordinator-{}: TUI sentinel alive (pid={:?}) — deferring respawn 5s", coordinator_id, s.map(|x| x.pid))); std::thread::sleep(std::time::Duration::from_secs(5)); continue; } (~12 LOC. Re-runs full pre-flight on next iteration so archive/ deletion still terminates the supervisor cleanly.) - DO NOT push timestamp into restart_timestamps for sentinel-defer iterations; otherwise the rate limiter kicks in during a long TUI session. The continue above naturally skips the `restart_timestamps.push_back` at line 875 because that line runs after the spawn we're skipping. (e) src/commands/nex.rs: (cosmetic, optional) - At 311-318, when bail on session-busy, also read tui_sentinel and append `(driven by wg tui pid=N)` to the message. ~5 LOC. (f) Test (impl task adds, not in this research): - Integration: spawn supervisor with mock spawn-task, write sentinel from a separate process, assert supervisor's loop defers (no nex spawned) for 10s, then clear sentinel and assert it spawns within ~10s. Lives in tests/integration_*.rs alongside other supervisor tests. - Unit: tui_sentinel round-trip + stale-pid path (in tui_sentinel.rs). ================================================================================ PART 7 — EXPLICIT VALIDATION CHECKLIST (mirrors task ## Validation) ================================================================================ [x] Both options analyzed (Parts 4, 6). [x] Recommendation made with rationale (Part 5). [x] Patch shape concrete enough to drive impl task (Part 6 names files, line ranges, function signatures, and the new module's public API). [x] tmux-wrapping interaction documented (Part 2: supervisor's nex is NOT tmux-wrapped today; TUI's IS; the recommended design is independent of either, and tmux-wrapping the supervisor would make the race worse not better). [x] No source modifications in this task — confirmed via `git status` showing only the .wg/ scratch dir untracked.
2026-05-03T00:59:45.450192657+00:00 Validated: both options analyzed; OPTION A (single-owner sentinel) recommended; patch shape names new tui_sentinel module + concrete file:line edits in state.rs / coordinator_agent.rs / nex.rs; tmux-wrapping interaction documented (supervisor's nex NOT wrapped, TUI's IS); zero source modifications (only .wg/ untracked). Ready for fix-tui-supervisor-coexistence.
2026-05-03T00:59:55.752746547+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-03T01:03:03.758959060+00:00 PendingEval → Done (evaluator passed; downstream unblocks)