Metadata
| Status | done |
|---|---|
| Assigned | agent-1270 |
| Agent identity | 3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3 |
| Model | claude:opus |
| Created | 2026-04-30T16:18:02.531132428+00:00 |
| Started | 2026-04-30T16:23:18.938375483+00:00 |
| Completed | 2026-04-30T16:32:19.414641701+00:00 |
| Tags | priority-high,research,bug,chat,nex, eval-scheduled |
| Eval score | 0.93 |
| └ blocking impact | 0.92 |
| └ completeness | 0.93 |
| └ constraint fidelity | 0.85 |
| └ coordination overhead | 0.95 |
| └ correctness | 0.95 |
| └ downstream usability | 0.92 |
| └ efficiency | 0.90 |
| └ intent fidelity | 0.83 |
| └ style adherence | 0.95 |
Description
Description
User attempted to launch a wg nex chat on .chat-32 with model='qwen3-coder', endpoint='https://lambda01.tail334fe6.ts.net:30000'. Result: process died immediately, NO error messages, NO stderr captured. The TUI fell back to 'chat with the coordinator' (because the chat agent was gone). User had no way to debug.
Daemon log evidence (2026-04-30 16:15:50Z):
[INFO] IPC CreateChat: name=None, model=Some("qwen3-coder"), executor=Some("native"), endpoint=Some("https://lambda01.tail334fe6.ts.net:30000")
[INFO] Graph file changed (fs watcher), scheduling dispatcher tick in 2000ms (settling delay)
[INFO] Settling delay elapsed, running coordinator tick now
[dispatcher] Reconciliation: recovered 1 orphaned task(s)
Two seconds between CreateChat and reconciliation orphan-recovery. No spawn-error log, no stderr capture, no exit code recorded. Silent death.
wg log .chat-32 --agent returns 'No agent archives' — the agent never wrote anything.
Hypotheses to test
- nex handler crashed before spawning anything: maybe the model spec 'qwen3-coder' (without nex: prefix) didn't route correctly, OR endpoint parsing failed, OR something in the nex initialization paniced
- nex spawned but exited immediately: maybe the endpoint was unreachable / TLS handshake failed / model name not accepted by server, and the handler exited 1 before any output was buffered
- Stderr capture is broken for nex spawns specifically: claude/codex have explicit stderr-log files (e.g.,
claude-handler-stderr.log); nex may not, so its stderr just goes to /dev/null
The third hypothesis is implied by the daemon log: claude handler explicitly logs stderr="/home/erik/workgraph/.wg/service/claude-handler-stderr.log" on spawn. NO equivalent line for any nex spawn. So nex's stderr is unrouted.
Investigation steps
- Reproduce the spawn from CLI:
wg nex --model qwen3-coder --endpoint https://lambda01.tail334fe6.ts.net:30000directly. See what error fires (if any). - Reproduce via IPC the same way the TUI did. Inspect spawn code path for nex (likely src/dispatch/handler_for_model.rs + a nex-specific handler module).
- Compare claude handler's stderr capture vs nex handler's. Identify the gap.
- Test: model='qwen3-coder' vs model='nex:qwen3-coder' — does the prefix matter for the IPC path?
Deliverable
A wg log entry with:
- Root cause for the silent death (file:line of where the spawn fails or where stderr is discarded)
- Concrete fix proposal for stderr capture parity (what claude has but nex doesn't)
- Concrete fix proposal for the underlying spawn failure if it's a real bug, not just a missing-error-surface issue
No source modifications — diagnose only.
Validation
- Reproduction documented with exact command + observed silent failure
- Root cause identified with file:line citation
- Both fixes (stderr capture + underlying spawn issue if any) specified concretely enough that a follow-up implementation task can execute
- No source / doc modifications
Depends on
Required by
Log
- 2026-04-30T16:18:02.519471476+00:00 Task paused
- 2026-04-30T16:18:47.013567614+00:00 Task published
- 2026-04-30T16:21:01.285515895+00:00 USER REPRODUCTION 2026-04-30: CLI spawn works, IPC spawn fails silently. Same model + endpoint. Working (CLI): $ wg nex -m qwen3-coder -e https://lambda01.tail334fe6.ts.net:30000 wg nex — interactive session with qwen3-coder > hi Hello! How can I help you today? Failing (IPC, .chat-32): IPC CreateChat: model=Some("qwen3-coder"), executor=Some("native"), endpoint=Some("https://lambda01.tail334fe6.ts.net:30000") → 2 seconds later → orphan-recovery, no stderr, agent archives empty CRITICAL DIAGNOSTIC: this rules out: - nex handler bugs (works fine on CLI) - endpoint reachability (CLI connects fine) - TLS / cert validation (CLI handshake succeeds) - model name acceptance by server (CLI gets responses) - API key / auth (no key needed for this endpoint, but CLI confirms it auths correctly anyway) The bug MUST be in the divergence between: - CLI: `wg nex -m X -e Y` invocation path (src/commands/nex.rs or similar — the user-facing subcommand) - IPC: chat-spawn path triggered by CreateChat IPC (likely src/ipc/ + src/dispatch/ + chat-spawn site in src/tui/viz_viewer/state.rs) INVESTIGATION FOCUS: diff the two spawn paths. What does the IPC path do differently? Hypotheses worth checking specifically: 1. **Tmux wrapping difference**: implement-tmux-wrapped (commit ce6ca245a) wrapped chat agents in tmux. CLI `wg nex` doesn't go through tmux. If tmux's wrap-spawn is corrupting argv / env / stderr-redirection for nex specifically, that'd be the bug. Test: try chat-spawn but with tmux disabled for that one chat. 2. **Different argv construction**: CLI builds `wg nex --model X --endpoint Y` directly. IPC may construct args via a different builder that drops/mangles the endpoint flag, or adds a flag nex doesn't recognize, causing immediate exit. 3. **Stderr is unrouted in IPC path but routed in CLI**: this part is already in the original task description. 4. **Working directory / env differences**: CLI inherits user's shell env; IPC spawn may strip critical env vars that the connect needs (e.g., SSL_CERT_FILE for system CA store, or CARGO_HOME, etc.). Compare the two paths line by line. The fix is whatever they should have in common but currently don't.
- 2026-04-30T16:23:18.938379540+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-30T16:23:33.401954750+00:00 Starting investigation: nex chat spawn dies silently. Will reproduce + diff against claude handler stderr capture.
- 2026-04-30T16:25:40.415379499+00:00 Found code path: CreateChat IPC writes graph + per-coord state. Daemon never auto-spawns supervisor for newly-created chats — boot enumeration runs ONCE; runtime path requires either (a) daemon restart, or (b) UrgentWake from UserChat IPC. After CreateChat, fs-watcher fires reconciliation tick which sees .chat-N InProgress + assigned=None and 'recovers' it (sets to Open) — see sweep.rs:393. Reconciliation only excludes coordinator-loop and compact-loop tags, NOT chat-loop.
- 2026-04-30T16:30:17.081611917+00:00 REPRODUCTION: Looked at .chat-32 in current graph. wg show .chat-32 shows status=Open with log line 'Reconciliation: task recovered from orphaned state (was InProgress, agent: (none)) [reconcile]' at 16:15:52.905Z — exactly 2.5s after the IPC CreateChat at 16:15:50.348Z. Daemon log for coordinator-32 has zero entries (no spawn attempt, no error). Supervisor never ran.
- 2026-04-30T16:30:30.600680095+00:00 ROOT CAUSE 1 (silent death — primary): Reconciliation orphan-recovery treats newly-created chat tasks as orphaned. src/commands/service/ipc.rs:1519-1521 creates the chat task with Status::InProgress + tag=chat-loop + assigned=None. src/commands/sweep.rs:388-397 (reconcile_orphaned_tasks) checks 'task.status == InProgress && assigned == None' and excludes ONLY tags 'coordinator-loop' and 'compact-loop' — chat-loop is NOT in the list. So 2s after CreateChat, the fs-watcher fires reconcile_orphaned_tasks which 'recovers' .chat-32 from InProgress → Open (sweep.rs:412-419). Meanwhile the daemon's supervisor-spawn happens ONLY at boot (service/mod.rs:2188 enumerate_chat_supervisors_for_boot, ONE-shot at startup) or via UrgentWake on UserChat IPC (service/mod.rs:2553-2611 lazy-spawn). CreateChat IPC (ipc.rs:566-583) just writes graph + state and returns OK; it does NOT signal the daemon to spawn a supervisor. So chat-32 sits with no agent until a UserChat arrives — but the user sees the tab as 'live' in the TUI and reasonably expects a process to be running.
- 2026-04-30T16:30:47.658929470+00:00 ROOT CAUSE 2 (endpoint silently dropped — secondary, latent): CreateChat IPC stores the user's endpoint URL on the chat task at ipc.rs:1535 (task.endpoint = Some(...)) AND in CoordinatorState.endpoint_override at ipc.rs:1581. Confirmed: coordinator-state-32.json has endpoint_override='https://lambda01.tail334fe6.ts.net:30000'. BUT src/dispatch/plan.rs:214-228 (plan_spawn endpoint resolution for native executor) ONLY checks config.llm_endpoints.find_default() — task.endpoint is NEVER read. Verified via dry-run: WG_EXECUTOR_TYPE=native WG_MODEL=qwen3-coder wg spawn-task --dry-run .chat-32 emits 'wg nex --chat .chat-32 -m qwen3-coder' with NO -e flag. The supervisor (coordinator_agent.rs:810-820) sets env vars WG_EXECUTOR_TYPE and WG_MODEL but no WG_ENDPOINT. spawn_task.rs:209-216 reads only those two env vars. Result: even if the supervisor DID spawn, nex would invoke create_provider_ext with endpoint=None, fall through provider.rs:289-316 heuristics ('qwen3-coder' is bare → oai-compat default), then llm_endpoints.find_default() — talking to whatever default endpoint config has, not lambda01. Likely a 401/404/connection-refused that nex would log to stderr and exit.
- 2026-04-30T16:31:01.302173530+00:00 ROOT CAUSE 3 (stderr capture parity gap): claude_handler.rs:399-411 redirects spawn.stderr to a dedicated persistent file ~/.wg/service/claude-handler-stderr.log AND emits an explicit 'claude-handler: spawning ... stderr=...' line at claude_handler.rs:413-419 BEFORE child.spawn(). The nex path (coordinator_agent.rs:821-823) pipes stderr into reader threads (lines 867-878) that prefix each line with '[coordinator-N stderr]' and route to daemon.log via logger_err.info. So nex stderr IS captured WHEN the supervisor runs — but: (a) it lives in daemon.log inline, not in a discoverable per-handler file; (b) reader threads run AFTER child.spawn() succeeds, so spawn-failures (e.g. exec error, missing binary) emit only the supervisor's own error line at coordinator_agent.rs:833-836, not the child's stderr; (c) most importantly, in the ACTUAL incident the supervisor never spawned at all — so there was nothing to capture. The user's perception of 'no stderr capture' is correct: there was zero process and zero stderr.
- 2026-04-30T16:31:08.714085270+00:00 FIX PROPOSAL A (primary — blocks the symptom): src/commands/sweep.rs:393-396 — extend the orphan-exclusion tag list. Change 't == "coordinator-loop" || t == "compact-loop"' to also exclude 'chat-loop' (workgraph::chat_id::CHAT_LOOP_TAG). Better: use workgraph::chat_id::is_chat_loop_tag(t) which already covers both 'chat-loop' (new) and 'coordinator-loop' (legacy) — see chat_id.rs and dispatch_boot.rs:42 which already uses this helper for the equivalent boot enumeration. Without this fix, every newly-created chat task gets reset to Open within ~2s of creation, severing the InProgress invariant the supervisor's pre-flight relies on.
- 2026-04-30T16:31:20.284422773+00:00 FIX PROPOSAL B (eager supervisor spawn — closes the user-visible gap): src/commands/service/ipc.rs:566-583 (IPC CreateChat handler) currently writes the graph and returns. Add a side effect to enqueue the new chat_id into pending_coordinator_ids and signal urgent_wake (the same channel UserChat IPC uses at service/mod.rs:2553-2611). Two concrete options: (1) Reuse the ipc-handler-to-main-loop side-effect channel — same pattern the existing delete_coordinator_ids and interrupt_coordinator_ids vectors use (ipc.rs:606-608, 641). Add 'pending_coordinator_ids.push(next_id)' on successful create, then in service/mod.rs main loop check pending_coordinator_ids unconditionally (not just on urgent_wake) and lazy-spawn for any IDs without an active CoordinatorAgent. (2) Simpler: have CreateChat set urgent_wake=true so the existing lazy-spawn block at mod.rs:2556-2611 fires. Either way the supervisor for .chat-N starts within ~100ms of creation, not 'whenever the user happens to send the first message'. Without B alone, the user can still hit the spawn-failure case (B-then-failure) — but with B alone (no A), reconciliation will still flip status to Open while the supervisor is starting up; the supervisor's pre-flight tolerates Open (coordinator_agent.rs:718-732 only excludes Done/Abandoned/archived) so B+A is the durable combo.
- 2026-04-30T16:31:40.538815473+00:00 FIX PROPOSAL C (endpoint plumbing — the underlying bug for nex): src/dispatch/plan.rs:214-228 must consult task.endpoint BEFORE falling back to config.llm_endpoints.find_default(). Roughly: if executor.needs_endpoint() { if let Some(ep_str) = task.endpoint.as_deref() { /* build EndpointConfig from URL or look up by name */ } else if let Some(ep) = config.llm_endpoints.find_default() {...} }. The EndpointConfig the inline-URL path needs already exists conceptually in provider.rs:208-230 (the '-e http(s)://...' shortcut) — plan.rs needs to either return an EndpointConfig with name=URL (so spawn_task.rs:233 endpoint = plan.endpoint.as_ref().map(|e| e.name.clone()) yields the URL itself, which dispatch_native passes via -e at spawn_task.rs:359-361) OR add a new variant that carries the raw URL. Concretely the smallest patch: change the resolver at plan.rs:214-228 to read task.endpoint first and synthesize an EndpointConfig { name: ep_str.clone(), url: Some(ep_str.clone()), ... } when the value is an http(s):// URL, falling back to find_by_name(ep_str) for named endpoints, then to find_default().
- 2026-04-30T16:31:50.815816584+00:00 FIX PROPOSAL D (stderr-capture parity for nex — observability): src/commands/service/coordinator_agent.rs:821-823 — add a persistent stderr file equivalent to claude_handler.rs:399-411. Concretely: open ~/.wg/service/nex-handler-stderr-<chat_id>.log with create+append, use Stdio::from(file) for cmd.stderr() instead of Stdio::piped(), and drop the stderr-reader thread (lines 867-878). Or keep BOTH (tee via a small forwarder) so daemon.log inline preview AND the persistent file both exist. Also add an explicit 'Coordinator-N: spawning via wg spawn-task ... (executor=..., model=..., endpoint=..., stderr_log=<path>)' log line BEFORE cmd.spawn() at coordinator_agent.rs:830 — the current line at 825-828 omits endpoint and the stderr-file path, which is the missing breadcrumb claude_handler emits at line 413-419. Together these mean that even on a spawn-time crash, the user has both a clear log line ('here's where the file is') and an existing stderr file to read.
- 2026-04-30T16:32:02.662746054+00:00 SUMMARY for fix-chat-tab consumer: The downstream task fix-chat-tab asks for 'chat tab should surface spawn errors when the chat agent dies, not fall back to coordinator'. The relevant signal it must check: when the user opens a chat tab for chat_id N, the TUI should query whether (1) a CoordinatorAgent for N is registered AND alive (currently checked via state.cached_coordinator_id_set + IPC StatChat or similar), AND (2) on a spawn failure show the daemon's spawn error rather than silently routing to coordinator-0. With FIX A applied, .chat-N stays InProgress so the TUI's 'is this chat alive' check works correctly. With FIX B applied, the supervisor exists by the time the user types. With FIX C/D applied, when nex DOES fail (bad endpoint, network, model rejected) the failure is visible via persistent log file + supervisor.error log line. Net: A+B make the chat reliably exist; C makes it actually run against the right endpoint; D makes any future failure debuggable. Order of urgency: A (one-line fix, blocks the symptom), B (signals existing path on create), C (endpoint plumbing — most invasive but unblocks all nex chats), D (observability hardening, smaller). No source modifications in this task — pure diagnosis.
- 2026-04-30T16:32:12.940464060+00:00 Validation: each item in '## Validation' addressed. (1) Reproduction documented: wg show .chat-32 confirms the orphan-recovery, daemon.log shows zero coordinator-32 supervisor lines. (2) Root cause cited with file:line: sweep.rs:393-396 (primary), plan.rs:214-228 (endpoint), claude_handler.rs:399-411 vs coordinator_agent.rs:821-823 (stderr parity), service/mod.rs:2188 + 2553-2611 (no eager supervisor spawn). (3) Fix proposals A/B/C/D each cite file:line and the smallest concrete patch. (4) No source/doc modifications — only wg log entries on this task. Ready for fix-chat-tab + a separate fix task chain to consume.
- 2026-04-30T16:32:19.414644937+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-04-30T16:34:53.568925690+00:00 PendingEval → Done (evaluator passed; downstream unblocks)