fix-nex-chat

Fix: nex chat-spawn — apply 4 fixes from diagnose-wg-nex (sweep exclusion, eager supervisor, endpoint plumbing, stderr parity)

Metadata

Statusdone
Assignedagent-1317
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-05-01T12:57:18.981997924+00:00
Started2026-05-01T12:57:53.984403365+00:00
Completed2026-05-01T13:40:32.351044133+00:00
Tagspriority-critical,fix,chat,nex,bug, eval-scheduled
Eval score0.75
└ blocking impact0.80
└ completeness0.78
└ constraint fidelity0.85
└ coordination overhead0.76
└ correctness0.72
└ downstream usability0.72
└ efficiency0.76
└ intent fidelity0.61
└ style adherence0.76

Description

Description

Diagnose (diagnose-wg-nex, agent-1270) identified four stacked root causes for why nex chat agents die silently. Read its log via wg show diagnose-wg-nex for the full forensic trace.

The 4 fixes (in order of urgency — apply ALL)

Fix A — sweep.rs:393-396 (one-line; blocks the symptom)

Orphan-recovery sweep excludes 'coordinator-loop' and 'compact-loop' but NOT 'chat-loop'. So every newly-created chat task gets reset to Open within ~2s of creation, breaking the InProgress invariant the supervisor's pre-flight relies on.

Patch: change the exclusion check to use workgraph::chat_id::is_chat_loop_tag(t) (helper already exists; covers both 'chat-loop' new tag and 'coordinator-loop' legacy). dispatch_boot.rs:42 already uses this helper for the equivalent boot enumeration — be consistent.

Fix B — eager supervisor spawn on CreateChat (closes the user-visible gap)

src/commands/service/ipc.rs:566-583 — the IPC CreateChat handler writes the graph and returns. Currently the supervisor for the new chat doesn't spawn until UserChat IPC fires (i.e., when the user actually sends a first message). Between create and first message, no supervisor exists.

Patch: enqueue the new chat_id into pending_coordinator_ids on successful create AND signal urgent_wake. The same pattern as delete_coordinator_ids and interrupt_coordinator_ids (ipc.rs:606-608, 641). OR simpler: set urgent_wake=true so the existing lazy-spawn block at service/mod.rs:2556-2611 fires.

Fix C — endpoint plumbing in plan.rs:214-228 (the underlying nex bug)

The task.endpoint field (set from CreateChat IPC's endpoint=Some(url) at ipc.rs:1535 + CoordinatorState.endpoint_override at ipc.rs:1581) is NEVER READ by plan_spawn. Only config.llm_endpoints.find_default() is consulted. So even when the supervisor spawns, nex talks to the default endpoint instead of the user's specified URL.

Verified via dry-run: WG_EXECUTOR_TYPE=native WG_MODEL=qwen3-coder wg spawn-task --dry-run .chat-32 emits wg nex --chat .chat-32 -m qwen3-coder with NO -e flag.

Patch: in plan.rs:214-228, read task.endpoint FIRST. If present, synthesize an EndpointConfig with the URL (matches the pattern provider.rs:208-230 already uses for the '-e http://...' CLI shortcut). Then fall back to find_by_name(ep_str) for named endpoints, then find_default().

Fix D — stderr-capture parity for nex (observability hardening)

src/commands/service/coordinator_agent.rs:821-823 — nex stderr is piped into reader threads that route to daemon.log AFTER child.spawn() succeeds. Spawn-time failures emit only the supervisor's own error line, not the child's stderr. claude_handler.rs:399-411 has a dedicated persistent file (~/.wg/service/claude-handler-stderr.log); nex doesn't.

Patch: open ~/.wg/service/nex-handler-stderr-<chat_id>.log with create+append, use Stdio::from(file) for cmd.stderr() instead of Stdio::piped(). Add an explicit 'Coordinator-N: spawning via wg spawn-task ... (executor=..., model=..., endpoint=..., stderr_log=

Why all four together

  • A alone: the chat task stays InProgress but no supervisor (B fixes that).
  • B alone: supervisor starts but reconciliation still flips status to Open mid-startup; nex talks to wrong endpoint (A fixes that, C fixes the endpoint bug).
  • C alone: endpoint plumbed correctly but the task still gets reset to Open and the supervisor still doesn't fire eagerly.
  • D alone: better observability but the underlying bug (A+B+C) still causes silent death.

A+B+C+D is the durable fix. D is the safety net — even when something else breaks in the future, you'll see why.

Validation

  • Failing tests written first
  • Unit test: chat-loop tag is excluded from orphan sweep (Fix A)
  • Unit test: CreateChat IPC enqueues pending_coordinator_ids and urgent_wake fires (Fix B)
  • Unit test: plan.rs reads task.endpoint when set (Fix C); dry-run spawn-task .chat-N emits the -e URL flag
  • Live smoke (the user's reported scenario): in TUI, create a new chat with executor=native, model=qwen3-coder, endpoint=https://lambda01.tail334fe6.ts.net:30000. ASSERT: - chat task stays InProgress (not flipped to Open by sweep) - CoordinatorAgent for the new chat starts within ~100ms of CreateChat - The spawned 'wg nex' invocation includes '-e https://lambda01.tail334fe6.ts.net:30000' - On a working endpoint: chat actually responds to messages - On a broken endpoint: stderr file at ~/.wg/service/nex-handler-stderr-<chat_id>.log contains the actual error
  • No regression: claude / codex chats still work as before
  • cargo build + cargo test pass (3294+ tests)
  • Permanent smoke scenario added covering the chat-spawn happy path AND a bad-endpoint failure path with stderr-file inspection
  • cargo install --path . was run before claiming done

Process note

This is a multi-fix task. Apply all four. The diagnose did the design work — the implementer just executes against the file:line spec. If any one fix turns out to be wrong/incomplete, file a follow-up rather than punting all four.

Depends on

Required by

Log