Metadata
| Status | done |
|---|---|
| Assigned | agent-1317 |
| Agent identity | f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e |
| Created | 2026-05-01T12:57:18.981997924+00:00 |
| Started | 2026-05-01T12:57:53.984403365+00:00 |
| Completed | 2026-05-01T13:40:32.351044133+00:00 |
| Tags | priority-critical,fix,chat,nex,bug, eval-scheduled |
| Eval score | 0.75 |
| └ blocking impact | 0.80 |
| └ completeness | 0.78 |
| └ constraint fidelity | 0.85 |
| └ coordination overhead | 0.76 |
| └ correctness | 0.72 |
| └ downstream usability | 0.72 |
| └ efficiency | 0.76 |
| └ intent fidelity | 0.61 |
| └ style adherence | 0.76 |
Description
Description
Diagnose (diagnose-wg-nex, agent-1270) identified four stacked root causes for why nex chat agents die silently. Read its log via wg show diagnose-wg-nex for the full forensic trace.
The 4 fixes (in order of urgency — apply ALL)
Fix A — sweep.rs:393-396 (one-line; blocks the symptom)
Orphan-recovery sweep excludes 'coordinator-loop' and 'compact-loop' but NOT 'chat-loop'. So every newly-created chat task gets reset to Open within ~2s of creation, breaking the InProgress invariant the supervisor's pre-flight relies on.
Patch: change the exclusion check to use workgraph::chat_id::is_chat_loop_tag(t) (helper already exists; covers both 'chat-loop' new tag and 'coordinator-loop' legacy). dispatch_boot.rs:42 already uses this helper for the equivalent boot enumeration — be consistent.
Fix B — eager supervisor spawn on CreateChat (closes the user-visible gap)
src/commands/service/ipc.rs:566-583 — the IPC CreateChat handler writes the graph and returns. Currently the supervisor for the new chat doesn't spawn until UserChat IPC fires (i.e., when the user actually sends a first message). Between create and first message, no supervisor exists.
Patch: enqueue the new chat_id into pending_coordinator_ids on successful create AND signal urgent_wake. The same pattern as delete_coordinator_ids and interrupt_coordinator_ids (ipc.rs:606-608, 641). OR simpler: set urgent_wake=true so the existing lazy-spawn block at service/mod.rs:2556-2611 fires.
Fix C — endpoint plumbing in plan.rs:214-228 (the underlying nex bug)
The task.endpoint field (set from CreateChat IPC's endpoint=Some(url) at ipc.rs:1535 + CoordinatorState.endpoint_override at ipc.rs:1581) is NEVER READ by plan_spawn. Only config.llm_endpoints.find_default() is consulted. So even when the supervisor spawns, nex talks to the default endpoint instead of the user's specified URL.
Verified via dry-run: WG_EXECUTOR_TYPE=native WG_MODEL=qwen3-coder wg spawn-task --dry-run .chat-32 emits wg nex --chat .chat-32 -m qwen3-coder with NO -e flag.
Patch: in plan.rs:214-228, read task.endpoint FIRST. If present, synthesize an EndpointConfig with the URL (matches the pattern provider.rs:208-230 already uses for the '-e http://...' CLI shortcut). Then fall back to find_by_name(ep_str) for named endpoints, then find_default().
Fix D — stderr-capture parity for nex (observability hardening)
src/commands/service/coordinator_agent.rs:821-823 — nex stderr is piped into reader threads that route to daemon.log AFTER child.spawn() succeeds. Spawn-time failures emit only the supervisor's own error line, not the child's stderr. claude_handler.rs:399-411 has a dedicated persistent file (~/.wg/service/claude-handler-stderr.log); nex doesn't.
Patch: open ~/.wg/service/nex-handler-stderr-<chat_id>.log with create+append, use Stdio::from(file) for cmd.stderr() instead of Stdio::piped(). Add an explicit 'Coordinator-N: spawning via wg spawn-task ... (executor=..., model=..., endpoint=..., stderr_log=
Why all four together
- A alone: the chat task stays InProgress but no supervisor (B fixes that).
- B alone: supervisor starts but reconciliation still flips status to Open mid-startup; nex talks to wrong endpoint (A fixes that, C fixes the endpoint bug).
- C alone: endpoint plumbed correctly but the task still gets reset to Open and the supervisor still doesn't fire eagerly.
- D alone: better observability but the underlying bug (A+B+C) still causes silent death.
A+B+C+D is the durable fix. D is the safety net — even when something else breaks in the future, you'll see why.
Validation
- Failing tests written first
- Unit test: chat-loop tag is excluded from orphan sweep (Fix A)
- Unit test: CreateChat IPC enqueues pending_coordinator_ids and urgent_wake fires (Fix B)
-
Unit test: plan.rs reads task.endpoint when set (Fix C); dry-run
spawn-task .chat-Nemits the -e URL flag - Live smoke (the user's reported scenario): in TUI, create a new chat with executor=native, model=qwen3-coder, endpoint=https://lambda01.tail334fe6.ts.net:30000. ASSERT: - chat task stays InProgress (not flipped to Open by sweep) - CoordinatorAgent for the new chat starts within ~100ms of CreateChat - The spawned 'wg nex' invocation includes '-e https://lambda01.tail334fe6.ts.net:30000' - On a working endpoint: chat actually responds to messages - On a broken endpoint: stderr file at ~/.wg/service/nex-handler-stderr-<chat_id>.log contains the actual error
- No regression: claude / codex chats still work as before
- cargo build + cargo test pass (3294+ tests)
- Permanent smoke scenario added covering the chat-spawn happy path AND a bad-endpoint failure path with stderr-file inspection
- cargo install --path . was run before claiming done
Process note
This is a multi-fix task. Apply all four. The diagnose did the design work — the implementer just executes against the file:line spec. If any one fix turns out to be wrong/incomplete, file a follow-up rather than punting all four.
Depends on
Required by
- (none)
Log
- 2026-05-01T12:57:18.968900532+00:00 Task paused
- 2026-05-01T12:57:23.678220435+00:00 Task published
- 2026-05-01T12:57:51.658969081+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=task, reason=Correctness-critical multi-file bug fix (4 coordinated patches with precise specs) requires Careful Programmer role; high score (0.79) and 466 completed tasks demonstrate relevant experience.
- 2026-05-01T12:57:53.984407182+00:00 Spawned by coordinator --executor claude --model opus
- 2026-05-01T12:58:02.923471713+00:00 Starting work — reading diagnose-wg-nex forensic trace and the 4 file:line targets
- 2026-05-01T13:10:19.831987879+00:00 Fixes A/B/C/D applied. Unit tests passing for A/B/C. Running full test suite next.
- 2026-05-01T13:37:52.804238107+00:00 All four fixes (A/B/C/D) applied + tested. Live smoke confirms: (1) wg sweep --dry-run does NOT flag chat-loop tasks (Fix A); (2) wg spawn-task --dry-run .chat-0 emits 'wg nex --chat .chat-0 -m nex:qwen3-coder -e https://lambda01.tail334fe6.ts.net:30000' (Fix C); (3) provenance log line records 'task.endpoint (inline URL: ...)' as source. New permanent smoke scenario chat_native_endpoint_full_pipeline added with fix-nex-chat in owners list.
- 2026-05-01T13:40:21.597420040+00:00 Committed: 5092cf22c — pushed to remote
- 2026-05-01T13:40:32.351056536+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-05-01T13:43:11.964876868+00:00 PendingEval → Done (evaluator passed; downstream unblocks)