diagnose-nex-chat

Diagnose: nex chat IPC-spawn fails silently while same-args CLI succeeds — capture the actual divergence

Metadata

Statusdone
Assignedagent-2073
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-05-03T21:26:00.966231126+00:00
Started2026-05-03T21:27:06.329595665+00:00
Completed2026-05-03T21:40:38.134108078+00:00
Tagspriority-critical,research,bug,nex,chat, eval-scheduled
Eval score0.78
└ blocking impact0.78
└ completeness0.75
└ constraint fidelity0.85
└ coordination overhead0.72
└ correctness0.85
└ downstream usability0.72
└ efficiency0.78
└ intent fidelity0.63
└ style adherence0.88

Description

Description

Despite integrate-nex-chat-end-to-end (commit 73041f533) supposedly fixing nex chat in TUI, user STILL hits silent fail when launching nex chat from the new-chat dialog. The exact same args via direct CLI work perfectly.

User direct demonstration 2026-05-03:

$ wg nex -m qwen3-coder-30b -e https://lambda01.tail334fe6.ts.net:30000
wg nex — interactive session with qwen3-coder-30b
> hi
Hello! How can I help you today?

vs IPC-spawned .chat-35 (same model, same endpoint):

  • Daemon log: 'Coordinator-35: nex subprocess running (pid 3200587)'
  • Per-chat stderr file (.wg/service/nex-handler-stderr-35.log):
    [spawn_task] .chat-35: SpawnPlan executor=native (from agency.effective_executor), model=qwen3-coder-30b, endpoint=https://lambda01.tail334fe6.ts.net:30000
    wg nex — interactive session with qwen3-coder-30b
    [end of file]
    
  • User typed 'hi sup' — never got a reply.

CRITICAL CONSTRAINT — diagnose with EVIDENCE only

The chat agent (me) made an unfounded claim earlier ('the model name probably doesn't exist on the endpoint') without actually checking. The user correctly called this out: 'the name is a dummy variable. you have no proof of what you're saying btw.'

This task MUST capture the actual divergence empirically. NO speculation.

Investigation steps

1. Capture the exact CLI invocation

  • strace -f -e execve -o /tmp/cli-execve.log wg nex -m qwen3-coder-30b -e https://lambda01.tail334fe6.ts.net:30000
  • Drive a 'hi' message manually
  • Capture the full execve chain: argv, env vars, working dir, file descriptor inheritance

2. Capture the exact IPC-spawn invocation

  • Trigger an IPC-spawn (open TUI, create nex chat with same model+endpoint)
  • strace -f -e execve -o /tmp/ipc-execve.log -p $(pgrep -f 'wg service') BEFORE creating the chat (so it captures the spawn)
  • OR: instrument coordinator_agent.rs:830ish (where the spawn happens) to emit the full Command::new() invocation to a log

3. Diff the two captures

  • argv differences (is there a --chat or --resume flag IPC adds that CLI doesn't?)
  • env var differences
  • cwd
  • tmux wrapping vs direct
  • stdio redirection (PTY allocation, controlling-tty status)
  • File descriptor inheritance

4. Test specific hypotheses with EVIDENCE

For each hypothesis, capture proof BEFORE asserting it:

  • 'tmux wrapping interferes' → diff tmux-wrapped vs direct invocation in a controlled test, confirm with byte-level capture
  • 'stdin not connected to a TTY' → check isatty(0) on the spawned process; if false, that's almost certainly the issue (nex CLI likely needs TTY for the agentic loop)
  • '--resume flag passed when no resume exists' → check argv for --resume; remove if present and observe
  • 'env strip removes something' → diff env vars line-by-line

5. Read agent-1848's resolve_handler fix

Verify it's actually deployed in the user's binary (stat cargo install timestamp vs commit 73041f533). If not, that's the answer — user just needs rebuild. If yes, the bug is somewhere else and we need fresh diagnosis.

Deliverable

wg log entry with:

  • /tmp/cli-execve.log and /tmp/ipc-execve.log contents (or relevant excerpts)
  • LITERAL diff between the two invocations
  • Identified root cause with file:line citation in the IPC-spawn path
  • Concrete fix proposal
  • 'Working hypothesis' is fine but every claim must cite evidence

Validation

  • Both invocations captured (CLI and IPC-spawn)
  • Literal diff documented
  • Hypothesis tested with evidence (not speculation)
  • Root cause cited with file:line
  • Fix proposal concrete enough for a follow-up implementation task
  • No source / doc modifications — diagnose only

Process note

This is the THIRD diagnose-then-fix cycle on nex-chat-IPC-spawn (after diagnose-wg-nex/fix-nex-chat and design-nex-chat/integrate-nex-chat-end-to-end). The pattern of 'diagnose finds something, fix lands, user still hits failure' suggests the diagnose work has been narrow each time. This time: capture the FULL divergence between working CLI and broken IPC, not just one hypothesis.

Depends on

Required by

Log