design-nex-chat

Design: nex chat in TUI, end-to-end — autopoietic fan-out research + impl + simulated-human smoke

Metadata

Statusdone
Assignedagent-1760
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-05-02T23:44:12.456231152+00:00
Started2026-05-02T23:45:37.484007288+00:00
Completed2026-05-02T23:59:53.167405314+00:00
Tagspriority-critical,design,autopoietic,nex,chat,tui,smoke, eval-scheduled
Eval score0.85
└ blocking impact0.94
└ completeness0.90
└ constraint fidelity0.85
└ coordination overhead0.92
└ correctness0.88
└ downstream usability0.93
└ efficiency0.80
└ intent fidelity0.89
└ style adherence0.85

Description

Description

fix-nex-chat (commit bd118e503) shipped 4 targeted fixes from diagnose-wg-nex, but nex chat in the TUI STILL crashes every time. The user has been unable to get a working nex chat session despite the endpoint working perfectly via direct CLI (wg nex -m qwen3-coder -e https://lambda01.tail334fe6.ts.net:30000 confirmed working repeatedly).

Targeted patches haven't held. Time for a comprehensive fan-out: research the FULL scope of changes needed, then implement, then verify with a simulated-human integration test that the canonical user flow (open TUI → create nex chat → message → reply → resume) works end-to-end.

User direct guidance 2026-05-02

'A system needs a simulated human using the TUI. ... It needs to click. It needs to see. It needs to read text on the screen. It's gonna be annoying. It's slow, but it's got to do it. It's got to do a proper smoke test that fires up the NEX. ... Run an autopoietic task. It's going to fan out. It's going to expand across the space of operations we have to do. And we do research into the full scope of this sort of change that has to happen based on initial design document. That's the fan out. ... Each one of those leads to an implementation component. There should be an integration implementation component and then an actual implementation of it. That could be a single thread if it's easiest or it could be many if there are many things to edit. And if it's many, there should be another fan in after that.'

This task is AUTOPOIETIC

The deliverable is BOTH:

  1. A design doc (posted via wg log)
  2. A subgraph of follow-up tasks (filed via wg add --paused --no-place) that the design itself determines

The subgraph shape:

  • Fan-out research: 3-7 parallel research tasks, one per area-of-the-system that needs change
  • Implementation: per-research-area impl tasks (single thread if scope-small, parallel if many things to edit)
  • Integration impl: single task that ties everything together
  • Fan-in synthesis: final task that runs the simulated-human smoke and confirms the full flow works

After this task completes (with the subgraph filed paused), the chat agent (next turn) calls wg publish <root> --wcc to release the whole subgraph at once — single command, whole subgraph runs.

Investigation areas the design must cover

Don't constrain to these exhaustively — the design picks the right areas based on what it finds. But these are starting points:

Why nex still crashes post fix-nex-chat

  • Re-run diagnose-wg-nex's repro: open TUI, create nex chat, observe crash
  • Identify what fix-nex-chat actually changed vs what it missed
  • Check ALL FOUR fix-nex-chat targets (sweep exclusion, eager supervisor, endpoint plumbing, stderr parity) — which actually shipped correctly? Are there REMAINING bugs?
  • Inspect daemon log + per-chat stderr file (now that fix-nex-chat added it, supposedly) for the actual error

Configuration / startup / caching / resumption

User's specific list: 'configuration, startup of the chat, and caching and resumption of the chat themselves. All those things are messed up.'

  • How does a nex chat read its config (model, endpoint, API key) from the chat task's metadata?
  • What's the cache state for a freshly-spawned chat (chat history file, vendor session log, etc.)?
  • What's the resume semantics when the user reconnects to an existing chat?

Simulated-human smoke test architecture

This is the META infrastructure that will make EVERY future TUI bug catchable:

  • Approach options to evaluate: expect / pexpect (line-based), pyte (terminal emulator parsing), asciinema (recording-based), tmux + script + grep (cheap), dedicated Rust harness with crossterm-replay
  • Must support: programmatic keystroke injection, programmatic click injection (mouse events), screen capture, text assertions, timing/wait semantics
  • MUST be runnable in CI (not just locally)
  • Smoke scenario shape: 'open wg tui in a tmpdir; press + to open new-chat dialog; type model+endpoint; press Enter; assert chat tab appears; type 'hello'; assert response text appears within N seconds; close TUI; assert chat persisted'
  • This is a meaningful piece of work — but it's the only way to make sure the fix actually works AND stays working

Composition with related in-flight work

  • revert-redo-fix landed (last_interaction_at primitive + sort)
  • fix-tui-perf-2 in flight (6 perf fixes, narrow scope)
  • fix-agents-md in flight (chat agent contract clarity)
  • redesign-new-chat (the simpler new-chat dialog)
  • design-chat-agent + implement-tmux-wrapped (chat persistence via tmux)

The nex-end-to-end fix must compose with all of these, not regress them.

Deliverable

1. Design doc via wg log

  • Section: re-diagnosis of post fix-nex-chat residual bugs (with file:line evidence from daemon log inspection + worktree review)
  • Section: simulated-human smoke architecture choice + rationale
  • Section: subgraph shape (specific task list with names + dependencies + rationale for parallel-vs-serial)
  • Section: composition concerns with in-flight work

2. Filed subgraph (paused, not yet running)

For each task in the planned subgraph: wg add --paused --no-place --tag <appropriate> -d "<full description>" --after <upstream> so dependencies are wired correctly.

The chat agent (next turn) will call wg publish <subgraph-root> --wcc once available, OR wg publish <each> per-task in the meantime, to release.

3. NO source modifications in this task

This is design + subgraph creation. Implementation happens in the FILED subgraph, not here.

Validation

  • Re-diagnosis posted with evidence (specific failure mode of current nex chat in TUI, beyond what fix-nex-chat addressed)
  • Simulated-human smoke architecture chosen with rationale
  • Subgraph filed: all tasks present, paused, dependencies wired, ready for batch-publish
  • Subgraph composition with in-flight work documented (no merge collisions, no scope overlaps)
  • Design doc uses wg log (this task's log) for the deliverable
  • No source / doc modifications outside filing tasks via wg add
  • Task ends with explicit instruction to chat agent: 'subgraph published in --paused; release with wg publish <root> --wcc (or per-task)'

Process note

Per the user's repeated framing this should feel autopoietic — workgraph building workgraph. The design IS the bootstrap that creates the work. Each filed sub-task is a precise piece of investigation or implementation; together they constitute the actual fix and the smoke test that verifies it.

This pattern (design that auto-files subgraph) is worth extracting as a reusable wg func once it lands successfully. Same shape applies to other 'targeted patches haven't held, need to go broader' situations.

Depends on

Required by

Messages 3 messages (3 unread)

  1. #1user2026-05-02T23:53:33.054032698+00:00read
    USER MODULATION 2026-05-02: when you file the implementation sub-tasks in the design's subgraph, pin them to **codex:gpt-5.5** as the worker model.
    
    Specifically:
    - Research / design / audit tasks in the subgraph: leave model unspecified (quality-pass picks per agency stats — likely opus or sonnet)
    - IMPLEMENTATION tasks (the I-tasks that write code): pass `--model codex:gpt-5.5` on the `wg add` invocation
    - Integration impl: same — `--model codex:gpt-5.5`
    - Fan-in synthesis: leave unspecified (research-flavored)
    
    Concrete syntax for filing:
      wg add 'Implement X' --paused --no-place --model codex:gpt-5.5 --tag fix,nex,chat ...
    
    Reasoning: the user wants codex:gpt-5.5 to handle the implementation work for this batch — partly to stress-test codex on substantial work, partly because they're standardizing toward codex for worker-tier tasks. The quality-pass pattern's data-driven recommendation will catch up over time as codex accumulates eval history; this is the manual override for now.
    
    Also: if your design recommends a single-thread vs many-impls split, that's fine — same codex:gpt-5.5 pin applies whichever shape you pick.
    
    This modulation does NOT change anything else about the design. Just the model assignment on impl tasks.
  2. #2user2026-05-02T23:54:06.208426650+00:00read
    USER ADDITIONAL MODULATION 2026-05-02:
    
    CONFIRMING + EXTENDING the codex:gpt-5.5 pin:
    
    (1) 'For the implementation agent specifically' — yes, ONLY impl tasks. Research stays on opus ('Opus is really good for research').
    
    (2) NEW capability the design should consider: **cross-model peer review on implementation work**.
    
    User direct quote: 'we could do things in parallel too, like have agents check each other's work and be different models. That kind of stuff would be very helpful.'
    
    Possible structures:
    
    ### A. Sequential cross-model review (cheap, one extra call per impl)
    - impl-X runs on codex:gpt-5.5
    - review-X runs on claude:opus AFTER impl-X — reads the diff, the test results, the eval/FLIP output. Independent verdict ('this looks correct' OR 'concerns: ...').
    - Disagreement between codex impl + claude review = surface to human OR trigger additional iteration
    
    ### B. Parallel cross-model implementation (expensive, redundant work, strong signal)
    - impl-X-codex runs on codex:gpt-5.5
    - impl-X-claude runs on claude:opus IN PARALLEL on the same task
    - Compare outputs (diff, behavior, test results). If they agree, ship one. If they diverge, the divergence itself is signal — surface for review.
    
    ### C. Hybrid
    - Impl on codex:gpt-5.5 (single thread, no waste)
    - After impl + eval, opus reviews. If opus disagrees with eval verdict, flag for human.
    
    Recommend **C** for this batch. Cheaper than B, stronger signal than A alone (review checks the eval, not just the impl).
    
    For the design's subgraph: ADD a 'peer-review' task type after each impl that's pinned to claude:opus. The peer-reviewer reads:
    - The impl's diff (git diff main..HEAD on the impl agent's worktree branch)
    - The impl's smoke test results
    - The eval/FLIP scores
    - And produces a verdict: 'concur' OR 'concerns + cite file:line'.
    
    If 'concerns', the chat agent (next user prompt cycle) decides whether to act on them.
    
    This is a NEW task type the design introduces. Whether to ship it as a hardcoded pattern or extract it as a function later (similar to the doc-sync function pattern) can be decided after the design lands successfully.
    
    Net: design should produce a subgraph where each implementation task has a cross-model peer-review task wired downstream of it. Both pinned to specific models per the modulation above.
  3. #3design-nex-chat2026-05-02T23:56:36.546107510+00:00read
    Acknowledged both modulations. Plan: (1) recreate impl tasks (I1-I4 + INT) pinned to codex:gpt-5.5; (2) add 5 peer-review tasks on claude:opus, one per impl + one for INT, each --after its target — they read diff + smoke results + eval/FLIP and emit concur/concerns. Wiring: each peer-review depends on its impl; integrate gates on impls AND peer-reviews so the SYN smoke runs only after both verdicts agree. Pattern C per your guidance. Filing now.

Log