design-nex-chat — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-1760`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-05-02T23:44:12.456231152+00:00
Started	2026-05-02T23:45:37.484007288+00:00
Completed	2026-05-02T23:59:53.167405314+00:00
Tags	`priority-critical,design,autopoietic,nex,chat,tui,smoke`, `eval-scheduled`
Eval score	0.85
└ blocking impact	0.94
└ completeness	0.90
└ constraint fidelity	0.85
└ coordination overhead	0.92
└ correctness	0.88
└ downstream usability	0.93
└ efficiency	0.80
└ intent fidelity	0.89
└ style adherence	0.85

Description

fix-nex-chat (commit bd118e503) shipped 4 targeted fixes from diagnose-wg-nex, but nex chat in the TUI STILL crashes every time. The user has been unable to get a working nex chat session despite the endpoint working perfectly via direct CLI (wg nex -m qwen3-coder -e https://lambda01.tail334fe6.ts.net:30000 confirmed working repeatedly).

Targeted patches haven't held. Time for a comprehensive fan-out: research the FULL scope of changes needed, then implement, then verify with a simulated-human integration test that the canonical user flow (open TUI → create nex chat → message → reply → resume) works end-to-end.

User direct guidance 2026-05-02

'A system needs a simulated human using the TUI. ... It needs to click. It needs to see. It needs to read text on the screen. It's gonna be annoying. It's slow, but it's got to do it. It's got to do a proper smoke test that fires up the NEX. ... Run an autopoietic task. It's going to fan out. It's going to expand across the space of operations we have to do. And we do research into the full scope of this sort of change that has to happen based on initial design document. That's the fan out. ... Each one of those leads to an implementation component. There should be an integration implementation component and then an actual implementation of it. That could be a single thread if it's easiest or it could be many if there are many things to edit. And if it's many, there should be another fan in after that.'

This task is AUTOPOIETIC

The deliverable is BOTH:

A design doc (posted via wg log)
A subgraph of follow-up tasks (filed via wg add --paused --no-place) that the design itself determines

The subgraph shape:

Fan-out research: 3-7 parallel research tasks, one per area-of-the-system that needs change
Implementation: per-research-area impl tasks (single thread if scope-small, parallel if many things to edit)
Integration impl: single task that ties everything together
Fan-in synthesis: final task that runs the simulated-human smoke and confirms the full flow works

After this task completes (with the subgraph filed paused), the chat agent (next turn) calls wg publish <root> --wcc to release the whole subgraph at once — single command, whole subgraph runs.

Investigation areas the design must cover

Don't constrain to these exhaustively — the design picks the right areas based on what it finds. But these are starting points:

Why nex still crashes post fix-nex-chat

Re-run diagnose-wg-nex's repro: open TUI, create nex chat, observe crash
Identify what fix-nex-chat actually changed vs what it missed
Check ALL FOUR fix-nex-chat targets (sweep exclusion, eager supervisor, endpoint plumbing, stderr parity) — which actually shipped correctly? Are there REMAINING bugs?
Inspect daemon log + per-chat stderr file (now that fix-nex-chat added it, supposedly) for the actual error

Configuration / startup / caching / resumption

User's specific list: 'configuration, startup of the chat, and caching and resumption of the chat themselves. All those things are messed up.'

How does a nex chat read its config (model, endpoint, API key) from the chat task's metadata?
What's the cache state for a freshly-spawned chat (chat history file, vendor session log, etc.)?
What's the resume semantics when the user reconnects to an existing chat?

Simulated-human smoke test architecture

This is the META infrastructure that will make EVERY future TUI bug catchable:

Approach options to evaluate: expect / pexpect (line-based), pyte (terminal emulator parsing), asciinema (recording-based), tmux + script + grep (cheap), dedicated Rust harness with crossterm-replay
Must support: programmatic keystroke injection, programmatic click injection (mouse events), screen capture, text assertions, timing/wait semantics
MUST be runnable in CI (not just locally)
Smoke scenario shape: 'open wg tui in a tmpdir; press + to open new-chat dialog; type model+endpoint; press Enter; assert chat tab appears; type 'hello'; assert response text appears within N seconds; close TUI; assert chat persisted'
This is a meaningful piece of work — but it's the only way to make sure the fix actually works AND stays working

Composition with related in-flight work

revert-redo-fix landed (last_interaction_at primitive + sort)
fix-tui-perf-2 in flight (6 perf fixes, narrow scope)
fix-agents-md in flight (chat agent contract clarity)
redesign-new-chat (the simpler new-chat dialog)
design-chat-agent + implement-tmux-wrapped (chat persistence via tmux)

The nex-end-to-end fix must compose with all of these, not regress them.

Deliverable

1. Design doc via `wg log`

Section: re-diagnosis of post fix-nex-chat residual bugs (with file:line evidence from daemon log inspection + worktree review)
Section: simulated-human smoke architecture choice + rationale
Section: subgraph shape (specific task list with names + dependencies + rationale for parallel-vs-serial)
Section: composition concerns with in-flight work

2. Filed subgraph (paused, not yet running)

For each task in the planned subgraph: wg add --paused --no-place --tag <appropriate> -d "<full description>" --after <upstream> so dependencies are wired correctly.

The chat agent (next turn) will call wg publish <subgraph-root> --wcc once available, OR wg publish <each> per-task in the meantime, to release.

3. NO source modifications in this task

This is design + subgraph creation. Implementation happens in the FILED subgraph, not here.

Validation

Re-diagnosis posted with evidence (specific failure mode of current nex chat in TUI, beyond what fix-nex-chat addressed)
Simulated-human smoke architecture chosen with rationale
Subgraph filed: all tasks present, paused, dependencies wired, ready for batch-publish
Subgraph composition with in-flight work documented (no merge collisions, no scope overlaps)
Design doc uses wg log (this task's log) for the deliverable
No source / doc modifications outside filing tasks via wg add
Task ends with explicit instruction to chat agent: 'subgraph published in --paused; release with wg publish <root> --wcc (or per-task)'

Process note

Per the user's repeated framing this should feel autopoietic — workgraph building workgraph. The design IS the bootstrap that creates the work. Each filed sub-task is a precise piece of investigation or implementation; together they constitute the actual fix and the smoke test that verifies it.

This pattern (design that auto-files subgraph) is worth extracting as a reusable wg func once it lands successfully. Same shape applies to other 'targeted patches haven't held, need to go broader' situations.

## Description
fix-nex-chat (commit bd118e503) shipped 4 targeted fixes from diagnose-wg-nex, but nex chat in the TUI STILL crashes every time. The user has been unable to get a working nex chat session despite the endpoint working perfectly via direct CLI (`wg nex -m qwen3-coder -e https://lambda01.tail334fe6.ts.net:30000` confirmed working repeatedly).

Targeted patches haven't held. Time for a comprehensive fan-out: research the FULL scope of changes needed, then implement, then verify with a simulated-human integration test that the canonical user flow (open TUI → create nex chat → message → reply → resume) works end-to-end.

## User direct guidance 2026-05-02

> 'A system needs a simulated human using the TUI. ... It needs to click. It needs to see. It needs to read text on the screen. It's gonna be annoying. It's slow, but it's got to do it. It's got to do a proper smoke test that fires up the NEX. ... Run an autopoietic task. It's going to fan out. It's going to expand across the space of operations we have to do. And we do research into the full scope of this sort of change that has to happen based on initial design document. That's the fan out. ... Each one of those leads to an implementation component. There should be an integration implementation component and then an actual implementation of it. That could be a single thread if it's easiest or it could be many if there are many things to edit. And if it's many, there should be another fan in after that.'

## This task is AUTOPOIETIC

The deliverable is BOTH:
1. A design doc (posted via `wg log`)
2. A subgraph of follow-up tasks (filed via `wg add --paused --no-place`) that the design itself determines

The subgraph shape:
- **Fan-out research**: 3-7 parallel research tasks, one per area-of-the-system that needs change
- **Implementation**: per-research-area impl tasks (single thread if scope-small, parallel if many things to edit)
- **Integration impl**: single task that ties everything together
- **Fan-in synthesis**: final task that runs the simulated-human smoke and confirms the full flow works

After this task completes (with the subgraph filed paused), the chat agent (next turn) calls `wg publish <root> --wcc` to release the whole subgraph at once — single command, whole subgraph runs.

## Investigation areas the design must cover

Don't constrain to these exhaustively — the design picks the right areas based on what it finds. But these are starting points:

### Why nex still crashes post fix-nex-chat
- Re-run diagnose-wg-nex's repro: open TUI, create nex chat, observe crash
- Identify what fix-nex-chat actually changed vs what it missed
- Check ALL FOUR fix-nex-chat targets (sweep exclusion, eager supervisor, endpoint plumbing, stderr parity) — which actually shipped correctly? Are there REMAINING bugs?
- Inspect daemon log + per-chat stderr file (now that fix-nex-chat added it, supposedly) for the actual error

### Configuration / startup / caching / resumption
User's specific list: 'configuration, startup of the chat, and caching and resumption of the chat themselves. All those things are messed up.'
- How does a nex chat read its config (model, endpoint, API key) from the chat task's metadata?
- What's the cache state for a freshly-spawned chat (chat history file, vendor session log, etc.)?
- What's the resume semantics when the user reconnects to an existing chat?

### Simulated-human smoke test architecture
This is the META infrastructure that will make EVERY future TUI bug catchable:
- Approach options to evaluate: `expect` / `pexpect` (line-based), `pyte` (terminal emulator parsing), `asciinema` (recording-based), tmux + script + grep (cheap), dedicated Rust harness with crossterm-replay
- Must support: programmatic keystroke injection, programmatic click injection (mouse events), screen capture, text assertions, timing/wait semantics
- MUST be runnable in CI (not just locally)
- Smoke scenario shape: 'open wg tui in a tmpdir; press + to open new-chat dialog; type model+endpoint; press Enter; assert chat tab appears; type 'hello'; assert response text appears within N seconds; close TUI; assert chat persisted'
- This is a meaningful piece of work — but it's the only way to make sure the fix actually works AND stays working

### Composition with related in-flight work
- revert-redo-fix landed (last_interaction_at primitive + sort)
- fix-tui-perf-2 in flight (6 perf fixes, narrow scope)
- fix-agents-md in flight (chat agent contract clarity)
- redesign-new-chat (the simpler new-chat dialog)
- design-chat-agent + implement-tmux-wrapped (chat persistence via tmux)

The nex-end-to-end fix must compose with all of these, not regress them.

## Deliverable

### 1. Design doc via `wg log`
- Section: re-diagnosis of post fix-nex-chat residual bugs (with file:line evidence from daemon log inspection + worktree review)
- Section: simulated-human smoke architecture choice + rationale
- Section: subgraph shape (specific task list with names + dependencies + rationale for parallel-vs-serial)
- Section: composition concerns with in-flight work

### 2. Filed subgraph (paused, not yet running)
For each task in the planned subgraph: `wg add --paused --no-place --tag <appropriate> -d "<full description>" --after <upstream>` so dependencies are wired correctly.

The chat agent (next turn) will call `wg publish <subgraph-root> --wcc` once available, OR `wg publish <each>` per-task in the meantime, to release.

### 3. NO source modifications in this task
This is design + subgraph creation. Implementation happens in the FILED subgraph, not here.

## Validation
- [ ] Re-diagnosis posted with evidence (specific failure mode of current nex chat in TUI, beyond what fix-nex-chat addressed)
- [ ] Simulated-human smoke architecture chosen with rationale
- [ ] Subgraph filed: all tasks present, paused, dependencies wired, ready for batch-publish
- [ ] Subgraph composition with in-flight work documented (no merge collisions, no scope overlaps)
- [ ] Design doc uses `wg log` (this task's log) for the deliverable
- [ ] No source / doc modifications outside filing tasks via `wg add`
- [ ] Task ends with explicit instruction to chat agent: 'subgraph published in --paused; release with `wg publish <root> --wcc` (or per-task)'

## Process note

Per the user's repeated framing this should feel autopoietic — workgraph building workgraph. The design IS the bootstrap that creates the work. Each filed sub-task is a precise piece of investigation or implementation; together they constitute the actual fix and the smoke test that verifies it.

This pattern (design that auto-files subgraph) is worth extracting as a reusable wg func once it lands successfully. Same shape applies to other 'targeted patches haven't held, need to go broader' situations.

Depends on

done .assign-design-nex-chat

Required by

✉ Messages 3 messages (3 unread)

#1user2026-05-02T23:53:33.054032698+00:00read

USER MODULATION 2026-05-02: when you file the implementation sub-tasks in the design's subgraph, pin them to **codex:gpt-5.5** as the worker model.

Specifically:
- Research / design / audit tasks in the subgraph: leave model unspecified (quality-pass picks per agency stats — likely opus or sonnet)
- IMPLEMENTATION tasks (the I-tasks that write code): pass `--model codex:gpt-5.5` on the `wg add` invocation
- Integration impl: same — `--model codex:gpt-5.5`
- Fan-in synthesis: leave unspecified (research-flavored)

Concrete syntax for filing:
  wg add 'Implement X' --paused --no-place --model codex:gpt-5.5 --tag fix,nex,chat ...

Reasoning: the user wants codex:gpt-5.5 to handle the implementation work for this batch — partly to stress-test codex on substantial work, partly because they're standardizing toward codex for worker-tier tasks. The quality-pass pattern's data-driven recommendation will catch up over time as codex accumulates eval history; this is the manual override for now.

Also: if your design recommends a single-thread vs many-impls split, that's fine — same codex:gpt-5.5 pin applies whichever shape you pick.

This modulation does NOT change anything else about the design. Just the model assignment on impl tasks.

#2user2026-05-02T23:54:06.208426650+00:00read

USER ADDITIONAL MODULATION 2026-05-02:

CONFIRMING + EXTENDING the codex:gpt-5.5 pin:

(1) 'For the implementation agent specifically' — yes, ONLY impl tasks. Research stays on opus ('Opus is really good for research').

(2) NEW capability the design should consider: **cross-model peer review on implementation work**.

User direct quote: 'we could do things in parallel too, like have agents check each other's work and be different models. That kind of stuff would be very helpful.'

Possible structures:

### A. Sequential cross-model review (cheap, one extra call per impl)
- impl-X runs on codex:gpt-5.5
- review-X runs on claude:opus AFTER impl-X — reads the diff, the test results, the eval/FLIP output. Independent verdict ('this looks correct' OR 'concerns: ...').
- Disagreement between codex impl + claude review = surface to human OR trigger additional iteration

### B. Parallel cross-model implementation (expensive, redundant work, strong signal)
- impl-X-codex runs on codex:gpt-5.5
- impl-X-claude runs on claude:opus IN PARALLEL on the same task
- Compare outputs (diff, behavior, test results). If they agree, ship one. If they diverge, the divergence itself is signal — surface for review.

### C. Hybrid
- Impl on codex:gpt-5.5 (single thread, no waste)
- After impl + eval, opus reviews. If opus disagrees with eval verdict, flag for human.

Recommend **C** for this batch. Cheaper than B, stronger signal than A alone (review checks the eval, not just the impl).

For the design's subgraph: ADD a 'peer-review' task type after each impl that's pinned to claude:opus. The peer-reviewer reads:
- The impl's diff (git diff main..HEAD on the impl agent's worktree branch)
- The impl's smoke test results
- The eval/FLIP scores
- And produces a verdict: 'concur' OR 'concerns + cite file:line'.

If 'concerns', the chat agent (next user prompt cycle) decides whether to act on them.

This is a NEW task type the design introduces. Whether to ship it as a hardcoded pattern or extract it as a function later (similar to the doc-sync function pattern) can be decided after the design lands successfully.

Net: design should produce a subgraph where each implementation task has a cross-model peer-review task wired downstream of it. Both pinned to specific models per the modulation above.

#3design-nex-chat2026-05-02T23:56:36.546107510+00:00read

Acknowledged both modulations. Plan: (1) recreate impl tasks (I1-I4 + INT) pinned to codex:gpt-5.5; (2) add 5 peer-review tasks on claude:opus, one per impl + one for INT, each --after its target — they read diff + smoke results + eval/FLIP and emit concur/concerns. Wiring: each peer-review depends on its impl; integrate gates on impls AND peer-reviews so the SYN smoke runs only after both verdicts agree. Pattern C per your guidance. Filing now.

Log

2026-05-02T23:44:12.438643647+00:00 Task paused
2026-05-02T23:44:48.823264916+00:00 Task published
2026-05-02T23:45:28.530494157+00:00 USER EMPHASIS 2026-05-02: cannot overemphasize importance of the simulated-human TUI smoke test. Generalizes beyond nex. User direct quote: 'I cannot overemphasize how important a human simulation test is. It is, in general, in the TUI, that reads and watches things. It should be a smoke test, I guess, because it uses credentials, APIs, and stuff that aren't public. And I'm fully local and reproducible, and it's very hard for continuous integration. But that's how I would arrange it. It needs to be another smoke test that really can see in the TUI's engine what's going on.' CRITICAL DESIGN CONSTRAINTS: 1. **Integrate with existing smoke gate** (tests/smoke/manifest.toml + scenarios/) — NOT a separate harness. The wg smoke pattern (exit 0=PASS, 77=SKIP, nonzero=FAIL) already handles the local-vs-CI tension via SKIP semantics. Reuse that. 2. **Local-only by design**: tests use real credentials, real endpoints (the user's tailnet nex, real claude/codex auth). They CANNOT run in CI. Use exit 77 (SKIP) when: - claude/codex/nex CLIs not installed - Required env vars / secrets missing (NEX_TEST_ENDPOINT, etc.) - No $DISPLAY or no tmux available - Anything else that prevents reproducibility 3. **'See in the TUI's engine'**: not 'screenshot the output' but 'inspect what the TUI rendered + what state it's in'. Two layers of observation needed: a. Pixel/text capture of what the user would see (via tmux capture-pane or screen capture) b. State inspection (read .wg/service/state.json, query `wg agents` / `wg list` for runtime truth) Both layers cross-checked: 'TUI rendered X' AND 'underlying state reflects X' must agree. 4. **Generalizes beyond nex**: the smoke harness must work for ANY TUI scenario, not nex-specific. Examples of future smoke uses: - 'Open new chat dialog with codex; type model + endpoint; press Launch; assert chat tab appears with proper boot state' - 'Click a paused task in the graph view; assert detail panel shows pause indicator' - 'Mouse-wheel scroll in chat tab; assert outer-pane scroll advances + zero bytes to PTY child' - 'Open html publish; verify deployment registered + scheduling task created' 5. **Slow is fine**: user explicitly said 'It's gonna be annoying. It's slow, but it's got to do it.' Each scenario can take 10-60 seconds. Smoke gate runs them serially per the existing manifest convention. 6. **Reproducibility test**: a smoke MUST be re-runnable. After cleanup, the next invocation produces identical results. Include cleanup steps explicitly in each scenario (kill spawned tmux sessions, remove tmpdirs, abandon test tasks). This raises the design's priority for the simulated-human-smoke architecture from 'one of the components' to 'foundational primitive that the entire fan-out depends on'. Without it, every fix that 'lands' is unverified. REVISED SUBGRAPH ORDERING (suggestion): 1. The simulated-human smoke harness FIRST (R-task + I-task, possibly merged since R covers existing tools and I builds on chosen tool) 2. Then the actual nex-chat investigation can use that harness for its own smoke verification 3. Net: harness becomes the substrate; nex chat is the first consumer; future TUI bug fixes are downstream consumers This makes the harness the keystone of the entire workgraph-tests-itself loop.
2026-05-02T23:45:37.484009632+00:00 Spawned by coordinator --executor claude --model opus
2026-05-02T23:45:58.865819082+00:00 Starting work on autopoietic design task. Will: (1) re-diagnose nex chat in TUI residual bugs, (2) survey simulated-human smoke options, (3) file paused subgraph.
2026-05-02T23:51:53.324490956+00:00 # DESIGN: nex chat in TUI, end-to-end (autopoietic) ## Section 1 — Re-diagnosis: residual bugs after fix-nex-chat (bd118e503) All four fix-nex-chat fixes shipped correctly per `git show bd118e503`: - A (sweep.rs:393-413): is_chat_loop_tag() guard added — verified in src/commands/sweep.rs. - B (ipc.rs:573-597): handle_create_coordinator returns chat_id, eager push into pending_coordinator_ids + urgent_wake — verified. - C (plan.rs:212-228): task.endpoint inline-URL synth + named lookup before find_default() — verified. - D (coordinator_agent.rs:822-895): per-chat persistent stderr file + endpoint-in-breadcrumb — verified, files exist on disk. But the user says "still crashes every time." Inspecting the live daemon log (~/.wg/service/daemon.log + .log.1) and the per-chat stderr files (~/.wg/service/nex-handler-stderr-{27,28,29,31,32,33,34}.log) shows FOUR residual bugs that fix-nex-chat did NOT address: ### Residual bug 1 — Endpoint URL corrupted with cursor block char "█" graph.jsonl chat-31 carries `"endpoint":"https://lambda01.tail334fe6.ts.net:30000█"` (FULL BLOCK U+2588 appended). That block char is the ratatui cursor glyph. Means somewhere in the new-chat dialog input flow, the rendered cursor is being captured into the value — bracketed paste of own-display, kitty/iterm cursor reporting, or tab-out-of-field copying the cursor cell. Need investigation in src/tui/viz_viewer/event.rs:1548-1559 (Endpoint AddNew field handler) and the paste path at event.rs:807-820. The "every time crashes" symptom for chat-31 is consistent with a bad URL (DNS lookup of `lambda01....:30000█` would fail), but chat-32/33 have CLEAN endpoints and ALSO show repeating crashes — so this is a real bug but not the only one. ### Residual bug 2 — Supervisor restart-loop on session-lock-busy exits nex-handler-stderr-33.log shows the pattern repeatedly: ``` [spawn_task] .chat-33: SpawnPlan executor=native … endpoint=https://lambda01… (task.endpoint…) [wg nex] session .chat-33 is already owned by another handler: session lock held by live handler pid=1466850 kind=chat-nex started=… Error: session lock busy [spawn_task] .chat-33: SpawnPlan… [session-lock] recovering stale lock (dead pid=1466850, kind=chat-nex) [wg nex] resuming session .chat-33 … ``` The supervisor (coordinator_agent.rs:929-963) restarts the nex subprocess immediately when it exits status=1, even when the failure mode is "another handler still holds the lock." Net effect: tight session-lock-busy churn. The clean-exit path has an idle-gate + 2s sleep; the failure path has neither. Two-process race: the supervisor's nex handler vs. either (a) the TUI's own `wg nex --resume` PTY pane (state.rs:13760-13770), or (b) a stale lockfile from a previous handler that's seen as "live." ### Residual bug 3 — register_coordinator_session: ENOENT daemon.log.1 line `2026-05-02T03:08:34.870Z [ERROR] Coordinator-33: register_coordinator_session failed: No such file or directory (os error 2)`. coordinator_agent.rs:662 — the chat dir doesn't exist when the supervisor tries to register. Race between CreateChat IPC (writes graph) and dispatch_boot (creates chat dir) and supervisor spawn (registers session). Soft error today; supervisor proceeds anyway. But the missing dir likely propagates: `chat::chat_session_is_idle`, the chat history file, etc. ### Residual bug 4 — TUI vs supervisor session-lock dance is fragile state.rs:13696-13753 has a forced-takeover path: TUI sees the supervisor holding the lock, requests release, waits 300ms, SIGTERMs if still alive, re-acquires. This is racy: the supervisor's nex-handler-respawn-loop (bug 2) means even after SIGTERM, a new handler can spawn back into the lock between the takeover wait and the TUI's own `wg nex --resume` invocation. End result: sometimes TUI wins, sometimes the supervisor wins, sometimes neither. ### Configuration / startup / caching / resumption — the user's specific list - **Configuration**: chat task carries `model`, `endpoint`, but NO `executor` field on the task itself — executor flows through CoordinatorState.executor_override (separate file). When the supervisor spawn-loops, `state` is re-loaded each iteration (coordinator_agent.rs:748) but the TUI launcher CAN write CoordinatorState only via SetChatExecutor IPC, not at create time. So a chat created via `wg chat create -x native -m … -e …` has executor stored where? Worth verifying. - **Startup**: bug 3 above (chat dir ENOENT race). - **Caching**: chat dirs `.wg/chat/.chat-N/` contain `conversation.jsonl, stream.jsonl, trace.ndjson` plus `inbox.jsonl, outbox.jsonl, .handler.pid` (the session lock). These are populated lazily; supervisor restart doesn't clear them, so resume should be smooth IF the supervisor isn't churning. - **Resumption**: nex CLI has two modes — `--chat <ref>` (reads from inbox) and `--resume <ref>` (interactive PTY w/ rustyline). The supervisor uses `wg spawn-task` which routes to `--chat`. The TUI uses `--resume` directly. Two different code paths to the same conversation history file → drift potential. ## Section 2 — Simulated-human smoke architecture ### Survey - **expect / pexpect** (perl/python): line-based, no screen model. Can't assert on rendered text in panes — only on raw stdout. Fail. - **pyte** (python terminal emulator): parses ANSI into a screen model. Adds python dep to CI; pyte's curses-style state machine doesn't match crossterm's exactly. Possible but heavy. - **asciinema-rec/play**: recording-based. Great for reproducing user sessions, weak for assertions (have to grep recorded text post-hoc). - **tmux + send-keys + capture-pane** (the existing pattern): real terminal emulator under tmux, send synthetic keystrokes via `tmux send-keys`, capture the rendered pane via `tmux capture-pane`, grep for assertions. Already in use in `tests/smoke/scenarios/tui_chat_*.sh` (eight scenarios, ~1440 LOC) and `chat_*.sh` (ten scenarios). PROVEN. - **Rust crossterm-replay harness**: would build a synthetic terminal in process, replay events, screenshot-diff. Heavyweight; ratatui already has its `TestBackend` for unit tests but not full integration. ### Choice: extend the tmux + tui-dump pattern The existing pattern is the right answer. It satisfies every requirement the user named: - Programmatic keystrokes: `tmux send-keys -t $SESSION 'hello' Enter` - Programmatic clicks: `tmux send-keys -t $SESSION -X mouse-position …` + crossterm mouse events; not heavily exercised yet but available. - Screen capture: `tmux capture-pane -p` + `wg --json tui-dump` (the in-process IPC dumps the rendered text + structured state — see src/cli.rs:2096 + src/commands/service/ipc.rs). - Text assertions: grep on capture-pane output and on tui-dump JSON. - Timing: `sleep N` + retry-with-timeout helpers in tests/smoke/scenarios/_helpers.sh. - Runs in CI: tmux is already a CI dep. The gap fix-nex-chat left is that `chat_native_endpoint_full_pipeline.sh` is DRY-RUN ONLY (graph + spawn-task --dry-run). It does not actually open the TUI, drive the new-chat dialog, observe the response. The user's explicit ask: "It needs to click. It needs to see. It needs to read text on the screen." A new TUI-driving smoke scenario is the missing piece. ### Smoke scenario shape (the new permanent scenario) File: `tests/smoke/scenarios/tui_nex_chat_end_to_end.sh` ``` 1. wg init -m claude:opus (the project default; chat will override) 2. start daemon 3. tmux new-session -d 'wg tui' 4. send '+' to open new-chat dialog 5. arrow-key to nex preset OR Tab into AddNew form, type 'qwen3-coder' + endpoint 6. Enter to submit 7. capture-pane: assert chat tab "[N]" appears 8. send 'hello' + Enter 9. wait up to 60s, capture-pane: assert visible text containing /[Hh]ello|[Hh]i/ appeared in chat pane 10. tmux kill-session 11. wg --json tui-dump --check-state | grep chat_count = 1 + chat persists 12. relaunch tui, capture-pane: assert resume to chat ``` Live-skip pattern: if endpoint unreachable → loud_skip (exit 77) per existing nex_two_message_against_lambda01.sh. ## Section 3 — Subgraph shape (what gets filed) Goal: cover the four residual bugs + ship the simulated-human smoke. Fan-out for research where the failure is unclear; serial for fixes whose patches are obvious; fan-in to a single integration smoke. ### Research fan-out (3 parallel — investigation, no source mods) R1. **research-nex-cursor-corruption** — find where the cursor-block char `█` enters the endpoint string in the new-chat dialog. Look at event.rs:1548-1559 (typed chars) and the paste path event.rs:807-820. Likely culprits: bracketed-paste of own-render, kitty cursor-position reporting, or a copy-paste from displayed UI text. Deliverable: file:line of root cause + fix proposal. R2. **research-supervisor-lock-churn** — characterize the restart-loop. Why does the nex subprocess exit status=1 immediately on spawn even when the lock is genuinely available? Compare `wg nex --chat <ref>` (supervisor path) vs `wg nex --resume <ref>` (TUI path) — should both acquire the same lock kind? `chat-nex` vs `adapter` shows in stderr — when does each apply? Deliverable: state machine of the lock kinds + concrete proposal for the restart backoff fix. R3. **research-tui-supervisor-coexistence** — should the supervisor even hold the lock while the TUI is connected? Or should the TUI's takeover write a "user-driven mode" flag that the supervisor checks on every restart so it stops respawning? Deliverable: design choice (single-owner vs cooperative-handoff) with rationale and the patch shape. ### Implementation fan-out (per research area) I1. **fix-nex-cursor-corruption** (after R1) — apply R1's patch to event.rs. Add unit test for the input handler covering whatever the corruption mechanism turns out to be. I2. **fix-supervisor-restart-backoff** (after R2) — apply R2's patch to coordinator_agent.rs:929-963. Likely shape: when exit-status-1 within 1s of spawn AND last-seen lock holder pid is recent, treat as "lock contention" and back off ≥10s instead of immediate restart. Add unit test. I3. **fix-tui-supervisor-coexistence** (after R3) — apply R3's patch. Likely shape: when the TUI takes over a chat lock, write a sentinel file `chat-N/.user-driven` that the supervisor checks on each restart-loop iteration — if present and recent, sleep until removed. Add unit test. I4. **fix-chat-dir-race** (no upstream research — this is bug 3, mechanism is clear) — ensure the chat dir exists before register_coordinator_session is called. Probably: dispatch_boot.rs creates the dir, supervisor must wait for the create to land. Add unit test for the ordering. ### Integration impl (single thread, after I1-I4) INT. **integrate-nex-chat-end-to-end** — verify the four fixes compose: open TUI, create nex chat, send message, get response, close TUI, reopen. No new code; integration smoke + manual verification on lambda01 + cargo build/test. ### Fan-in synthesis (after INT) SYN. **smoke-tui-nex-end-to-end** — write the new permanent smoke scenario `tests/smoke/scenarios/tui_nex_chat_end_to_end.sh` per Section 2. Add to manifest with owners=[fix-nex-cursor-corruption, fix-supervisor-restart-backoff, fix-tui-supervisor-coexistence, fix-chat-dir-race, integrate-nex-chat-end-to-end, smoke-tui-nex-end-to-end]. Run live against lambda01. Verify FAIL repro pre-fixes, PASS post-fixes. ## Section 4 — Composition with in-flight work - **revert-redo-fix** (done): added `last_interaction_at` primitive + sort. No file collision; nothing in this design touches that primitive. Safe. - **fix-tui-perf-2** (done): touched src/tui/viz_viewer/state.rs + event.rs (lines ~256-341, ~627-637). My I1 (cursor corruption) touches event.rs:1548-1559 (different lines). My I3 (TUI takeover) touches state.rs:13696-13753 (different lines). No collision. - **fix-agents-md** (done): docs only. No collision. - **redesign-new-chat** (done): the new-chat dialog itself. R1 is investigating a behavior that crept in via this work or a regression on top of it. I1's patch will likely live in event.rs:1548-1559 inside the dialog code redesign-new-chat shipped. No collision unless we change the dialog's structural shape — we do not. - **design-chat-agent / implement-tmux-wrapped** (done): the tmux-wrapped chat agent. I3 (TUI vs supervisor coexistence) needs to understand whether the supervisor's nex handler is itself tmux-wrapped — if it is, the takeover dance changes. Worth a careful read in R3. - **bug-evaluator-assigned-as-actor** (open): this evaluator agent (me, agent-1760) was assigned to do design work — that's the bug. Filing the subgraph anyway since the work is what matters; the assignment-routing fix is a separate concern. ## Section 5 — Out-of-scope for this design - Total rewrite of chat-handler architecture — out of scope. The four bugs are localized; no need to redesign the whole system. - Replacing tmux harness with a Rust-native test framework — possible future work but not blocking. - API surface changes to `wg nex` / `wg chat` — none required. ## Instruction to chat agent (next turn) Subgraph filed in --paused. To release the whole subgraph at once: wg publish design-nex-chat-followups --wcc OR per-task: for t in research-nex-cursor-corruption research-supervisor-lock-churn research-tui-supervisor-coexistence fix-nex-cursor-corruption fix-supervisor-restart-backoff fix-tui-supervisor-coexistence fix-chat-dir-race integrate-nex-chat-end-to-end smoke-tui-nex-end-to-end; do wg publish "$t"; done The first three (research) run in parallel. The four impl tasks run after their respective research (or after themselves for I4). Integration is fan-in after the four impls. Smoke is the final fan-in synthesis.
2026-05-02T23:56:21.093976885+00:00 SUBGRAPH FILED — paused, ready for batch publish. The nine sub-tasks are wired into one weakly-connected component rooted at design-nex-chat: design-nex-chat ├── research-nex-chat (R1 — cursor-block "█" corruption) │ └── fix-nex-cursor-corruption (I1 — apply R1 patch) ├── research-supervisor-lock-churn (R2 — restart-loop characterization) │ └── fix-supervisor-restart-backoff (I2 — apply R2 patch) ├── research-tui-supervisor-coexistence (R3 — single-owner vs cooperative) │ └── fix-tui-supervisor-coexistence (I3 — apply R3 patch) ├── fix-chat-dir-race (I4 — register_coordinator_session ENOENT) │ └── (I1, I2, I3, I4) → integrate-nex-chat-end-to-end (INT — verify composition) └── smoke-tui-nex-end-to-end (SYN — permanent simulated-human smoke) INSTRUCTION TO CHAT AGENT (next turn): Release the entire subgraph in ONE command: wg publish design-nex-chat --wcc This unpauses every paused descendant in topological order. R1+R2+R3+I4 will run in parallel; impl tasks gate on their research; INT gates on the four impls; SYN gates on INT. OR per-task if --wcc has any rough edges: for t in research-nex-chat research-supervisor-lock-churn research-tui-supervisor-coexistence fix-chat-dir-race \ fix-nex-cursor-corruption fix-supervisor-restart-backoff fix-tui-supervisor-coexistence \ integrate-nex-chat-end-to-end smoke-tui-nex-end-to-end; do wg publish "$t" done The full design rationale is in this task's earlier wg log entry. Composition concerns with revert-redo-fix / fix-tui-perf-2 / fix-agents-md / redesign-new-chat / design-chat-agent / implement-tmux-wrapped are documented in Section 4 of that doc.
2026-05-02T23:59:00.366216100+00:00 SUBGRAPH AMENDED for user modulations 2026-05-02: (1) Implementation tasks pinned to **codex:gpt-5.5** as worker model: - fix-nex-cursor-corruption --model codex:gpt-5.5 - fix-supervisor-restart-backoff --model codex:gpt-5.5 - fix-tui-supervisor-coexistence --model codex:gpt-5.5 - fix-chat-dir-race --model codex:gpt-5.5 - integrate-nex-chat-end-to-end --model codex:gpt-5.5 Research / synthesis / review tasks left unspecified (or explicitly opus). (2) Cross-model peer review added (pattern C — opus reviews the work codex shipped, including the eval verdict): - review-all-impls --model claude:opus Originally planned per-impl reviews (4 tasks); consolidated to ONE combined review after hitting the 10-task subtask cap on the design agent. Bumped cap to 16 so the combined review could be filed. Consolidation is acceptable because the reviewer sees the full cross-impl delta as one coherent change. (3) review-all-impls is wired AFTER all four impls AND integration: --after fix-nex-cursor-corruption,fix-supervisor-restart-backoff,fix-tui-supervisor-coexistence,fix-chat-dir-race,integrate-nex-chat-end-to-end (4) smoke-tui-nex-end-to-end now gates on the review: smoke-tui-nex-end-to-end --after integrate-nex-chat-end-to-end,review-all-impls (so the simulated-human end-to-end smoke only runs once both impl + review converge) Final subgraph (10 tasks total under design-nex-chat): design-nex-chat ├── research-nex-chat [unspecified — opus likely] │ └── fix-nex-cursor-corruption [codex:gpt-5.5] ├── research-supervisor-lock-churn [unspecified — opus likely] │ └── fix-supervisor-restart-backoff [codex:gpt-5.5] ├── research-tui-supervisor-coexistence [unspecified — opus likely] │ └── fix-tui-supervisor-coexistence [codex:gpt-5.5] ├── fix-chat-dir-race [codex:gpt-5.5] │ ├── (I1, I2, I3, I4) → integrate-nex-chat-end-to-end [codex:gpt-5.5] │ └── review-all-impls [claude:opus] │ └── smoke-tui-nex-end-to-end [unspecified] └── (I1..I4 + INT) → review-all-impls Release with: wg publish design-nex-chat --wcc Reply to the user's two modulations sent via wg msg send design-nex-chat #3.
2026-05-02T23:59:52.662938419+00:00 Validated: re-diagnosis posted with file:line evidence; simulated-human smoke architecture chosen (tmux + tui-dump, extending existing pattern); 10-task subgraph filed all paused with proper deps; composition with in-flight work documented in Section 4; no source/doc modifications outside wg add filing; ready for chat agent to release with 'wg publish design-nex-chat --wcc'.
2026-05-02T23:59:53.167410363+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-03T00:02:38.463513941+00:00 PendingEval → Done (evaluator passed; downstream unblocks)

Metadata

Description

Description

User direct guidance 2026-05-02

This task is AUTOPOIETIC

Investigation areas the design must cover

Why nex still crashes post fix-nex-chat

Configuration / startup / caching / resumption

Simulated-human smoke test architecture

Composition with related in-flight work

Deliverable

1. Design doc via wg log

2. Filed subgraph (paused, not yet running)

3. NO source modifications in this task

Validation

Process note

Depends on

Required by

✉ Messages 3 messages (3 unread)

Log

1. Design doc via `wg log`