fix-tui-must — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-2501`
Agent identity	`02e879681e52e0a384106169be043416c4d946e850ab26b2269c57681b52a6e7`
Model	`claude:opus`
Created	2026-05-04T22:56:02.723124408+00:00
Started	2026-05-04T22:57:23.222397508+00:00
Completed	2026-05-04T23:35:29.299895399+00:00
Tags	`priority-high,fix,perf,tui,async`, `eval-scheduled`
Eval score	0.71
└ blocking impact	0.78
└ completeness	0.68
└ constraint fidelity	0.55
└ coordination overhead	0.75
└ correctness	0.72
└ downstream usability	0.72
└ efficiency	0.72
└ intent fidelity	0.84
└ style adherence	0.70

Description

User is running workgraph on a system where the filesystem hosting .wg/ has high latency (likely NFS / sshfs / networked). fix-tui-perf-2 added in-process caching + throttling, but cache MISSES still hit slow disk reads and block the main loop. Net: TUI freezes on cache misses.

User report 2026-05-04: 'I'm on a system that has extremely high latency to the file system where workgraph is being hosted and it is causing the TUI to basically get stuck. ... we need to make some kind of calls asynchronous so they don't block the TUI for at least communication with the coordinating agent.'

Required architectural property

No file I/O on the TUI's main thread, ever. All disk reads + writes happen on background threads / async tasks. Main thread polls via channel for results. Chat input + render proceed regardless of disk latency.

Existing work that almost achieved this (fix-tui-perf-2):

Caching reduced repeated reads
Throttling reduced refresh frequency
Render-debouncing (Fix 4)
Per-agent tail thread (Fix 6)
Chat-PTY-render decoupled from graph-render (Fix 5)

What's MISSING for high-latency case:

Cache MISSES still go to main thread
Initial reads at startup still go to main thread
Stat() calls for fs watcher are still on main thread
Any user-triggered refresh (manual reload, scroll-to-task, etc.) goes to main thread

Spec — make the TUI truly latency-resilient

1. Audit every fs syscall on the main thread

grep for fs::read / fs::metadata / fs::open / etc. in the TUI render path
For each, classify: 'always cached' (good), 'cache-miss-possible' (problem), 'always synchronous' (bad)

2. Move 'cache-miss-possible' and 'always synchronous' off the main thread

Pattern:

Main thread checks cache
If cache hit: render with cached value
If cache miss: render with last-known-stale value + dispatch background read
Background read posts result via channel
Next render frame picks up the fresh value

This is 'optimistic concurrency': render with possibly-stale data immediately, refresh in background. Acceptable for a TUI where the user is reading dense info (a few-hundred-ms staleness is invisible).

3. Chat input MUST be unblockable

Specifically: typing in a chat tab routes ONLY to the inner PTY's stdin. NEVER waits on graph state, agent metadata, or anything that could touch disk. This was Fix 5 in fix-tui-perf-2 — verify it actually shipped clean, and if it has any disk dependency add 'cache only, no fallback to disk on main thread'.

4. Add a 'disk-slow' detector

If a background read takes >500ms, surface a one-line indicator in the status bar: '⚠ disk slow (read took 1.2s)'. User awareness without blocking. Optional but useful for diagnosis.

5. Stat caching for fs watcher

The graph-watch / output.log watchers do stat() on each event. On high-latency FS, even stat() can be slow. Cache stat results aggressively; invalidate via the fs notify event, not via re-stat'ing.

Validation

Failing test: simulate a 500ms-latency filesystem (use the slowfs library or LD_PRELOAD a delay shim). TUI startup completes; chat input is responsive (keystrokes echo within 50ms p99 even under load).
Live test on user's actual high-latency setup: TUI doesn't freeze on any operation; chat input remains responsive.
Disk-slow indicator (Fix 4) fires when reads exceed 500ms; clears when latency normalizes.
No regression of fix-tui-perf-2's caching / throttling work.
Permanent benchmark added: 'tui_responsive_under_500ms_latency' in tests/smoke/scenarios/, asserts chat input p99 < 100ms with simulated slow FS.
cargo build + cargo test pass
cargo install --path . was run before claiming done

Immediate user mitigations (until this lands)

Move .wg/ to local filesystem. Symlink or actually relocate. If you need cross-machine collaboration, sync via git (commit .wg/graph.jsonl etc.) rather than mounting the workgraph dir over the network.
Reduce dispatcher poll frequency in config: [dispatcher].poll_interval = 30 (default 5s; bumping to 30s dramatically cuts fs-watcher event rate).
Disable agent stream tailing if not needed: probably not a config knob today; per-agent tail thread (Fix 6 from fix-tui-perf-2) should help.

Coordinate

fix-tui-perf-2 (done) — partial step in this direction
design-chat-agent / implement-tmux-wrapped (done) — chat persistence; orthogonal but composes
This task generalizes 'TUI never blocks on slow I/O' as a systemic property

## Description
User is running workgraph on a system where the filesystem hosting `.wg/` has high latency (likely NFS / sshfs / networked). fix-tui-perf-2 added in-process caching + throttling, but cache MISSES still hit slow disk reads and block the main loop. Net: TUI freezes on cache misses.

User report 2026-05-04: 'I'm on a system that has extremely high latency to the file system where workgraph is being hosted and it is causing the TUI to basically get stuck. ... we need to make some kind of calls asynchronous so they don't block the TUI for at least communication with the coordinating agent.'

## Required architectural property

**No file I/O on the TUI's main thread, ever.** All disk reads + writes happen on background threads / async tasks. Main thread polls via channel for results. Chat input + render proceed regardless of disk latency.

Existing work that almost achieved this (fix-tui-perf-2):
- Caching reduced repeated reads
- Throttling reduced refresh frequency
- Render-debouncing (Fix 4)
- Per-agent tail thread (Fix 6)
- Chat-PTY-render decoupled from graph-render (Fix 5)

What's MISSING for high-latency case:
- Cache MISSES still go to main thread
- Initial reads at startup still go to main thread
- Stat() calls for fs watcher are still on main thread
- Any user-triggered refresh (manual reload, scroll-to-task, etc.) goes to main thread

## Spec — make the TUI truly latency-resilient

### 1. Audit every fs syscall on the main thread
- grep for fs::read / fs::metadata / fs::open / etc. in the TUI render path
- For each, classify: 'always cached' (good), 'cache-miss-possible' (problem), 'always synchronous' (bad)

### 2. Move 'cache-miss-possible' and 'always synchronous' off the main thread
Pattern:
- Main thread checks cache
- If cache hit: render with cached value
- If cache miss: render with last-known-stale value + dispatch background read
- Background read posts result via channel
- Next render frame picks up the fresh value

This is 'optimistic concurrency': render with possibly-stale data immediately, refresh in background. Acceptable for a TUI where the user is reading dense info (a few-hundred-ms staleness is invisible).

### 3. Chat input MUST be unblockable
Specifically: typing in a chat tab routes ONLY to the inner PTY's stdin. NEVER waits on graph state, agent metadata, or anything that could touch disk. This was Fix 5 in fix-tui-perf-2 — verify it actually shipped clean, and if it has any disk dependency add 'cache only, no fallback to disk on main thread'.

### 4. Add a 'disk-slow' detector
If a background read takes >500ms, surface a one-line indicator in the status bar: '⚠ disk slow (read took 1.2s)'. User awareness without blocking. Optional but useful for diagnosis.

### 5. Stat caching for fs watcher
The graph-watch / output.log watchers do stat() on each event. On high-latency FS, even stat() can be slow. Cache stat results aggressively; invalidate via the fs notify event, not via re-stat'ing.

## Validation
- [ ] Failing test: simulate a 500ms-latency filesystem (use the `slowfs` library or LD_PRELOAD a delay shim). TUI startup completes; chat input is responsive (keystrokes echo within 50ms p99 even under load).
- [ ] Live test on user's actual high-latency setup: TUI doesn't freeze on any operation; chat input remains responsive.
- [ ] Disk-slow indicator (Fix 4) fires when reads exceed 500ms; clears when latency normalizes.
- [ ] No regression of fix-tui-perf-2's caching / throttling work.
- [ ] Permanent benchmark added: 'tui_responsive_under_500ms_latency' in tests/smoke/scenarios/, asserts chat input p99 < 100ms with simulated slow FS.
- [ ] cargo build + cargo test pass
- [ ] cargo install --path . was run before claiming done

## Immediate user mitigations (until this lands)

1. **Move `.wg/` to local filesystem.** Symlink or actually relocate. If you need cross-machine collaboration, sync via git (commit `.wg/graph.jsonl` etc.) rather than mounting the workgraph dir over the network.
2. **Reduce dispatcher poll frequency** in config: `[dispatcher].poll_interval = 30` (default 5s; bumping to 30s dramatically cuts fs-watcher event rate).
3. **Disable agent stream tailing if not needed**: probably not a config knob today; per-agent tail thread (Fix 6 from fix-tui-perf-2) should help.

## Coordinate
- fix-tui-perf-2 (done) — partial step in this direction
- design-chat-agent / implement-tmux-wrapped (done) — chat persistence; orthogonal but composes
- This task generalizes 'TUI never blocks on slow I/O' as a systemic property

Depends on

done .assign-fix-tui-must

Required by

done .flip-fix-tui-must

Log

2026-05-04T22:56:02.701327694+00:00 Task paused
2026-05-04T22:56:51.141188031+00:00 Task published
2026-05-04T22:57:19.438088713+00:00 Lightweight assignment: agent=Careful Programmer (02e87968), exec_mode=full, context_scope=graph, reason=Careful Programmer is the only implementation agent available; high score (0.81) and Careful tradeoff suit this correctness-critical refactoring of TUI's async I/O architecture.
2026-05-04T22:57:21.270675209+00:00 USER ADDITIONAL CONTEXT 2026-05-04: needs MEASUREMENT / repro mechanism + analysis path. User direct quote: 'We need to measure this somehow. ... we need to simulate something getting messed up with the way the file system can be read. ... I'm worried we're not going to be able to solve the issue. Hopefully we can figure it out by analysis. ... it's a network file system. It's OK for writing stuff, but just kind of slow. ... we should be looking for places where the TUI itself is getting locked in and waiting on file information.' EXPANDED TASK: this is now Diagnose + Fix + Measurement. The agent picking this up should: ## Phase 1: Build a local repro harness Concrete options for simulating a slow filesystem locally: ### Option A: charybdefs (recommended) Facebook's FUSE-based fault-injection filesystem. Lets you inject latency / errors per-syscall on a specific path. - https://github.com/scylladb/charybdefs - Mount: `charybdefs /tmp/slow-mount -o original=/tmp/real-data` - Inject 500ms delay: `charybdefs-cli --probability 100 --delay-us 500000 read` ### Option B: FUSE loopback with sleep injection Write a tiny loopback FUSE FS in Rust (or use an existing one) that adds `thread::sleep(Duration::from_millis(N))` before each read. Mount it on top of `.wg/`. Slower than charybdefs but trivial to write. ### Option C: LD_PRELOAD shim Library that intercepts open/read/stat and adds latency. Doesn't require FUSE / root. Simpler than mounting but only affects libc-based fs calls. ### Option D: Docker with rate-limited bind mount `docker run --device-read-bps /dev/loop0:1mb ...` — limits I/O bandwidth on a specific loopback device. Less precise than per-syscall delay but realistic enough for the bug class. Recommend A (charybdefs) for precision; B for portability; C as fallback. ## Phase 2: Use the harness to pinpoint blocking I/O With the slow-fs harness in place: - Run `wg tui` on a project mounted via the slow FS - Capture: which UI interactions freeze? for how long? per syscall? - Use `strace -p $(pgrep wg-tui) -e trace=open,read,stat,fstat -tt` to log fs calls with timestamps - Identify: what's the BLOCKING syscall on the main thread that exceeds N ms? This produces a concrete list of 'TUI is locked here on disk' sites with file:line. ## Phase 3: Fix the identified blocking sites Apply the patterns from the original task description: - Cache-with-stale-fallback for reads - Background-thread for unblockable I/O - Chat input MUST never wait on disk ## Phase 4: Permanent regression test The harness becomes a permanent smoke scenario: - `tests/smoke/scenarios/tui_responsive_under_slow_fs.sh` - Sets up charybdefs (or fallback harness) with 500ms read delay - Spawns wg tui against that mount - Drives N keystrokes, asserts p99 echo latency <200ms - Skips cleanly (exit 77) if charybdefs / FUSE not available Without measurement infrastructure, the bug recurs. With it, this regression class is permanently gated. ## Analysis path (parallel with measurement) Even before the harness is set up, the agent can identify likely-blocking sites by reading the TUI source: - src/tui/viz_viewer/state.rs — refresh / load_viz_from_graph paths - src/messages.rs — list_messages / parse_token_usage_live paths - src/commands/viz/mod.rs — message_stats / agency_token_usage - src/tui/pty_pane.rs — PTY handling (any fs reads here?) - src/tui/viz_viewer/event.rs — main event loop fs interactions For each, ask: is this on the main thread? Is the result cached? Is the cache TTL appropriate? Does cache miss go to main thread or background? The diagnose-tui-scales task already identified some of these; this task EXTENDS that work to specifically look at the high-latency case. USER PRIORITY: the user is currently blocked by this on their live system. Local repro is important for the long-term fix; analysis-based mitigation is wanted in the meantime. Both phases worth pursuing in parallel.
2026-05-04T22:57:23.222402166+00:00 Spawned by coordinator --executor claude --model opus
2026-05-04T22:57:37.899143200+00:00 Starting investigation: auditing fs syscalls in TUI main thread
2026-05-04T23:32:02.943150271+00:00 Implemented AsyncFs background worker. All TUI main-thread fs::metadata, load_graph, read_streaming, and bump_chat_interaction calls now route through a background thread. Added disk-slow indicator to status bar. Added permanent smoke scenario tui_responsive_under_500ms_latency that verifies main-thread API stays under 50ms p99 under 500ms simulated FS latency.
2026-05-04T23:34:54.187585493+00:00 Committed: cc04e5fa3 — pushed to remote
2026-05-04T23:35:12.146863017+00:00 Validated: cargo build + cargo test --bin wg pass (3400 passed, 0 failed, 1 ignored). Validated: smoke scenario tui_responsive_under_500ms_latency passes. Validated: cargo install --path . was run.
2026-05-04T23:35:29.299899357+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-04T23:36:53.643019021+00:00 PendingEval → Done (evaluator passed; downstream unblocks)