fix-tui-must

Fix: TUI must remain responsive on high-latency filesystems — all main-thread I/O backgrounded, chat input never blocks on disk

Metadata

Statusdone
Assignedagent-2501
Agent identity02e879681e52e0a384106169be043416c4d946e850ab26b2269c57681b52a6e7
Modelclaude:opus
Created2026-05-04T22:56:02.723124408+00:00
Started2026-05-04T22:57:23.222397508+00:00
Completed2026-05-04T23:35:29.299895399+00:00
Tagspriority-high,fix,perf,tui,async, eval-scheduled
Eval score0.71
└ blocking impact0.78
└ completeness0.68
└ constraint fidelity0.55
└ coordination overhead0.75
└ correctness0.72
└ downstream usability0.72
└ efficiency0.72
└ intent fidelity0.84
└ style adherence0.70

Description

Description

User is running workgraph on a system where the filesystem hosting .wg/ has high latency (likely NFS / sshfs / networked). fix-tui-perf-2 added in-process caching + throttling, but cache MISSES still hit slow disk reads and block the main loop. Net: TUI freezes on cache misses.

User report 2026-05-04: 'I'm on a system that has extremely high latency to the file system where workgraph is being hosted and it is causing the TUI to basically get stuck. ... we need to make some kind of calls asynchronous so they don't block the TUI for at least communication with the coordinating agent.'

Required architectural property

No file I/O on the TUI's main thread, ever. All disk reads + writes happen on background threads / async tasks. Main thread polls via channel for results. Chat input + render proceed regardless of disk latency.

Existing work that almost achieved this (fix-tui-perf-2):

  • Caching reduced repeated reads
  • Throttling reduced refresh frequency
  • Render-debouncing (Fix 4)
  • Per-agent tail thread (Fix 6)
  • Chat-PTY-render decoupled from graph-render (Fix 5)

What's MISSING for high-latency case:

  • Cache MISSES still go to main thread
  • Initial reads at startup still go to main thread
  • Stat() calls for fs watcher are still on main thread
  • Any user-triggered refresh (manual reload, scroll-to-task, etc.) goes to main thread

Spec — make the TUI truly latency-resilient

1. Audit every fs syscall on the main thread

  • grep for fs::read / fs::metadata / fs::open / etc. in the TUI render path
  • For each, classify: 'always cached' (good), 'cache-miss-possible' (problem), 'always synchronous' (bad)

2. Move 'cache-miss-possible' and 'always synchronous' off the main thread

Pattern:

  • Main thread checks cache
  • If cache hit: render with cached value
  • If cache miss: render with last-known-stale value + dispatch background read
  • Background read posts result via channel
  • Next render frame picks up the fresh value

This is 'optimistic concurrency': render with possibly-stale data immediately, refresh in background. Acceptable for a TUI where the user is reading dense info (a few-hundred-ms staleness is invisible).

3. Chat input MUST be unblockable

Specifically: typing in a chat tab routes ONLY to the inner PTY's stdin. NEVER waits on graph state, agent metadata, or anything that could touch disk. This was Fix 5 in fix-tui-perf-2 — verify it actually shipped clean, and if it has any disk dependency add 'cache only, no fallback to disk on main thread'.

4. Add a 'disk-slow' detector

If a background read takes >500ms, surface a one-line indicator in the status bar: '⚠ disk slow (read took 1.2s)'. User awareness without blocking. Optional but useful for diagnosis.

5. Stat caching for fs watcher

The graph-watch / output.log watchers do stat() on each event. On high-latency FS, even stat() can be slow. Cache stat results aggressively; invalidate via the fs notify event, not via re-stat'ing.

Validation

  • Failing test: simulate a 500ms-latency filesystem (use the slowfs library or LD_PRELOAD a delay shim). TUI startup completes; chat input is responsive (keystrokes echo within 50ms p99 even under load).
  • Live test on user's actual high-latency setup: TUI doesn't freeze on any operation; chat input remains responsive.
  • Disk-slow indicator (Fix 4) fires when reads exceed 500ms; clears when latency normalizes.
  • No regression of fix-tui-perf-2's caching / throttling work.
  • Permanent benchmark added: 'tui_responsive_under_500ms_latency' in tests/smoke/scenarios/, asserts chat input p99 < 100ms with simulated slow FS.
  • cargo build + cargo test pass
  • cargo install --path . was run before claiming done

Immediate user mitigations (until this lands)

  1. Move .wg/ to local filesystem. Symlink or actually relocate. If you need cross-machine collaboration, sync via git (commit .wg/graph.jsonl etc.) rather than mounting the workgraph dir over the network.
  2. Reduce dispatcher poll frequency in config: [dispatcher].poll_interval = 30 (default 5s; bumping to 30s dramatically cuts fs-watcher event rate).
  3. Disable agent stream tailing if not needed: probably not a config knob today; per-agent tail thread (Fix 6 from fix-tui-perf-2) should help.

Coordinate

  • fix-tui-perf-2 (done) — partial step in this direction
  • design-chat-agent / implement-tmux-wrapped (done) — chat persistence; orthogonal but composes
  • This task generalizes 'TUI never blocks on slow I/O' as a systemic property

Depends on

Required by

Log