fix-tui-perf

Fix: TUI perf — apply 6 prioritized fixes from diagnose-tui-scales (caching + decoupling + throttling)

Metadata

Statusfailed
Assignedagent-1398
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-05-01T18:38:30.148363534+00:00
Started2026-05-01T20:01:34.986086738+00:00
Tagspriority-high,fix,perf,tui, eval-scheduled
Tokens2653677 in / 9778 out
Failure reasonrescue eval unavailable after 2 attempts; falling back to terminal failure

Description

Description

diagnose-tui-scales (agent-1347) identified the cause + spec'd 6 prioritized fixes. Read its log via wg show diagnose-tui-scales for the full forensic + benchmark scenarios.

Root cause (already proven)

Not quadratic — O(N) per refresh × high refresh-rate. With 8 active agents each appending to output.log every ~100ms, the recursive fs watcher fans in 80+ events/sec, each triggering a full pipeline pass that re-reads same files multiple times, re-walks archive dirs, etc. Single-threaded main loop means PTY keystroke echo waits behind graph render.

The 6 fixes (apply in priority order)

Fix 1 (HIGHEST IMPACT) — message_stats + coordinator_message_status caching + fold-into-one-pass

  • Files: src/messages.rs:207-256, :667-713, src/commands/viz/mod.rs:736-756, src/tui/viz_viewer/state.rs:7092-7138
  • Cache per (task_id, mtime) in VizApp
  • fs watcher already knows changed paths — surface that path list (currently discarded at start_fs_watcher state.rs:7107) and selectively invalidate
  • The two functions each call list_messages() — fold into ONE pass that reads the file once

Fix 2 — live_token_usage + agency_token_usage caching

  • Files: src/commands/viz/mod.rs:646-733, src/graph.rs:914 (parse_token_usage_live)
  • Cache per (agent_id, output_log_mtime) and per (task_id, lifecycle_member_mtime)
  • Currently re-walks log/agents//* every refresh

Fix 3 — eliminate second graph load in apply_sort_mode

  • Files: src/tui/viz_viewer/state.rs:7815-7835
  • Pass already-loaded WorkGraph in (or precompute and cache the status_map)
  • Two graph loads per refresh is pure waste

Fix 4 — throttle viz regen

  • Files: src/tui/viz_viewer/state.rs:7143-7250
  • Add 'last_full_refresh_at' guard; cap at ~200ms (5fps) regardless of fs event rate
  • Current: full pipeline runs every wakeup

Fix 5 (THE INPUT-LATENCY KILLER) — decouple chat PTY render from graph render

Two options, prefer (b) but (a) is acceptable as v1:

(a) Cheap: in chat_pty_mode, when redraw is triggered by chat_pty_has_new_bytes() (PTY echo) but NOT by graph state change, skip load_viz_from_graph + apply_sort_mode + load_stats_from_graph. Render cached lines verbatim. Keystrokes echo at PTY speed.

(b) Better: spawn maybe_refresh heavy work on a background thread that posts a snapshot via channel; main loop only reads latest snapshot. fs watcher already runs in own thread; this is natural extension.

Fix 6 — per-agent tail thread for stream parsing

  • Files: src/tui/viz_viewer/state.rs:10857-end of update_agent_streams
  • Move agent stream parsing off the main thread

Validation

Each fix has a dedicated benchmark scenario from the diagnose (A-E):

  • A. tui_idle_fps — wg tui at idle, 1000 tasks, no agents. Measure render fps.

  • B. tui_loaded_cpu — same with fixture simulating 8 agents appending to output.log every 100ms for 30s; ASSERT wg tui CPU < 40%.

  • C. tui_chat_input_latency — wg tui in chat_pty_mode against 1000-task graph + 8 simulated writers; drive 50 keystrokes via tmux send-keys; ASSERT p99 echo delay < 50ms.

  • D. cargo bench bench_generate_viz_output — N ∈ {100, 500, 1000, 2000}; ASSERT near-linear scaling and < 50ms at N=1000.

  • E. cargo bench bench_message_stats_pair — fold-to-one-pass; ASSERT < 50% of baseline.

  • Failing tests/benchmarks written first per the diagnose's spec

  • Each of the 6 fixes applied

  • All 5 benchmark scenarios PASS the asserts

  • Live smoke against this project (~250 tasks, 8 agents busy): chat input latency feels snappy; CPU stays well under 100%; viewport doesn't lag

  • No regression of revert-redo-fix's last_interaction_at primitive (when it lands first)

  • cargo build + cargo test pass

  • Permanent smoke scenarios A-E added to manifest with this task id in owners

  • cargo install --path . was run before claiming done

Why depends on revert-redo-fix

Both touch src/tui/viz_viewer/state.rs heavily (apply_sort_mode, maybe_refresh, scroll/sort logic). Serializing avoids a merge fight. revert-redo-fix's last_interaction_at primitive may also offer cleaner integration points for caching keys (e.g., (task_id, last_interaction_at) as a cache key naturally invalidates).

Process note

This is a substantial multi-fix task. Apply all 6 in priority order. The diagnose did the design work; the implementer executes against file:line spec. If any one fix turns out wrong/incomplete, file follow-up rather than abandoning all six.

Depends on

Required by

Log