diagnose-wg-tui — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-1077`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-04-29T11:44:38.010908088+00:00
Started	2026-04-29T11:45:00.581584584+00:00
Completed	2026-04-29T11:50:41.293687689+00:00
Tags	`bug,tui,perf`, `eval-scheduled`
Eval score	0.81
└ blocking impact	0.80
└ completeness	0.82
└ coordination overhead	0.85
└ correctness	0.85
└ downstream usability	0.90
└ efficiency	0.85
└ intent fidelity	0.71
└ style adherence	0.85

Description

User report: wg tui is using 55% CPU sustained.

Verified 2026-04-29 morning: ps -o pid,pcpu,etime,args showed:

PID     %CPU  ELAPSED  COMMAND
2535288 55.5  15:11    wg tui

This is alongside an active scrollback duplication symptom the user is reporting on initial chat-tab render (filed as fix-pty-scrollback). Hypothesis: same root cause — a render path running far more often than it should, which both burns CPU and re-emits buffer content faster than the diff reconciler handles, producing visible duplication.

Goal

Profile wg tui under typical idle conditions (graph open, one chat tab visible, no user input) and identify what's running every frame. Output should be a list of hotspots with suggested fixes.

Specific things to check

Is the render loop event-driven (waits for input/event) or polling-driven (busy-loop with sleep_ms)?
Are there any tight loops without proper backoff in: PTY reader, chat event stream parser, graph watcher consumer, scrollback re-wrap?
Is the layout/wrap recomputed every frame even when content hasn't changed?
Any uncached file reads on every tick (chat history file, graph.jsonl, registry)?
Is the daemon socket being polled by the TUI separately from the fs-watcher?

Tools

perf record -p $(pgrep -f 'wg tui') -F 99 -- sleep 30 && perf report — sample profile
Or samply record if perf is unavailable
strace -c -p <pid> for syscall breakdown — high syscall rate during idle = busy-loop somewhere

Validation

Profile captured (perf or samply) and saved as a flame-graph or top-functions list in the task log
Hotspot identified — function + estimated % of CPU spent
Hypothesis confirmed or refuted: is this the same root cause as fix-pty-scrollback's symptom? If yes, propose merging the fixes; if no, document why they're separate
Recommended fix(es) noted with file/function pointers — the actual fix lands in a follow-up task (fix-tui-cpu or rolled into fix-pty-scrollback)
No source modifications — diagnose only

## Description
User report: `wg tui` is using 55% CPU sustained.

Verified 2026-04-29 morning: `ps -o pid,pcpu,etime,args` showed:
```
PID     %CPU  ELAPSED  COMMAND
2535288 55.5  15:11    wg tui
```

This is alongside an active scrollback duplication symptom the user is reporting on initial chat-tab render (filed as `fix-pty-scrollback`). Hypothesis: same root cause — a render path running far more often than it should, which both burns CPU and re-emits buffer content faster than the diff reconciler handles, producing visible duplication.

## Goal
Profile wg tui under typical idle conditions (graph open, one chat tab visible, no user input) and identify what's running every frame. Output should be a list of hotspots with suggested fixes.

## Specific things to check
- Is the render loop event-driven (waits for input/event) or polling-driven (busy-loop with sleep_ms)?
- Are there any tight loops without proper backoff in: PTY reader, chat event stream parser, graph watcher consumer, scrollback re-wrap?
- Is the layout/wrap recomputed every frame even when content hasn't changed?
- Any uncached file reads on every tick (chat history file, graph.jsonl, registry)?
- Is the daemon socket being polled by the TUI separately from the fs-watcher?

## Tools
- `perf record -p $(pgrep -f 'wg tui') -F 99 -- sleep 30 && perf report` — sample profile
- Or `samply record` if perf is unavailable
- `strace -c -p <pid>` for syscall breakdown — high syscall rate during idle = busy-loop somewhere

## Validation
- [ ] Profile captured (perf or samply) and saved as a flame-graph or top-functions list in the task log
- [ ] Hotspot identified — function + estimated % of CPU spent
- [ ] Hypothesis confirmed or refuted: is this the same root cause as fix-pty-scrollback's symptom? If yes, propose merging the fixes; if no, document why they're separate
- [ ] Recommended fix(es) noted with file/function pointers — the actual fix lands in a follow-up task (fix-tui-cpu or rolled into fix-pty-scrollback)
- [ ] No source modifications — diagnose only

Depends on

done .assign-diagnose-wg-tui

Required by

done fix-wg-tui

Log

2026-04-29T11:44:37.995919154+00:00 Task paused
2026-04-29T11:44:38.032396530+00:00 Task published
2026-04-29T11:45:00.520235906+00:00 Lightweight assignment: agent=Default Evaluator (31847164), exec_mode=light, context_scope=task, reason=Task is pure diagnosis/profiling (no code changes permitted); Evaluator role is optimized for assessment and finding hotspots; strong track record (0.83 score, 285 tasks) on diagnostic work.
2026-04-29T11:45:00.581588431+00:00 Spawned by coordinator --executor claude --model opus
2026-04-29T11:45:13.196894913+00:00 Starting diagnosis: looking for current wg tui PID, then will profile via perf/samply + strace
2026-04-29T11:50:03.235026218+00:00 ROOT CAUSE: render::draw_chat_tab (render.rs:3015) calls active_tab_ids_and_labels() AND list_user_board_entries() on every frame. Both helpers fully reload+JSON-parse the 2.19MB / 683-task graph.jsonl via workgraph::parser::load_graph() (state.rs:13362 and :13470). With chat tab open, adaptive poll runs at 50-200ms during animations -> 10-40 graph reloads/sec. Perf profile (20s, 99Hz, 1349 samples) attributes ~55% of CPU to serde_json deserialize + matching malloc/free/BTreeMap churn driven by these reloads. The PTY/vt100 subsystem is only ~0.4% of samples.
2026-04-29T11:50:03.261757555+00:00 HYPOTHESIS RE: fix-pty-scrollback - SEPARATE bugs sharing a render-rate amplifier. CPU profile shows zero time in vt100/scrollback paths, so duplication isn't directly caused by the JSON-reload bug. Both are made worse by the same over-redraw though. Recommend: keep tasks separate; if duplication disappears after CPU fix, close fix-pty-scrollback as duplicate, else investigate vt100 grid/scrollback independently.
2026-04-29T11:50:03.278403203+00:00 RECOMMENDED FIX (in priority order): (1) Cache tab-bar inputs in VizApp - populate cached_chat_tab_entries / cached_user_board_entries inside maybe_refresh() at state.rs:7035 where graph is already loaded once per tick. Render reads cache. Touches: render.rs:3054-3055, state.rs:13360,:13399,:13468. (2) Refactor those helpers + live_chat_count to take &Graph. (3) Audit and de-dupe the 24 other load_graph callsites in src/tui/viz_viewer/state.rs.
2026-04-29T11:50:03.295841077+00:00 Artifacts: .wg/output/diagnose-wg-tui/{findings.md, perf-top.txt, perf-callgraph.txt, wgtui.perf.data}. No source modifications made. Validation criteria from task description all addressed in findings.md.
2026-04-29T11:50:37.392946454+00:00 Filed follow-up: fix-wg-tui (depends on this task) with full implementation plan and validation criteria. Linked to findings.md artifact.
2026-04-29T11:50:37.432370506+00:00 Validated: all 5 task validation criteria addressed in findings.md - (1) profile captured (perf, 1349 samples, saved as artifacts), (2) hotspot identified (load_graph called from render::draw_chat_tab via active_tab_ids_and_labels + list_user_board_entries; ~55% CPU), (3) hypothesis - separate from fix-pty-scrollback but share render-rate amplifier, documented why, (4) recommended fix noted with file:line pointers + filed as fix-wg-tui, (5) no source modifications.
2026-04-29T11:50:41.293690945+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-04-29T11:53:47.008551334+00:00 PendingEval → Done (evaluator passed; downstream unblocks)