diagnose-hud-slot — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-1349`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Model	`claude:opus`
Created	2026-05-01T14:59:57.501549594+00:00
Started	2026-05-01T15:08:17.619996787+00:00
Completed	2026-05-01T15:17:13.009140130+00:00
Tags	`priority-high,research,bug,tui,hud`, `eval-scheduled`
Eval score	0.87
└ blocking impact	0.90
└ completeness	0.94
└ constraint fidelity	0.70
└ coordination overhead	0.89
└ correctness	0.92
└ downstream usability	0.91
└ efficiency	0.85
└ intent fidelity	0.87
└ style adherence	0.88

Description

The TUI's HUD shows agent slot occupancy (e.g., '1/8 slots') but it consistently disagrees with reality. User report 2026-05-01: visibly 5 agents running, HUD says '1/8'.

Prior fix attempts that didn't hold:

6b87ae242 fix-tui-hud (agent-745)
659208d2b 'TUI agent count uses status-based active_count to match wg status' (fix-tui-agent-count-2)
Plus possibly more (user says ~4 attempts)

User direct quote 2026-05-01: 'The HUD that shows the system state and how many tasks are running, it is never accurate. Like right now, I see five tasks running. It says one of eight slots are occupied. What the hell is wrong with that thing? We tried to fix it like four times.'

Why prior attempts haven't held

Each fix probably changed WHERE the count is read from but not WHY it's wrong. Possible causes:

The HUD reads from a polled cache that updates slowly (debounce / interval mismatch with reality)
The HUD reads from one source (e.g., 'agents I spawned this tick') while wg status and the chat tab list read from another (registry of all-alive agents)
There's a subscriber pattern where HUD missed an event (spawn fired, HUD didn't subscribe; or kill fired, HUD didn't decrement)
The HUD's count includes a stale-state filter that excludes some active agents (e.g., agents in 'spawning' transitional state aren't counted as occupying a slot, but they ARE running)

Investigation steps (no source mods)

1. Capture the divergence

At a moment when HUD is wrong, capture:
- HUD text output (slot count display)
- wg agents output (what the registry says is alive)
- wg service status output (what the daemon says about slots)
- Process tree: pgrep -af claude|codex|nex|wg.spawn-task (what's actually running)
Diff these. Identify which is right (probably the process tree) and which is wrong (probably HUD).

2. Find the data source

Search src/tui/ for where the HUD slot count is rendered
Identify the data source it reads from (likely an in-memory counter or a derived value from the registry)
Compare with the source wg status reads from
If they're different sources, that's the bug

3. Audit prior fixes

6b87ae242 (fix-tui-hud): what did it actually change? Read the diff
659208d2b (fix-tui-agent-count-2): same — read the diff
Identify why those fixes didn't hold. Possible patterns:
- Fixed the count for one rendering path, missed another
- Fixed the read source but the source itself was already wrong
- Fixed the right thing but a subsequent change reverted it

4. Spec a comprehensive fix

The fix must:

Use a SINGLE source-of-truth for active agent count (probably the registry's enumeration of running agents)
Be subscribed to the events that change that source (spawn, exit) so HUD never lags
Have a smoke test that asserts count parity across HUD, wg agents, and process tree

Deliverable

wg log entry with:

Captured divergence evidence (HUD vs reality at a specific moment)
Root cause with file:line citation
Why prior fixes didn't hold (specific shortcoming of each)
Concrete fix proposal that addresses the root cause, not the symptom

Validation

Divergence captured and pasted in task log (HUD output, wg agents output, process tree at the same moment)
Root cause identified with file:line
Prior-fix postmortems written
Concrete fix proposal that includes a parity assertion (smoke test)
No source / doc modifications — diagnose only

## Description
The TUI's HUD shows agent slot occupancy (e.g., '1/8 slots') but it consistently disagrees with reality. User report 2026-05-01: visibly 5 agents running, HUD says '1/8'.

Prior fix attempts that didn't hold:
- 6b87ae242 fix-tui-hud (agent-745)
- 659208d2b 'TUI agent count uses status-based active_count to match wg status' (fix-tui-agent-count-2)
- Plus possibly more (user says ~4 attempts)

User direct quote 2026-05-01: 'The HUD that shows the system state and how many tasks are running, it is never accurate. Like right now, I see five tasks running. It says one of eight slots are occupied. What the hell is wrong with that thing? We tried to fix it like four times.'

## Why prior attempts haven't held
Each fix probably changed WHERE the count is read from but not WHY it's wrong. Possible causes:
1. The HUD reads from a polled cache that updates slowly (debounce / interval mismatch with reality)
2. The HUD reads from one source (e.g., 'agents I spawned this tick') while `wg status` and the chat tab list read from another (registry of all-alive agents)
3. There's a subscriber pattern where HUD missed an event (spawn fired, HUD didn't subscribe; or kill fired, HUD didn't decrement)
4. The HUD's count includes a stale-state filter that excludes some active agents (e.g., agents in 'spawning' transitional state aren't counted as occupying a slot, but they ARE running)

## Investigation steps (no source mods)

### 1. Capture the divergence
- At a moment when HUD is wrong, capture:
  - HUD text output (slot count display)
  - `wg agents` output (what the registry says is alive)
  - `wg service status` output (what the daemon says about slots)
  - Process tree: `pgrep -af claude|codex|nex|wg.spawn-task` (what's actually running)
- Diff these. Identify which is right (probably the process tree) and which is wrong (probably HUD).

### 2. Find the data source
- Search src/tui/ for where the HUD slot count is rendered
- Identify the data source it reads from (likely an in-memory counter or a derived value from the registry)
- Compare with the source `wg status` reads from
- If they're different sources, that's the bug

### 3. Audit prior fixes
- 6b87ae242 (fix-tui-hud): what did it actually change? Read the diff
- 659208d2b (fix-tui-agent-count-2): same — read the diff
- Identify why those fixes didn't hold. Possible patterns:
  - Fixed the count for one rendering path, missed another
  - Fixed the read source but the source itself was already wrong
  - Fixed the right thing but a subsequent change reverted it

### 4. Spec a comprehensive fix
The fix must:
- Use a SINGLE source-of-truth for active agent count (probably the registry's enumeration of running agents)
- Be subscribed to the events that change that source (spawn, exit) so HUD never lags
- Have a smoke test that asserts count parity across HUD, `wg agents`, and process tree

## Deliverable
`wg log` entry with:
- Captured divergence evidence (HUD vs reality at a specific moment)
- Root cause with file:line citation
- Why prior fixes didn't hold (specific shortcoming of each)
- Concrete fix proposal that addresses the root cause, not the symptom

## Validation
- [ ] Divergence captured and pasted in task log (HUD output, wg agents output, process tree at the same moment)
- [ ] Root cause identified with file:line
- [ ] Prior-fix postmortems written
- [ ] Concrete fix proposal that includes a parity assertion (smoke test)
- [ ] No source / doc modifications — diagnose only

Depends on

Required by

(none)

Log

2026-05-01T14:59:57.480336506+00:00 Task paused
2026-05-01T15:01:22.894463656+00:00 Task published
2026-05-01T15:08:17.620002588+00:00 Spawned by coordinator --executor claude --model opus
2026-05-01T15:08:37.980667113+00:00 Starting investigation. Capturing current divergence: HUD/agents/processes/service status.
2026-05-01T15:17:00.231770193+00:00 ===== DIVERGENCE CAPTURE (2026-05-01 ~15:08 UTC, this session) ===== `wg service status`: Service: running (PID 3460389) Agents: 7 alive, 0 idle, 1349 total Last tick: 2026-05-01T15:08:32 (#16, agents_alive=7/8, tasks_ready=0) `wg status`: Agents (7 alive, 1342 dead) Tasks: 7 in-progress, 0 ready, 0 blocked, 234 done `wg agents --alive`: 7 working agents (agent-1333, 1334, 1338, 1339, 1347, 1348, 1349) Direct registry count (jq on .wg/service/registry.json): working: 8 (the 7 from wg agents --alive + agent-1349 just spawning) done: 939 failed: 125 dead: 278 Process tree (pgrep claude|codex|nex|spawn-task): ~25 processes including: - 7 wg claude-handler subprocess parents (matching agent registry) - ~12 child `claude --print --verbose` subprocesses (children of handlers) - ~6 stale `claude` processes from prior coordinators/sessions still alive User-reported snapshot ("1/8 with 5 running") could not be reproduced live in this session (current state shows correct match). However, the structural divergence is now fully explained — see ROOT CAUSE below. ===== ROOT CAUSE ===== The HUD has THREE distinct numbers that all *look like* "how many agents are running" but compute different things from different sources, with NO single source-of-truth and no consistency invariant between them: (1) Dashboard coordinator card "Agents: X/Y" src/tui/viz_viewer/render.rs:5408 format!("{}/{}", card.agents_alive, card.max_agents) src/tui/viz_viewer/state.rs:10758 fresh_alive = AgentRegistry::load_or_warn(dir).active_count() Source: registry status — counts AgentStatus in {Starting, Working, Idle} (2) HUD vitals strip "● N agents" src/tui/viz_viewer/render.rs:8229 format!("● {} agents", v.agents_alive) src/tui/viz_viewer/state.rs:11981 vitals.agents_alive = service_health.agents_alive src/tui/viz_viewer/state.rs:11812 service_health.agents_alive = registry.active_count() Source: registry status — same path as (1) (3) HUD vitals strip "M running" src/tui/viz_viewer/render.rs:8241 format!("{} running", v.running) src/tui/viz_viewer/state.rs:11986 vitals.running = task_counts.in_progress src/tui/viz_viewer/state.rs:6157-6173 in_progress includes Status::is_active() (InProgress + PendingValidation + PendingEval) Source: graph TASK status — completely different concept from agent count The user's "5 running, 1/8 slot" report aligns with (3)=5 task-active vs (1)=1 agent-alive. The "5" is real tasks in {InProgress, PendingValidation, PendingEval}; the "1" is the one wg-handler agent currently registered as Working. After an agent finishes its work, the task transitions to PendingEval awaiting `.evaluate-*` LLM scoring — the task is still "running" in viz (yellow), but the agent is gone. The gap is structural, not a bug in the read path. A SEPARATE issue compounds this: agents.rs (wg agents --alive) uses a different predicate than the TUI does: src/commands/agents.rs:47 agent.is_alive() && is_process_alive(agent.pid) src/service/registry.rs:576 active_count() = is_alive() only (no PID check) After 659208d2b, the TUI dropped the PID check to match wg-status; but wg-agents --alive still does both. So TUI and `wg agents --alive` can disagree when registry status says Working but PID is reaped (zombie). The 6b87ae242 test confirms (1)/(2)=registry-status-only; the agents-CLI confirms a different predicate. The user shells comparing TUI vs `wg agents --alive` will see intermittent ±1 drift purely from this predicate mismatch. A THIRD issue, latent: src/tui/viz_viewer/state.rs:10774 sets every coordinator card's agents_alive to the GLOBAL fresh_alive. If multi-coordinator (which the "Coordinator-28" log line shows is in use), all cards display the global count divided by their per-coordinator max_agents — every card shows the same agents_alive but different max_agents, which is nonsense. Not the user's top-of-mind bug but is a related correctness defect. ===== PRIOR-FIX POSTMORTEMS ===== # 6b87ae242 "feat: fix-tui-hud (agent-745)" — squashed from df447e4ca What it changed: src/tui/viz_viewer/state.rs:6157-6173 — task_counts.in_progress now uses Status::is_active() so the HUD "X running" matches what wg viz highlights yellow (InProgress + PendingValidation + PendingEval). What it did NOT touch: agent slot count. That number lives in service_health/ vitals.agents_alive, completely different code path. Why it didn't hold: It actually WIDENED the gap between "running tasks" and "alive agents". Pre-fix, "running" was just InProgress, which often equals agents-alive. Post-fix, "running" includes PendingEval — tasks awaiting LLM scoring with no live agent. The user perceives the wider gap as "the HUD is broken". # 659208d2b "fix: TUI agent count uses status-based active_count to match wg status (fix-tui-agent-count-2)" What it changed: src/tui/viz_viewer/state.rs:11812 — service_health.agents_alive switched from `is_alive() && is_process_alive(pid)` to `registry.active_count()` (status-only). Plus state.rs:10758 — coordinator card uses fresh registry load instead of stale CoordinatorState.agents_alive. What it did NOT touch: - The relationship between "agents alive" and "running tasks" - The agents.rs CLI predicate (still uses PID liveness) - The multi-coordinator card sharing global fresh_alive - The registry's source-of-truth correctness (heartbeat triage can mark live processes Dead if heartbeat is stale) Why it didn't hold: It made TUI and `wg status` agree (good), but did not reconcile TUI with `wg agents --alive` (which uses a different predicate). And it didn't address the user's actual perception problem — "X/Y slots" looks like "agents working on tasks" but is really "wg-handler processes registered as Starting/Working/Idle". Both fixes addressed CORRECTNESS of the slot-count read path. Neither addressed the SEMANTIC PROBLEM that the user reads "X/8 slots" and "M running" as the same thing and is confused when they differ. Each fix moves the count closer to one definition of "right" while a different displayed number diverges. ===== CONCRETE FIX PROPOSAL ===== Three independent issues require three independent fixes: ## Fix 1 (root cause for user's confusion): Reify the agents↔tasks relationship in the HUD The HUD currently shows two numbers that LOOK like the same thing but track different concepts. Replace ambiguity with explicit decomposition: Current: "● 7 agents | 8 open · 5 running · 234 done" Proposed: "● 7/8 slots · 5 running (5 in-progress, 0 awaiting eval) | 8 open · 234 done" Concretely (state.rs ~2780-2795 + render.rs ~8228-8243): - Show slots as "agents_alive/agents_max" so it matches the dashboard card - Decompose "running" into (a) tasks with a live agent, (b) tasks in PendingEval/PendingValidation. When (b) > 0, render it dim/parenthetical so the user sees "5 running but only 5-N have agents". - Add a tooltip / help-overlay key explaining that PendingEval has no agent. This makes the structural gap visible and self-explaining instead of appearing as a counting bug. ## Fix 2: Single source-of-truth + single predicate for "is this agent alive" Define ONE function used everywhere: registry.alive_agents() -> impl Iterator<&AgentEntry> registry.alive_count() -> usize // calls .alive_agents().count() Decide one predicate (recommend: status-based, since daemon's triage is authoritative — but with a 60s grace window for very-recent spawns where heartbeat hasn't been written yet). Then audit and replace ALL of: - src/commands/agents.rs:47 (wg agents --alive) - src/service/registry.rs:576 (active_count) - src/tui/viz_viewer/state.rs:11812, 10758 (TUI HUD/dashboard) - any other callers of is_alive(), active_count(), is_process_alive(pid) After this fix, `wg agents --alive` count == TUI HUD count == registry.alive_count() == `wg status` count, by construction. The only way they can disagree is a stale registry.json read, which is bounded by the 2s poll interval. ## Fix 3 (latent): Per-coordinator slot count in dashboard cards Replace state.rs:10774's `agents_alive: fresh_alive` (global) with a per-coordinator count: load registry once, partition by coordinator-id (stored in AgentEntry as `coordinator_id` if it exists; if not, this is a separate small refactor). Each card shows its own coordinator's agents-alive, not the global total. ## PARITY SMOKE TEST (the hard gate) Add tests/smoke/scenarios/hud-slot-parity.toml + fixture script that: 1. Starts a fresh wg service with max_agents=4 in a temp dir 2. Adds 6 distinct tasks (forces queueing past slot cap) 3. Polls every 2s for 60s, capturing at each tick: - registry.json: count of agents with status in {Starting,Working,Idle} - `wg agents --alive`: count of lines in output - `wg service status`: parse "agents_alive=N/M" - `wg status`: parse "X alive," - TUI HUD: scrape the rendered "X/Y" from a headless TUI snapshot (via existing test_helpers, see test_active_count_mixed_executor_types for the snapshot-rendering pattern) 4. ASSERT that all FIVE values are equal at every tick — if any tick has divergence > 0 between any two sources, the scenario FAILS with the timestamp + which sources differed. 5. Owners: [diagnose-hud-slot, fix-tui-agent-count-2, fix-tui-hud] so any future change to these areas re-runs this assertion. Without this smoke test, fix #4 will follow fix #3 will follow fix #2. The system has been "fixed" four times because no test enforced the cross-surface invariant — each fix was verified locally against one display path while a different path silently regressed. ===== VALIDATION CHECKLIST ===== [x] Divergence captured (registry, wg status, wg agents --alive, process tree, current snapshot) [x] Root cause identified — file:line citations for all 3 distinct counts (state.rs:10758, 11812, 6157+11986) [x] Prior-fix postmortems written for both 6b87ae242 and 659208d2b [x] Concrete fix proposal includes parity smoke test owned by all three task ids [x] No source/doc modifications (this log is the deliverable)
2026-05-01T15:17:13.009152583+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-01T15:23:01.329382790+00:00 PendingEval → Done (evaluator passed; downstream unblocks)