coordinator-id-ghost — Workgraph live mirror

Metadata

Status	abandoned ‖ paused
Created	2026-04-26T15:15:41.448453417+00:00

Description

Daemon boot calls CoordinatorState::load_all(dir). When no service/coordinator-state-N.json files exist (fresh install or after rm -rf .wg), the legacy fallback at src/commands/service/mod.rs:574-578 synthesizes a (0, default) entry. The daemon then spawns a Coordinator-0 supervisor for it, which formats task id .coordinator-0 and tries to spawn wg spawn-task .coordinator-0 — but no such task exists in the graph, since the actual coordinator created via TUI is .coordinator-1 (or higher; find_next_fresh_coordinator_id skipped 0 because a chat dir for coordinator-0 existed once).

Symptom in daemon log:

[INFO] Coordinator-0: spawning via `wg spawn-task .coordinator-0` (executor=claude, model=None)
[ERROR] Coordinator-0: failed to spawn ... (os error 2)
[ERROR] Coordinator-0: 3 restarts in last 10 minutes, pausing for 584s

Net effect: every fresh wg init produces a ghost coordinator that burns the restart budget, and the user-created coordinator never actually gets a working supervisor (because the supervisor is bound to the ghost id, not the real task).

Fix

Don't synthesize a phantom coordinator from absence of state files. The legacy fallback should only fire if there's evidence a coordinator-0 ever existed (e.g. .wg/chat/coordinator-0/ dir, or a .coordinator/.coordinator-0 task in the graph). No state file + no chat dir + no graph task → no coordinator. Empty list is the correct return.
Tie supervisor lifecycle to graph state, not state files. The daemon should derive 'which coordinators need supervisors' from tasks().filter(coordinator-loop tag, status != Abandoned, !archived). State files are overrides, not the source of truth for existence.
Defensive check in subprocess_coordinator_loop: before spawning, verify the task id exists in the graph. If not, log a clear error ('Coordinator-N orphaned: task .coordinator-N not in graph; supervisor exiting') and exit the loop instead of restart-looping.

Files to touch

src/commands/service/mod.rs — fix CoordinatorState::load_all to not synthesize coordinator 0; or better, deprecate load_all in favor of a graph-driven enumeration in the daemon boot path.
src/commands/service/coordinator_agent.rs — add the orphaned-task guard before spawn.
Daemon boot logic (wherever load_all is consumed at boot) — switch to graph-driven coordinator enumeration.

Validation

Failing tests first:
- test_load_all_returns_empty_when_no_state_and_no_legacy — ensures fresh install doesn't synthesize Coordinator-0
- test_supervisor_exits_when_task_missing — guard in subprocess_coordinator_loop
- test_daemon_boot_enumerates_coordinators_from_graph — boot path picks up .coordinator-N tasks via tag scan, not state files
Implementation makes tests pass
cargo build + cargo test pass with no regressions
Manual smoke (in scratch dir):
- rm -rf .wg && wg init -x claude && wg service start
- tail daemon.log: NO 'Coordinator-0: spawning' lines, NO 'failed to spawn' restart loop
- Open wg tui, create coordinator named 'test'
- Daemon log shows 'Coordinator-1: subprocess running (pid X, executor=claude)' (matching the actual task .coordinator-1)
- Send a chat message in TUI; coordinator responds

## Description

Daemon boot calls `CoordinatorState::load_all(dir)`. When no `service/coordinator-state-N.json` files exist (fresh install or after rm -rf .wg), the legacy fallback at `src/commands/service/mod.rs:574-578` synthesizes a (0, default) entry. The daemon then spawns a Coordinator-0 supervisor for it, which formats task id `.coordinator-0` and tries to spawn `wg spawn-task .coordinator-0` — but no such task exists in the graph, since the actual coordinator created via TUI is `.coordinator-1` (or higher; `find_next_fresh_coordinator_id` skipped 0 because a chat dir for coordinator-0 existed once).

Symptom in daemon log:

```
[INFO] Coordinator-0: spawning via `wg spawn-task .coordinator-0` (executor=claude, model=None)
[ERROR] Coordinator-0: failed to spawn ... (os error 2)
[ERROR] Coordinator-0: 3 restarts in last 10 minutes, pausing for 584s
```

Net effect: every fresh `wg init` produces a ghost coordinator that burns the restart budget, and the user-created coordinator never actually gets a working supervisor (because the supervisor is bound to the ghost id, not the real task).

### Fix

1. **Don't synthesize a phantom coordinator from absence of state files.** The legacy fallback should only fire if there's evidence a coordinator-0 ever existed (e.g. `.wg/chat/coordinator-0/` dir, or a `.coordinator`/`.coordinator-0` task in the graph). No state file + no chat dir + no graph task → no coordinator. Empty list is the correct return.

2. **Tie supervisor lifecycle to graph state, not state files.** The daemon should derive 'which coordinators need supervisors' from `tasks().filter(coordinator-loop tag, status != Abandoned, !archived)`. State files are *overrides*, not the source of truth for existence.

3. **Defensive check in `subprocess_coordinator_loop`**: before spawning, verify the task id exists in the graph. If not, log a clear error ('Coordinator-N orphaned: task .coordinator-N not in graph; supervisor exiting') and exit the loop instead of restart-looping.

### Files to touch

- `src/commands/service/mod.rs` — fix `CoordinatorState::load_all` to not synthesize coordinator 0; or better, deprecate `load_all` in favor of a graph-driven enumeration in the daemon boot path.
- `src/commands/service/coordinator_agent.rs` — add the orphaned-task guard before spawn.
- Daemon boot logic (wherever `load_all` is consumed at boot) — switch to graph-driven coordinator enumeration.

## Validation

- [ ] Failing tests first:
  - test_load_all_returns_empty_when_no_state_and_no_legacy — ensures fresh install doesn't synthesize Coordinator-0
  - test_supervisor_exits_when_task_missing — guard in subprocess_coordinator_loop
  - test_daemon_boot_enumerates_coordinators_from_graph — boot path picks up .coordinator-N tasks via tag scan, not state files
- [ ] Implementation makes tests pass
- [ ] cargo build + cargo test pass with no regressions
- [ ] Manual smoke (in scratch dir):
  - rm -rf .wg && wg init -x claude && wg service start
  - tail daemon.log: NO 'Coordinator-0: spawning' lines, NO 'failed to spawn' restart loop
  - Open wg tui, create coordinator named 'test'
  - Daemon log shows 'Coordinator-1: subprocess running (pid X, executor=claude)' (matching the actual task .coordinator-1)
  - Send a chat message in TUI; coordinator responds

Depends on

(none)

Required by

(none)

Log

2026-04-26T15:15:41.448277233+00:00 Task paused
2026-04-26T16:02:08.135038958+00:00 Task abandoned