coordinator-id-ghost

Coordinator id ghost: legacy fallback in service/mod.rs:574-578 creates phantom Coordinator-0 when no per-id state files exist

Metadata

Statusabandoned ‖ paused
Created2026-04-26T15:15:41.448453417+00:00

Description

Description

Daemon boot calls CoordinatorState::load_all(dir). When no service/coordinator-state-N.json files exist (fresh install or after rm -rf .wg), the legacy fallback at src/commands/service/mod.rs:574-578 synthesizes a (0, default) entry. The daemon then spawns a Coordinator-0 supervisor for it, which formats task id .coordinator-0 and tries to spawn wg spawn-task .coordinator-0 — but no such task exists in the graph, since the actual coordinator created via TUI is .coordinator-1 (or higher; find_next_fresh_coordinator_id skipped 0 because a chat dir for coordinator-0 existed once).

Symptom in daemon log:

[INFO] Coordinator-0: spawning via `wg spawn-task .coordinator-0` (executor=claude, model=None)
[ERROR] Coordinator-0: failed to spawn ... (os error 2)
[ERROR] Coordinator-0: 3 restarts in last 10 minutes, pausing for 584s

Net effect: every fresh wg init produces a ghost coordinator that burns the restart budget, and the user-created coordinator never actually gets a working supervisor (because the supervisor is bound to the ghost id, not the real task).

Fix

  1. Don't synthesize a phantom coordinator from absence of state files. The legacy fallback should only fire if there's evidence a coordinator-0 ever existed (e.g. .wg/chat/coordinator-0/ dir, or a .coordinator/.coordinator-0 task in the graph). No state file + no chat dir + no graph task → no coordinator. Empty list is the correct return.

  2. Tie supervisor lifecycle to graph state, not state files. The daemon should derive 'which coordinators need supervisors' from tasks().filter(coordinator-loop tag, status != Abandoned, !archived). State files are overrides, not the source of truth for existence.

  3. Defensive check in subprocess_coordinator_loop: before spawning, verify the task id exists in the graph. If not, log a clear error ('Coordinator-N orphaned: task .coordinator-N not in graph; supervisor exiting') and exit the loop instead of restart-looping.

Files to touch

  • src/commands/service/mod.rs — fix CoordinatorState::load_all to not synthesize coordinator 0; or better, deprecate load_all in favor of a graph-driven enumeration in the daemon boot path.
  • src/commands/service/coordinator_agent.rs — add the orphaned-task guard before spawn.
  • Daemon boot logic (wherever load_all is consumed at boot) — switch to graph-driven coordinator enumeration.

Validation

  • Failing tests first:
    • test_load_all_returns_empty_when_no_state_and_no_legacy — ensures fresh install doesn't synthesize Coordinator-0
    • test_supervisor_exits_when_task_missing — guard in subprocess_coordinator_loop
    • test_daemon_boot_enumerates_coordinators_from_graph — boot path picks up .coordinator-N tasks via tag scan, not state files
  • Implementation makes tests pass
  • cargo build + cargo test pass with no regressions
  • Manual smoke (in scratch dir):
    • rm -rf .wg && wg init -x claude && wg service start
    • tail daemon.log: NO 'Coordinator-0: spawning' lines, NO 'failed to spawn' restart loop
    • Open wg tui, create coordinator named 'test'
    • Daemon log shows 'Coordinator-1: subprocess running (pid X, executor=claude)' (matching the actual task .coordinator-1)
    • Send a chat message in TUI; coordinator responds

Depends on

Required by

Log