dispatcher-auto-respawns

Dispatcher auto-respawns chat agents even with no TUI active; need bulk purge that preserves graph

Metadata

Statusdone
Assignedagent-819
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Created2026-04-27T17:32:52.630148860+00:00
Started2026-04-27T18:40:56.371485079+00:00
Completed2026-04-27T19:04:40.895417058+00:00
Tagseval-scheduled
Tokens7987344 in / 29730 out
Eval score0.02
└ blocking impact0.00
└ completeness0.00
└ constraint fidelity0.25
└ coordination overhead0.10
└ correctness0.00
└ downstream usability0.00
└ efficiency0.05
└ intent fidelity0.00
└ style adherence0.00

Description

Description

The dispatcher's coordinator-supervisor loop auto-respawns chat agents (.chat-N / legacy .coordinator-N) on a back-off schedule whether or not anything is consuming them. Today (2026-04-27) we saw 4 supervisors crash-looping (Coordinator-0/1/3/4) with no TUI connected:

  • Coordinator-1 was even spawning against the deprecated .coordinator-1 task id (never migrated via wg migrate chat-rename) — every spawn fails immediately, supervisor retries forever.
  • Coordinator-3 spawn failed because session lock was held by an already-running handler — supervisor kept retrying until 599s back-off.
  • Coordinator-4 (.chat-4) was a chat the user had explicitly wanted to retire (see task tui-cannot-retire) but the supervisor kept it alive.

User quote: 'why is it respawning??? lol. there is no active tui.' And: 'i'm about to go to a place where they should be purged but the graph left intact.'

Two problems to solve

1. Don't respawn chat agents when nothing is connected

The supervisor should only keep a chat handler alive when (a) the TUI is connected to it OR (b) there's queued user input waiting in the chat's inbox. With neither, the chat should idle out, not respawn forever.

This may need: a 'last-consumer' timestamp on each chat (TUI attach / IPC ping / inbox write) and a supervisor rule 'if last_consumer_at > N seconds ago AND no pending inbox messages, do not respawn'.

2. Bulk-purge command that preserves graph

User needs a way to clean up coordinator/chat agents in one shot. Two acceptable shapes (user explicitly listed both):

A. wg service purge-chats — archives every coordinator supervisor (so no respawn), kills all live chat handler processes, but leaves the chat task nodes + their history (chat/<ref>/*.jsonl) on disk and in the graph. Reversible — user can run wg service create-chat to restart fresh, or restore from history later.

B. wg service archive-all-chats — same effect (perma-archived), maybe phrased as 'mark every chat as Done so the supervisor never touches them again.'

Recommend A as the named command, with B's semantics ('perma-archived') as the implementation. purge-chats is the user-facing verb; under the hood it archives every coordinator, kills handlers, and writes a sticky 'purged_at' marker so a daemon restart doesn't resurrect them.

Additional requirements

  • Idempotent — re-running purge when already-purged is a no-op, not an error
  • Restart-survival — once archived/purged, a dispatcher restart MUST NOT re-spawn the coordinator (today's symptom suggests the supervisor reads coordinator-state-N.json and respawns). Either delete those files on archive, or have the supervisor honor an archived/purged flag in the file.
  • Migrate stale ids — Coordinator-1 spawning .coordinator-1 is a separate latent bug; wg migrate chat-rename should run automatically (or the supervisor should refuse to spawn against a non-existent task and self-archive).

Files likely to touch

  • src/commands/service/ — supervisor loop, archive-chat command, new purge-chats command
  • src/commands/service/coordinator.rs (or whatever lives in the supervisor file path post-rename) — respawn logic, idle-detection
  • .wg/service/coordinator-state-N.json schema — add 'purged' / 'archived' sticky flag, or remove file entirely on archive
  • src/commands/migrate/ — auto-run chat-rename if stale .coordinator-N ids are encountered
  • TUI — the existing tui-cannot-retire task can hook into the same purge primitive

Validation

  • Failing test first: spawn 3 chat coordinators, run wg service purge-chats, restart the dispatcher, assert (a) zero coordinator handler processes alive, (b) zero supervisor respawn attempts in the log over 30s, (c) chat tasks still exist in the graph with status Done/archived, (d) chat history files still exist
  • Failing test for idle-respawn-rule: spawn a chat with no consumer, no inbox messages → assert supervisor does NOT respawn within N seconds
  • Failing test for stale-id self-archive: create a coordinator-state file referencing a non-existent task id → supervisor archives itself instead of crash-looping
  • Implementation makes all tests pass
  • cargo build + cargo test pass with no regressions
  • Manual smoke: with no TUI connected, wg service purge-chats then wg service restart → no Coordinator-N respawn errors in daemon.log; wg list still shows .chat-N tasks (archived); chat history dirs intact

Depends on

Required by

Log