coordinator-inotify-graph — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-95`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-04-26T14:32:27.951357335+00:00
Started	2026-04-26T19:43:09.182334474+00:00
Completed	2026-04-26T20:41:44.544438113+00:00
Tags	`eval-scheduled`
Eval score	0.04
└ blocking impact	0.00
└ completeness	0.00
└ coordination overhead	0.20
└ correctness	0.05
└ downstream usability	0.00
└ efficiency	0.00
└ intent fidelity	0.09
└ style adherence	0.05

Description

The coordinator currently polls the graph on a fixed interval (default 60s, also 30s and 10s in different places). Replace the pure-polling design with event-driven graph watching + a slower safety timer:

Primary trigger: filesystem watch on .wg/graph.jsonl (and any other coordinator-relevant files: .wg/service/*, .wg/agency/* if the coordinator reacts to those). New events wake the coordinator immediately.
Safety timer: fires every 30s (configurable) for work that isn't graph-change-driven — cycle_delay scheduling, agent heartbeat / timeout reaping, model registry refresh, compaction trigger checks, anything else time-based.

Requirements

File watcher: use the notify crate (or notify-debouncer-mini for built-in debounce). Watch .wg/graph.jsonl and emit a 'graph changed' event.
Debounce: a single wg add (or any wg command) can cause multiple writes within milliseconds. Coalesce events with a short debounce window (50–200ms) so the coordinator wakes once per logical change, not once per fsync.
Self-write filtering: when the coordinator itself writes the graph (e.g. updating task status), don't wake itself. Either ignore writes that happen between 'I'm about to write' and 'I'm done writing', or rely on the debounce + idempotent loop body.
Fallback when watcher unavailable: inotify isn't available everywhere (some NFS mounts, WSL1, certain remote/sandbox filesystems). Detect at startup, log one clear warning, fall back to a short poll (e.g. 5s).
Config consolidation: the current config has three intervals — [coordinator] interval, [coordinator] poll_interval, [agent] interval. Audit what each one governs, document in the config schema, and where two can collapse into one (now that polling is the safety timer, not the primary trigger), collapse them. Default safety timer = 30s. Don't break existing configs — keep accepting the old keys with deprecation warnings.
TUI responsiveness: when a user adds a task in one terminal, the TUI in another terminal should reflect it within a second. Verify this in manual smoke.

Non-goals

Don't replace the polling fallback entirely.
Don't change the agent-spawning logic, only what triggers it.
Don't try to watch every file in .wg/ — start with graph.jsonl and add others only if a clear coordinator-relevant event is missed.

Files likely to touch (best guess from grep, implementer should verify)

src/service/coordinator.rs (or wherever the main loop lives) — replace the sleep(poll_interval) with a select! on (watcher_event, safety_tick, shutdown_signal).
src/config.rs — schema changes for consolidating intervals + adding watcher-related options (debounce_ms, fallback_poll_interval).
Cargo.toml — add notify (or notify-debouncer-mini).
Tests in tests/ for the new behavior.

Edge cases to handle

Watcher process crashes mid-run → restart it once, then fall back to polling.
Repo on NFS / Docker volume / network filesystem → fallback path must work cleanly.
Multiple coordinators in different worktrees — don't wake on each other's graph writes (different .wg dirs, so naturally isolated, but verify).
Graph file missing at startup → wait for it to appear (don't crash); useful for wg init race.

Validation

Failing tests written first:
- test_coordinator_wakes_on_graph_write (write to graph.jsonl while coordinator idle → coordinator processes within 200ms, well before safety timer)
- test_coordinator_debounces_burst_writes (10 writes in 50ms → coordinator wakes ≤ 2 times, not 10)
- test_coordinator_safety_timer_fires_with_no_graph_changes (no writes for 30s → safety timer triggers a loop iteration)
- test_coordinator_falls_back_when_watcher_init_fails (inject failure → service still works, logs warning, polls at fallback interval)
- test_config_legacy_poll_interval_accepted_with_deprecation_warning
Implementation makes all tests pass
cargo build + cargo test pass with no regressions
Manual smoke:
- Start service. In another terminal, wg add 'foo'. Within 1s, the new task is visible in wg list AND the coordinator log shows a wake event.
- Run wg tui in one pane, wg add 'bar' in another — TUI updates within a second.
- Test on a tmpfs / NFS-mounted .wg dir if available (or simulate watcher failure) — confirm fallback poll engages.

## Description

The coordinator currently polls the graph on a fixed interval (default 60s, also 30s and 10s in different places). Replace the pure-polling design with **event-driven graph watching + a slower safety timer**:

- **Primary trigger**: filesystem watch on `.wg/graph.jsonl` (and any other coordinator-relevant files: `.wg/service/*`, `.wg/agency/*` if the coordinator reacts to those). New events wake the coordinator immediately.
- **Safety timer**: fires every 30s (configurable) for work that *isn't* graph-change-driven — cycle_delay scheduling, agent heartbeat / timeout reaping, model registry refresh, compaction trigger checks, anything else time-based.

### Requirements

1. **File watcher**: use the `notify` crate (or `notify-debouncer-mini` for built-in debounce). Watch `.wg/graph.jsonl` and emit a 'graph changed' event.
2. **Debounce**: a single `wg add` (or any wg command) can cause multiple writes within milliseconds. Coalesce events with a short debounce window (50–200ms) so the coordinator wakes once per logical change, not once per fsync.
3. **Self-write filtering**: when the coordinator itself writes the graph (e.g. updating task status), don't wake itself. Either ignore writes that happen between 'I'm about to write' and 'I'm done writing', or rely on the debounce + idempotent loop body.
4. **Fallback when watcher unavailable**: inotify isn't available everywhere (some NFS mounts, WSL1, certain remote/sandbox filesystems). Detect at startup, log one clear warning, fall back to a short poll (e.g. 5s).
5. **Config consolidation**: the current config has three intervals — `[coordinator] interval`, `[coordinator] poll_interval`, `[agent] interval`. Audit what each one governs, document in the config schema, and where two can collapse into one (now that polling is the safety timer, not the primary trigger), collapse them. Default safety timer = 30s. Don't break existing configs — keep accepting the old keys with deprecation warnings.
6. **TUI responsiveness**: when a user adds a task in one terminal, the TUI in another terminal should reflect it within a second. Verify this in manual smoke.

### Non-goals

- Don't replace the polling fallback entirely.
- Don't change the agent-spawning logic, only what *triggers* it.
- Don't try to watch every file in `.wg/` — start with `graph.jsonl` and add others only if a clear coordinator-relevant event is missed.

### Files likely to touch (best guess from grep, implementer should verify)

- `src/service/coordinator.rs` (or wherever the main loop lives) — replace the `sleep(poll_interval)` with a `select!` on (watcher_event, safety_tick, shutdown_signal).
- `src/config.rs` — schema changes for consolidating intervals + adding watcher-related options (debounce_ms, fallback_poll_interval).
- `Cargo.toml` — add `notify` (or `notify-debouncer-mini`).
- Tests in `tests/` for the new behavior.

### Edge cases to handle

- Watcher process crashes mid-run → restart it once, then fall back to polling.
- Repo on NFS / Docker volume / network filesystem → fallback path must work cleanly.
- Multiple coordinators in different worktrees — don't wake on each other's graph writes (different .wg dirs, so naturally isolated, but verify).
- Graph file missing at startup → wait for it to appear (don't crash); useful for `wg init` race.

## Validation

- [ ] Failing tests written first:
  - test_coordinator_wakes_on_graph_write (write to graph.jsonl while coordinator idle → coordinator processes within 200ms, well before safety timer)
  - test_coordinator_debounces_burst_writes (10 writes in 50ms → coordinator wakes ≤ 2 times, not 10)
  - test_coordinator_safety_timer_fires_with_no_graph_changes (no writes for 30s → safety timer triggers a loop iteration)
  - test_coordinator_falls_back_when_watcher_init_fails (inject failure → service still works, logs warning, polls at fallback interval)
  - test_config_legacy_poll_interval_accepted_with_deprecation_warning
- [ ] Implementation makes all tests pass
- [ ] cargo build + cargo test pass with no regressions
- [ ] Manual smoke:
  - [ ] Start service. In another terminal, `wg add 'foo'`. Within 1s, the new task is visible in `wg list` AND the coordinator log shows a wake event.
  - [ ] Run `wg tui` in one pane, `wg add 'bar'` in another — TUI updates within a second.
  - [ ] Test on a tmpfs / NFS-mounted .wg dir if available (or simulate watcher failure) — confirm fallback poll engages.

Depends on

done .assign-coordinator-inotify-graph

Required by

(none)

Log

2026-04-26T14:32:27.951083506+00:00 Task paused
2026-04-26T16:03:14.701272310+00:00 Task published
2026-04-26T16:14:03.137328279+00:00 Spawned by coordinator --executor native --model claude-opus-4-6
2026-04-26T16:14:03.162054938+00:00 Task marked as failed: Agent exited with code 1
2026-04-26T16:17:15.909496234+00:00 Task reset for retry from failed (attempt #2)
2026-04-26T18:59:15.121311333+00:00 Spawned by coordinator --executor native --model opus
2026-04-26T18:59:15.208802321+00:00 Task marked as failed: Agent exited with code 1
2026-04-26T19:10:28.329358216+00:00 Task reset for retry from failed (attempt #3)
2026-04-26T19:10:30.919905093+00:00 Spawned by coordinator --executor native --model opus
2026-04-26T19:10:30.941831526+00:00 Task marked as failed: Agent exited with code 1
2026-04-26T19:43:06.354144224+00:00 Task reset for retry from failed (attempt #4)
2026-04-26T19:43:09.182338882+00:00 Spawned by coordinator --executor claude --model opus
2026-04-26T19:43:18.594780163+00:00 Starting fresh attempt — previous failed on Anthropic API key. Investigating coordinator structure first.
2026-04-26T20:25:21.239359823+00:00 Implementation complete: GraphWatcher (notify-debouncer-mini) wired into daemon main loop via self-pipe; 4 unit tests + 7 alias integration tests pass; manual smoke test confirms direct file writes wake the daemon and self-write filter prevents feedback loop.
2026-04-26T20:41:30.173679930+00:00 Committed: 0eded76bc — pushed to remote (wg/agent-95/coordinator-inotify-graph)
2026-04-26T20:41:44.544446398+00:00 Task marked as done