coordinator-inotify-graph

Coordinator: inotify graph watch + safety timer (replace pure polling)

Metadata

Statusdone
Assignedagent-95
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-26T14:32:27.951357335+00:00
Started2026-04-26T19:43:09.182334474+00:00
Completed2026-04-26T20:41:44.544438113+00:00
Tagseval-scheduled
Eval score0.04
└ blocking impact0.00
└ completeness0.00
└ coordination overhead0.20
└ correctness0.05
└ downstream usability0.00
└ efficiency0.00
└ intent fidelity0.09
└ style adherence0.05

Description

Description

The coordinator currently polls the graph on a fixed interval (default 60s, also 30s and 10s in different places). Replace the pure-polling design with event-driven graph watching + a slower safety timer:

  • Primary trigger: filesystem watch on .wg/graph.jsonl (and any other coordinator-relevant files: .wg/service/*, .wg/agency/* if the coordinator reacts to those). New events wake the coordinator immediately.
  • Safety timer: fires every 30s (configurable) for work that isn't graph-change-driven — cycle_delay scheduling, agent heartbeat / timeout reaping, model registry refresh, compaction trigger checks, anything else time-based.

Requirements

  1. File watcher: use the notify crate (or notify-debouncer-mini for built-in debounce). Watch .wg/graph.jsonl and emit a 'graph changed' event.
  2. Debounce: a single wg add (or any wg command) can cause multiple writes within milliseconds. Coalesce events with a short debounce window (50–200ms) so the coordinator wakes once per logical change, not once per fsync.
  3. Self-write filtering: when the coordinator itself writes the graph (e.g. updating task status), don't wake itself. Either ignore writes that happen between 'I'm about to write' and 'I'm done writing', or rely on the debounce + idempotent loop body.
  4. Fallback when watcher unavailable: inotify isn't available everywhere (some NFS mounts, WSL1, certain remote/sandbox filesystems). Detect at startup, log one clear warning, fall back to a short poll (e.g. 5s).
  5. Config consolidation: the current config has three intervals — [coordinator] interval, [coordinator] poll_interval, [agent] interval. Audit what each one governs, document in the config schema, and where two can collapse into one (now that polling is the safety timer, not the primary trigger), collapse them. Default safety timer = 30s. Don't break existing configs — keep accepting the old keys with deprecation warnings.
  6. TUI responsiveness: when a user adds a task in one terminal, the TUI in another terminal should reflect it within a second. Verify this in manual smoke.

Non-goals

  • Don't replace the polling fallback entirely.
  • Don't change the agent-spawning logic, only what triggers it.
  • Don't try to watch every file in .wg/ — start with graph.jsonl and add others only if a clear coordinator-relevant event is missed.

Files likely to touch (best guess from grep, implementer should verify)

  • src/service/coordinator.rs (or wherever the main loop lives) — replace the sleep(poll_interval) with a select! on (watcher_event, safety_tick, shutdown_signal).
  • src/config.rs — schema changes for consolidating intervals + adding watcher-related options (debounce_ms, fallback_poll_interval).
  • Cargo.toml — add notify (or notify-debouncer-mini).
  • Tests in tests/ for the new behavior.

Edge cases to handle

  • Watcher process crashes mid-run → restart it once, then fall back to polling.
  • Repo on NFS / Docker volume / network filesystem → fallback path must work cleanly.
  • Multiple coordinators in different worktrees — don't wake on each other's graph writes (different .wg dirs, so naturally isolated, but verify).
  • Graph file missing at startup → wait for it to appear (don't crash); useful for wg init race.

Validation

  • Failing tests written first:
    • test_coordinator_wakes_on_graph_write (write to graph.jsonl while coordinator idle → coordinator processes within 200ms, well before safety timer)
    • test_coordinator_debounces_burst_writes (10 writes in 50ms → coordinator wakes ≤ 2 times, not 10)
    • test_coordinator_safety_timer_fires_with_no_graph_changes (no writes for 30s → safety timer triggers a loop iteration)
    • test_coordinator_falls_back_when_watcher_init_fails (inject failure → service still works, logs warning, polls at fallback interval)
    • test_config_legacy_poll_interval_accepted_with_deprecation_warning
  • Implementation makes all tests pass
  • cargo build + cargo test pass with no regressions
  • Manual smoke:
    • Start service. In another terminal, wg add 'foo'. Within 1s, the new task is visible in wg list AND the coordinator log shows a wake event.
    • Run wg tui in one pane, wg add 'bar' in another — TUI updates within a second.
    • Test on a tmpfs / NFS-mounted .wg dir if available (or simulate watcher failure) — confirm fallback poll engages.

Depends on

Required by

Log