audit-graph-lock

Audit graph lock scope on network filesystems

Metadata

Statusabandoned
Assignedagent-2569
Agent identity5f5f9e1ac73378e8fc64f7603d5ad052f5e6e30285efe8415814579e618bd37d
Created2026-05-05T04:20:58.990403547+00:00
Started2026-05-05T04:22:07.948987466+00:00
Tagsresearch,locking,netfs, eval-scheduled

Description

Context

User reports stuck-waiting on the graph lock when running workgraph on MooseFS. MooseFS supports POSIX locks (flock/fcntl) but routes them through the master server — every acquire/release is a network RPC, and locks are serialized at the master. Hold-times that are invisible on local ext4 become catastrophic here. Contention compounds quadratically with agent count.

Suspicion: we hold the flock for more than just graph.jsonl write operations — possibly for reads, for entire command execution, or across subprocess spawns / LLM calls.

MooseFS-specific properties to keep in mind

  • flock/fcntl work but are master-mediated → every lock op is a network round-trip
  • File rename is atomic on a single chunkserver but the master serializes metadata ops
  • fsync is honored but slow; mmap locks are sketchy
  • A lock held for 100ms locally can be 100ms+RTT*N here
  • The classic netfs fix — "copy graph.jsonl to /tmp, edit, atomically rename back" — works fine on MooseFS as long as /tmp is local

What to investigate

Audit every flock / lock acquisition in the codebase. For each, report:

  1. Where — file path + function + line range
  2. What it guards — the resource (graph.jsonl? agency dir? something else?)
  3. Scope — how long is the lock held? (single write? whole command? across an LLM call?)
  4. Read or write? — do we lock for pure read operations that could use a snapshot/copy instead?
  5. MooseFS hazard — does the critical section include slow I/O (subprocess spawns, network calls, file copies, git operations, LLM calls) that would be catastrophic to hold a master-mediated lock across?

Pay special attention to:

  • src/graph.rs and any graph load/save paths
  • Service daemon (src/commands/service/) — coordinator and dispatcher loops
  • Agent claim/unclaim paths
  • Agency reads (.workgraph/agency/ — does it share the lock?)
  • TUI read paths (wg list, wg show, wg status) — should be lock-free or read-only
  • The TUI poll loop in particular — if it acquires the lock every refresh, that alone is a thundering herd on MooseFS

Also check:

  • Single global lock vs finer-grained locks?
  • Read/write distinction (shared vs exclusive flock)?
  • Do we ever hold the lock across an LLM call or git worktree creation? (instant fail on netfs)
  • Is the lock file on MooseFS or could it live on local disk while data lives on MooseFS?

Deliverable

A markdown report at audit-graph-lock-scope.md in the repo root with:

  • Table of every lock acquisition (file:line, scope, what it guards, hold duration estimate)
  • List of "smells" — places where lock is held longer than needed
  • Specific MooseFS-attributable hazards (locks held across slow I/O)
  • Concrete recommendations (e.g., "line X-Y holds lock across subprocess spawn — extract to copy-then-release pattern")
  • Top 3 fixes ranked by MooseFS impact (lowest-effort, highest-relief first)
  • For each top-3 fix: rough patch sketch (which functions to refactor, what the new shape looks like) — no actual code changes

Validation

  • Report file exists at audit-graph-lock-scope.md
  • Every flock / Mutex / RwLock guarding graph or agency state is enumerated with file:line
  • Each entry has scope description and hold-duration estimate
  • Smells section explicitly flags any lock held across: subprocess spawn, LLM call, git op, sleep, or network I/O
  • MooseFS-specific reasoning is present (master-mediated lock RPC cost is named)
  • Top 3 ranked fixes include rough patch sketches, not just "investigate further"
  • Report is research-only; no source files modified

Depends on

Required by

Log