Metadata
| Status | abandoned |
|---|---|
| Assigned | agent-2569 |
| Agent identity | 5f5f9e1ac73378e8fc64f7603d5ad052f5e6e30285efe8415814579e618bd37d |
| Created | 2026-05-05T04:20:58.990403547+00:00 |
| Started | 2026-05-05T04:22:07.948987466+00:00 |
| Tags | research,locking,netfs, eval-scheduled |
Description
Context
User reports stuck-waiting on the graph lock when running workgraph on MooseFS. MooseFS supports POSIX locks (flock/fcntl) but routes them through the master server — every acquire/release is a network RPC, and locks are serialized at the master. Hold-times that are invisible on local ext4 become catastrophic here. Contention compounds quadratically with agent count.
Suspicion: we hold the flock for more than just graph.jsonl write operations — possibly for reads, for entire command execution, or across subprocess spawns / LLM calls.
MooseFS-specific properties to keep in mind
- flock/fcntl work but are master-mediated → every lock op is a network round-trip
- File rename is atomic on a single chunkserver but the master serializes metadata ops
fsyncis honored but slow; mmap locks are sketchy- A lock held for 100ms locally can be 100ms+RTT*N here
- The classic netfs fix — "copy graph.jsonl to /tmp, edit, atomically rename back" — works fine on MooseFS as long as /tmp is local
What to investigate
Audit every flock / lock acquisition in the codebase. For each, report:
- Where — file path + function + line range
- What it guards — the resource (graph.jsonl? agency dir? something else?)
- Scope — how long is the lock held? (single write? whole command? across an LLM call?)
- Read or write? — do we lock for pure read operations that could use a snapshot/copy instead?
- MooseFS hazard — does the critical section include slow I/O (subprocess spawns, network calls, file copies, git operations, LLM calls) that would be catastrophic to hold a master-mediated lock across?
Pay special attention to:
src/graph.rsand any graph load/save paths- Service daemon (
src/commands/service/) — coordinator and dispatcher loops - Agent claim/unclaim paths
- Agency reads (
.workgraph/agency/— does it share the lock?) - TUI read paths (
wg list,wg show,wg status) — should be lock-free or read-only - The TUI poll loop in particular — if it acquires the lock every refresh, that alone is a thundering herd on MooseFS
Also check:
- Single global lock vs finer-grained locks?
- Read/write distinction (shared vs exclusive flock)?
- Do we ever hold the lock across an LLM call or git worktree creation? (instant fail on netfs)
- Is the lock file on MooseFS or could it live on local disk while data lives on MooseFS?
Deliverable
A markdown report at audit-graph-lock-scope.md in the repo root with:
- Table of every lock acquisition (file:line, scope, what it guards, hold duration estimate)
- List of "smells" — places where lock is held longer than needed
- Specific MooseFS-attributable hazards (locks held across slow I/O)
- Concrete recommendations (e.g., "line X-Y holds lock across subprocess spawn — extract to copy-then-release pattern")
- Top 3 fixes ranked by MooseFS impact (lowest-effort, highest-relief first)
- For each top-3 fix: rough patch sketch (which functions to refactor, what the new shape looks like) — no actual code changes
Validation
-
Report file exists at
audit-graph-lock-scope.md - Every flock / Mutex / RwLock guarding graph or agency state is enumerated with file:line
- Each entry has scope description and hold-duration estimate
- Smells section explicitly flags any lock held across: subprocess spawn, LLM call, git op, sleep, or network I/O
- MooseFS-specific reasoning is present (master-mediated lock RPC cost is named)
- Top 3 ranked fixes include rough patch sketches, not just "investigate further"
- Report is research-only; no source files modified
Depends on
Required by
Log
- 2026-05-05T04:20:58.965025742+00:00 Task paused
- 2026-05-05T04:21:01.942197077+00:00 Task published
- 2026-05-05T04:22:07.655893267+00:00 Lightweight assignment: agent=Default Evaluator (5f5f9e1a), exec_mode=light, context_scope=task, reason=Evaluator (0.89 score, 680 tasks) excels at systematic analysis and detailed enumeration; light mode for read-only codebase exploration across locking patterns; deliverable is audit report matching evaluation strengths.
- 2026-05-05T04:22:07.948991754+00:00 Spawned by coordinator --executor claude --model opus
- 2026-05-05T04:22:17.325682548+00:00 Starting audit of graph lock scope. Will enumerate every flock/Mutex/RwLock guarding graph or agency state.
- 2026-05-05T04:23:00.234163254+00:00 Task abandoned