audit-graph-lock — Workgraph live mirror

Metadata

Status	abandoned
Assigned	`agent-2569`
Agent identity	`5f5f9e1ac73378e8fc64f7603d5ad052f5e6e30285efe8415814579e618bd37d`
Created	2026-05-05T04:20:58.990403547+00:00
Started	2026-05-05T04:22:07.948987466+00:00
Tags	`research,locking,netfs`, `eval-scheduled`

Description

Context

User reports stuck-waiting on the graph lock when running workgraph on MooseFS. MooseFS supports POSIX locks (flock/fcntl) but routes them through the master server — every acquire/release is a network RPC, and locks are serialized at the master. Hold-times that are invisible on local ext4 become catastrophic here. Contention compounds quadratically with agent count.

Suspicion: we hold the flock for more than just graph.jsonl write operations — possibly for reads, for entire command execution, or across subprocess spawns / LLM calls.

MooseFS-specific properties to keep in mind

flock/fcntl work but are master-mediated → every lock op is a network round-trip
File rename is atomic on a single chunkserver but the master serializes metadata ops
fsync is honored but slow; mmap locks are sketchy
A lock held for 100ms locally can be 100ms+RTT*N here
The classic netfs fix — "copy graph.jsonl to /tmp, edit, atomically rename back" — works fine on MooseFS as long as /tmp is local

What to investigate

Audit every flock / lock acquisition in the codebase. For each, report:

Where — file path + function + line range
What it guards — the resource (graph.jsonl? agency dir? something else?)
Scope — how long is the lock held? (single write? whole command? across an LLM call?)
Read or write? — do we lock for pure read operations that could use a snapshot/copy instead?
MooseFS hazard — does the critical section include slow I/O (subprocess spawns, network calls, file copies, git operations, LLM calls) that would be catastrophic to hold a master-mediated lock across?

Pay special attention to:

src/graph.rs and any graph load/save paths
Service daemon (src/commands/service/) — coordinator and dispatcher loops
Agent claim/unclaim paths
Agency reads (.workgraph/agency/ — does it share the lock?)
TUI read paths (wg list, wg show, wg status) — should be lock-free or read-only
The TUI poll loop in particular — if it acquires the lock every refresh, that alone is a thundering herd on MooseFS

Also check:

Single global lock vs finer-grained locks?
Read/write distinction (shared vs exclusive flock)?
Do we ever hold the lock across an LLM call or git worktree creation? (instant fail on netfs)
Is the lock file on MooseFS or could it live on local disk while data lives on MooseFS?

Deliverable

A markdown report at audit-graph-lock-scope.md in the repo root with:

Table of every lock acquisition (file:line, scope, what it guards, hold duration estimate)
List of "smells" — places where lock is held longer than needed
Specific MooseFS-attributable hazards (locks held across slow I/O)
Concrete recommendations (e.g., "line X-Y holds lock across subprocess spawn — extract to copy-then-release pattern")
Top 3 fixes ranked by MooseFS impact (lowest-effort, highest-relief first)
For each top-3 fix: rough patch sketch (which functions to refactor, what the new shape looks like) — no actual code changes

Validation

Report file exists at audit-graph-lock-scope.md
Every flock / Mutex / RwLock guarding graph or agency state is enumerated with file:line
Each entry has scope description and hold-duration estimate
Smells section explicitly flags any lock held across: subprocess spawn, LLM call, git op, sleep, or network I/O
MooseFS-specific reasoning is present (master-mediated lock RPC cost is named)
Top 3 ranked fixes include rough patch sketches, not just "investigate further"
Report is research-only; no source files modified

## Context
User reports stuck-waiting on the graph lock when running workgraph on **MooseFS**. MooseFS supports POSIX locks (flock/fcntl) but routes them through the master server — every acquire/release is a network RPC, and locks are serialized at the master. Hold-times that are invisible on local ext4 become catastrophic here. Contention compounds quadratically with agent count.

Suspicion: we hold the flock for more than just `graph.jsonl` write operations — possibly for reads, for entire command execution, or across subprocess spawns / LLM calls.

## MooseFS-specific properties to keep in mind
- flock/fcntl work but are master-mediated → every lock op is a network round-trip
- File rename is atomic on a single chunkserver but the master serializes metadata ops
- `fsync` is honored but slow; mmap locks are sketchy
- A lock held for 100ms locally can be 100ms+RTT*N here
- The classic netfs fix — "copy graph.jsonl to /tmp, edit, atomically rename back" — works fine on MooseFS as long as /tmp is local

## What to investigate
Audit every flock / lock acquisition in the codebase. For each, report:
1. **Where** — file path + function + line range
2. **What it guards** — the resource (graph.jsonl? agency dir? something else?)
3. **Scope** — how long is the lock held? (single write? whole command? across an LLM call?)
4. **Read or write?** — do we lock for pure read operations that could use a snapshot/copy instead?
5. **MooseFS hazard** — does the critical section include slow I/O (subprocess spawns, network calls, file copies, git operations, LLM calls) that would be catastrophic to hold a master-mediated lock across?

Pay special attention to:
- `src/graph.rs` and any graph load/save paths
- Service daemon (`src/commands/service/`) — coordinator and dispatcher loops
- Agent claim/unclaim paths
- Agency reads (`.workgraph/agency/` — does it share the lock?)
- TUI read paths (`wg list`, `wg show`, `wg status`) — should be lock-free or read-only
- The TUI poll loop in particular — if it acquires the lock every refresh, that alone is a thundering herd on MooseFS

Also check:
- Single global lock vs finer-grained locks?
- Read/write distinction (shared vs exclusive flock)?
- Do we ever hold the lock across an LLM call or git worktree creation? (instant fail on netfs)
- Is the lock file on MooseFS or could it live on local disk while data lives on MooseFS?

## Deliverable
A markdown report at `audit-graph-lock-scope.md` in the repo root with:
- Table of every lock acquisition (file:line, scope, what it guards, hold duration estimate)
- List of "smells" — places where lock is held longer than needed
- Specific MooseFS-attributable hazards (locks held across slow I/O)
- Concrete recommendations (e.g., "line X-Y holds lock across subprocess spawn — extract to copy-then-release pattern")
- Top 3 fixes ranked by MooseFS impact (lowest-effort, highest-relief first)
- For each top-3 fix: rough patch sketch (which functions to refactor, what the new shape looks like) — no actual code changes

## Validation
- [ ] Report file exists at `audit-graph-lock-scope.md`
- [ ] Every flock / Mutex / RwLock guarding graph or agency state is enumerated with file:line
- [ ] Each entry has scope description and hold-duration estimate
- [ ] Smells section explicitly flags any lock held across: subprocess spawn, LLM call, git op, sleep, or network I/O
- [ ] MooseFS-specific reasoning is present (master-mediated lock RPC cost is named)
- [ ] Top 3 ranked fixes include rough patch sketches, not just "investigate further"
- [ ] Report is research-only; no source files modified

Depends on

done .assign-audit-graph-lock

Required by

abandoned .flip-audit-graph-lock

Log

2026-05-05T04:20:58.965025742+00:00 Task paused
2026-05-05T04:21:01.942197077+00:00 Task published
2026-05-05T04:22:07.655893267+00:00 Lightweight assignment: agent=Default Evaluator (5f5f9e1a), exec_mode=light, context_scope=task, reason=Evaluator (0.89 score, 680 tasks) excels at systematic analysis and detailed enumeration; light mode for read-only codebase exploration across locking patterns; deliverable is audit report matching evaluation strengths.
2026-05-05T04:22:07.948991754+00:00 Spawned by coordinator --executor claude --model opus
2026-05-05T04:22:17.325682548+00:00 Starting audit of graph lock scope. Will enumerate every flock/Mutex/RwLock guarding graph or agency state.
2026-05-05T04:23:00.234163254+00:00 Task abandoned