retry-graph-lock — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-2570`
Agent identity	`02e879681e52e0a384106169be043416c4d946e850ab26b2269c57681b52a6e7`
Created	2026-05-05T04:23:01.830486268+00:00
Started	2026-05-05T04:23:47.300147852+00:00
Completed	2026-05-05T04:46:42.391470632+00:00
Tags	`bug,locking,moosefs`, `eval-scheduled`
Eval score	0.78
└ blocking impact	0.80
└ completeness	0.75
└ constraint fidelity	0.70
└ coordination overhead	0.80
└ correctness	0.75
└ downstream usability	0.75
└ efficiency	0.85
└ intent fidelity	0.73
└ style adherence	0.85

Description

Context

Real-world bug, observed on MooseFS-backed checkouts. The graph lock (.wg/graph.lock) is held only briefly during writes (wg add, publish, resume, done, log, artifact) — scope is fine. The problem is that MooseFS occasionally returns EIO (os error 5) instead of the expected EWOULDBLOCK when flock contends, and wg surfaces this as a hard failure instead of retrying.

Symptom: spurious task-add / log / done failures when multiple agents are active. Errors clear after a few seconds. Diagnostic was already done; this task is the fix only.

What to do

Find the lock-acquisition wrapper. There should be one (or a small number of) call sites that flock .wg/graph.lock. Search for graph.lock, flock, fs2::FileExt, try_lock / lock_exclusive etc. The fix belongs in the wrapper, not at every caller.
Add bounded retry-with-backoff for transient errors. Catch EIO (io::Error where raw_os_error() == Some(5)) and EWOULDBLOCK. Retry with exponential backoff + small jitter. Suggested defaults:
- max wall-clock budget: ~5s (configurable via env / config if easy)
- initial delay: 25ms, factor 2, jittered
- log at debug level on each retry, warn on giving up
- on final failure: surface the original error with retry count attached (acquired lock failed after N retries over Xms: <inner>)
Don't retry on actual hard errors. EACCES, ENOENT (lock dir vanished), ENOSPC etc. should propagate immediately. Only EIO + EWOULDBLOCK + EINTR are retried.
Make it testable. Extract the retry policy as a small struct/function so a unit test can inject a fake acquire closure that returns EIO N times then succeeds, and verify backoff + give-up behavior. Don't try to mock MooseFS itself.

Files likely involved

src/graph.rs (or wherever load_with_lock / save_with_lock lives)
Possibly a new src/lock.rs if the retry logic deserves its own module
A unit test file for the retry policy

Out of scope

Changing lock scope or read paths (already correct per diagnosis)
Switching lock backends
MooseFS detection / FS-type-specific tuning (a generic transient-EIO retry is enough)

Validation

Lock acquisition wrapper identified and modified in one place; no scattered retry loops at call sites
EIO (errno 5) is retried with bounded exponential backoff + jitter
EWOULDBLOCK and EINTR are also retried; other errors propagate immediately
Retry budget is bounded (default ~5s) and on exhaustion the error message includes retry count + elapsed time
Unit test exists that injects N EIO failures then success, and asserts the wrapper succeeded; another test asserts give-up after budget
cargo build clean, cargo test clean
cargo install --path . after, so the fix is live in the global wg
Manual smoke: run wg add / wg log rapidly while another agent holds the lock; observe no spurious failures (run from any FS — the retry path is exercised by EWOULDBLOCK alone)

## Context
Real-world bug, observed on MooseFS-backed checkouts. The graph lock (`.wg/graph.lock`) is held only briefly during writes (`wg add`, `publish`, `resume`, `done`, `log`, `artifact`) — scope is fine. The problem is that MooseFS occasionally returns **EIO (os error 5)** instead of the expected **EWOULDBLOCK** when flock contends, and wg surfaces this as a hard failure instead of retrying.

Symptom: spurious task-add / log / done failures when multiple agents are active. Errors clear after a few seconds. Diagnostic was already done; this task is the fix only.

## What to do

1. **Find the lock-acquisition wrapper.** There should be one (or a small number of) call sites that flock `.wg/graph.lock`. Search for `graph.lock`, `flock`, `fs2::FileExt`, `try_lock` / `lock_exclusive` etc. The fix belongs in the wrapper, not at every caller.

2. **Add bounded retry-with-backoff for transient errors.** Catch EIO (`io::Error` where `raw_os_error() == Some(5)`) and EWOULDBLOCK. Retry with exponential backoff + small jitter. Suggested defaults:
   - max wall-clock budget: ~5s (configurable via env / config if easy)
   - initial delay: 25ms, factor 2, jittered
   - log at debug level on each retry, warn on giving up
   - on final failure: surface the original error with retry count attached (`acquired lock failed after N retries over Xms: <inner>`)

3. **Don't retry on actual hard errors.** EACCES, ENOENT (lock dir vanished), ENOSPC etc. should propagate immediately. Only EIO + EWOULDBLOCK + EINTR are retried.

4. **Make it testable.** Extract the retry policy as a small struct/function so a unit test can inject a fake `acquire` closure that returns EIO N times then succeeds, and verify backoff + give-up behavior. Don't try to mock MooseFS itself.

## Files likely involved
- `src/graph.rs` (or wherever `load_with_lock` / `save_with_lock` lives)
- Possibly a new `src/lock.rs` if the retry logic deserves its own module
- A unit test file for the retry policy

## Out of scope
- Changing lock scope or read paths (already correct per diagnosis)
- Switching lock backends
- MooseFS detection / FS-type-specific tuning (a generic transient-EIO retry is enough)

## Validation
- [ ] Lock acquisition wrapper identified and modified in one place; no scattered retry loops at call sites
- [ ] EIO (errno 5) is retried with bounded exponential backoff + jitter
- [ ] EWOULDBLOCK and EINTR are also retried; other errors propagate immediately
- [ ] Retry budget is bounded (default ~5s) and on exhaustion the error message includes retry count + elapsed time
- [ ] Unit test exists that injects N EIO failures then success, and asserts the wrapper succeeded; another test asserts give-up after budget
- [ ] `cargo build` clean, `cargo test` clean
- [ ] `cargo install --path .` after, so the fix is live in the global `wg`
- [ ] Manual smoke: run `wg add` / `wg log` rapidly while another agent holds the lock; observe no spurious failures (run from any FS — the retry path is exercised by EWOULDBLOCK alone)

Depends on

done .assign-retry-graph-lock

Required by

done .flip-retry-graph-lock

Log

2026-05-05T04:23:01.810430830+00:00 Task paused
2026-05-05T04:23:19.331814411+00:00 Task published
2026-05-05T04:23:43.581829577+00:00 Lightweight assignment: agent=Careful Programmer (02e87968), exec_mode=full, context_scope=task, reason=Correctness-critical locking bug with meticulous error handling and comprehensive validation requirements; Careful Programmer's tradeoff and 0.81 score on 638 prior tasks is the best fit.
2026-05-05T04:23:47.300152120+00:00 Spawned by coordinator --executor claude --model opus
2026-05-05T04:23:57.092550217+00:00 Starting investigation: locating graph.lock acquisition wrapper
2026-05-05T04:45:06.680743625+00:00 Implemented bounded retry-with-backoff for transient lock errors in src/lock.rs (RetryPolicy + retry_acquire) wired into FileLock::flock_impl and FileLock::try_acquire_shared in src/parser.rs. EWOULDBLOCK is retried for blocking acquisitions but excluded for non-blocking shared reads (preserves load_graph contract). 9 new unit tests cover EIO/EWOULDBLOCK/EINTR retry, budget exhaustion, EACCES/ENOENT/ENOSPC propagation, and jitter bound. Full library suite 2181 passing. cargo install --path . succeeded; 50 concurrent 'wg add' under contention all succeed.
2026-05-05T04:46:29.238707127+00:00 Committed: 8cdc1c0e9 — pushed to remote
2026-05-05T04:46:42.391474831+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-05T04:47:42.020636365+00:00 PendingEval → Done (evaluator passed; downstream unblocks)