Metadata
| Status | done |
|---|---|
| Assigned | agent-2570 |
| Agent identity | 02e879681e52e0a384106169be043416c4d946e850ab26b2269c57681b52a6e7 |
| Created | 2026-05-05T04:23:01.830486268+00:00 |
| Started | 2026-05-05T04:23:47.300147852+00:00 |
| Completed | 2026-05-05T04:46:42.391470632+00:00 |
| Tags | bug,locking,moosefs, eval-scheduled |
| Eval score | 0.78 |
| └ blocking impact | 0.80 |
| └ completeness | 0.75 |
| └ constraint fidelity | 0.70 |
| └ coordination overhead | 0.80 |
| └ correctness | 0.75 |
| └ downstream usability | 0.75 |
| └ efficiency | 0.85 |
| └ intent fidelity | 0.73 |
| └ style adherence | 0.85 |
Description
Context
Real-world bug, observed on MooseFS-backed checkouts. The graph lock (.wg/graph.lock) is held only briefly during writes (wg add, publish, resume, done, log, artifact) — scope is fine. The problem is that MooseFS occasionally returns EIO (os error 5) instead of the expected EWOULDBLOCK when flock contends, and wg surfaces this as a hard failure instead of retrying.
Symptom: spurious task-add / log / done failures when multiple agents are active. Errors clear after a few seconds. Diagnostic was already done; this task is the fix only.
What to do
-
Find the lock-acquisition wrapper. There should be one (or a small number of) call sites that flock
.wg/graph.lock. Search forgraph.lock,flock,fs2::FileExt,try_lock/lock_exclusiveetc. The fix belongs in the wrapper, not at every caller. -
Add bounded retry-with-backoff for transient errors. Catch EIO (
io::Errorwhereraw_os_error() == Some(5)) and EWOULDBLOCK. Retry with exponential backoff + small jitter. Suggested defaults:- max wall-clock budget: ~5s (configurable via env / config if easy)
- initial delay: 25ms, factor 2, jittered
- log at debug level on each retry, warn on giving up
- on final failure: surface the original error with retry count attached (
acquired lock failed after N retries over Xms: <inner>)
-
Don't retry on actual hard errors. EACCES, ENOENT (lock dir vanished), ENOSPC etc. should propagate immediately. Only EIO + EWOULDBLOCK + EINTR are retried.
-
Make it testable. Extract the retry policy as a small struct/function so a unit test can inject a fake
acquireclosure that returns EIO N times then succeeds, and verify backoff + give-up behavior. Don't try to mock MooseFS itself.
Files likely involved
src/graph.rs(or whereverload_with_lock/save_with_locklives)- Possibly a new
src/lock.rsif the retry logic deserves its own module - A unit test file for the retry policy
Out of scope
- Changing lock scope or read paths (already correct per diagnosis)
- Switching lock backends
- MooseFS detection / FS-type-specific tuning (a generic transient-EIO retry is enough)
Validation
- Lock acquisition wrapper identified and modified in one place; no scattered retry loops at call sites
- EIO (errno 5) is retried with bounded exponential backoff + jitter
- EWOULDBLOCK and EINTR are also retried; other errors propagate immediately
- Retry budget is bounded (default ~5s) and on exhaustion the error message includes retry count + elapsed time
- Unit test exists that injects N EIO failures then success, and asserts the wrapper succeeded; another test asserts give-up after budget
-
cargo buildclean,cargo testclean -
cargo install --path .after, so the fix is live in the globalwg -
Manual smoke: run
wg add/wg lograpidly while another agent holds the lock; observe no spurious failures (run from any FS — the retry path is exercised by EWOULDBLOCK alone)
Depends on
Required by
Log
- 2026-05-05T04:23:01.810430830+00:00 Task paused
- 2026-05-05T04:23:19.331814411+00:00 Task published
- 2026-05-05T04:23:43.581829577+00:00 Lightweight assignment: agent=Careful Programmer (02e87968), exec_mode=full, context_scope=task, reason=Correctness-critical locking bug with meticulous error handling and comprehensive validation requirements; Careful Programmer's tradeoff and 0.81 score on 638 prior tasks is the best fit.
- 2026-05-05T04:23:47.300152120+00:00 Spawned by coordinator --executor claude --model opus
- 2026-05-05T04:23:57.092550217+00:00 Starting investigation: locating graph.lock acquisition wrapper
- 2026-05-05T04:45:06.680743625+00:00 Implemented bounded retry-with-backoff for transient lock errors in src/lock.rs (RetryPolicy + retry_acquire) wired into FileLock::flock_impl and FileLock::try_acquire_shared in src/parser.rs. EWOULDBLOCK is retried for blocking acquisitions but excluded for non-blocking shared reads (preserves load_graph contract). 9 new unit tests cover EIO/EWOULDBLOCK/EINTR retry, budget exhaustion, EACCES/ENOENT/ENOSPC propagation, and jitter bound. Full library suite 2181 passing. cargo install --path . succeeded; 50 concurrent 'wg add' under contention all succeed.
- 2026-05-05T04:46:29.238707127+00:00 Committed: 8cdc1c0e9 — pushed to remote
- 2026-05-05T04:46:42.391474831+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-05-05T04:47:42.020636365+00:00 PendingEval → Done (evaluator passed; downstream unblocks)