retry-graph-lock

Retry graph.lock acquisition on transient EIO (MooseFS)

Metadata

Statusdone
Assignedagent-2570
Agent identity02e879681e52e0a384106169be043416c4d946e850ab26b2269c57681b52a6e7
Created2026-05-05T04:23:01.830486268+00:00
Started2026-05-05T04:23:47.300147852+00:00
Completed2026-05-05T04:46:42.391470632+00:00
Tagsbug,locking,moosefs, eval-scheduled
Eval score0.78
└ blocking impact0.80
└ completeness0.75
└ constraint fidelity0.70
└ coordination overhead0.80
└ correctness0.75
└ downstream usability0.75
└ efficiency0.85
└ intent fidelity0.73
└ style adherence0.85

Description

Context

Real-world bug, observed on MooseFS-backed checkouts. The graph lock (.wg/graph.lock) is held only briefly during writes (wg add, publish, resume, done, log, artifact) — scope is fine. The problem is that MooseFS occasionally returns EIO (os error 5) instead of the expected EWOULDBLOCK when flock contends, and wg surfaces this as a hard failure instead of retrying.

Symptom: spurious task-add / log / done failures when multiple agents are active. Errors clear after a few seconds. Diagnostic was already done; this task is the fix only.

What to do

  1. Find the lock-acquisition wrapper. There should be one (or a small number of) call sites that flock .wg/graph.lock. Search for graph.lock, flock, fs2::FileExt, try_lock / lock_exclusive etc. The fix belongs in the wrapper, not at every caller.

  2. Add bounded retry-with-backoff for transient errors. Catch EIO (io::Error where raw_os_error() == Some(5)) and EWOULDBLOCK. Retry with exponential backoff + small jitter. Suggested defaults:

    • max wall-clock budget: ~5s (configurable via env / config if easy)
    • initial delay: 25ms, factor 2, jittered
    • log at debug level on each retry, warn on giving up
    • on final failure: surface the original error with retry count attached (acquired lock failed after N retries over Xms: <inner>)
  3. Don't retry on actual hard errors. EACCES, ENOENT (lock dir vanished), ENOSPC etc. should propagate immediately. Only EIO + EWOULDBLOCK + EINTR are retried.

  4. Make it testable. Extract the retry policy as a small struct/function so a unit test can inject a fake acquire closure that returns EIO N times then succeeds, and verify backoff + give-up behavior. Don't try to mock MooseFS itself.

Files likely involved

  • src/graph.rs (or wherever load_with_lock / save_with_lock lives)
  • Possibly a new src/lock.rs if the retry logic deserves its own module
  • A unit test file for the retry policy

Out of scope

  • Changing lock scope or read paths (already correct per diagnosis)
  • Switching lock backends
  • MooseFS detection / FS-type-specific tuning (a generic transient-EIO retry is enough)

Validation

  • Lock acquisition wrapper identified and modified in one place; no scattered retry loops at call sites
  • EIO (errno 5) is retried with bounded exponential backoff + jitter
  • EWOULDBLOCK and EINTR are also retried; other errors propagate immediately
  • Retry budget is bounded (default ~5s) and on exhaustion the error message includes retry count + elapsed time
  • Unit test exists that injects N EIO failures then success, and asserts the wrapper succeeded; another test asserts give-up after budget
  • cargo build clean, cargo test clean
  • cargo install --path . after, so the fix is live in the global wg
  • Manual smoke: run wg add / wg log rapidly while another agent holds the lock; observe no spurious failures (run from any FS — the retry path is exercised by EWOULDBLOCK alone)

Depends on

Required by

Log