in-place-eval — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-693`
Agent identity	`3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3`
Created	2026-04-27T14:41:25.524611646+00:00
Started	2026-04-27T14:45:18.343871621+00:00
Completed	2026-04-27T15:06:19.621685207+00:00
Tags	`eval-gate`, `eval-scheduled`

Description

User architectural clarification (2026-04-27): when an eval gate FAILS, the rescue path must reuse the SAME agent identity AND the SAME worktree, NOT spawn a fresh worker.

User's verbatim quote:

'the failed gate, it should result in a retry, but again without destruction of the particular agent. Like we should regenerate that agent so that it has the same work tree and so on. it'''s just another iteration, right?'

Required behavior

PendingEval → eval fail (score < threshold) ─┬─ rescue_count < max → Open (same task.agent, same worktree, eval feedback in context)
                                              └─ rescue_count >= max → Failed (triage)

Same iteration semantic as a chat session reattaching: pick up prior state, continue from there.

Wired with

add-pendingeval-state (this task's parent): Adds PendingEval state, dispatcher resolution on pass, dep gating. The eval-FAIL path currently uses the existing fresh-agent rescue which is wrong per the clarification.
worktree-retention-don (already merged): Don't reap worktree until eval+merge actually completes. Together these produce the proper resumable iteration loop.

Files likely to touch

src/commands/evaluate.rs — check_eval_gate: on score < threshold, instead of run_eval_reject + rescue::run, transition source PendingEval → Open keeping task.agent / task.assigned, increment task.rescue_count, append eval notes to next-attempt context.
src/commands/spawn/context.rs — pick up evaluator notes from prior iteration when spawning (similar to how retry_count > 0 already injects previous-attempt context).
src/commands/service/coordinator.rs — if needed, suppress worktree GC for tasks in the eval-rescue loop.
src/config.rs — max_eval_rescues cap (already exists as alias for max_verify_failures).

What stays from add-pendingeval-state

Status::PendingEval variant
pick_done_target_status (wg done → PendingEval when eval scheduled)
resolve_pending_eval_tasks (eval pass → Done)
Color rendering (chartreuse)
approve / reject / fail accept PendingEval

Validation

Failing test first: test_eval_fail_retries_in_place_with_same_agent — task A in PendingEval, eval scores below threshold, after the fail-handler runs A is Open with task.agent UNCHANGED and task.rescue_count incremented (no new task created)
Failing test: test_eval_fail_at_cap_transitions_to_failed — same setup, rescue_count == max_eval_rescues, A goes to Failed (no further iteration spawn)
Failing test: test_eval_feedback_in_next_spawn_context — after rescue, the next spawn's previous_attempt_context contains the evaluator notes
Failing test: test_worktree_preserved_across_eval_iteration — worktree dir for A still exists after eval-fail rescue (not reaped)
cargo build + cargo test pass with no regressions
Manual smoke: low-scoring task A → wg show A reports Status: in-progress (or open with assigned set), same task.agent hash, same worktree path; rescue_count: 1

## Description

User architectural clarification (2026-04-27): when an eval gate FAILS, the rescue path must reuse the SAME agent identity AND the SAME worktree, NOT spawn a fresh worker.

User's verbatim quote:
> 'the failed gate, it should result in a retry, but again without destruction of the particular agent. Like we should regenerate that agent so that it has the same work tree and so on. it'\''s just another iteration, right?'

### Required behavior

```
PendingEval → eval fail (score < threshold) ─┬─ rescue_count < max → Open (same task.agent, same worktree, eval feedback in context)
                                              └─ rescue_count >= max → Failed (triage)
```

Same iteration semantic as a chat session reattaching: pick up prior state, continue from there.

### Wired with

- **add-pendingeval-state** (this task's parent): Adds PendingEval state, dispatcher resolution on pass, dep gating. The eval-FAIL path currently uses the existing fresh-agent rescue which is wrong per the clarification.
- **worktree-retention-don** (already merged): Don't reap worktree until eval+merge actually completes. Together these produce the proper resumable iteration loop.

### Files likely to touch

- `src/commands/evaluate.rs` — `check_eval_gate`: on score < threshold, instead of `run_eval_reject` + `rescue::run`, transition source PendingEval → Open keeping task.agent / task.assigned, increment task.rescue_count, append eval notes to next-attempt context.
- `src/commands/spawn/context.rs` — pick up evaluator notes from prior iteration when spawning (similar to how retry_count > 0 already injects previous-attempt context).
- `src/commands/service/coordinator.rs` — if needed, suppress worktree GC for tasks in the eval-rescue loop.
- `src/config.rs` — `max_eval_rescues` cap (already exists as alias for max_verify_failures).

### What stays from add-pendingeval-state

- Status::PendingEval variant
- pick_done_target_status (wg done → PendingEval when eval scheduled)
- resolve_pending_eval_tasks (eval pass → Done)
- Color rendering (chartreuse)
- approve / reject / fail accept PendingEval

### Validation

- [ ] Failing test first: test_eval_fail_retries_in_place_with_same_agent — task A in PendingEval, eval scores below threshold, after the fail-handler runs A is Open with task.agent UNCHANGED and task.rescue_count incremented (no new task created)
- [ ] Failing test: test_eval_fail_at_cap_transitions_to_failed — same setup, rescue_count == max_eval_rescues, A goes to Failed (no further iteration spawn)
- [ ] Failing test: test_eval_feedback_in_next_spawn_context — after rescue, the next spawn's previous_attempt_context contains the evaluator notes
- [ ] Failing test: test_worktree_preserved_across_eval_iteration — worktree dir for A still exists after eval-fail rescue (not reaped)
- [ ] cargo build + cargo test pass with no regressions
- [ ] Manual smoke: low-scoring task A → wg show A reports Status: in-progress (or open with assigned set), same task.agent hash, same worktree path; rescue_count: 1

Depends on

done .assign-in-place-eval

Required by

(none)

Log

2026-04-27T14:45:18.343874757+00:00 Spawned by coordinator --executor claude --model opus
2026-04-27T14:45:33.621674261+00:00 Starting: explore current check_eval_gate, spawn context, and rescue flow to plan in-place eval-fail iteration
2026-04-27T15:00:00.827441773+00:00 Tests pass: 4 new + 15 existing evaluate tests green. Preexisting failures (provenance_full_lifecycle, integration_resume compile error) are unrelated and present without my changes.
2026-04-27T15:05:31.147109733+00:00 Committed: 0912ffa66 — pushed to remote
2026-04-27T15:06:19.621697960+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-04-27T15:06:52.887897083+00:00 PendingEval → Done (evaluator passed; downstream unblocks)