Metadata
| Status | done |
|---|---|
| Assigned | agent-852 |
| Agent identity | f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e |
| Created | 2026-04-27T19:33:57.252025282+00:00 |
| Started | 2026-04-27T21:18:19.184046117+00:00 |
| Completed | 2026-04-27T21:39:28.481355117+00:00 |
| Tags | eval-scheduled |
| Tokens | 13174632 in / 45689 out |
| Eval score | 0.24 |
| └ blocking impact | 0.20 |
| └ completeness | 0.15 |
| └ constraint fidelity | 0.70 |
| └ coordination overhead | 0.30 |
| └ correctness | 0.20 |
| └ downstream usability | 0.35 |
| └ efficiency | 0.30 |
| └ style adherence | 0.35 |
Description
Description
Right now .evaluate-*, .flip-*, and .assign-* tasks dispatch via a hardcoded executor=eval path — distinct from the claude and native executors used for everything else. Today's outage exposed problems with this:
- The eval path silently failed when its model resolved to openrouter (no key) — no retry, no fallback, error barely surfaced in daemon log
- The behavior of
executor=evalis opaque even to people working on the system — what does it do for auth, compaction, retry, token logging? - It's a special case: the rest of the system runs through
claude(claude CLI). Eval should not be different.
Desired end state
- Agency tasks (
.evaluate-*,.flip-*,.assign-*) dispatch viaexecutor=claude(claude CLI,claude -pstyle for one-shot scoring) - Model stays
claude:haiku(kept cheap on purpose — DO NOT bump to opus) - The
evalexecutor type is removed, OR retained only as an alias that maps toclaudewith the evaluator model - All eval/flip/assign tasks inherit the same retry, compaction, logging, and error-surfacing behavior as worker agents
Investigation needed first
- What does
executor=evalactually do today? Where in the source is the eval handler defined? What's its auth path, retry policy, compaction behavior, output format? - What inputs does it consume (task transcript, rubric, agent metadata)? What's the prompt shape it sends? What output format does the agency framework expect back?
- Can claude CLI in print mode (
claude -p) produce that exact output format with the same prompt? Any structured-output requirements (JSON schema for scoring) that need careful handling? - Are there callers anywhere that assume
executor=evalspecifically (vs. just 'whatever runs agency tasks')?
Implementation
After investigation lands a design note, the implementation task should:
- Route
.evaluate-*/.flip-*/.assign-*task spawning through the claude executor - Preserve the existing prompt + output contract (no behavior change for FLIP scores, eval gates, etc.)
- Update CLAUDE.md / agent guide / quickstart to reflect: 'agency tasks run on the claude CLI like everything else'
- Decide: keep
executor=evalas deprecation alias, or remove entirely
Validation
- Investigation note written: design.md or similar, covers the 4 questions in 'Investigation needed first'
-
After implementation: a freshly published task's
.flip-*showsExecutor: claude+Model: claude:haiku(verify in TUI orwg show) - FLIP scores + eval gates produce same shape of output as before (no downstream agency-framework breakage)
- cargo build + cargo test pass with no regressions
- Smoke scenarios for .evaluate-/.flip-/.assign-* still pass (or are added if missing)
Depends on
Required by
- (none)
Log
- 2026-04-27T19:33:57.221414628+00:00 Task paused
- 2026-04-27T19:33:59.829288143+00:00 Task published
- 2026-04-27T19:35:57.575345564+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-27T19:36:07.930193675+00:00 Starting investigation of executor=eval path
- 2026-04-27T19:41:21.967272715+00:00 Wrote design.md (docs/design-agency-tasks-on-claude.md). Now implementing.
- 2026-04-27T20:05:22.093709628+00:00 Task marked as failed: Agent exited with code 1
- 2026-04-27T20:35:50.318872664+00:00 Reset by `wg recover` — reason: API errors during dispatch — second outage of session
- 2026-04-27T20:36:11.279615041+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-27T20:39:13.852098937+00:00 Resuming from prior attempt: design doc + 3 modified rs files. Reviewing what's there before continuing.
- 2026-04-27T21:18:12.254410415+00:00 Task unclaimed: agent 'agent-845' (PID 2579236) process exited
- 2026-04-27T21:18:19.184053571+00:00 Spawned by coordinator --executor claude --model opus
- 2026-04-27T21:34:55.725612592+00:00 Hit stash collision via shared .git/refs/stash across worktrees — my prior WIP got swapped with another agent's WIP during stash/pop. Other agent's files (config_cmd.rs, viz_viewer/*) now in my working tree; will leave them untouched. Re-applying my llm/registry/coordinator/CLAUDE.md changes from memory of the diff I read earlier in this session.
- 2026-04-27T21:38:39.817662532+00:00 Validated: cargo build + service::llm/registry/coordinator tests pass; live smoke (agency_inline_spawn_registers_executor_claude) confirms .evaluate-* registers executor=claude in both registry.json and metadata.json
- 2026-04-27T21:39:10.086806079+00:00 Committed: 437c7582c — pushed to origin/wg/agent-832/migrate-agency-tasks
- 2026-04-27T21:39:28.481359696+00:00 Task marked as done