migrate-agency-tasks

Migrate agency tasks (eval/flip/assign) from eval executor to claude CLI

Metadata

Statusdone
Assignedagent-852
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-27T19:33:57.252025282+00:00
Started2026-04-27T21:18:19.184046117+00:00
Completed2026-04-27T21:39:28.481355117+00:00
Tagseval-scheduled
Tokens13174632 in / 45689 out
Eval score0.24
└ blocking impact0.20
└ completeness0.15
└ constraint fidelity0.70
└ coordination overhead0.30
└ correctness0.20
└ downstream usability0.35
└ efficiency0.30
└ style adherence0.35

Description

Description

Right now .evaluate-*, .flip-*, and .assign-* tasks dispatch via a hardcoded executor=eval path — distinct from the claude and native executors used for everything else. Today's outage exposed problems with this:

  • The eval path silently failed when its model resolved to openrouter (no key) — no retry, no fallback, error barely surfaced in daemon log
  • The behavior of executor=eval is opaque even to people working on the system — what does it do for auth, compaction, retry, token logging?
  • It's a special case: the rest of the system runs through claude (claude CLI). Eval should not be different.

Desired end state

  • Agency tasks (.evaluate-*, .flip-*, .assign-*) dispatch via executor=claude (claude CLI, claude -p style for one-shot scoring)
  • Model stays claude:haiku (kept cheap on purpose — DO NOT bump to opus)
  • The eval executor type is removed, OR retained only as an alias that maps to claude with the evaluator model
  • All eval/flip/assign tasks inherit the same retry, compaction, logging, and error-surfacing behavior as worker agents

Investigation needed first

  1. What does executor=eval actually do today? Where in the source is the eval handler defined? What's its auth path, retry policy, compaction behavior, output format?
  2. What inputs does it consume (task transcript, rubric, agent metadata)? What's the prompt shape it sends? What output format does the agency framework expect back?
  3. Can claude CLI in print mode (claude -p) produce that exact output format with the same prompt? Any structured-output requirements (JSON schema for scoring) that need careful handling?
  4. Are there callers anywhere that assume executor=eval specifically (vs. just 'whatever runs agency tasks')?

Implementation

After investigation lands a design note, the implementation task should:

  • Route .evaluate-* / .flip-* / .assign-* task spawning through the claude executor
  • Preserve the existing prompt + output contract (no behavior change for FLIP scores, eval gates, etc.)
  • Update CLAUDE.md / agent guide / quickstart to reflect: 'agency tasks run on the claude CLI like everything else'
  • Decide: keep executor=eval as deprecation alias, or remove entirely

Validation

  • Investigation note written: design.md or similar, covers the 4 questions in 'Investigation needed first'
  • After implementation: a freshly published task's .flip-* shows Executor: claude + Model: claude:haiku (verify in TUI or wg show)
  • FLIP scores + eval gates produce same shape of output as before (no downstream agency-framework breakage)
  • cargo build + cargo test pass with no regressions
  • Smoke scenarios for .evaluate-/.flip-/.assign-* still pass (or are added if missing)

Depends on

Required by

Log