fix-evaluator-role — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-2286`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Model	`codex:gpt-5.5`
Created	2026-05-04T15:13:17.247944293+00:00
Started	2026-05-04T15:14:56.873673232+00:00
Completed	2026-05-04T17:50:23.628879956+00:00
Tags	`priority-high,fix,bug,native,evaluator,openrouter`, `eval-scheduled`
Eval score	0.91
└ blocking impact	0.90
└ completeness	0.94
└ constraint fidelity	0.85
└ coordination overhead	0.86
└ correctness	0.95
└ downstream usability	0.90
└ efficiency	0.82
└ intent fidelity	0.73
└ style adherence	0.91

Description

Evaluator role assigned to native executor with model='anthropic/claude-haiku-4-5'. The native executor tried OpenRouter (no API key configured), failed, fell back to claude CLI, which ALSO failed.

Live error from .evaluate-weather-monitor-memphis on 2026-05-04 10:10:

wg evaluate exited with code 1
Note: task has no assigned agent — evaluating without role/tradeoff context [lightweight-llm]
native openrouter call failed for role=evaluator model=anthropic/claude-haiku-4-5: 
  Failed to create OpenAI client for lightweight call: 
  No API key found for provider 'openrouter'.
— falling back to claude CLI
Error: Evaluation LLM call failed
Caused by: Claude CLI call failed (exit Some(1))

User report 2026-05-04: 'got this bug with native executor.'

Root cause hypothesis

Two stacked problems:

Routing: the model spec anthropic/claude-haiku-4-5 (provider/model slash-format) routes to openrouter by default. With no openrouter API key it fails. But claude:haiku is available via the claude CLI directly — we should be using that path, not openrouter.
Fallback: the claude-CLI fallback ALSO fails (exit 1, no useful error). The fallback may be passing the openrouter-style model name (anthropic/claude-haiku-4-5) to claude CLI, which doesn't recognize that format. claude CLI expects opus / sonnet / haiku / claude:opus etc.

Required fix

For evaluator role on native executor: prefer claude CLI directly (no openrouter detour) when the model is a claude variant
OR translate anthropic/claude-haiku-4-5 → claude:haiku for the claude-CLI fallback path
OR if we genuinely want openrouter to handle this, surface the no-key error EARLIER with a clear instruction (not a confusing two-step fallback)

Recommend: native executor's eval path should route claude-flavored model specs directly to the claude CLI, bypassing openrouter entirely. The openrouter path is for non-claude models (qwen, gpt, etc.). Don't go through openrouter just to land at claude.

Validation

Failing test: trigger an evaluator with native executor + claude-flavored model, observe the routing
Fix: claude-flavored models route directly to claude CLI (no openrouter hop)
Live smoke: .evaluate-X task on a task with no openrouter key configured succeeds
Audit: are there OTHER role-on-native paths (assigner, flip) that have the same routing bug? Check.
cargo build + cargo test pass
cargo install --path . was run before claiming done

## Description
Evaluator role assigned to native executor with model='anthropic/claude-haiku-4-5'. The native executor tried OpenRouter (no API key configured), failed, fell back to claude CLI, which ALSO failed.

Live error from .evaluate-weather-monitor-memphis on 2026-05-04 10:10:
```
wg evaluate exited with code 1
Note: task has no assigned agent — evaluating without role/tradeoff context [lightweight-llm]
native openrouter call failed for role=evaluator model=anthropic/claude-haiku-4-5: 
  Failed to create OpenAI client for lightweight call: 
  No API key found for provider 'openrouter'.
— falling back to claude CLI
Error: Evaluation LLM call failed
Caused by: Claude CLI call failed (exit Some(1))
```

User report 2026-05-04: 'got this bug with native executor.'

## Root cause hypothesis
Two stacked problems:

1. **Routing**: the model spec `anthropic/claude-haiku-4-5` (provider/model slash-format) routes to openrouter by default. With no openrouter API key it fails. But claude:haiku is available via the claude CLI directly — we should be using that path, not openrouter.

2. **Fallback**: the claude-CLI fallback ALSO fails (exit 1, no useful error). The fallback may be passing the openrouter-style model name (`anthropic/claude-haiku-4-5`) to claude CLI, which doesn't recognize that format. claude CLI expects `opus` / `sonnet` / `haiku` / `claude:opus` etc.

## Required fix
- For evaluator role on native executor: prefer claude CLI directly (no openrouter detour) when the model is a claude variant
- OR translate `anthropic/claude-haiku-4-5` → `claude:haiku` for the claude-CLI fallback path
- OR if we genuinely want openrouter to handle this, surface the no-key error EARLIER with a clear instruction (not a confusing two-step fallback)

Recommend: native executor's eval path should route claude-flavored model specs directly to the claude CLI, bypassing openrouter entirely. The openrouter path is for non-claude models (qwen, gpt, etc.). Don't go through openrouter just to land at claude.

## Validation
- [ ] Failing test: trigger an evaluator with native executor + claude-flavored model, observe the routing
- [ ] Fix: claude-flavored models route directly to claude CLI (no openrouter hop)
- [ ] Live smoke: `.evaluate-X` task on a task with no openrouter key configured succeeds
- [ ] Audit: are there OTHER role-on-native paths (assigner, flip) that have the same routing bug? Check.
- [ ] cargo build + cargo test pass
- [ ] cargo install --path . was run before claiming done

Depends on

done .assign-fix-evaluator-role

Required by

(none)

Log

2026-05-04T15:13:17.010147035+00:00 Task paused
2026-05-04T15:14:11.713473250+00:00 Task published
2026-05-04T15:14:45.251397189+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=graph, reason=Careful Programmer best suited for correctness-critical native executor routing bug; graph context needed to audit evaluator, assigner, and FLIP paths for same issue
2026-05-04T15:14:56.873679914+00:00 Spawned by coordinator --executor codex --model gpt-5.5
2026-05-04T15:15:11.482177984+00:00 Starting implementation: inspecting evaluator routing and native lightweight LLM paths
2026-05-04T15:21:23.714834157+00:00 Implemented Claude slash-form normalization for agency one-shot dispatch; focused agency_dispatch tests pass
2026-05-04T15:33:19.743041444+00:00 Fixed native resume hard-truncation regression uncovered by full cargo test; focused unit and integration compaction tests now pass
2026-05-04T16:16:02.147453796+00:00 Resuming WIP; investigating remaining native coordinator test cleanup issue before full validation
2026-05-04T16:16:51.445371583+00:00 Patched native coordinator test helpers to scrub parent WG worktree env so spawned test tasks cannot reap this agent worktree
2026-05-04T16:20:55.489704805+00:00 Validated targeted routing tests and smoke scenario: agency_dispatch tests pass; evaluator slash-form smoke bypasses OpenRouter and calls claude --model haiku
2026-05-04T16:23:50.018579186+00:00 Fixed stale integration_provider expectations for canonical oai-compat provider naming, premium verification default, and current registry IDs; integration_provider now passes
2026-05-04T16:26:50.050587097+00:00 Fixed integration_service status assertion for current Dispatcher status output; integration_service now passes
2026-05-04T16:30:02.679063721+00:00 Fixed integration_streaming canonical provider-name expectation; integration_streaming now passes
2026-05-04T16:32:37.754032826+00:00 Hardened spawn wrapper cleanup guard so inherited parent WG_WORKTREE_PATH cannot mark or reap this agent worktree
2026-05-04T16:35:14.259506105+00:00 Validation target was deleted by inherited cleanup env; rerunning build/tests with isolated CARGO_TARGET_DIR and cleanup guard fix
2026-05-04T16:37:55.388213649+00:00 Focused validation passed: cargo test test_wrapper_ with isolated CARGO_TARGET_DIR
2026-05-04T17:08:28.972806650+00:00 Full cargo test exposed cwd-sandbox assumptions in test_edit_file_edge_cases; updating that fixture to create temp files under target/ inside the repo
2026-05-04T17:14:13.057601762+00:00 Updated test_prompt_logging_debug to use ## Validation descriptions instead of removed wg add --verify
2026-05-04T17:20:40.728699560+00:00 Updated verify-timeout tests to use current ## Validation flow instead of removed --verify flag
2026-05-04T17:35:16.310513924+00:00 Validated: cargo build --quiet passed with isolated CARGO_TARGET_DIR
2026-05-04T17:35:41.193080583+00:00 Validated: cargo test --quiet passed with isolated CARGO_TARGET_DIR
2026-05-04T17:36:10.072677927+00:00 Validated: evaluator Claude slash smoke scenario passed without OpenRouter key
2026-05-04T17:36:36.038452509+00:00 Validated: cargo install --path . completed
2026-05-04T17:42:35.078649254+00:00 Committed: 6229e3ed7 — pushed to remote
2026-05-04T17:50:08.086319860+00:00 Validated after merge: agency_dispatch tests passed, evaluator smoke passed, git diff --check passed, cargo install --path . completed
2026-05-04T17:50:23.628895155+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-04T17:51:15.332157339+00:00 PendingEval → Done (evaluator passed; downstream unblocks)