wg-evaluate-record — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-165`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-04-26T23:08:20.170337944+00:00
Started	2026-04-26T23:20:40.120439761+00:00
Completed	2026-04-26T23:43:37.832615774+00:00
Tags	`eval-scheduled`
Eval score	0.93
└ blocking impact	0.90
└ completeness	0.98
└ coordination overhead	0.92
└ correctness	0.95
└ downstream usability	0.88
└ efficiency	0.92
└ intent fidelity	0.87
└ style adherence	0.90

Description

FLIP evaluation in autohaiku runs fine, computes score, then fails to record it:

=== FLIP Evaluation Complete ===
Task:       Autohaiku Assembly Line ...
FLIP Score: 0.72
...
error: unexpected argument '.flip-autohaiku-assembly-line-generate-hourly-haikus-fro' found

Usage: wg evaluate record [OPTIONS] --task <TASK> --score <SCORE> --source <SOURCE>

The eval script passes the task id positionally; the CLI now requires --task <TASK> flag. Schema mismatch — likely from a recent CLI refactor that didn't update the FLIP-eval invocation site.

Fix

Find the FLIP eval invocation site (likely in src/agency/flip.rs or src/commands/evaluate/) that runs wg evaluate record <task-id> positionally.
Update to use the current CLI: wg evaluate record --task <task-id> --score <score> --source <source>.
Add a unit test asserting the FLIP eval invocation matches the current CLI argument schema.
Audit any OTHER wg evaluate record callers for the same drift.

Why this matters

Every FLIP eval in every project currently fails to record. The eval runs (wastes tokens) and the result vanishes. Agency learning is silently broken anywhere FLIP fires.

Validation

Failing test first: test_flip_eval_record_invocation_uses_flag_args — assert FLIP path uses --task not positional
Implementation makes test pass
cargo build + cargo test pass with no regressions
Manual: trigger FLIP on any task in any project; assert eval is recorded to .wg/agency/evaluations/ with correct score

## Description

FLIP evaluation in autohaiku runs fine, computes score, then fails to record it:

```
=== FLIP Evaluation Complete ===
Task:       Autohaiku Assembly Line ...
FLIP Score: 0.72
...
error: unexpected argument '.flip-autohaiku-assembly-line-generate-hourly-haikus-fro' found

Usage: wg evaluate record [OPTIONS] --task <TASK> --score <SCORE> --source <SOURCE>
```

The eval script passes the task id positionally; the CLI now requires `--task <TASK>` flag. Schema mismatch — likely from a recent CLI refactor that didn't update the FLIP-eval invocation site.

### Fix

1. Find the FLIP eval invocation site (likely in src/agency/flip.rs or src/commands/evaluate/) that runs `wg evaluate record <task-id>` positionally.
2. Update to use the current CLI: `wg evaluate record --task <task-id> --score <score> --source <source>`.
3. Add a unit test asserting the FLIP eval invocation matches the current CLI argument schema.
4. Audit any OTHER `wg evaluate record` callers for the same drift.

### Why this matters

Every FLIP eval in every project currently fails to record. The eval runs (wastes tokens) and the result vanishes. Agency learning is silently broken anywhere FLIP fires.

## Validation

- [ ] Failing test first: test_flip_eval_record_invocation_uses_flag_args — assert FLIP path uses `--task` not positional
- [ ] Implementation makes test pass
- [ ] cargo build + cargo test pass with no regressions
- [ ] Manual: trigger FLIP on any task in any project; assert eval is recorded to .wg/agency/evaluations/ with correct score

Depends on

done .assign-wg-evaluate-record

Required by

(none)

Log

2026-04-26T23:08:20.166278537+00:00 Task paused
2026-04-26T23:08:20.324649643+00:00 Task published
2026-04-26T23:13:00.167914630+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=task, reason=Careful Programmer matches role fit (code implementation), tradeoff fit (correctness-critical CLI schema fix), and has strong task experience (52 prior tasks).
2026-04-26T23:20:40.120446563+00:00 Spawned by coordinator --executor claude --model opus
2026-04-26T23:20:52.550888015+00:00 Starting investigation: locating FLIP eval invocation sites
2026-04-26T23:22:04.533870884+00:00 Found bug: src/commands/service/coordinator.rs lines 2787 and 2794 invoke 'wg evaluate record <task-id> <score>' positionally, but CLI requires --task/--score flags. No other call sites in repo.
2026-04-26T23:41:55.824566984+00:00 Fix landed: refactored inline-eval script generation in coordinator.rs into testable build_inline_eval_script(), updated 'wg evaluate record' invocations to use --task/--score flags. Added unit tests (test_flip_eval_record_invocation_uses_flag_args + test_inline_eval_script_without_special_agent_skips_record). Added smoke scenario evaluate_record_flag_args.sh + manifest entry. cargo build OK. cargo test --bin wg: 2843 passed, 1 pre-existing failure (provenance_full_lifecycle_all_ops_recorded — confirmed failing on parent commit too, unrelated). Manual: cli flag form writes to .wg/agency/evaluations/; positional form rejected with exact bug-report error string.
2026-04-26T23:43:19.504248421+00:00 Committed: c5b51bf62 — pushed to remote
2026-04-26T23:43:24.812938989+00:00 Pushed: c5b51bf62 to origin/wg/agent-165/wg-evaluate-record
2026-04-26T23:43:37.832632024+00:00 Task marked as done