wg-evaluate-record

wg evaluate record CLI schema mismatch breaks FLIP recording (autohaiku)

Metadata

Statusdone
Assignedagent-165
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-26T23:08:20.170337944+00:00
Started2026-04-26T23:20:40.120439761+00:00
Completed2026-04-26T23:43:37.832615774+00:00
Tagseval-scheduled
Eval score0.93
└ blocking impact0.90
└ completeness0.98
└ coordination overhead0.92
└ correctness0.95
└ downstream usability0.88
└ efficiency0.92
└ intent fidelity0.87
└ style adherence0.90

Description

Description

FLIP evaluation in autohaiku runs fine, computes score, then fails to record it:

=== FLIP Evaluation Complete ===
Task:       Autohaiku Assembly Line ...
FLIP Score: 0.72
...
error: unexpected argument '.flip-autohaiku-assembly-line-generate-hourly-haikus-fro' found

Usage: wg evaluate record [OPTIONS] --task <TASK> --score <SCORE> --source <SOURCE>

The eval script passes the task id positionally; the CLI now requires --task <TASK> flag. Schema mismatch — likely from a recent CLI refactor that didn't update the FLIP-eval invocation site.

Fix

  1. Find the FLIP eval invocation site (likely in src/agency/flip.rs or src/commands/evaluate/) that runs wg evaluate record <task-id> positionally.
  2. Update to use the current CLI: wg evaluate record --task <task-id> --score <score> --source <source>.
  3. Add a unit test asserting the FLIP eval invocation matches the current CLI argument schema.
  4. Audit any OTHER wg evaluate record callers for the same drift.

Why this matters

Every FLIP eval in every project currently fails to record. The eval runs (wastes tokens) and the result vanishes. Agency learning is silently broken anywhere FLIP fires.

Validation

  • Failing test first: test_flip_eval_record_invocation_uses_flag_args — assert FLIP path uses --task not positional
  • Implementation makes test pass
  • cargo build + cargo test pass with no regressions
  • Manual: trigger FLIP on any task in any project; assert eval is recorded to .wg/agency/evaluations/ with correct score

Depends on

Required by

Log