.verify-bug-report-verify

Verify (FLIP 0.54): Bug report: Verify commands frequently misconfigured as descriptive text instead of executable commands

Metadata

Statusdone
Assignedagent-106
Agent identity3184716484e6f0ea08bb13539daf07686ee79d440505f1fdf2de0357707034c3
Modelclaude-opus-4-6
Created2026-04-01T18:39:09.615104991+00:00
Started2026-04-01T18:39:13.470547734+00:00
Completed2026-04-01T18:42:15.322079155+00:00
Tagsverification, agency, eval-scheduled
Eval score0.32
└ blocking impact0.50
└ completeness0.55
└ coordination overhead0.30
└ correctness0.25
└ downstream usability0.45
└ efficiency0.20
└ intent fidelity0.51
└ style adherence0.15

Description

FLIP Verification & Repair

FLIP score 0.54 is below threshold 0.70 — independently verify and, if needed, fix this task's work.

Your Authority

You are a senior engineer reviewing a junior's PR. You have full authority to:

  • Edit source files, run builds, run tests, and commit fixes
  • Correct mistakes, resolve test failures, and improve the implementation
  • Only reject (fail) the source task if the approach is fundamentally wrong

Fix first, fail last. If the work is close but has issues, repair it yourself.

Original Task

ID: bug-report-verify Title: Bug report: Verify commands frequently misconfigured as descriptive text instead of executable commands Description:

Bug Report: Pathological --verify Misconfiguration

Summary

Agents (and possibly the eval scaffold) frequently set --verify commands to descriptive text rather than actual executable shell commands. This causes tasks to get stuck in a spawn-die loop where:

  1. Agent completes all actual work successfully
  2. Agent tries to mark task done via wg done
  3. wg done runs the verify command, which fails because it's not a valid shell command
  4. Agent dies, coordinator respawns a new agent
  5. New agent sees work is already done, tries to mark done, same failure
  6. Loop repeats indefinitely

Concrete Example

Task fix-convert-docx had this verify command:

typst compile passes for all converted files; all DocX files have Typst equivalents

This is a description of what to check, not an executable command. The shell tried to run:

  • typst compile passes for all converted files → failed (unexpected argument 'all')
  • all DocX files have Typst equivalents → failed (all: not found)

Three agents (agent-94, agent-96, agent-98) spawned and died on this task, each having completed or confirmed the work was done, but unable to mark it complete.

Root Cause Analysis

The --verify field is meant to hold a shell command that exits 0 on success. But agents (or the task creation tooling) are treating it as a human-readable description of acceptance criteria. This suggests:

  1. Ambiguous field semantics — The --verify field name doesn't clearly communicate that it must be an executable command. Agents may interpret it as "describe how to verify" rather than "provide a command that verifies."

  2. No validation at write time — When --verify is set (via wg add --verify or wg edit --verify), there's no syntax check or warning that the string doesn't look like a valid command.

  3. No graceful failure mode — When verify fails, the task stays in a claimable state, so the coordinator keeps spawning agents that keep failing. There's no ci

Artifacts:

  • wg-verify-validator.sh
  • wg-audit-verify.sh
  • wg-safe-add.sh
  • workgraph-verify-fixes.md

FLIP Evaluation Results

Dimension scores:

  • hallucination_rate: 0.35
  • requirement_coverage: 0.52
  • semantic_match: 0.60
  • specificity_match: 0.40

Evaluator reasoning: The actual task is a bug report and root cause analysis of a specific workgraph problem (--verify commands being set to descriptive text instead of executable commands), with suggested fixes ranging from short-term to long-term. The inferred task reframes this as a solution implementation request, proposing to build validator tools, auditing tools, and wrappers. While both address the same domain (verify command misconfiguration), they represent different intents: the actual is analytical/diagnostic, the inferred is prescriptive/implementational. The inferred version captures some of the suggested fixes (validation, auditing) but generalizes them into tool implementations, hallucinating specific deliverables (validator tool, safe wrapper, documentation) that weren't explicit in the original bug report. The inferred version also loses the concrete example, root cause analysis, and impact narrative that grounded the actual task.

FLIP metadata: {"comparison_model":"claude-haiku-4-5-20251001","inference_model":"claude-sonnet-4-20250514","inferred_prompt":"There is a critical issue with workgraph's --verify command that is causing agents to die due to malformed verify statements. Agents are receiving verify commands that contain descriptive text instead of executable verification code (example: 'typst compile passes for all converted files; all DocX files have Typst equivalents'). This is a system-level bug that needs immediate attention. Create a comprehensive solution that includes: (1) A verify command validator tool that can detect when verify statements contain descriptive text patterns rather than executable code, (2) A task auditing tool to scan existing tasks for problematic verify commands, (3) A safe wrapper for 'wg add' that validates verify commands before task creation to prevent future issues, (4) Complete documentation with examples and integration instructions. The solution must address validation, circuit breaker recommendations, error surfacing, field documentation, agent training guidance, and auto-suggestion features. Prioritize reliability and correctness - this is affecting the entire workgraph system's stability."}

Verification Steps

Independently check whether the work was actually completed. Do NOT trust the original agent's claims.

  1. Check git log --oneline -10 for recent commits related to this task
  2. Check git diff to see if meaningful changes were made
  3. Run cargo build && cargo test to verify nothing is broken
  4. Verify any artifacts mentioned in the task description exist

Repair & Verdict

  • If everything looks good: log verification passed and mark this task done.
  • If problems found: fix them directly — edit code, resolve test failures, correct logic errors, then run the verification again. Commit your fixes with a descriptive message. Once fixed, mark this task done.
  • Only as a last resort, if the approach is fundamentally wrong and cannot be salvaged: run wg fail 'bug-report-verify' --reason "FLIP verification failed: <reason>" then mark this task done.

Remember: your job is to make the work pass, not to find reasons to reject it.

Depends on

Required by

Log