smoke-gate-is

Smoke gate is real: agents cannot claim done while a defined smoke scenario fails

Metadata

Statusdone
Assignedagent-102
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-26T20:00:02.907277292+00:00
Started2026-04-26T20:01:44.444708556+00:00
Completed2026-04-26T21:01:42.324052337+00:00
Tagseval-scheduled
Eval score0.90
└ blocking impact0.90
└ completeness0.95
└ coordination overhead0.92
└ correctness0.92
└ downstream usability0.90
└ efficiency0.88
└ intent fidelity0.88
└ style adherence0.92

Description

Description

Multiple recent 'done' tasks regressed live behavior:

  • wg-nex-native (still broken; user repro'd same fault after 'fix')
  • rename-dispatcher-daemon (resurrected Bug A orphan ghost)
  • wire-priority-field (priority schema mid-migration; integer in graph.jsonl breaks readers)
  • model-is-not done was good but exposed downstream routing breakage that a smoke would have caught

Pattern: agent claims done, smoke is theater, regression ships, user hits it. The smoke-test-gap task is supposed to fix this but evidently the smoke still doesn't have teeth.

Spec

The smoke must be a HARD GATE on wg done, not just a recommendation. If a defined smoke scenario fails, wg done rejects with the specific scenario that broke.

  1. Define a structured smoke manifest (e.g. tests/smoke/manifest.toml) listing scenarios with: name, repro script, assertion, owner-task (which task this scenario protects).
  2. wg done <task> runs all scenarios where the task is in the owner-task list, OR all scenarios if --full-smoke. If any fail, wg done exits non-zero with which scenarios failed.
  3. Agents (per CLAUDE.md and the agent prompt) MUST run the smoke before wg done. Add this as a hard requirement in the agent prompt template.
  4. Smoke runs against LIVE binaries and LIVE endpoints (no stubs). If an endpoint is unreachable, scenario emits SKIP loud (not silent).
  5. The smoke manifest is grow-only: every regression in this session adds a scenario. wg-nex-native bug, Bug A ghost, model routing — each gets a permanent scenario.

Required initial scenarios

  • nex_two_message_against_lambda01 — repro the exact 'wg init -x nex -m qwen3-coder -e https://lambda01...; wg tui; send 'hi'' user smoke
  • dispatcher_boot_no_orphan_supervisor — Bug A regression check (covered by bug-a-regression-test task; reuse)
  • claude_executor_with_global_openrouter_default — local=claude + global openrouter is_default → spawn metadata shows executor=claude, no native-exec
  • priority_int_and_string_deserialize — graph.jsonl with both 'priority':10 and 'priority':'high' reads cleanly
  • chat_create_via_ipc_works — open chat, send 'hi', get response within 30s

Out of scope

  • Replacing the existing wave-1 smoke (extend it, don't replace)
  • CI integration (this task is the local-gate; CI plumbing is follow-up)

Validation

  • Failing tests first: test_done_blocks_when_smoke_scenario_fails, test_done_succeeds_when_all_owned_scenarios_pass
  • Implementation: smoke manifest + wg done integration + agent prompt update
  • cargo build + cargo test pass with no regressions
  • Manual: try wg done <task> in scratch dir while a known scenario is broken; assert wg done refuses with a specific failure message
  • Manual: all 5 initial scenarios above run live against real endpoints, with clear SKIP on unreachable endpoints

Depends on

Required by

Messages 2 messages (2 unread)

  1. #1user2026-04-26T20:21:43.523511556+00:00read
    Principle reinforcement from user (verbatim): 'you should be able to interpret if the right things are coming out. Even if you're not a human being looking at a terminal window, it should be trivial. it should be like, I'm going to take a look.' Bake this in as the EXPLICIT design principle for smoke scenarios: every scenario must produce a programmatically-assertable text/data stream — never 'human looks at terminal and judges.' Each scenario states the expected output (literal text, JSON shape, file content, log line) and the smoke runner asserts that output appeared. NO eyeball gates. The Log view bug is a recent example: fix merged, file data exists, behavior broken, agent never asserted 'after fix, opening Log view emits these N lines of expected text.' If smoke had that scenario it would have caught it.
  2. #2smoke-gate-is2026-04-26T20:50:15.096488979+00:00read
    Acknowledged. Will (a) bake the 'no eyeball gates — every scenario asserts expected output programmatically' principle into tests/smoke/README.md as an explicit constitutive rule for the manifest, and (b) audit the 5 scenarios I just wrote: every one must positively assert expected output (specific role/text/JSON shape), not just 'absence of error markers'. Specifically tightening nex_two_message_against_lambda01 and chat_create_via_ipc_works to assert role=coordinator (or equivalent positive marker) in the outbox, and adding an 'Expected output' line per scenario in the README.

Log