smoke-gate-is — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-102`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-04-26T20:00:02.907277292+00:00
Started	2026-04-26T20:01:44.444708556+00:00
Completed	2026-04-26T21:01:42.324052337+00:00
Tags	`eval-scheduled`
Eval score	0.90
└ blocking impact	0.90
└ completeness	0.95
└ coordination overhead	0.92
└ correctness	0.92
└ downstream usability	0.90
└ efficiency	0.88
└ intent fidelity	0.88
└ style adherence	0.92

Description

Multiple recent 'done' tasks regressed live behavior:

wg-nex-native (still broken; user repro'd same fault after 'fix')
rename-dispatcher-daemon (resurrected Bug A orphan ghost)
wire-priority-field (priority schema mid-migration; integer in graph.jsonl breaks readers)
model-is-not done was good but exposed downstream routing breakage that a smoke would have caught

Pattern: agent claims done, smoke is theater, regression ships, user hits it. The smoke-test-gap task is supposed to fix this but evidently the smoke still doesn't have teeth.

Spec

The smoke must be a HARD GATE on wg done, not just a recommendation. If a defined smoke scenario fails, wg done rejects with the specific scenario that broke.

Define a structured smoke manifest (e.g. tests/smoke/manifest.toml) listing scenarios with: name, repro script, assertion, owner-task (which task this scenario protects).
wg done <task> runs all scenarios where the task is in the owner-task list, OR all scenarios if --full-smoke. If any fail, wg done exits non-zero with which scenarios failed.
Agents (per CLAUDE.md and the agent prompt) MUST run the smoke before wg done. Add this as a hard requirement in the agent prompt template.
Smoke runs against LIVE binaries and LIVE endpoints (no stubs). If an endpoint is unreachable, scenario emits SKIP loud (not silent).
The smoke manifest is grow-only: every regression in this session adds a scenario. wg-nex-native bug, Bug A ghost, model routing — each gets a permanent scenario.

Required initial scenarios

nex_two_message_against_lambda01 — repro the exact 'wg init -x nex -m qwen3-coder -e https://lambda01...; wg tui; send 'hi'' user smoke
dispatcher_boot_no_orphan_supervisor — Bug A regression check (covered by bug-a-regression-test task; reuse)
claude_executor_with_global_openrouter_default — local=claude + global openrouter is_default → spawn metadata shows executor=claude, no native-exec
priority_int_and_string_deserialize — graph.jsonl with both 'priority':10 and 'priority':'high' reads cleanly
chat_create_via_ipc_works — open chat, send 'hi', get response within 30s

Out of scope

Replacing the existing wave-1 smoke (extend it, don't replace)
CI integration (this task is the local-gate; CI plumbing is follow-up)

Validation

Failing tests first: test_done_blocks_when_smoke_scenario_fails, test_done_succeeds_when_all_owned_scenarios_pass
Implementation: smoke manifest + wg done integration + agent prompt update
cargo build + cargo test pass with no regressions
Manual: try wg done <task> in scratch dir while a known scenario is broken; assert wg done refuses with a specific failure message
Manual: all 5 initial scenarios above run live against real endpoints, with clear SKIP on unreachable endpoints

## Description

Multiple recent 'done' tasks regressed live behavior:
- wg-nex-native (still broken; user repro'd same fault after 'fix')
- rename-dispatcher-daemon (resurrected Bug A orphan ghost)
- wire-priority-field (priority schema mid-migration; integer in graph.jsonl breaks readers)
- model-is-not done was good but exposed downstream routing breakage that a smoke would have caught

Pattern: agent claims done, smoke is theater, regression ships, user hits it. The smoke-test-gap task is supposed to fix this but evidently the smoke still doesn't have teeth.

### Spec

The smoke must be a HARD GATE on `wg done`, not just a recommendation. If a defined smoke scenario fails, `wg done` rejects with the specific scenario that broke.

1. Define a structured smoke manifest (e.g. tests/smoke/manifest.toml) listing scenarios with: name, repro script, assertion, owner-task (which task this scenario protects).
2. `wg done <task>` runs all scenarios where the task is in the owner-task list, OR all scenarios if --full-smoke. If any fail, `wg done` exits non-zero with which scenarios failed.
3. Agents (per CLAUDE.md and the agent prompt) MUST run the smoke before `wg done`. Add this as a hard requirement in the agent prompt template.
4. Smoke runs against LIVE binaries and LIVE endpoints (no stubs). If an endpoint is unreachable, scenario emits SKIP loud (not silent).
5. The smoke manifest is grow-only: every regression in this session adds a scenario. wg-nex-native bug, Bug A ghost, model routing — each gets a permanent scenario.

### Required initial scenarios

- nex_two_message_against_lambda01 — repro the exact 'wg init -x nex -m qwen3-coder -e https://lambda01...; wg tui; send 'hi''  user smoke
- dispatcher_boot_no_orphan_supervisor — Bug A regression check (covered by bug-a-regression-test task; reuse)
- claude_executor_with_global_openrouter_default — local=claude + global openrouter is_default → spawn metadata shows executor=claude, no native-exec
- priority_int_and_string_deserialize — graph.jsonl with both `'priority':10` and `'priority':'high'` reads cleanly
- chat_create_via_ipc_works — open chat, send 'hi', get response within 30s

### Out of scope

- Replacing the existing wave-1 smoke (extend it, don't replace)
- CI integration (this task is the local-gate; CI plumbing is follow-up)

## Validation

- [ ] Failing tests first: test_done_blocks_when_smoke_scenario_fails, test_done_succeeds_when_all_owned_scenarios_pass
- [ ] Implementation: smoke manifest + wg done integration + agent prompt update
- [ ] cargo build + cargo test pass with no regressions
- [ ] Manual: try `wg done <task>` in scratch dir while a known scenario is broken; assert wg done refuses with a specific failure message
- [ ] Manual: all 5 initial scenarios above run live against real endpoints, with clear SKIP on unreachable endpoints

Depends on

done .assign-smoke-gate-is

Required by

(none)

✉ Messages 2 messages (2 unread)

#1user2026-04-26T20:21:43.523511556+00:00read

Principle reinforcement from user (verbatim): 'you should be able to interpret if the right things are coming out. Even if you're not a human being looking at a terminal window, it should be trivial. it should be like, I'm going to take a look.' Bake this in as the EXPLICIT design principle for smoke scenarios: every scenario must produce a programmatically-assertable text/data stream — never 'human looks at terminal and judges.' Each scenario states the expected output (literal text, JSON shape, file content, log line) and the smoke runner asserts that output appeared. NO eyeball gates. The Log view bug is a recent example: fix merged, file data exists, behavior broken, agent never asserted 'after fix, opening Log view emits these N lines of expected text.' If smoke had that scenario it would have caught it.

#2smoke-gate-is2026-04-26T20:50:15.096488979+00:00read

Acknowledged. Will (a) bake the 'no eyeball gates — every scenario asserts expected output programmatically' principle into tests/smoke/README.md as an explicit constitutive rule for the manifest, and (b) audit the 5 scenarios I just wrote: every one must positively assert expected output (specific role/text/JSON shape), not just 'absence of error markers'. Specifically tightening nex_two_message_against_lambda01 and chat_create_via_ipc_works to assert role=coordinator (or equivalent positive marker) in the outbox, and adding an 'Expected output' line per scenario in the README.

Log

2026-04-26T20:00:02.901543890+00:00 Task paused
2026-04-26T20:00:46.009667983+00:00 Task published
2026-04-26T20:01:42.527056154+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=graph, reason=Careful Programmer best fits correctness-critical infrastructure (smoke gate blocks task completion); Careful tradeoff suits high-risk validation work with live endpoint testing; 41-task track record demonstrates reliable implementation execution.
2026-04-26T20:01:44.444715489+00:00 Spawned by coordinator --executor claude --model opus
2026-04-26T20:01:57.376659621+00:00 Starting: planning approach — manifest + wg done integration + tests
2026-04-26T20:49:59.030502498+00:00 Implemented smoke gate: src/smoke.rs (manifest loader + scenario runner), wired into wg done as a hard gate before status transitions. Added --full-smoke (run all scenarios) and --skip-smoke (refused for agents unless WG_SMOKE_AGENT_OVERRIDE=1).
2026-04-26T20:49:59.039200680+00:00 Added 5 initial scenarios under tests/smoke/scenarios/ + manifest. Live-verified against the installed wg binary: 4/5 PASS (including nex_two_message against lambda01), 1 loud-SKIP for missing LLM creds. End-to-end: gate refused done with broken scenario, allowed done with --skip-smoke (human) and loud warn, blocked --skip-smoke for agent.
2026-04-26T20:49:59.052860111+00:00 Tests: 6 integration tests in tests/integration_smoke_gate.rs + 8 unit tests in src/smoke.rs all pass. cargo test full suite: 1 pre-existing failure (provenance_full_lifecycle_all_ops_recorded), confirmed pre-existing on clean checkout. No regressions caused by this change.
2026-04-26T20:49:59.064616167+00:00 Updated executor.rs prompt template (REQUIRED_WORKFLOW_SECTION) and CLAUDE.md so future agents see the smoke gate as part of the wg done contract.
2026-04-26T21:01:34.397106179+00:00 Committed: 9d1e60295 — pushed to origin wg/agent-102/smoke-gate-is
2026-04-26T21:01:42.324068737+00:00 Task marked as done