verify-document-shell — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-1318`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-05-01T13:15:16.229746646+00:00
Started	2026-05-01T13:15:43.061766306+00:00
Completed	2026-05-01T13:34:59.752402214+00:00
Tags	`fix,bug,docs,state-machine`, `eval-scheduled`
Eval score	0.93
└ blocking impact	0.95
└ completeness	0.95
└ constraint fidelity	0.70
└ coordination overhead	0.90
└ correctness	0.95
└ downstream usability	0.90
└ efficiency	0.85
└ intent fidelity	0.69
└ style adherence	0.95

Description

Two related concerns:

Behavioral: when a --exec-mode shell task fails (exit != 0) without explicit wg done / wg fail, does it enter failed-pending-eval and get stuck (because no LLM evaluator can score a shell task), OR does the state machine correctly skip the rescue path for shell tasks?
Documentation: wg quickstart documents --exec-mode shell and failed-pending-eval, but doesn't describe how they interact. The failed-pending-eval description says 'agent exited non-zero with auto_evaluate=true, awaiting evaluator verdict' — which implies eval ALWAYS runs, but that can't be true for shell tasks.

User question 2026-05-01: 'is this well documented in quickstart? is failure managed correctly?'

Investigation steps

Behavioral

Create a --exec-mode shell task that runs a command exiting non-zero (e.g., --exec 'exit 1')
Observe the resulting status. Possibilities:
- Direct terminal failed (skips pending-eval entirely — correct): the implementer saw shell tasks should bypass the rescue path
- Stuck in pending-eval (bug): waiting for an evaluator that never fires
- Some other state: investigate what
If the behavior is correct, find where in the state-transition code the shell-mode bypass happens (cite file:line). If missing, file as a fix.

Documentation

Update wg quickstart to explicitly say:
- Shell tasks (--exec-mode shell or --exec ...) do NOT get LLM evaluation; failure semantics are 'exit code 0 = done, non-zero = failed (terminal, no rescue)'
- failed-pending-eval applies ONLY to LLM agent tasks (full / light / bare exec modes)
- Recurring shell tasks (cron with --exec) follow the same rule: each invocation succeeds-or-fails terminally, no rescue attempt

Validation

Behavior verified — paste the outcome of wg add --exec 'exit 1' end-to-end test in the task log (state trajectory: open → in-progress → ???)
If state machine is correct: cite file:line of the shell-mode bypass in the state-transition code
If state machine is buggy (shell task stuck in pending-eval): file a follow-up fix task with concrete patch
Quickstart updated to clarify the interaction between shell-mode and failed-pending-eval
If applicable: CLAUDE.md, agent-guide bundle, or COMMANDS.md updated with the same clarification (per the doc-sync function pattern)
cargo build + cargo test pass
cargo install --path . was run before claiming done

Process note

This is a small task that doubles as a behavioral assertion + docs update. If the behavior is fine, the docs change is small. If the behavior has a bug, file the fix separately and let this task land the docs update against current behavior — they shouldn't block each other.

## Description
Two related concerns:

1. **Behavioral**: when a `--exec-mode shell` task fails (exit != 0) without explicit `wg done` / `wg fail`, does it enter failed-pending-eval and get stuck (because no LLM evaluator can score a shell task), OR does the state machine correctly skip the rescue path for shell tasks?

2. **Documentation**: `wg quickstart` documents `--exec-mode shell` and failed-pending-eval, but doesn't describe how they interact. The failed-pending-eval description says 'agent exited non-zero with auto_evaluate=true, awaiting evaluator verdict' — which implies eval ALWAYS runs, but that can't be true for shell tasks.

User question 2026-05-01: 'is this well documented in quickstart? is failure managed correctly?'

## Investigation steps

### Behavioral
1. Create a `--exec-mode shell` task that runs a command exiting non-zero (e.g., `--exec 'exit 1'`)
2. Observe the resulting status. Possibilities:
   - **Direct terminal failed** (skips pending-eval entirely — correct): the implementer saw shell tasks should bypass the rescue path
   - **Stuck in pending-eval** (bug): waiting for an evaluator that never fires
   - **Some other state**: investigate what
3. If the behavior is correct, find where in the state-transition code the shell-mode bypass happens (cite file:line). If missing, file as a fix.

### Documentation
- Update `wg quickstart` to explicitly say:
  - Shell tasks (`--exec-mode shell` or `--exec ...`) do NOT get LLM evaluation; failure semantics are 'exit code 0 = done, non-zero = failed (terminal, no rescue)'
  - failed-pending-eval applies ONLY to LLM agent tasks (full / light / bare exec modes)
  - Recurring shell tasks (cron with --exec) follow the same rule: each invocation succeeds-or-fails terminally, no rescue attempt

## Validation
- [ ] Behavior verified — paste the outcome of `wg add --exec 'exit 1'` end-to-end test in the task log (state trajectory: open → in-progress → ???)
- [ ] If state machine is correct: cite file:line of the shell-mode bypass in the state-transition code
- [ ] If state machine is buggy (shell task stuck in pending-eval): file a follow-up fix task with concrete patch
- [ ] Quickstart updated to clarify the interaction between shell-mode and failed-pending-eval
- [ ] If applicable: CLAUDE.md, agent-guide bundle, or COMMANDS.md updated with the same clarification (per the doc-sync function pattern)
- [ ] cargo build + cargo test pass
- [ ] cargo install --path . was run before claiming done

## Process note
This is a small task that doubles as a behavioral assertion + docs update. If the behavior is fine, the docs change is small. If the behavior has a bug, file the fix separately and let this task land the docs update against current behavior — they shouldn't block each other.

Depends on

done .assign-verify-document-shell

Required by

done fix-shell-pending-eval

Log

2026-05-01T13:15:16.207795027+00:00 Task paused
2026-05-01T13:15:20.638463508+00:00 Task published
2026-05-01T13:15:40.674680814+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=full, reason=Careful Programmer is well-suited for state-machine verification (correctness-critical) and documentation updates; high score (0.79) and extensive task history (466) demonstrate strong performance on similar verification/refactoring work.
2026-05-01T13:15:43.061771115+00:00 Spawned by coordinator --executor claude --model opus
2026-05-01T13:15:57.513689568+00:00 Starting investigation: shell-mode (--exec) failure handling vs failed-pending-eval state
2026-05-01T13:17:40.804083030+00:00 Found fail.rs:69-71 — FailedPendingEval routing happens when class=AgentExitNonzero AND auto_evaluate=true. No exec-mode bypass exists in the routing decision.
2026-05-01T13:21:10.670053059+00:00 BUG CONFIRMED: shell task with --class agent-exit-nonzero routes to failed-pending-eval. Trajectory: open → in-progress → failed-pending-eval. No .evaluate-X exists for shell tasks (correctly suppressed by eval_scaffold.rs:174-179) so the rescue never resolves. Bug location: src/commands/fail.rs:69-71 — the routing condition checks class+auto_evaluate but NOT is_shell_task.
2026-05-01T13:33:51.155767448+00:00 Validated: cargo build + cargo test (failed_pending_eval + pending_eval_state suites pass; 1 unrelated pre-existing failure in config::tests::test_global_config_path on .workgraph→.wg path migration). cargo install --path . completed. wg quickstart now shows updated docs in agent-guide.md, COMMANDS.md, and quickstart.rs.
2026-05-01T13:34:49.008835829+00:00 Committed: b1acf64d2 — pushed to origin
2026-05-01T13:34:59.752414257+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-05-01T13:38:02.846230745+00:00 PendingEval → Done (evaluator passed; downstream unblocks)