Metadata
| Status | done |
|---|---|
| Assigned | agent-1487 |
| Agent identity | f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e |
| Created | 2026-05-02T02:31:00.718635052+00:00 |
| Started | 2026-05-02T02:31:45.179229704+00:00 |
| Completed | 2026-05-02T02:52:42.139061195+00:00 |
| Tags | priority-high,fix,docs,agents,prompting, eval-scheduled |
| Eval score | 0.76 |
| └ blocking impact | 0.80 |
| └ completeness | 0.70 |
| └ constraint fidelity | 0.10 |
| └ coordination overhead | 0.85 |
| └ correctness | 0.75 |
| └ downstream usability | 0.75 |
| └ efficiency | 0.75 |
| └ intent fidelity | 0.76 |
| └ style adherence | 0.80 |
Description
Description
Codex chat agents are observed doing implementation work themselves (writing code, making changes) instead of dispatching to worker tasks via wg add. The chat agent contract is supposed to be 'thin task-creator, not implementer'. Claude chat agents follow this; codex doesn't.
User report 2026-05-01: 'we have the codex .chat- agents always doing work themselves rather than making wg tasks. is there a prompting gap with codex/claude? like AGENTS.md or the .chat prompting isn't as clearly saying hey, don't just do the work unless the user asks you to.'
Confirmed root cause
After reorg-separate-universal (Apr 29) split CLAUDE.md into layer-2 only (project-specific) + bundled wg agent-guide (universal role contract), the same surgery was NOT applied to AGENTS.md.
Current state:
- CLAUDE.md (5145 bytes): layer-2 only, says 'run
wg agent-guidefor the universal contract' - AGENTS.md (7687 bytes): has the OLDER mixed content with the role contract INLINE, never updated post-reorg
Net effect:
- claude agents → read CLAUDE.md → run
wg agent-guide→ see the canonical, possibly more directive role contract - codex agents → read AGENTS.md → see older inline role contract → less consistent enforcement
Plus likely behavioral asymmetry: codex's 'be helpful, do the work' baseline is stronger than its instruction-following especially when the role contract feels softer/older than the bundled version.
Spec
Fix 1: bring AGENTS.md into parity with CLAUDE.md
- AGENTS.md becomes layer-2 only (workgraph-as-a-project content)
- Strip the inline universal role contract
- Add the same 'run
wg agent-guidefor the universal contract' pointer that CLAUDE.md has - Both files point at the SAME bundled source of truth — no drift
Fix 2: strengthen wg agent-guide's chat-agent role language
The current bundled agent-guide should be loud about 'DO NOT WRITE CODE. DO NOT IMPLEMENT.' Specifically:
- Lead with the role distinction prominently — the FIRST thing a chat agent reads should be 'You are a chat agent. Your job is to create wg tasks via
wg add, NOT to do the work yourself.' - Add concrete anti-patterns: 'Don't run
cargo build. Don't open the editor. Don't grep for code. Usewg addto dispatch every code-touching action to a worker.' - Add explicit list of things chat agents CAN do:
wg show,wg list,wg log,wg add,wg edit,wg publish. Things they CAN'T do:cargo,grepon source, edit files in src/, etc.
Fix 3 (optional, codex-specific): codex chat spawn includes an extra system-prompt addendum
If codex's behavioral baseline still pulls toward 'do work' even with strengthened agent-guide, add a codex-specific addendum at chat spawn that says (loudly) 'STOP. Do not write code. Use wg add for any implementation. The user is talking to you to ORCHESTRATE work, not to receive it.' Inject this when spawning codex chat tabs specifically.
This is the same kind of asymmetry the codex bypass-flag fix addressed (different handler needs different treatment). Acceptable to ship.
Validate behavior empirically
The proof is empirical: a codex chat agent receiving a 'fix bug X' request should respond with 'I'll file this as a wg task' + actual wg add invocation, NOT with 'Let me look at the code...' followed by editing.
Validation
-
Failing test or behavioral repro: spawn a codex chat agent, give it a code-touching request ('fix bug Y in src/foo.rs'). Pre-fix: agent reads source / makes edits. Post-fix: agent files
wg addand waits for the worker. - AGENTS.md is now layer-2 only (workgraph-project context only); same shape as CLAUDE.md
- grep AGENTS.md for inline role-contract content: zero matches (or only pointers to wg agent-guide)
- wg agent-guide content updated with stronger / clearer chat-agent role language
- Same behavioral test passes for claude chat agent (no regression)
- If Fix 3 implemented: codex chat spawn args include the system-prompt addendum
- cargo build + cargo test pass
- cargo install --path . was run before claiming done
Process note
This is exactly the kind of asymmetry a comprehensive doc-sync would have caught — both files have similar surface but different age. The doc-sync function template should be amended to: 'AGENTS.md and CLAUDE.md should be checked together; any drift between them is a bug, not an intentional difference.'
The autohaiku evaluator-grade-zero bug from earlier today (.workgraph paths) and this bug share a root cause: agent-visible documentation drift. Each instance feels small but they compound — agents make decisions based on stale text and we don't notice until something goes wrong.
Depends on
Required by
- (none)
Log
- 2026-05-02T02:31:00.687074625+00:00 Task paused
- 2026-05-02T02:31:00.778336100+00:00 Task published
- 2026-05-02T02:31:41.809518595+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=task, reason=Implementation + documentation fix requiring careful validation of agent-prompt behavior; Careful Programmer tradeoff matches the correctness-critical nature and exhaustive testing requirement emphasized in validation criteria.
- 2026-05-02T02:31:45.179235033+00:00 Spawned by coordinator --executor claude --model opus
- 2026-05-02T02:31:54.145495627+00:00 Starting: reading AGENTS.md, CLAUDE.md, and wg agent-guide source
- 2026-05-02T02:33:06.460170272+00:00 Plan: (1) rewrite AGENTS.md to layer-2-only mirror of CLAUDE.md; (2) strengthen src/text/agent_guide.md by moving chat-agent contract to lead with louder anti-patterns + can/can't lists; (3) add codex-specific 'STOP. Do not write code.' addendum in codex_handler.rs first-turn prompt; (4) add a unit test asserting the addendum appears.
- 2026-05-02T02:52:00.721293962+00:00 Validated: cargo build clean (warnings only). cargo test --bin wg -- agent_guide codex_handler: 15/15 pass. Full suite 3350/3351 (1 pre-existing flaky tmux test passes in isolation). cargo install --path . done. wg agent-guide now leads with STOP banner.
- 2026-05-02T02:52:33.417442984+00:00 Committed: bf583d80e — pushed to origin/wg/agent-1487/fix-agents-md
- 2026-05-02T02:52:42.139074209+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-05-02T02:56:37.192348383+00:00 PendingEval → Done (evaluator passed; downstream unblocks)