Metadata
| Status | done |
|---|---|
| Assigned | agent-2478 |
| Agent identity | 02e879681e52e0a384106169be043416c4d946e850ab26b2269c57681b52a6e7 |
| Model | codex:gpt-5.5 |
| Created | 2026-05-04T21:58:09.358085955+00:00 |
| Started | 2026-05-04T21:58:59.794201959+00:00 |
| Completed | 2026-05-04T22:22:23.877719189+00:00 |
| Tags | fix,nex,agent,prompting,tools, eval-scheduled |
| Eval score | 0.85 |
| └ blocking impact | 0.87 |
| └ completeness | 0.87 |
| └ constraint fidelity | 0.55 |
| └ coordination overhead | 0.87 |
| └ correctness | 0.88 |
| └ downstream usability | 0.82 |
| └ efficiency | 0.77 |
| └ intent fidelity | 0.83 |
| └ style adherence | 0.88 |
Description
Description
Observed in .chat-0 (nex/qwen3-coder) 2026-05-04: user asked for a 'Copenhagen weather forecast for June 28-July 3, 2026'. The agent's response: write a Rust program in ~/household/src/main.rs that outputs hardcoded weather text via println!.
User reaction: 'wg nex tried to write a rust program too instead of searching the web lol.'
Root cause (multi-layer)
-
Model bias: qwen3-coder is a coding model. Its training prior heavily favors 'write code to solve this' over 'use a tool / fetch data / answer directly'. Reasonable for many tasks; wrong for current-data tasks.
-
Tool affordances: nex's tool list may not include web search (or curl in a way the agent recognizes as fetching live data). Agent sees: file write, bash, code execution. No 'web' affordance. So it falls back to 'write code that fakes the data'.
-
Prompt guidance: nex's system prompt likely doesn't say 'if the task needs current data and you don't have a web tool, ASK the user for the data or NOTE that you can't fetch it'. Without this guidance, the model fills the gap by hallucinating.
What the agent SHOULD have done
Several reasonable responses:
- Note that it has no web access tool, ask the user to paste the forecast OR confirm 'do you want me to write a placeholder program?'
- If bash has curl available:
curl https://wttr.in/Copenhagenand parse — actual current data - If neither: explicitly say 'I don't have a way to fetch current data; here's a placeholder structure you can fill in'
Spec
A. Improve nex's bundled prompt to handle this case
Add to nex's system prompt (or the agent-guide content nex consumes):
When asked to produce content that requires current real-world data (weather, news, prices, dates beyond your training cutoff, etc.):
- If you have a web fetch tool, use it.
- If you have bash, try curl / wget for known data endpoints (e.g., curl wttr.in for weather).
- If neither: explicitly state you cannot fetch live data, and ask the user to either provide it or confirm they want a code skeleton / placeholder.
- Do NOT default to writing code that fabricates the data.
B. Audit nex's tool list for a web-fetch affordance
- Does nex have a curl / fetch tool exposed to the agent? If not, why not?
- Recommend: expose a basic 'fetch_url' tool that the agent can call. Bound to safe verbs (GET only by default), respects user-configurable allowlists.
- If web fetch is intentionally not available: at least have bash, and surface to the agent that bash is the path for HTTP requests.
C. Model selection guidance
qwen3-coder is biased toward code; for general-assistant work the user might want a non-coder model. nex should accept a non-coder model spec without complaint. The agent's behavior shouldn't entirely depend on this — the prompt fix in A handles the bias even when the user picks qwen3-coder.
Validation
- Failing test or repro: ask nex/qwen3-coder for current weather; pre-fix, it writes a code skeleton; post-fix, it asks the user OR uses curl OR explicitly says 'I can't fetch data'
- Test with a non-coder model (claude:haiku, gpt-5.4-mini): same behavior — explicit honesty about data access
- Test with curl available in bash: agent attempts a fetch via curl
- cargo build + cargo test pass
- cargo install --path . was run before claiming done
Out of scope for this task
- Building a full web-fetch tool with HTML parsing / search ranking. That's a bigger feature; this task just adds the prompt guidance + bash hint.
- Restricting what models can be used with nex. That's user choice; this fix makes the agent behavior reasonable regardless of model.
Depends on
Required by
Log
- 2026-05-04T21:58:09.335148393+00:00 Task paused
- 2026-05-04T21:58:29.588855390+00:00 Task published
- 2026-05-04T21:58:55.995933203+00:00 Lightweight assignment: agent=Careful Programmer (02e87968), exec_mode=full, context_scope=task, reason=Careful Programmer is best-suited for implementing prompting fixes and tool audits; high score (0.81) and 622 tasks demonstrate reliability for correctness-critical agent behavior work.
- 2026-05-04T21:58:59.794206518+00:00 Spawned by coordinator --executor codex --model gpt-5.5
- 2026-05-04T21:59:23.033212265+00:00 Starting implementation: inspecting nex prompts, tool exposure, and existing tests.
- 2026-05-04T22:03:08.104629885+00:00 Implemented nex current-data prompt guidance, minimal-tools bash HTTP hint, tool audit doc, unit tests, and owned smoke scenario.
- 2026-05-04T22:09:55.210478452+00:00 Validation note: cargo build passed; focused nex unit tests passed; full cargo test currently fails in unrelated prompt_snapshots motivation hash diffs (Quality First hash), not in nex changes.
- 2026-05-04T22:13:59.763834915+00:00 Validated: installed wg passes tests/smoke/scenarios/nex_current_data_prompt.sh against local fake OAI endpoint for qwen coder, non-coder, and minimal-tools paths.
- 2026-05-04T22:20:05.683374092+00:00 Validated: cargo build passed; cargo test passed; cargo install --path . completed; installed wg passes owned nex_current_data_prompt smoke scenario.
- 2026-05-04T22:21:48.090984724+00:00 Committed: 5c64aa12b — pushing to remote.
- 2026-05-04T22:22:07.894179741+00:00 Committed: 5c64aa12b — pushed to origin/wg/agent-2478/fix-nex-agent.
- 2026-05-04T22:22:23.877729789+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-05-04T22:23:30.367681992+00:00 PendingEval → Done (evaluator passed; downstream unblocks)