triage-3-small-model — Workgraph live mirror

Metadata

Status	done
Assigned	`agent-876`
Agent identity	`f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e`
Created	2026-04-27T21:47:48.312178850+00:00
Started	2026-04-27T21:50:17.519176926+00:00
Completed	2026-04-27T21:54:40.796498852+00:00
Tags	`eval-scheduled`
Eval score	0.88
└ blocking impact	0.87
└ completeness	0.92
└ constraint fidelity	0.85
└ coordination overhead	0.87
└ correctness	0.88
└ downstream usability	0.88
└ efficiency	0.82
└ intent fidelity	0.87
└ style adherence	0.87

Description

A Qwen3-Coder-30B mixture-of-experts model running via wg nex produced three bug reports describing real friction. User context: 'these small models are useful if they're going to get access to information, but it seemed as having trouble doing that.' Goal of this task is to triage the reports — validate each against actual wg nex code, separate true bugs from model misunderstandings, and produce a prioritized follow-up task list.

Source files

/home/erik/workgraph/tool_call_processing_bug_report.md — wants streaming/incremental tool-call feedback + tokens/sec metrics
/home/erik/workgraph/ui_freeze_bug_report.md — UI appears frozen during long write_file ops; no incremental output
/home/erik/workgraph/ecmwf_analysis_limitation.md — agent can't download binary files (GRIB), can't auth to APIs, can't parse SVG/PNG; specific use case was Memphis weather forecasts

Triage questions per report

For each report:

Reproduce or refute: does the described behavior actually happen in wg nex today? Run the failure mode, observe.
Root cause: if real, where in the code (src/executor/native/* or wherever)? If not real, what did the model misunderstand about its own tool surface?
Cost/value: what's the implementation effort vs. how much it unlocks for small models?
Decide: file follow-up task / merge with existing task / mark won't-fix / clarify model docs

Specific things to look at

Reports 1+2 are almost certainly the same root issue (no streaming / no progress feedback during long ops). Treat as one investigation.
Report 3 is different: it's about tool capability gaps (binary fetch, auth, format parsing). Note: the existing native executor has fetch_max_chars = 16000 per local config — that's a hard cap that would explain the binary-fetch limitation. Worth surfacing whether the agent's complaints reflect missing capability vs. limits the model didn't know about.
Native executor delegate config ([native_executor.delegate]) and web config ([native_executor.web]) are existing surface area worth understanding before proposing additions.

Deliverable

Produce docs/triage-wg-nex-small-model-reports-2026-04-27.md with:

Per-report verdict: real bug / model misunderstanding / known limitation
For real bugs: proposed fix scope + estimated effort tier (small/medium/large)
For misunderstandings: what tool docs / system prompt addition would prevent it
For known limitations: should they be lifted? Configurable? Documented?
Concrete follow-up task list with proposed titles + descriptions (not draft tasks yet — just the list, this task's owner picks them up after triage lands)

Cleanup

Move the 3 bug report files from repo root into docs/agent-reports/ (don't delete — they're real signal worth preserving).

Validation

All 3 reports addressed in the deliverable doc (real bug / misunderstanding / known limitation verdicts)
Each real bug has a proposed fix scope + estimated effort tier
Follow-up task list is actionable (could be turned into wg add calls without re-triaging)
3 source files moved to docs/agent-reports/, not deleted
Deliverable file exists at docs/triage-wg-nex-small-model-reports-2026-04-27.md

## Description

A Qwen3-Coder-30B mixture-of-experts model running via `wg nex` produced three bug reports describing real friction. User context: 'these small models are useful if they're going to get access to information, but it seemed as having trouble doing that.' Goal of this task is to **triage** the reports — validate each against actual wg nex code, separate true bugs from model misunderstandings, and produce a prioritized follow-up task list.

## Source files

- `/home/erik/workgraph/tool_call_processing_bug_report.md` — wants streaming/incremental tool-call feedback + tokens/sec metrics
- `/home/erik/workgraph/ui_freeze_bug_report.md` — UI appears frozen during long write_file ops; no incremental output
- `/home/erik/workgraph/ecmwf_analysis_limitation.md` — agent can't download binary files (GRIB), can't auth to APIs, can't parse SVG/PNG; specific use case was Memphis weather forecasts

## Triage questions per report

For each report:
1. **Reproduce or refute**: does the described behavior actually happen in wg nex today? Run the failure mode, observe.
2. **Root cause**: if real, where in the code (`src/executor/native/*` or wherever)? If not real, what did the model misunderstand about its own tool surface?
3. **Cost/value**: what's the implementation effort vs. how much it unlocks for small models?
4. **Decide**: file follow-up task / merge with existing task / mark won't-fix / clarify model docs

## Specific things to look at

- Reports 1+2 are almost certainly the same root issue (no streaming / no progress feedback during long ops). Treat as one investigation.
- Report 3 is different: it's about tool *capability* gaps (binary fetch, auth, format parsing). Note: the existing native executor has `fetch_max_chars = 16000` per local config — that's a hard cap that would explain the binary-fetch limitation. Worth surfacing whether the agent's complaints reflect missing capability vs. limits the model didn't know about.
- Native executor delegate config (`[native_executor.delegate]`) and web config (`[native_executor.web]`) are existing surface area worth understanding before proposing additions.

## Deliverable

Produce `docs/triage-wg-nex-small-model-reports-2026-04-27.md` with:
- Per-report verdict: real bug / model misunderstanding / known limitation
- For real bugs: proposed fix scope + estimated effort tier (small/medium/large)
- For misunderstandings: what tool docs / system prompt addition would prevent it
- For known limitations: should they be lifted? Configurable? Documented?
- **Concrete follow-up task list** with proposed titles + descriptions (not draft tasks yet — just the list, this task's owner picks them up after triage lands)

## Cleanup

Move the 3 bug report files from repo root into `docs/agent-reports/` (don't delete — they're real signal worth preserving).

## Validation

- [ ] All 3 reports addressed in the deliverable doc (real bug / misunderstanding / known limitation verdicts)
- [ ] Each real bug has a proposed fix scope + estimated effort tier
- [ ] Follow-up task list is actionable (could be turned into `wg add` calls without re-triaging)
- [ ] 3 source files moved to `docs/agent-reports/`, not deleted
- [ ] Deliverable file exists at `docs/triage-wg-nex-small-model-reports-2026-04-27.md`

Depends on

done .assign-triage-3-small-model

Required by

(none)

Log

2026-04-27T21:47:48.278321553+00:00 Task paused
2026-04-27T21:47:51.618180896+00:00 Task published
2026-04-27T21:49:14.082251350+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=task, reason=Careful Programmer with high capability (0.72 score, 137 tasks) suits code investigation and critical triaging decisions; Careful tradeoff prioritizes correctness for bug classification.
2026-04-27T21:49:15.059728978+00:00 Spawned by coordinator --executor claude --model opus
2026-04-27T21:49:24.386231463+00:00 Starting triage of 3 small-model bug reports. Reading source files first.
2026-04-27T21:50:12.727916133+00:00 Task unclaimed: agent 'agent-874' (PID 3034394) process exited
2026-04-27T21:50:17.519181445+00:00 Spawned by coordinator --executor claude --model opus
2026-04-27T21:52:55.662365306+00:00 Investigation done. Reports 1+2 are real (tool-call arg streaming has no user-facing callback). Report 3 partly real (no auth headers, no GRIB parser) but largely misunderstanding (binary download already works since fce5ac4df 2026-04-16). Writing deliverable.
2026-04-27T21:53:57.816172753+00:00 Triage doc written at docs/triage-wg-nex-small-model-reports-2026-04-27.md. 3 source files moved to docs/agent-reports/. Verdicts: reports 1+2 = same real bug (InputJsonDelta not streamed to user); report 3 = mostly model misunderstanding + small system-prompt + dead todo_write reference.
2026-04-27T21:54:27.272307656+00:00 Committed: a07603679 — pushed to remote (wg/agent-874/triage-3-small-model)
2026-04-27T21:54:27.392506761+00:00 Validated: deliverable file exists, all 3 source files moved (not deleted), follow-up task list is wg-add ready (3 tasks proposed). Docs-only change — no cargo needed.
2026-04-27T21:54:40.796503541+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
2026-04-27T21:56:09.884942159+00:00 PendingEval → Done (evaluator passed; downstream unblocks)
2026-04-27T21:57:49.944932200+00:00 Doc verified. All 5 validation items satisfied: 3 reports addressed (per-report verdicts), each real bug has fix scope + effort tier, follow-up tasks are wg-add-ready, source files relocated to docs/agent-reports/, deliverable file present. Committing doc refinements.
2026-04-27T21:58:05.725172383+00:00 Committed: 929a3d52c — pushed to remote