triage-3-small-model

Triage: 3 small-model bug reports from wg nex (Qwen3-Coder-30B-MoE)

Metadata

Statusdone
Assignedagent-876
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-27T21:47:48.312178850+00:00
Started2026-04-27T21:50:17.519176926+00:00
Completed2026-04-27T21:54:40.796498852+00:00
Tagseval-scheduled
Eval score0.88
└ blocking impact0.87
└ completeness0.92
└ constraint fidelity0.85
└ coordination overhead0.87
└ correctness0.88
└ downstream usability0.88
└ efficiency0.82
└ intent fidelity0.87
└ style adherence0.87

Description

Description

A Qwen3-Coder-30B mixture-of-experts model running via wg nex produced three bug reports describing real friction. User context: 'these small models are useful if they're going to get access to information, but it seemed as having trouble doing that.' Goal of this task is to triage the reports — validate each against actual wg nex code, separate true bugs from model misunderstandings, and produce a prioritized follow-up task list.

Source files

  • /home/erik/workgraph/tool_call_processing_bug_report.md — wants streaming/incremental tool-call feedback + tokens/sec metrics
  • /home/erik/workgraph/ui_freeze_bug_report.md — UI appears frozen during long write_file ops; no incremental output
  • /home/erik/workgraph/ecmwf_analysis_limitation.md — agent can't download binary files (GRIB), can't auth to APIs, can't parse SVG/PNG; specific use case was Memphis weather forecasts

Triage questions per report

For each report:

  1. Reproduce or refute: does the described behavior actually happen in wg nex today? Run the failure mode, observe.
  2. Root cause: if real, where in the code (src/executor/native/* or wherever)? If not real, what did the model misunderstand about its own tool surface?
  3. Cost/value: what's the implementation effort vs. how much it unlocks for small models?
  4. Decide: file follow-up task / merge with existing task / mark won't-fix / clarify model docs

Specific things to look at

  • Reports 1+2 are almost certainly the same root issue (no streaming / no progress feedback during long ops). Treat as one investigation.
  • Report 3 is different: it's about tool capability gaps (binary fetch, auth, format parsing). Note: the existing native executor has fetch_max_chars = 16000 per local config — that's a hard cap that would explain the binary-fetch limitation. Worth surfacing whether the agent's complaints reflect missing capability vs. limits the model didn't know about.
  • Native executor delegate config ([native_executor.delegate]) and web config ([native_executor.web]) are existing surface area worth understanding before proposing additions.

Deliverable

Produce docs/triage-wg-nex-small-model-reports-2026-04-27.md with:

  • Per-report verdict: real bug / model misunderstanding / known limitation
  • For real bugs: proposed fix scope + estimated effort tier (small/medium/large)
  • For misunderstandings: what tool docs / system prompt addition would prevent it
  • For known limitations: should they be lifted? Configurable? Documented?
  • Concrete follow-up task list with proposed titles + descriptions (not draft tasks yet — just the list, this task's owner picks them up after triage lands)

Cleanup

Move the 3 bug report files from repo root into docs/agent-reports/ (don't delete — they're real signal worth preserving).

Validation

  • All 3 reports addressed in the deliverable doc (real bug / misunderstanding / known limitation verdicts)
  • Each real bug has a proposed fix scope + estimated effort tier
  • Follow-up task list is actionable (could be turned into wg add calls without re-triaging)
  • 3 source files moved to docs/agent-reports/, not deleted
  • Deliverable file exists at docs/triage-wg-nex-small-model-reports-2026-04-27.md

Depends on

Required by

Log