wg-nex-resume-311

wg nex resume: 311-msg journal blows context, 400 loop, no backoff, throbber spins forever (autohaiku showstopper)

Metadata

Statusdone
Assignedagent-166
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Created2026-04-26T23:24:28.558416530+00:00
Started2026-04-26T23:28:01.014617545+00:00
Completed2026-04-27T00:09:12.923318853+00:00
Tagseval-scheduled
Tokens16887975 in / 45358 out
Eval score0.84
└ blocking impact0.90
└ completeness0.85
└ coordination overhead0.80
└ correctness0.82
└ downstream usability0.78
└ efficiency0.88
└ intent fidelity0.86
└ style adherence0.92

Description

Description

User repro in ~/autohaiku: try to resume coordinator-0 in TUI chat. Multiple stacked bugs visible in one frame:

Symptoms

  1. Top error: 'Failed to create coordinator: Error: Service daemon...' (truncated in TUI display — full message likely 'Service daemon not responding' or similar)
  2. nex resume log: '[native-agent] Resuming from journal: 311 messages, 0 stale annotations'
  3. User Ctrl-C: '[nex] Interrupted — dropping in-flight response.'
  4. Endpoint log (lambda01) receives a flood of malformed requests:
    POST /v1/chat/completions HTTP/1.1 400 Bad Request
    POST /v1/chat/completions HTTP/1.1 400 Bad Request
    POST /v1/chat/completions HTTP/1.1 400 Bad Request
    ... (multiple per second, indefinitely)
    
  5. TUI throbber spins infinitely — never times out, never reports the 400, never gives up.

Root causes (3 distinct, all critical)

A. Context overflow on resume: qwen3-coder (per model_registry: context_window=32768) cannot accept 311 messages. nex's resume path replays the entire journal verbatim into a single request → exceeds context → endpoint rejects with 400.

B. No backoff on 4xx: nex retries the same request immediately on failure, in a tight loop. There's no exponential backoff, no max-retries circuit breaker, no dead-letter handoff. 400 means 'request is malformed; retrying won't help' — should surface and stop, not retry.

C. Throbber doesn't reflect actual state: UI shows perpetual 'thinking' even when every request is failing. Throbber should clear / show error after N consecutive failures or M seconds without a successful response.

Fix per cause

A. Journal-replay context budget:

  • On resume, compute total token estimate of journal messages.
  • If total > model's context_window * (1 - safety_margin, e.g. 0.8), apply one of:
    • Auto-summarize older messages (compaction-on-resume)
    • Truncate to last N messages that fit
    • Surface a 'journal too large to resume; truncate? compact? abandon?' prompt
  • DEFAULT: auto-truncate to fit with a clear log line: 'Journal had 311 messages (~250k tokens); truncated to last 47 messages (~28k tokens) to fit qwen3-coder context'.

B. Retry policy on HTTP 4xx:

  • 400/422 (malformed): 0 retries, surface error immediately, abort the turn.
  • 401/403 (auth): 0 retries, surface error, abort.
  • 429 (rate limit): exponential backoff with Retry-After header respect, max 3 retries.
  • 5xx: exponential backoff, max 5 retries.
  • Document this policy in src/executor/native/client.rs or wherever HTTP errors are handled.

C. Throbber state truthfulness:

  • Throbber reflects 'live request in flight'. If request fails / aborts / times out, throbber clears AND error is shown in chat pane.
  • After N consecutive failures or T seconds without response, throbber clears and an error toast surfaces: 'nex: 8 consecutive 400 Bad Request errors against lambda01; aborted. Check daemon log.'

Workaround (manual, until fix lands)

  1. Kill the wedged process: pkill -f 'wg nex.*autohaiku' && pkill -f 'native-agent.*coordinator-0'
  2. Endpoint flood stops.
  3. Either trim coordinator-0's chat journal manually (.wg/chat/coordinator-0/{inbox,outbox}.jsonl) to leave the last 20-30 messages, OR archive coordinator-0 and create a fresh chat.

Hard gate before claiming done

  • Repro the exact scenario: scratch dir, qwen3-coder + lambda01, drop a 311-message journal in (synthesize one), resume in TUI.
  • Assert: nex either auto-truncates the journal AND succeeds, OR surfaces a clear error AND stops retrying within 3 attempts AND throbber clears.
  • Endpoint MUST NOT receive more than 3 requests in the failure case.
  • Capture endpoint log + daemon log + chat session jsonl as evidence.

Validation

  • Failing tests first:
    • test_nex_resume_truncates_oversized_journal — synthetic 500-msg journal + small context → resume succeeds with truncation log
    • test_nex_400_no_retry_loop — stub endpoint returning 400 → nex sends exactly 1 request, surfaces error, throbber clears
    • test_nex_429_backoff — stub endpoint returning 429 → exponential backoff observed, max 3 retries
  • Implementation makes tests pass
  • cargo build + cargo test pass with no regressions
  • HARD GATE manual smoke as above
  • Coordinate with deprecation-warnings-on (handler stdout hygiene) — same handler code area

Depends on

Required by

Log