write-feedback-document

Write feedback document on failure modes

Metadata

Statusdone
Assignedagent-629
Agent identity3577bc75d6ed4f1947509aa5c086c91ce7c997c7806dab6bf6affac647452647
Created2026-04-01T19:04:00.744883601+00:00
Started2026-05-01T21:09:35.003637911+00:00
Completed2026-05-01T21:10:34.984218213+00:00
Tagsfeedback,meta, eval-scheduled
Tokens92323 in / 3601 out
Eval score0.88
└ blocking impact0.80
└ completeness0.95
└ coordination overhead0.70
└ correctness0.95
└ downstream usability0.90
└ efficiency0.70
└ intent fidelity0.60
└ style adherence0.85

Description

Goal

Write a markdown document describing the failure modes we experienced during the copy-number-aware enrichment analysis subgraph. This is feedback for the workgraph system developers.

Failure modes to document

1. Task explosion / over-decomposition

The implement-copy-number task was supposed to implement 2-3 enrichment methods. Instead, the agent decomposed it into ~150+ subtasks covering theoretical statistics (mathematical formulations, type I error validation, null distribution validation, edge case analysis, parameter constraint validation, etc.). This turned a practical implementation task into an academic research program.

Impact: Graph went from ~85 tasks to 286+ tasks. Most subtasks were unnecessary for the actual goal.

Root cause hypothesis: The agent used an 'autopoietic' decomposition pattern that recursively creates subtasks without a bound on depth or breadth. No guardrail prevented a single task from spawning dozens of children.

Suggested fix: Consider max subtask limits per task, or requiring coordinator approval for decompositions beyond N subtasks.

2. Claude CLI crash cascade (~16:02)

Around 16:02, a wave of Claude CLI failures hit simultaneously:

  • All .flip-* tasks failed: 'FLIP inference LLM call failed — Claude CLI call failed (exit Some(1))'
  • All .evaluate-* tasks failed: 'Evaluation LLM call failed — Claude CLI call failed (exit Some(1))'
  • Multiple agent tasks failed with 'Agent exited with code 1'

This was likely a transient API rate limit or outage, but the cascade was disproportionate — every eval/FLIP task that was in-flight at that moment failed permanently rather than retrying.

Impact: ~25 tasks failed simultaneously. Required manual retry of each one.

Root cause hypothesis: The eval/FLIP tasks don't have built-in retry logic for transient API failures. A single API hiccup takes out every in-flight evaluation.

Suggested fix: Add automatic retry with backoff for Claude CLI failures in eval/FLIP tasks (e.g., retry 3x with exponential backoff before marking as failed).

3. Eval tasks racing ahead of parent tasks

After mass retry, several .evaluate-* tasks failed with: 'Task X has status Open — must be done or failed to evaluate'. The eval tasks were scheduled before their parent work tasks had completed.

Impact: More spurious failures requiring manual cleanup.

Root cause hypothesis: The retry of a parent task resets it to 'open', but the already-queued eval task doesn't get re-blocked. It runs, finds the parent isn't done, and fails.

Suggested fix: Eval tasks should automatically re-block themselves if their parent task is not in a terminal state (done/failed).

4. Graph context scope crash loop (integration-testing-with)

The task integration-testing-with had context_scope: graph. With 300+ tasks in the graph, loading the full context caused every spawned agent to crash within ~60 seconds. The coordinator kept respawning agents (20+ times over 30 minutes), each one dying immediately.

Impact: Wasted compute (20+ agent spawns) and blocked downstream tasks for 30+ minutes until manually abandoned.

Root cause hypothesis: context_scope: graph loads ALL task descriptions/logs into the agent's context. At 300+ tasks, this exceeded context limits or token budgets, causing immediate OOM/crash.

Suggested fix:

  • Add a circuit breaker: if an agent crashes N times on the same task, pause the task and alert.
  • Cap graph context loading (e.g., max 50 most relevant tasks, or summarize instead of full dump).
  • Don't let agents set context_scope: graph on tasks in large graphs without a warning.

5. Verification system executing human-readable text as shell commands

Task type-i-error was blocked by a verification bug: the verify criteria 'Type I error rates within 1% of nominal α levels; simulation results documented; false positive rate validation complete' was being executed as a shell command instead of being evaluated as human-readable acceptance criteria.

Impact: Task appeared failed despite all work being completed (commit 63d8c5b).

Root cause hypothesis: The verification system doesn't distinguish between machine-checkable commands and human-readable criteria.

Suggested fix: Parse verify criteria — if it looks like a shell command (starts with a known binary, contains pipes, etc.), execute it. Otherwise, use LLM evaluation against the criteria text.

Timeline

  • 14:46 — User requests copy-number-aware enrichment methods
  • ~15:00 — Research task completes, implementation task begins
  • ~15:10-15:35 — Implementation agent decomposes into 150+ subtasks (Failure #1)
  • 16:02 — Claude CLI crash cascade takes out ~25 tasks (Failure #2)
  • 16:02-16:03 — Eval tasks race ahead of parent tasks (Failure #3)
  • 18:10 — User notices failures, coordinator retries
  • 18:12 — More retries needed
  • 18:30-19:00 — integration-testing-with crash loops 20+ times (Failure #4)
  • 19:01 — Coordinator abandons crash-looping task, cleans up graph

Overall assessment

The core work completed successfully — the methodology research, recommendations, and synthesis all finished. The failures were in the validation/testing tail and in the scaffolding system (eval, FLIP, verify). The graph self-healed to some degree but required significant manual intervention (~15 minutes of coordinator cleanup).

The biggest systemic risk is Failure #4 (crash loop without circuit breaker) — it wastes resources silently and would go unnoticed without human monitoring.

Depends on

Required by

Messages 1 message

  1. #1user2026-04-01T19:04:45.024493670+00:00sent
    ADDITIONAL POINT from user: Add a section about graph context management. The graph grew to 337 tasks and context_scope: graph became unusable. There should be a way to compact/summarize completed subgraphs so that: (1) agents using graph scope get a summarized view, not raw dump of 300+ tasks, (2) completed task trees can be rolled up into summary nodes (e.g. 'copy-number methodology research: completed, produced X, Y, Z artifacts'), (3) there's a token budget for graph context that triggers automatic summarization when exceeded. The compaction system exists but only manages coordinator context, not agent task context. This is a feature request: hierarchical graph summarization for agent context.

Log