Metadata
| Status | done |
|---|---|
| Assigned | agent-2289 |
| Agent identity | f51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e |
| Model | codex:gpt-5.5 |
| Created | 2026-05-04T15:16:14.781930817+00:00 |
| Started | 2026-05-04T15:17:18.707217097+00:00 |
| Completed | 2026-05-04T15:30:35.812525644+00:00 |
| Tags | fix,docs,readme,evaluation, eval-scheduled |
| Eval score | 0.90 |
| └ blocking impact | 0.90 |
| └ completeness | 1.00 |
| └ constraint fidelity | 0.85 |
| └ coordination overhead | 0.80 |
| └ correctness | 0.95 |
| └ downstream usability | 0.75 |
| └ efficiency | 0.85 |
| └ intent fidelity | 0.82 |
| └ style adherence | 0.90 |
Description
Description
README.md contains a 'Terminal-Bench evaluation' section presenting null-result data from an early prototype run. The run was buggy and not representative. STRAIGHT REMOVE — not demote, not archive-with-preface. Just remove the reference from the README entirely.
User direct quotes 2026-05-04:
- 'we need a task to remove the tb references they are very stale and not appropriate in the readme!'
- 'we should straight remove the reference to terminalbench!'
- 'it's not right'
- 'it was a messed up run'
What to change
README.md
- DELETE the entire Terminal-Bench evaluation section (the table with 52.3%/51.4%/49.0%, the 'no statistically significant difference' framing, the easy/medium/hard breakdown, and any link to terminal-bench/BLOG.md from README)
- DELETE any other reference to Terminal-Bench in the README
- Do NOT replace with a half-archival note — straight remove
terminal-bench/ directory
- Leave it untouched on disk (git history preserves the work)
- No need for archival prefaces or 'this is superseded' notices in the directory itself — it's just not promoted from the README anymore
Why straight remove
The run was 'messed up' (user's words). The data is unreliable. Keeping it in the README — even framed as 'historical' or 'superseded' — still presents stale buggy-prototype data as the project's quantitative section. A skeptical reader's takeaway is the same regardless of framing: 'they have a null result, workgraph doesn't help.' Straight removal eliminates that misread entirely.
If a real evaluation is run later, that goes into the README. Until then: no eval section. Simpler than complicated archival framing.
Validation
- grep README.md for 'terminal' / 'tb' / 'bench' / '52.3' / '51.4' / '49.0' — all matches removed
- No new 'archival' / 'superseded' framing added — just removed
- terminal-bench/ directory contents untouched (preserved on disk for reproducibility, just not surfaced from README)
- cargo build + cargo test pass (defensive — docs only)
- cargo install --path . was run before claiming done
Per skip-back-compat-ceremony memory
Hard removal is the standing default. No deprecation framing. Just delete the section.
Depends on
Required by
Log
- 2026-05-04T15:16:14.745600296+00:00 Task paused
- 2026-05-04T15:16:52.378176078+00:00 Task published
- 2026-05-04T15:17:15.445159848+00:00 Lightweight assignment: agent=Careful Programmer (f5143935), exec_mode=full, context_scope=task, reason=Careful Programmer is the best fit for straightforward documentation cleanup requiring thorough verification (grep, build, test, install validation).
- 2026-05-04T15:17:18.707222677+00:00 Spawned by coordinator --executor codex --model gpt-5.5
- 2026-05-04T15:17:33.338067537+00:00 Starting README cleanup; no unread messages at session start
- 2026-05-04T15:29:12.697421563+00:00 Validation: README stale Terminal-Bench strings/links/percentages removed; no archival/superseded framing added; terminal-bench directory diff is empty
- 2026-05-04T15:29:48.471160542+00:00 Validation: cargo build passed; cargo test failed in pre-existing integration_cycle_detection wg init/no-model tests unrelated to README docs; cargo install --path . completed; committed 864549f21
- 2026-05-04T15:30:35.812533269+00:00 Task pending eval (agent reported done; awaiting `.evaluate-*` to score)
- 2026-05-04T15:33:00.637798830+00:00 PendingEval → Done (evaluator passed; downstream unblocks)