fix-readme-s

Fix: README's Terminal-Bench section is stale (early-prototype with bugs); demote to archival note

Metadata

Statusdone
Assignedagent-2289
Agent identityf51439356729d112a6c404803d88015d5b44832c6c584c62b96732b63c2b0c7e
Modelcodex:gpt-5.5
Created2026-05-04T15:16:14.781930817+00:00
Started2026-05-04T15:17:18.707217097+00:00
Completed2026-05-04T15:30:35.812525644+00:00
Tagsfix,docs,readme,evaluation, eval-scheduled
Eval score0.90
└ blocking impact0.90
└ completeness1.00
└ constraint fidelity0.85
└ coordination overhead0.80
└ correctness0.95
└ downstream usability0.75
└ efficiency0.85
└ intent fidelity0.82
└ style adherence0.90

Description

Description

README.md contains a 'Terminal-Bench evaluation' section presenting null-result data from an early prototype run. The run was buggy and not representative. STRAIGHT REMOVE — not demote, not archive-with-preface. Just remove the reference from the README entirely.

User direct quotes 2026-05-04:

  • 'we need a task to remove the tb references they are very stale and not appropriate in the readme!'
  • 'we should straight remove the reference to terminalbench!'
  • 'it's not right'
  • 'it was a messed up run'

What to change

README.md

  • DELETE the entire Terminal-Bench evaluation section (the table with 52.3%/51.4%/49.0%, the 'no statistically significant difference' framing, the easy/medium/hard breakdown, and any link to terminal-bench/BLOG.md from README)
  • DELETE any other reference to Terminal-Bench in the README
  • Do NOT replace with a half-archival note — straight remove

terminal-bench/ directory

  • Leave it untouched on disk (git history preserves the work)
  • No need for archival prefaces or 'this is superseded' notices in the directory itself — it's just not promoted from the README anymore

Why straight remove

The run was 'messed up' (user's words). The data is unreliable. Keeping it in the README — even framed as 'historical' or 'superseded' — still presents stale buggy-prototype data as the project's quantitative section. A skeptical reader's takeaway is the same regardless of framing: 'they have a null result, workgraph doesn't help.' Straight removal eliminates that misread entirely.

If a real evaluation is run later, that goes into the README. Until then: no eval section. Simpler than complicated archival framing.

Validation

  • grep README.md for 'terminal' / 'tb' / 'bench' / '52.3' / '51.4' / '49.0' — all matches removed
  • No new 'archival' / 'superseded' framing added — just removed
  • terminal-bench/ directory contents untouched (preserved on disk for reproducibility, just not surfaced from README)
  • cargo build + cargo test pass (defensive — docs only)
  • cargo install --path . was run before claiming done

Per skip-back-compat-ceremony memory

Hard removal is the standing default. No deprecation framing. Just delete the section.

Depends on

Required by

Log