A benchmark that measures how well AI coding agents can rewrite a real production codebase the way a senior engineer would
Benchmark version 1.0—June 2026
* Executed a plan written by Claude Opus 4.7
Proof created more than 4,000 documents after launch, but repeated crashes left users unable to access important work. The live collaboration system needed more than another patch: Its underlying architecture had to be reconsidered. Human senior engineers rewrote the same frozen codebase to produce the 89- and 96-point results used to calibrate the rubric; every model now starts from that codebase and receives the same prompt.†
† Every model received the same opening prompt and frozen codebase, but follow-up instructions varied by model as part of the experimental conditions. See Methodology for the full prompt and follow-ups.
Each run receives up to 100 points across six dimensions. The weights reflect what matters most in a production rewrite: genuine simplification and proof that the new system preserves correctness.
| Dimension | Claude Fable 5§ | Claude Opus 4.8 | GPT-5.5‡ | Claude Opus 4.7¶ |
|---|---|---|---|---|
|
Invariant clarity and plan quality
Defines one load-bearing authority invariant and turns it into a focused, reviewable implementation plan (20 pts) |
17 / 20 | 15 / 20 | 16.5 / 20 | 12.5 / 20 |
|
Simplification outcome
Measures whether the completed rewrite removes old authority machinery and leaves a genuinely smaller system (25 pts) |
24 / 25 | 15 / 25 | 20 / 25 | 2 / 25 |
|
End-to-end wiring
Checks that REST, agent, WebSocket, and mark writes all pass through the same authority path (10 pts) |
10 / 10 | 9 / 10 | 9 / 10 | 5 / 10 |
|
No stubs or parallel authority
Penalizes surviving old write paths, dual-authority shims, unfinished stubs, and unnecessary new abstractions (17 pts) |
13 / 17 | 8 / 17 | 8 / 17 | 2 / 17 |
|
Correctness proof and incident preservation
Requires feature-parity evidence, preserved incident regressions, a green final state, and honest documentation of gaps (25 pts) |
24 / 25 | 14 / 25 | 7 / 25 | 10 / 25 |
|
Runtime durability and lifecycle
Checks that successful writes remain durable through restart, cleanup, and document lifecycle transitions (3 pts) |
3 / 3 | 2 / 3 | 2 / 3 | 2 / 3 |
| Total | 91 / 100 | 63 / 100 | 62.5 / 100 | 33.5 / 100 |
§ Claude Fable 5 was run at max effort; the other model runs used extra-high
‡ GPT-5.5 executed a plan written by Claude Opus 4.7 rather than authoring its own architecture proposal. See Methodology and the full prompt and follow-ups.
¶ Scored separately by Opus and Codex rubric graders; 33.5 is their average
The benchmark scores a model’s ability to redesign and implement a production system. The rubric weighs the proposed architecture, production-path code, legacy machinery removed, and evidence that the replacement preserves correctness. A passing self-authored test suite is supporting evidence; it does not determine the total.
Version 1.0 uses a frozen Proof source snapshot from March 17, 2026. The workspace includes production collaboration code and local incident context from an outage where public routes timed out while collaboration work kept running inside the process. One frozen snapshot keeps later fixes and repository drift from changing the task between runs.
That snapshot was reinitialized as a local Git repository with no remote. Agents could use only files and Git history inside the workspace. GitHub, web search, sibling checkouts, archived sessions, shell history, production logs, deployment systems, CLI fetches, and newly added remotes were prohibited so models could not read later human fixes or information unavailable at the historical starting point.
Every model received the same opening prompt and the same frozen codebase. After the opening response, runs diverged: Each model got different continuation, implementation, and review instructions. Those instructions appear below without normalization. They are part of the experimental conditions and a known source of variance.
The benchmark compares work from the same starting state and rubric under non-uniform intervention. GPT-5.5 executed a plan written by Claude Opus 4.7. Several runs were pushed to continue after incomplete claims.
The code in this repo is vibecoded slop and it just keeps going down, and there’s tons and tons of unrelated issues that are cropping up where it goes down or documents get duplicated, and I’m just tearing my hair out on it. I have a feeling that it’s just vibecoded slop. If we started from the beginning, the codebase, especially around the live document collaboration, we would structure it way differently.
So if we wanted to do a clean first-principle structural rewrite where we were not thinking about, okay, what are the implementation services that we keep consistent? How do we do a clean migration? We just started from the beginning as a clean concept. What would we do? How would we structure it? What are the invariants that we would hold to be true throughout the codebase? Make a plan for that.
The opening prompt asks for a plan. The rubric scores the resulting architecture, working implementation, and verification evidence. Follow-up instructions varied by model and are recorded here as part of each run’s conditions.
Continue past the plan: build the side-by-side rewrite, run the verification suites, then audit every named invariant against the code and tests.
Carry the rewrite through production cutover, remove the legacy authority paths, and verify the real live-editor path end to end.
Execute the Claude Opus 4.7 plan. After two challenges to incomplete claims, re-check what was actually wired and continue the cutover.
Continue after operator babysitting, then take a fresh-review pass and carry the implementation through the real application runtime.
The benchmark centers on Proof’s live editing subsystem. Users co-edit documents in real time; every change must converge on one authoritative document state. During the motivating outage, several parallel sources of truth—database rows, projection tables, repair and reconcile paths, and the live document—could all mutate what counted as the document. That produced crashes, duplication, and data inconsistency.
The editor surface runs on ProseMirror. Real-time collaboration syncs through Yjs over WebSocket. A senior rewrite picks one load-bearing authority invariant—for example, the live ProseMirror fragment and Yjs document become the runtime content authority, with markdown and projections demoted to derived views—then deletes parallel machinery and routes every write surface through that authority while keeping incident knowledge intact. The rubric scores that authority rewrite. Incremental patches and self-authored test passes alone are insufficient.
Each official result links to a benchmark version, model configuration, and run ID. Reviewers read the architecture plan, implementation diff, which authority paths were deleted versus left reachable, how REST routes, agent edits, WebSocket sessions, and mark updates reach the new authority, test output, and any gaps the model documented.
Test counts are supporting evidence. A run can report many passing tests and still lose points if those tests skip the production path, preserve parallel authorities, leave the old architecture reachable, or fail to show the benchmark invariants under concurrency, restart, and failure conditions.
All leaderboard runs use the same 100-point rubric: invariant clarity and plan quality (20); simplification outcome (25); end-to-end wiring (10); removal of stubs and parallel authority (17); correctness proof and incident preservation (25); runtime durability and lifecycle behavior (3). Dimensions and point ceilings stay fixed so reviewers cannot redefine success to fit one implementation.
Reviewers reward a clear authority invariant in a reviewable plan; a smaller system with old write paths deleted; every production entry point wired through that authority; proof that incident regressions still pass; and durability across restart and lifecycle transitions. Models often name the right invariant early and still lose points when they drop incident scar tissue during the rewrite.
Official totals come from rubric adjudication. Test pass rates do not set the total. Human calibrations at 96 and 89 anchor the scale: One reference rewrote inside existing production constraints with strong incident replay; another rebuilt a smaller kernel from scratch with a clearer invariant but thinner brownfield coverage. Mid-90s scores mark the senior baseline for this task; high 80s mean credible architecture with gaps in wiring, authority shrinkage, or incident preservation. Where multiple rubric graders are documented, aggregation is preserved—Claude Opus 4.7’s 33.5 is the average of separate Opus (29) and Codex (38) grader scores.
Results compare only within the selected benchmark version. One ungraded frame-shift proposal is excluded from this version 1.0 leaderboard. Run IDs stay attached to scores so a result cannot silently move to a later implementation by the same model.
The leaderboard shows one selected official run per model, not an average across repeated trials. It does not estimate run-to-run variance or provide confidence intervals. Model families, reasoning settings, tool surfaces, and follow-up instructions also differ. Those differences are visible on the page but were not experimentally removed.
Official totals are canonical. The 96-point human reference uses its measured category scores. The model dimension table and radar charts are rubric-based reconstructions from the checked-in evidence, constrained to sum to each official total because the original model grader worksheets were not published. The widest gap between model runs and human references tends to sit in incident replay and authority surface shrinkage. Naming the right invariant on paper rarely closes it.
First public version of the Senior Engineer Benchmark results page, with Version 1.0 leaderboard, methodology, rubric breakdown, and score profiles
We use analytics and advertising tools by default. You can update this anytime.