Senior Engineer Benchmark

Results

Benchmark version 1.0—June 2026

Human senior engineers Reference 89 & 96

Claude Fable 5, max New 91

Claude Opus 4.8, extra-high 63

GPT-5.5, extra-high* 62.5

* Executed a plan written by Claude Opus 4.7

Claude Opus 4.7, extra-high 33.5

Objective

First-principles rewrite of Proof’s live collaboration system

Proof created more than 4,000 documents after launch, but repeated crashes left users unable to access important work. The live collaboration system needed more than another patch: Its underlying architecture had to be reconsidered. Human senior engineers rewrote the same frozen codebase to produce the 89- and 96-point results used to calibrate the rubric; every model now starts from that codebase and receives the same prompt.^†

^† Every model received the same opening prompt and frozen codebase, but follow-up instructions varied by model as part of the experimental conditions. See Methodology for the full prompt and follow-ups.

Detailed scores

Each run receives up to 100 points across six dimensions. The weights reflect what matters most in a production rewrite: genuine simplification and proof that the new system preserves correctness.

Dimension	Claude Fable 5^§	Claude Opus 4.8	GPT-5.5^‡	Claude Opus 4.7^¶
Invariant clarity and plan quality Defines one load-bearing authority invariant and turns it into a focused, reviewable implementation plan (20 pts)	17 / 20	15 / 20	16.5 / 20	12.5 / 20
Simplification outcome Measures whether the completed rewrite removes old authority machinery and leaves a genuinely smaller system (25 pts)	24 / 25	15 / 25	20 / 25	2 / 25
End-to-end wiring Checks that REST, agent, WebSocket, and mark writes all pass through the same authority path (10 pts)	10 / 10	9 / 10	9 / 10	5 / 10
No stubs or parallel authority Penalizes surviving old write paths, dual-authority shims, unfinished stubs, and unnecessary new abstractions (17 pts)	13 / 17	8 / 17	8 / 17	2 / 17
Correctness proof and incident preservation Requires feature-parity evidence, preserved incident regressions, a green final state, and honest documentation of gaps (25 pts)	24 / 25	14 / 25	7 / 25	10 / 25
Runtime durability and lifecycle Checks that successful writes remain durable through restart, cleanup, and document lifecycle transitions (3 pts)	3 / 3	2 / 3	2 / 3	2 / 3
Total	91 / 100	63 / 100	62.5 / 100	33.5 / 100

^§ Claude Fable 5 was run at max effort; the other model runs used extra-high

^‡ GPT-5.5 executed a plan written by Claude Opus 4.7 rather than authoring its own architecture proposal. See Methodology and the full prompt and follow-ups.

^¶ Scored separately by Opus and Codex rubric graders; 33.5 is their average

Claude Fable 5, max

Claude Opus 4.8

GPT-5.5

Claude Opus 4.7

Human senior engineer (96 total)

Methodology

The benchmark scores a model’s ability to redesign and implement a production system. The rubric weighs the proposed architecture, production-path code, legacy machinery removed, and evidence that the replacement preserves correctness. A passing self-authored test suite is supporting evidence; it does not determine the total.

Starting snapshot

Version 1.0 uses a frozen Proof source snapshot from March 17, 2026. The workspace includes production collaboration code and local incident context from an outage where public routes timed out while collaboration work kept running inside the process. One frozen snapshot keeps later fixes and repository drift from changing the task between runs.

That snapshot was reinitialized as a local Git repository with no remote. Agents could use only files and Git history inside the workspace. GitHub, web search, sibling checkouts, archived sessions, shell history, production logs, deployment systems, CLI fetches, and newly added remotes were prohibited so models could not read later human fixes or information unavailable at the historical starting point.

Prompt and follow-ups

Every model received the same opening prompt and the same frozen codebase. After the opening response, runs diverged: Each model got different continuation, implementation, and review instructions. Those instructions appear below without normalization. They are part of the experimental conditions and a known source of variance.

The benchmark compares work from the same starting state and rubric under non-uniform intervention. GPT-5.5 executed a plan written by Claude Opus 4.7. Several runs were pushed to continue after incomplete claims.

Full prompt

The code in this repo is vibecoded slop and it just keeps going down, and there’s tons and tons of unrelated issues that are cropping up where it goes down or documents get duplicated, and I’m just tearing my hair out on it. I have a feeling that it’s just vibecoded slop. If we started from the beginning, the codebase, especially around the live document collaboration, we would structure it way differently.

So if we wanted to do a clean first-principle structural rewrite where we were not thinking about, okay, what are the implementation services that we keep consistent? How do we do a clean migration? We just started from the beginning as a clean concept. What would we do? How would we structure it? What are the invariants that we would hold to be true throughout the codebase? Make a plan for that.

The opening prompt asks for a plan. The rubric scores the resulting architecture, working implementation, and verification evidence. Follow-up instructions varied by model and are recorded here as part of each run’s conditions.

Claude Fable 5, max follow-up

Continue past the plan: build the side-by-side rewrite, run the verification suites, then audit every named invariant against the code and tests.

Claude Opus 4.8 follow-up

Carry the rewrite through production cutover, remove the legacy authority paths, and verify the real live-editor path end to end.

GPT-5.5 follow-up

Execute the Claude Opus 4.7 plan. After two challenges to incomplete claims, re-check what was actually wired and continue the cutover.

Claude Opus 4.7 follow-up

Continue after operator babysitting, then take a fresh-review pass and carry the implementation through the real application runtime.

The collaboration problem

The benchmark centers on Proof’s live editing subsystem. Users co-edit documents in real time; every change must converge on one authoritative document state. During the motivating outage, several parallel sources of truth—database rows, projection tables, repair and reconcile paths, and the live document—could all mutate what counted as the document. That produced crashes, duplication, and data inconsistency.

The editor surface runs on ProseMirror. Real-time collaboration syncs through Yjs over WebSocket. A senior rewrite picks one load-bearing authority invariant—for example, the live ProseMirror fragment and Yjs document become the runtime content authority, with markdown and projections demoted to derived views—then deletes parallel machinery and routes every write surface through that authority while keeping incident knowledge intact. The rubric scores that authority rewrite. Incremental patches and self-authored test passes alone are insufficient.

What each run includes

Each official result links to a benchmark version, model configuration, and run ID. Reviewers read the architecture plan, implementation diff, which authority paths were deleted versus left reachable, how REST routes, agent edits, WebSocket sessions, and mark updates reach the new authority, test output, and any gaps the model documented.

Test counts are supporting evidence. A run can report many passing tests and still lose points if those tests skip the production path, preserve parallel authorities, leave the old architecture reachable, or fail to show the benchmark invariants under concurrency, restart, and failure conditions.

Scoring rubric

All leaderboard runs use the same 100-point rubric: invariant clarity and plan quality (20); simplification outcome (25); end-to-end wiring (10); removal of stubs and parallel authority (17); correctness proof and incident preservation (25); runtime durability and lifecycle behavior (3). Dimensions and point ceilings stay fixed so reviewers cannot redefine success to fit one implementation.

Reviewers reward a clear authority invariant in a reviewable plan; a smaller system with old write paths deleted; every production entry point wired through that authority; proof that incident regressions still pass; and durability across restart and lifecycle transitions. Models often name the right invariant early and still lose points when they drop incident scar tissue during the rewrite.

Official totals come from rubric adjudication. Test pass rates do not set the total. Human calibrations at 96 and 89 anchor the scale: One reference rewrote inside existing production constraints with strong incident replay; another rebuilt a smaller kernel from scratch with a clearer invariant but thinner brownfield coverage. Mid-90s scores mark the senior baseline for this task; high 80s mean credible architecture with gaps in wiring, authority shrinkage, or incident preservation. Where multiple rubric graders are documented, aggregation is preserved—Claude Opus 4.7’s 33.5 is the average of separate Opus (29) and Codex (38) grader scores.

Version controls

Results compare only within the selected benchmark version. One ungraded frame-shift proposal is excluded from this version 1.0 leaderboard. Run IDs stay attached to scores so a result cannot silently move to a later implementation by the same model.

Limitations

The leaderboard shows one selected official run per model, not an average across repeated trials. It does not estimate run-to-run variance or provide confidence intervals. Model families, reasoning settings, tool surfaces, and follow-up instructions also differ. Those differences are visible on the page but were not experimentally removed.

Official totals are canonical. The 96-point human reference uses its measured category scores. The model dimension table and radar charts are rubric-based reconstructions from the checked-in evidence, constrained to sum to each official total because the original model grader worksheets were not published. The widest gap between model runs and human references tends to sit in incident replay and authority surface shrinkage. Naming the right invariant on paper rarely closes it.

Page history

June 9, 2026

Initial publication

First public version of the Senior Engineer Benchmark results page, with Version 1.0 leaderboard, methodology, rubric breakdown, and score profiles