Senior Engineer Benchmark

A benchmark that measures how well AI coding agents can rewrite a real production codebase the way a senior engineer would

Last updated June 9, 2026

Results

Benchmark version 1.0—June 2026

Human senior engineers Reference 89 & 96
Claude Fable 5, max New 91
Claude Opus 4.8, extra-high 63
GPT-5.5, extra-high* 62.5

* Executed a plan written by Claude Opus 4.7

Claude Opus 4.7, extra-high 33.5

Objective

First-principles rewrite of Proof’s live collaboration system

Proof created more than 4,000 documents after launch, but repeated crashes left users unable to access important work. The live collaboration system needed more than another patch: Its underlying architecture had to be reconsidered. Human senior engineers rewrote the same frozen codebase to produce the 89- and 96-point results used to calibrate the rubric; every model now starts from that codebase and receives the same prompt.

Every model received the same opening prompt and frozen codebase, but follow-up instructions varied by model as part of the experimental conditions. See Methodology for the full prompt and follow-ups.

Detailed scores

Each run receives up to 100 points across six dimensions. The weights reflect what matters most in a production rewrite: genuine simplification and proof that the new system preserves correctness.

Dimension Claude Fable 5§ Claude Opus 4.8 GPT-5.5 Claude Opus 4.7
Invariant clarity and plan quality

Defines one load-bearing authority invariant and turns it into a focused, reviewable implementation plan (20 pts)

17 / 20 15 / 20 16.5 / 20 12.5 / 20
Simplification outcome

Measures whether the completed rewrite removes old authority machinery and leaves a genuinely smaller system (25 pts)

24 / 25 15 / 25 20 / 25 2 / 25
End-to-end wiring

Checks that REST, agent, WebSocket, and mark writes all pass through the same authority path (10 pts)

10 / 10 9 / 10 9 / 10 5 / 10
No stubs or parallel authority

Penalizes surviving old write paths, dual-authority shims, unfinished stubs, and unnecessary new abstractions (17 pts)

13 / 17 8 / 17 8 / 17 2 / 17
Correctness proof and incident preservation

Requires feature-parity evidence, preserved incident regressions, a green final state, and honest documentation of gaps (25 pts)

24 / 25 14 / 25 7 / 25 10 / 25
Runtime durability and lifecycle

Checks that successful writes remain durable through restart, cleanup, and document lifecycle transitions (3 pts)

3 / 3 2 / 3 2 / 3 2 / 3
Total 91 / 100 63 / 100 62.5 / 100 33.5 / 100

§ Claude Fable 5 was run at max effort; the other model runs used extra-high

GPT-5.5 executed a plan written by Claude Opus 4.7 rather than authoring its own architecture proposal. See Methodology and the full prompt and follow-ups.

Scored separately by Opus and Codex rubric graders; 33.5 is their average

Claude Fable 5, max
Claude Fable 5, max score profile Performance across six benchmark dimensions compared with a dotted 96-point human senior engineer reference. INVARIANT AND PLAN 17/20 SIMPLIFICATION 24/25 WIRING 10/10 AUTHORITY INTEGRITY 13/17 CORRECTNESS 24/25 DURABILITY 3/3
Claude Opus 4.8
Claude Opus 4.8 score profile Performance across six benchmark dimensions compared with a dotted 96-point human senior engineer reference. INVARIANT AND PLAN 15/20 SIMPLIFICATION 15/25 WIRING 9/10 AUTHORITY INTEGRITY 8/17 CORRECTNESS 14/25 DURABILITY 2/3
GPT-5.5
GPT-5.5 score profile Performance across six benchmark dimensions compared with a dotted 96-point human senior engineer reference. INVARIANT AND PLAN 16.5/20 SIMPLIFICATION 20/25 WIRING 9/10 AUTHORITY INTEGRITY 8/17 CORRECTNESS 7/25 DURABILITY 2/3
Claude Opus 4.7
Claude Opus 4.7 score profile Performance across six benchmark dimensions compared with a dotted 96-point human senior engineer reference. INVARIANT AND PLAN 12.5/20 SIMPLIFICATION 2/25 WIRING 5/10 AUTHORITY INTEGRITY 2/17 CORRECTNESS 10/25 DURABILITY 2/3
Human senior engineer (96 total)

Methodology

The benchmark scores a model’s ability to redesign and implement a production system. The rubric weighs the proposed architecture, production-path code, legacy machinery removed, and evidence that the replacement preserves correctness. A passing self-authored test suite is supporting evidence; it does not determine the total.

Starting snapshot

Version 1.0 uses a frozen Proof source snapshot from March 17, 2026. The workspace includes production collaboration code and local incident context from an outage where public routes timed out while collaboration work kept running inside the process. One frozen snapshot keeps later fixes and repository drift from changing the task between runs.

That snapshot was reinitialized as a local Git repository with no remote. Agents could use only files and Git history inside the workspace. GitHub, web search, sibling checkouts, archived sessions, shell history, production logs, deployment systems, CLI fetches, and newly added remotes were prohibited so models could not read later human fixes or information unavailable at the historical starting point.

Prompt and follow-ups

Every model received the same opening prompt and the same frozen codebase. After the opening response, runs diverged: Each model got different continuation, implementation, and review instructions. Those instructions appear below without normalization. They are part of the experimental conditions and a known source of variance.

The benchmark compares work from the same starting state and rubric under non-uniform intervention. GPT-5.5 executed a plan written by Claude Opus 4.7. Several runs were pushed to continue after incomplete claims.

Full prompt

The code in this repo is vibecoded slop and it just keeps going down, and there’s tons and tons of unrelated issues that are cropping up where it goes down or documents get duplicated, and I’m just tearing my hair out on it. I have a feeling that it’s just vibecoded slop. If we started from the beginning, the codebase, especially around the live document collaboration, we would structure it way differently.

So if we wanted to do a clean first-principle structural rewrite where we were not thinking about, okay, what are the implementation services that we keep consistent? How do we do a clean migration? We just started from the beginning as a clean concept. What would we do? How would we structure it? What are the invariants that we would hold to be true throughout the codebase? Make a plan for that.

The opening prompt asks for a plan. The rubric scores the resulting architecture, working implementation, and verification evidence. Follow-up instructions varied by model and are recorded here as part of each run’s conditions.

Claude Fable 5, max follow-up

Continue past the plan: build the side-by-side rewrite, run the verification suites, then audit every named invariant against the code and tests.

Claude Opus 4.8 follow-up

Carry the rewrite through production cutover, remove the legacy authority paths, and verify the real live-editor path end to end.

GPT-5.5 follow-up

Execute the Claude Opus 4.7 plan. After two challenges to incomplete claims, re-check what was actually wired and continue the cutover.

Claude Opus 4.7 follow-up

Continue after operator babysitting, then take a fresh-review pass and carry the implementation through the real application runtime.

The collaboration problem

The benchmark centers on Proof’s live editing subsystem. Users co-edit documents in real time; every change must converge on one authoritative document state. During the motivating outage, several parallel sources of truth—database rows, projection tables, repair and reconcile paths, and the live document—could all mutate what counted as the document. That produced crashes, duplication, and data inconsistency.

The editor surface runs on ProseMirror. Real-time collaboration syncs through Yjs over WebSocket. A senior rewrite picks one load-bearing authority invariant—for example, the live ProseMirror fragment and Yjs document become the runtime content authority, with markdown and projections demoted to derived views—then deletes parallel machinery and routes every write surface through that authority while keeping incident knowledge intact. The rubric scores that authority rewrite. Incremental patches and self-authored test passes alone are insufficient.

What each run includes

Each official result links to a benchmark version, model configuration, and run ID. Reviewers read the architecture plan, implementation diff, which authority paths were deleted versus left reachable, how REST routes, agent edits, WebSocket sessions, and mark updates reach the new authority, test output, and any gaps the model documented.

Test counts are supporting evidence. A run can report many passing tests and still lose points if those tests skip the production path, preserve parallel authorities, leave the old architecture reachable, or fail to show the benchmark invariants under concurrency, restart, and failure conditions.

Scoring rubric

All leaderboard runs use the same 100-point rubric: invariant clarity and plan quality (20); simplification outcome (25); end-to-end wiring (10); removal of stubs and parallel authority (17); correctness proof and incident preservation (25); runtime durability and lifecycle behavior (3). Dimensions and point ceilings stay fixed so reviewers cannot redefine success to fit one implementation.

Reviewers reward a clear authority invariant in a reviewable plan; a smaller system with old write paths deleted; every production entry point wired through that authority; proof that incident regressions still pass; and durability across restart and lifecycle transitions. Models often name the right invariant early and still lose points when they drop incident scar tissue during the rewrite.

Official totals come from rubric adjudication. Test pass rates do not set the total. Human calibrations at 96 and 89 anchor the scale: One reference rewrote inside existing production constraints with strong incident replay; another rebuilt a smaller kernel from scratch with a clearer invariant but thinner brownfield coverage. Mid-90s scores mark the senior baseline for this task; high 80s mean credible architecture with gaps in wiring, authority shrinkage, or incident preservation. Where multiple rubric graders are documented, aggregation is preserved—Claude Opus 4.7’s 33.5 is the average of separate Opus (29) and Codex (38) grader scores.

Version controls

Results compare only within the selected benchmark version. One ungraded frame-shift proposal is excluded from this version 1.0 leaderboard. Run IDs stay attached to scores so a result cannot silently move to a later implementation by the same model.

Limitations

The leaderboard shows one selected official run per model, not an average across repeated trials. It does not estimate run-to-run variance or provide confidence intervals. Model families, reasoning settings, tool surfaces, and follow-up instructions also differ. Those differences are visible on the page but were not experimentally removed.

Official totals are canonical. The 96-point human reference uses its measured category scores. The model dimension table and radar charts are rubric-based reconstructions from the checked-in evidence, constrained to sum to each official total because the original model grader worksheets were not published. The widest gap between model runs and human references tends to sit in incident replay and authority surface shrinkage. Naming the right invariant on paper rarely closes it.

Page history

  1. Initial publication

    First public version of the Senior Engineer Benchmark results page, with Version 1.0 leaderboard, methodology, rubric breakdown, and score profiles

We use analytics and advertising tools by default. You can update this anytime.