Vibe Check:
GPT-5.5 Has It All

OpenAI’s new model is a top-end senior engineer—and easy to talk to

April 23, 2026

Frontier models usually come with tradeoffs. You get more depth, but less speed. More agency, but less control. Better code, but worse prose. The surprising thing about GPT-5.5, the new OpenAI model out today, is how few of those tradeoffs it asks you to make.

It’s much faster than Opus 4.7, easier to collaborate with, better at writing than any OpenAI model we’ve used since GPT-4.5 and GPT-4o, and the strongest model we’ve tested on our new Senior Engineer Benchmark, which measures how well models can rewrite a slop-coded codebase the way a senior engineer would.

On that benchmark, GPT-5.5 with extra high reasoning reached 62.5 on its best run, while Opus 4.7 at a similar reasoning level landed in the low 30s. For reference, human senior engineers score in the high 80s and low 90s. GPT-5.5 performed best, however, when it executed a plan written by Opus 4.7—curious.

For a long time, OpenAI looked like it was trying to be everywhere at once: Sora for video, Atlas for browsing, consumer ChatGPT features, creative media tools, and whatever else might turn AI into the next mass-market platform. Meanwhile, Anthropic doubled down on work, and Claude became the default for coding agents, long-running engineering tasks, and professional workflows.

GPT-5.5 gives OpenAI something it badly needed: a fast, capable workhorse model for the professional tasks where most AI use happens.

GPT-5.5 is OpenAI’s clearest bid to reclaim the code-and-work narrative. It does not win everything. Opus 4.7 seems to write better plans and have a superior eye for design and product details. But GPT-5.5 is faster, steadier, and easier to trust for everyday professional work.

Thanks to our Sponsor: Hapax

Uploaded image

AI with real agency

The worst part about AI today is that it’s passive—you need to prompt it to get what you want, slowing down your team and keeping its abilities limited to those who know how to use it best. Hapax fixes that. Hapax observes how your organization works, figures out what to automate on its own, and deploys custom AI workers to help each employee be more effective and efficient. It does it all without being prompted or set up. And it’s trustworthy: their customers include banks managing up to $90 billion in assets. Try it today with HAPAXDEMO to get 15 credits.

What OpenAI told us

OpenAI is pitching GPT-5.5 as a higher-capability model for complex work, especially tasks where stronger reasoning, higher reliability, and fewer retries yield a finished result faster and cheaper.

1 million-token context window

The context window remains 1 million tokens, with supported tools and rate limits similar to GPT-5.4.

Prompt caching

GPT-5.5 supports extended prompt caching for reusing long context across requests, but not in-memory caching for faster same-session reuse.

Medium reasoning by default

GPT-5.5 defaults to medium reasoning effort, unlike GPT-5.4, where the default was none.

No API availability at launch

GPT-5.5 launches in ChatGPT and Codex first, with the API coming later while OpenAI finishes additional safety and security validation.

More expensive than GPT-5.4

API pricing is set at $5 per 1 million input tokens and $30 per 1 million output tokens for GPT-5.5, with GPT-5.5 Pro at $30 and $180. OpenAI’s argument is that for harder tasks, better reasoning and fewer retries can lower the cost per completed task even when the per-token price is higher.

Pricing comparison

GPT-5.5: $5/1M input tokens, $30/1M output tokens

GPT-5.5 Pro: $30/1M input tokens, $180/1M output tokens

GPT-5.4: $2.50/1M input tokens, $15/1M output tokens

Opus 4.7: $5/1M input tokens, $25/1M output tokens

Why GPT-5.5 feels different

GPT-5.5 is built on a new pre-train—the broad, expensive training run that teaches the base model its underlying patterns before instruction tuning, tool use, and reasoning scaffolds are added in post-training. Post-training can make a model more obedient, safer, or more agentic. A new pre-train can change the model’s center of gravity.

OpenAI had already made a strong case that it was competitive again with GPT-5.4, which used the same pre-train as earlier GPT-5.x models. Releasing a new pre-train now suggests it wants to keep pressure on Anthropic—betting that the next answer to Claude starts with a different base model underneath, not just better scaffolding around the same one.

The most obvious change is speed. GPT-5.5 is much faster than Opus 4.7 in head-to-head tests, and conveys a low-friction competence. It is easier to iterate with, keep in the loop, and trust with everyday professional work. It also spends more time on planning and reviewing, asks more questions, and checks its work before moving on, especially at extra high reasoning.

GPT-5.5 is good at turning messy inputs into orderly, usable outputs: dashboards, curricula, run-of-show documents, consulting prose, and transcript-grounded writing. But the new pre-train does not solve everything. It can still be bland, struggle with Ruby, and trail Opus 4.7 on PowerPoint presentations, spatial composition, and ambitious prototypes.

The Reach Test

Dan Shipper
Dan Shipper The multi-threaded CEO

“GPT-5.5 is my new daily driver. It’s what I reach for first on every coding task from vibe coding to serious engineering. And it’s my main model for most other agentic knowledge-work tasks from spreadsheets to research. It’s also the model I use by default in my OpenClaw setup.”

Kieran Klaassen
Kieran Klaassen Father of compound engineering

“GPT-5.5 feels very capable, and you can see it thinking harder. The planning and review cycles are longer, and on the best tasks it feels similar to Opus 4.7, which I had called the best model so far. But I’m mixed on it for product work. It can build deep functionality, but the design doesn’t always come together. The details are often good; the whole can feel random. It’s strong in a way I respect, but not yet in a way that consistently inspires me. To be a daily driver, I need a model that’s very good in all things, not just one or a few. It needs to be better at starting from scratch and filling in the blanks while still following instructions closely.”

Mike Taylor
Mike Taylor PowerPoint engineer

“GPT-5.5 is the model I’d use when I need to get the job done without babysitting it. It’s less flashy than Opus, but it’s more natural, more accessible, and more client-ready. For dashboards, curricula, run-of-shows, and normal consulting docs, I trust it more. Opus still has more edge, and for high-stakes tasks I’m personally invested in that’s exactly what I want—especially for PowerPoint, sharp copy, or impressing a client. I’ll stick with Opus as my daily driver, but turn to GPT-5.5 when I need work I can use without thinking.”

Katie Parrott
Katie Parrott AI-pilled writer by day, vibe coder by night

“I haven’t touched ChatGPT for writing in almost a year, but that changes now: I’m switching my writing workflow over to GPT-5.5 and adapting my writing plugin for Codex. This model gives me more confidence in the structure of a piece than Opus 4.7 does: the idea progression is cleaner, and the draft feels easier to revise. It still has some AI smell in the over-smoothed transitions and over-used constructions, and Opus can be better at punchy framing. But GPT-5.5 has the mix of speed and sensitivity to feedback that I need for writing every day.”

Naveen Naidu
Naveen Naidu Codex power user

“The thing that changed for me with GPT-5.5 is how many different kinds of work I started trusting Codex with. I used it across my own native iOS and Mac to-do app, Monologue backend work, MCP, the auth website, iOS and Mac client work, support drafts, and production debugging. One day it was building a native Swift app in one giant thread; another day it was implementing OAuth-only MCP across backend, frontend, and API surfaces under a deadline; another day it was reading Intercom history and drafting replies that sounded like me. Older Codex models already felt great for real engineering. Now I’m using it as my default model for almost everything.”

Legend:
Paradigm shift
Psyched about this release
It’s okay, but I wouldn’t use it every day
Trash release

Coding: Better at sustained engineering

Our coding testers came away with different impressions about GPT-5.5’s engineering strength depending on how much structure the task provided. Dan’s Senior Engineer Benchmark tested backend-heavy architectural rewriting with an explicit plan. Naveen tested technical vibe coding inside long Codex loops. Kieran’s LFG benchmark tested product-forward engineering with clear prompts, technical planning, and review. Mike tested the purest no-plan case: a one-shot vibe coding prompt meant to simulate a novice user.

GPT-5.5 looked best when the task gave it structure: a plan, a live product loop, or a harness that forced planning and review. In those settings, it was remarkably assertive. It carried plans through, deleted code, resisted ineffectual patches, and kept moving toward the target architecture or product shape.

Rewrites vibe coded codebases (almost) like a senior engineer

Most coding benchmarks test whether a model can solve a clean, bounded problem: fix a bug, pass a test, implement a feature, or complete a well-specified GitHub issue. Those benchmarks don’t match the reality of using AI on messy production code.

Dan’s new Senior Engineer Benchmark poses a harder question: If you give a model a messily coded codebase and ask it to clean the whole thing up, can it do the job the way a senior engineer would?

In our testing, GPT-5.5 consistently identified the right principles for a refactor and carried them through, including deleting thousands of lines of old code, over many hours of work. Other models, including Opus 4.7, identified similar principles but kept patching what was already there instead of rewriting or removing it.

62.5

The best model score came from GPT-5.5 xHigh executing an Opus 4.7 plan.

High 80s to low 90s

That is where actual senior engineers landed on the same rubric.

40s

GPT-5.5 on its own, along with GPT-5.5 high and GPT-5.4 high, clustered in the useful-patch band.

Low 30s

Opus 4.7 on its own scored last, despite producing the stronger plan.

The difference seems to be the plan. GPT-5.5’s own plan had the right concepts and made sense to a human, but Opus 4.7’s plan was terser and more engineering-spec-like: Here is exactly what I want you to do. That style gave GPT-5.5 something better to execute against.

Once GPT-5.5 had a great plan, it did something we haven’t seen from most models: it stayed focused. It stuck to the goal of the rewrite, deleted a bunch of code, ignored the way the codebase currently worked, and rebuilt toward the intended architecture from first principles. Opus 4.7 started toward the big rewrite, then nibbled at the edges rather than taking the full bite.

Holds a product thread together

GPT-5.5’s strongest coding advantage may be long-context product continuity—the ability to hold onto the objective across many rounds of feedback, checks, fixes, and direction changes without getting stuck making small, local changes that miss the larger goal.

The cleanest example was Dayline, a native iOS and Mac to-do app Naveen built in one long Codex thread, shaping the product as he went rather than starting from a polished plan. GPT-5.5 stayed oriented the whole time, tracking what Dayline was supposed to be across messy, imprecise feedback like “move this,” “the row interaction feels wrong,” and “keyboard focus needs to feel right.”

That is a different kind of coding win than a benchmark pass. Weaker models often lose the thread: they overcorrect to the most recent note, forget prior constraints, or treat each iteration like a separate task instead of one continuous product conversation. GPT-5.5 stayed in the loop. Naveen said this was the first app he’d built without looking at a single line of code.

LFG: Reliable at building, but wobbly on design

Kieran’s LFG benchmark tests full-product engineering—whether the model can build something that works, looks good, and feels coherent. Built around the /lfg command in Every’s compound engineering plugin, it uses clear product prompts plus a technical planning and review phase to test how much an engineering harness can improve the base prompt.

GPT-5.5’s planning and review time rose noticeably from GPT-5.4, and its functional output was strong. But LFG asks for the whole product to work and feel good, not just for the backend to be correct. Kieran’s read was that GPT-5.5 did well on non-Ruby backend work, but product engineers need the frontend, design, and language-specific implementation to hold together too.

On Rubber Duck, the hardest task—which asks the model to build a full e-commerce store with a custom product designer—it delivered deep functionality and a working build. Only about 40 percent of models we’ve tested have cleared this challenge.

GPT-5.5 Rubber Duck e-commerce build with a custom product designer
GPT-5.5’s Rubber Duck build delivered a functional e-commerce flow with a custom product designer and polished details. (Image courtesy of Kieran Klaassen.)

Design was more uneven. On the landing page design benchmark, GPT-5.5 nailed some details: clean typography, solid buttons, and more original styling. But the page didn’t come together as a whole. “Some things are more structured, some things just don’t really make a lot of sense,” Kieran said. “The design, while being maybe less cliche and boring, also lacks coherence.”

GPT-5.5 Drift landing page design
GPT-5.5’s Drift landing page had strong individual details—clean type, solid buttons, less cliche styling—but the full composition did not quite cohere. (Image courtesy of Kieran.)

The same pattern showed in Cozy Island, which asks the model to build an interactive 3D island scene in Three.js from scratch, including procedural geometry, animated water and birds, swaying trees, and an orbiting camera. GPT-5.5 added good micro-details: rocks, trees, smoke, gentle motion. But it got the land color wrong, and its bird and fish animations were weaker than Opus 4.7’s.

GPT-5.5 Cozy Island 3D scene with rocks, trees, smoke, water, and motion details
GPT-5.5’s Cozy Island scene showed the same split: charming micro-details like rocks, trees, smoke, and gentle motion, but weaker animation and scene-level judgment than Opus 4.7. (Image courtesy of Kieran.)

GPT-5.5 also cannot write Ruby, consistent with every model we’ve tested on this benchmark.

Less reliable for one-shot vibe coding

Mike’s coding test was intentionally a no-plan vibe coding prompt. He wanted to see what a novice user would get from a blank page, without knowing to ask for plan mode or steer the model through a structured loop. In that setup, Opus 4.7 still looked stronger.

The task was to build a Typeform-style app where users speak instead of type to fill in structured data. A successful version needed working OpenAI and Anthropic connectors and a passing test suite. The demo had to run.

Opus 4.7 delivered. The connectors worked, the tests passed, and Mike had something he could use. GPT-5.5 scaffolded a reasonable architecture, but used placeholder code instead of working integrations and did not get the test suite running.

The lesson seems to be about setup. GPT-5.5 can be very strong from scratch when the task unfolds inside an iterative loop, but Mike’s test shows the risk when the model has to infer the plan and ship the demo from one loose prompt. For novice-style vibe coding, Opus still has the edge.

Writing: Smooth prose with stronger bones

GPT-5.5 is a better writer than we expected. Its writing is natural, accessible, and easy to read, but its real advantage is its grasp of structure. It tends to build clearer arguments than Opus, which makes its drafts easier to revise for publication. But it can still carry some AI tells: over-smoothed transitions, balanced constructions, and sentences that explain themselves a little too neatly.

Easier to keep reading

Mike preferred GPT-5.5’s writing overall. It felt more natural to him than Opus 4.7: easier to read, less showy, and more fully fleshed out. Opus 4.7 can say more in less space, which makes it effective for marketing copy or anything that needs to impress quickly. But it can also feel sparse: each line may be sharp, but the whole leaves you wanting more.

GPT-5.5 writing sample that is clearer and easier to follow than Opus 4.7
GPT-5.5’s writing style is clearer and easier to follow than Opus 4.7’s. (Image courtesy of Mike Taylor.)

GPT-5.5 explains itself, moves cleanly between ideas, and feels better suited to broader audiences. On course curriculum and run-of-show tasks, it chose simpler topic titles and clear agendas—less clever than Opus, but easier to imagine sending to a broad audience.

Naveen saw the same thing in customer support writing. GPT-5.5 read Intercom conversations, existing documentation, prior replies, product context, and internal comments, then drafted customer replies that sounded close to how he would write. It picked up the appropriate posture: be direct, be honest, apologize when the bug is ours, do not over-explain, and be clear about the issue.

Its drafts are easier to revise

Katie has used Claude as her writing default for months, but GPT-5.5 was the first OpenAI model in a long time to make her reconsider. Its prose still needed cleanup, but the draft had the thing she cares about most: a clean argumentative spine.

In a Working Overtime introduction test, Katie asked GPT-5.5 and Opus 4.7 to draft the opening for a piece about failing OKRs and building an AI assistant to manage ongoing work. On first read, Opus looked more energetic, with livelier turns of phrase.

Opus 4.7 draft of a Working Overtime introduction
Opus 4.7’s draft has more surface swagger, but it jumps too quickly from confession to shtick to conclusion. (Image courtesy of Katie Parrott.)

But with fresher eyes, GPT-5.5 produced the stronger draft. It moved more patiently from missed goals to the limits of to-do lists and then to the deeper problem of project coordination. It understood the shape of the essay better. The prose still had some AI smell, especially in its tidy correlative constructions and over-smoothed explanatory cadence. But those are easier problems to edit than a draft whose idea progression does not hold.

GPT-5.5 draft of a Working Overtime introduction
GPT-5.5 produced the stronger Working Overtime draft: patient, well-structured, and easier to revise into publishable shape, with only a little AI smoothness left to edit out. (Image courtesy of Katie.)

Turns transcripts into writing without trying too hard

One place GPT-5.5 surprised us was in how well it made dictated text feel like written text. Mike asked it to write in his style from an audio transcript, and it pulled in the right details at the right moments—and didn’t overdo human vocal tics. Claude, by contrast, sometimes shoves in too many “ums” and “ahs.”

GPT-5.5’s restraint helped here. In generating prose, it did not produce the catchiest titles or the best vibe-check quotes, but when the task was to sound like a person without caricaturing them, GPT-5.5 was impressively grounded.

GPT-5.5 is the better model for everyday writing like consulting reports, curricula, agendas, summaries, support replies, and broad-audience prose. It may also be the better first-draft partner for essays when structure matters most. Opus 4.7 still has an edge on punchy framing and catchier titles, but GPT-5.5 gives you cleaner material to work from.

Knowledge work: The dependable operator

GPT-5.5’s clearest advantage over Opus 4.7 is in practical knowledge work. It excels at tasks where you need to give order to messy inputs, like voice-dictated brainstorming sessions. It’s fast and less prone to overperforming. That makes it especially strong for dashboards, consulting documents, curricula, run-of-show plans, and turning transcripts into usable documentation.

It makes better dashboards

Mike Taylor’s biggest surprise was how well GPT-5.5 produced polished dashboards, with cleaner layouts and stronger visual presentation than Opus 4.7. This was unexpected because Opus 4.7’s launch story leaned heavily on improved vision, and in our testing it still wins on PowerPoint. But for dashboards, GPT-5.5 came out ahead.

GPT-5.5 dashboard output with clean layouts and readable charts
GPT-5.5 was strongest on dashboard-style knowledge work: clean layouts, readable charts, and client-ready synthesis without much prompting. (Image courtesy of Mike Taylor.)
Opus 4.7 dashboard output
Opus 4.7’s dashboard was functional, but hard to read and less polished than GPT-5.5’s version. (Image courtesy of Mike.)

Accessible, sometimes at the cost of insight

GPT-5.5’s outputs tend to be easier for broad audiences. On curriculum and planning tasks, it produced reassuringly simple structures that most people would understand quickly. Not every deliverable needs to be clever. Many need to be clear, politic, and easy to adopt.

But Opus 4.7 still finds sharper insights. In one task, Mike noted that Opus did a better job surfacing an audience-segmentation problem—which users the course was ready for and how to target the material to the right audience.

Worse at prompting than Claude

GPT-5.5 was weaker than Claude at generating exercises and prompts for Every consulting workshops. GPT-5.5 is good at making orderly deliverables, but less apt when the job is to create a tool or instruction to give to another model.

That is an important distinction for people who increasingly use models to build workflows. GPT-5.5 can help produce the artifact. Claude is still better at designing the instructions that produce the artifact.

Better for client-safe work

Across consulting tasks, GPT-5.5 felt more dependable. It chose straightforward course titles, simple agendas, and clean run-of-show structures. Opus 4.7 often took bigger swings, but those swings also created rough edges.

Mike’s summary: he would use Opus 4.7 if he wanted to impress a client with wit and had time to micromanage. He would use GPT-5.5 if he needed to get the job done without looking stupid because he was too busy to do the task himself. GPT-5.5 is less showy, but more professionally safe.

The verdict

Opus 4.7 often announces itself as a big model: slower, stranger, sharper at the edges, sometimes brilliant and sometimes annoying. GPT-5.5 is less theatrical. It returns usable work quickly and consistently. That makes it feel less like a lab instrument and more like a home appliance. You can leave it running in the background of writing, analysis, coding, and client work, and it’ll generally make your life more pleasant.

Reach for GPT-5.5 if…

  • You need strong everyday work quickly.

    GPT-5.5 is the better choice for dashboards, consulting documents, curricula, run-of-show plans, broad-audience writing, and transcript-grounded synthesis. It is also the go-to model when speed is at a premium, because its advantage over Opus 4.7 changes how you use it.

  • You are doing serious coding on a mature codebase.

    In our Senior Engineer bench, GPT-5.5 extra high came closest to the right architecture. It still needed review, but it was the strongest real-work implementation of the bunch.

  • You want to delegate without the model trying to overimpress.

    GPT-5.5 is less flashy but more grounded. That makes it easier to trust for routine professional work, especially when the audience values clarity over cleverness.

  • You want a strong first draft with a clean argument.

    GPT-5.5 was better at writing than we expected, especially for broad-audience prose and work that needs clear progression. It still needs editing, but its drafts often have a sound structure.

Reach for Opus 4.7 if…

  • You are making PowerPoints or visual compositions.

    Opus 4.7 remains the best model we’ve tested for generating slides, and it showed stronger spatial awareness than GPT-5.5.

  • You need the model to take a bigger swing on execution.

    On open-ended builds like Mike’s pure vibe coding test, GPT-5.5 can stop at a sensible scaffold. Opus 4.7 is more likely to push through to a working artifact, even when the task is messy.

GPT-5.5 is the model for when the work needs to keep moving. It is not always the model that produces the most memorable line, the boldest visual, or the prettiest artifact. But it is fast, grounded, and unusually good at turning vague professional intent into something workable. For a frontier model, GPT-5.5 is almost boringly useful. That may be the point.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to [email protected].

We use analytics and advertising tools by default. You can update this anytime.