Our verdict: The dominant criticism of coding tool Codex has always been the same: It acts like a brilliant senior engineer who is methodical to a fault. It'll ship a full product autonomously that builds without errors—if you have a detailed spec. But it's slow and cautious, sometimes gets stuck in myopic loops, and has very little empathy.
GPT-5.3 Codex, which OpenAI has released today, maintains the coding prowess of its predecessors, but it's a much more user-friendly model. It's fast, a bit warmer, and more creative. It's also way more industrious—it does things without asking for permission. For developers who were frustrated by earlier Codex versions stopping to double-check obvious decisions, this is the update you've been waiting for.
In a lot of ways, it feels like Codex got upgraded with some of Opus 4.5's better qualities. Paired with the new Codex app that OpenAI launched on Monday, it's clear that OpenAI wants to make Codex a more general-purpose model for knowledge work beyond coding, and 5.3 is a step in the right direction towards that goal.
What OpenAI told us
Best-of-both-worlds model
It has the frontier coding chops of the company's latest research combined with GPT-5.2's reliability for agentic work. It's built for long-horizon tasks—the kind of sustained, multi-step work that unfolds over minutes or hours rather than a single prompt-and-response.
Less of a black box
The model narrates what it's doing as it works, making agents feel more transparent and predictable.
Mid-turn redirection
You can course-correct while the model is working instead of waiting for it to finish.
We focused our testing on coding for this Vibe Check. OpenAI also claims the model unlocks "advanced writing," but due to timing and testing constraints, we didn't evaluate that on this go-around. For now, Claude is still our preferred model for writing. If that changes, we'll let you know.
The Reach Test
"I'm entering my Codex era. Prior to this model, I would only use Codex a bit for really hard tasks or code reviews. Now, it's becoming a daily driver for my non-vibe coding tasks in bigger code bases. I especially like using it in the Codex app. GUIs are back!"
"GPT-5.3 Codex is my go-to model. I've been using it for the past two weeks with the Codex app. Even up against Opus 4.6, I'm still reaching for Codex. I gave a big redesign task to both Claude Opus 4.6 and Codex. Codex did it well, with no build errors, but Claude couldn't complete the task. It had a few build failures. Little things like that give me more trust to use GPT-5.3 Codex over Opus 4.6."
"The -codex lineup was always powerful. It always went deep into source code, third-party plugins, etc. to find solutions. I can't say I noticed any standout difference from previous Codex models. Speed is up, one-shot reliability is consistent, deep investigation is intact. It's a solid upgrade, not a revelation."
"The new Codex model surprised me because it's so fast. It feels more useful and friendly than the ones before, where it felt a little bit too like an old-school, grumpy engineer who specialized on backend, less creative projects. This model is a little bit more creative, and that's really good. It still has its power—it just keeps going and does the work well. My daily driver will still be Claude, but Codex has a place in my workflow now. I use it for research, reviews, and long feature builds, and Claude for planning."
The headline findings
Members only
Only available for paid subscribers
Get full access to the verdicts, comparisons, and detailed analysis.
Subscribe to unlock →Codex finally stops asking for permission
Dan tested Codex 5.3 on Proof, a macOS markdown editor that he's been vibe coding that tracks the origin of every piece of text—whether it was written by a human or generated by AI—and lets users attest to how thoroughly they've reviewed AI-generated content.
Once the codebase got complex, Opus 4.5 started to trip up. Dan switched to Codex 5.3 and found it to be exactly the right tool for the job. He was surprised to see that the model ran a full test loop on its own: wrote fixes, checked results, found issues, and iterated. It did not pause for confirmation or hedge obvious decisions. It was also much faster on simple requests and quickly became his daily driver model for coding.
It still does exactly what you say—for better or worse
You don't need a Ralph Loop (a workaround developers use to keep AI agents running by restarting them in a loop until a task is done) when you work with Codex 5.3. It's going to keep going until it's finished everything it's supposed to. Dan ran Codex overnight multiple times over the last week on difficult bugs and features with great results.
However, it still leans more literal than the Claude family of models and can sometimes get off track myopically because it doesn't understand all of the context. In a head-to-head debugging test, Dan asked both Opus 4.6 and GPT-5.3 Codex to diagnose a tricky document formatting bug in Proof. GPT-5.3 Codex ran more than eight forensic tool calls, analyzing the document and associated code bit by bit and finding real issues—but missing the actual problem. Opus 4.6 read the document structure once and diagnosed the issue.
If you are writing precise, detailed instructions, GPT-5.3 Codex is hard to beat for power and speed. If you want a model that will do well with ambiguity, it's better than its predecessor but not the strongest model we've tested.
Quick verdict by use case
Coding
It's fast, powerful, and autonomous. It thrives when tasks are well-specified.
Knowledge work
It's promising for these tasks given its speed and smarts. But not very usable inside of the Codex app yet.
LFG benchmark
Standard AI coding benchmarks test whether a model can solve algorithm puzzles or pass unit tests. They're great for leaderboards, but less useful for telling you whether a model can ship a real project.
Kieran built Every's LFG benchmark to tell him about a model's capabilities as a software engineer.
The benchmark is named after the `/lfg` command in Every's compound engineering plugin—a single command that kicks off an entire development workflow. You give it one reasonably detailed but high-level prompt, and it handles the rest: planning what to build, writing the code, and reviewing its own work. There's no hand-holding or step-by-step guidance. The model either figures it out and delivers working code, or it doesn't.
Each benchmark run activates the `/lfg` command and then measures the result.
The tasks, in increasing order of difficulty
1. Drift landing page (React, medium complexity). Build a polished landing page for a fictional AI writing app with a dark editorial aesthetic, six required sections, and specific design constraints (no purple gradients or Inter font). Tests: Can the model follow a creative brief, write clean frontend code, and respect constraints?
2. Cozy island 3D (Three.js, medium-high complexity). Build an interactive 3D island scene with 13 features, including water, trees, a cottage, birds, clouds, and camera controls. Tests: Can the model handle spatial reasoning, 3D rendering, and complex visual features?
3. Earnings preview dashboard (Streamlit/Python, high complexity). Build an NVIDIA earnings dashboard with seven tabs, interactive charts, and real financial calculations. Tests: Can the model handle data-heavy applications with multiple interconnected views?
4. Rubber duck e-commerce (Next.js, very high complexity). Build a full e-commerce site—product pages, shopping cart, multi-step checkout, customization page. Tests: Can the model execute a complete, production-quality e-commerce site from scratch? This is the hardest benchmark: Only 40 percent of models completed it successfully.
Kieran tested GPT-5.3 Codex against seven other models: GPT-5.2 Codex plus Claude Opus 4.6, Opus 4.5, Sonnet 4.5, Haiku 4.5, Gemini 3 Flash, and Gemini 3 Pro. He scored each run on build success, feature completeness, visual design, and code quality. He also ran consistency tests—three attempts per benchmark—to measure how much outputs varied across runs.
The headline result
On raw benchmark scores, Opus 4.6 leads across the board: higher average scores, twice the first-attempt reliability, and stronger consistency with perfect build success. But those benchmarks reward autonomous exploration over spec execution—Opus 4.6's wheelhouse. Where GPT-5.3 Codex closes the gap is speed. It completes tasks about 25 percent faster. And in workflows with detailed requirements, the reliability gap narrows considerably.
Standout results
On raw scores, Opus 4.6 leads. It's more reliable on the first attempt (roughly twice as likely to succeed without retries) and produces more consistent results across multiple runs. But the benchmarks reward autonomous exploration—i.e., figuring things out on your own—which is Opus's wheelhouse.
Where Codex closes the gap: speed. GPT-5.3 Codex finished tasks about 25 percent faster than Opus 4.6. In workflows with detailed requirements, the reliability gap narrows considerably.
On consistency, Codex was shakier. One earnings dashboard run produced zero output files despite reporting success—a critical reliability issue. The e-commerce task sometimes generated completely different project structures from the same prompt.
On code quality, Codex was solid. The code it produced was clean and well-organized—among the best in our tests.
A note on our benchmarks
The benchmarks may tell us as much about task design as model capability. The instructions in our benchmarks are thorough but not full-specification–detailed (hence the "LFG" moniker). We want to know which models can figure it out on their own.
Opus models thrive in this environment. Hand it a vague goal and it explores, investigates, and converges on a solution. Codex, by contrast, wants direction. When the specs are detailed, it executes flawlessly. When they're not, it either guesses (sometimes wrong) or stalls.
This matches what we've observed in daily use. Codex tackled Naveen's redesign task—which came with clear requirements—without build errors. Claude Opus 4.6 loves autonomous planning tasks like Kieran's. The benchmarks don't tell the whole story; they tell a story, one where exploratory problem-solving matters more than precise execution.
Codex shines when you have a plan and want it done right
GPT-5.3 Codex thrives when you give it directions. When you trust that a well-specified task will succeed, you can hand it off and move on to something else, knowing that Codex will follow your plan to the letter, meaning less time spent debugging once it's done.
Naveen's experience bears this out. He gave both Opus 4.6 and GPT-5.3 Codex a big redesign task—one with clear requirements. Codex handled it with no build errors. Opus couldn't finish. The difference was that Naveen knew exactly what he wanted and wrote it down. That's where Codex excels.
Great at following docs
Andrey asked GPT-5.3 Codex to build an MCP server—MCP (Model Context Protocol) being the standard that lets AI models connect to external tools and data sources. Codex read the documentation on its own and nailed the implementation, which speaks to one of the model's genuine strengths: hand it a spec or a set of docs and it'll build exactly what they describe.
Noticeably faster—enough to change how you use it
GPT-5.3 Codex is noticeably faster than 5.2, though still not as fast as Cursor's Composer 1 Alpha. For Kieran, whose compounding engineering workflow depends on fast iteration loops, the speed gain was the thing that surprised him most about the release—enough to shift Codex from a model he'd tolerate to one he'd reach for.
Deep investigation: Strong as ever, not a step forward
Codex has always been good at digging into source code, figuring out API quirks, and tracing issues through third-party plugins. Both Andrey and Kieran consider it one of the model's core strengths, and GPT-5.3 Codex maintains that baseline. OpenAI pitched "deep research" as a new capability for this model. Based on our testing, it doesn't change much.
Codex doesn't feel like 'an engineer's model' anymore
GPT-5.3 Codex feels friendlier and more creative than previous Codex versions. Earlier iterations were capable but rigid—like working with a colleague who only speaks in implementation details. GPT-5.3 Codex has loosened up. It still has the horsepower to grind through long feature builds, but the personality around the edges is warmer.
Part of that is a shift in judgment. Codex could always go deep on details. What's new is that GPT-5.3 Codex doesn't always get stuck there. It can more frequently pull back from implementation minutiae and see the broader picture. Kieran flagged this specifically: The model knows when to stop digging and step back.
Codex's reputation as the serious, precise, no-fun model is part of why some developers default to Claude. If GPT-5.3 Codex can be precise and pleasant, that's a harder combination to walk away from.
More autonomy means more rabbit holes
The flipside of the caution fix. Earlier Codex versions would have stopped and asked when they started drifting from your intent. GPT-5.3 Codex keeps going—great when it's heading in the right direction, frustrating when it's not.
On longer tasks, the model sometimes gets lost down rabbit holes. It's technically executing on each step, but gradually moving away from the original intent—solving the problem it thinks you have, which may not be the problem you actually have.
Sometimes the model just gets dumber mid-session
Dan noticed that GPT-5.3 Codex sometimes routes to a weaker model mid-session. You're working with a frontier model, the responses feel sharp, and then suddenly the quality drops. You can't tell whether the model is struggling with your task or you've been secretly downgraded.
Final thoughts
GPT-5.3 Codex is a straight upgrade for anyone already in the Codex ecosystem. It's faster, more autonomous, and less likely to interrupt your flow with unnecessary confirmation prompts. If you were tolerating earlier Codex versions, this is the release that makes it pleasant to use.
The model's core personality hasn't changed: It does what you say, not what you mean. For developers who know exactly what they want and need confidence the build won't fail, that's a feature. For those who want a model that infers intent and pushes back when requests don't make sense, Claude is still the better fit.
The speed and reliability improvements are significant enough to shift how you work with it. Kieran, our most devoted Claude Code user, now reaches for Codex for reviews and long feature builds. That's not a full conversion, but it's a meaningful change in a workflow that had no room for Codex before.
Get all of our AI ideas, apps, and training
Every is the only subscription you need to stay at the edge of AI—trusted by 100,000 builders.
Expert led courses and camps
Four productivity apps
A Discord community learning together