GPT-5.3 Codex vs. Opus 4.6: The Great Convergence

Today, both OpenAI and Anthropic released new, significantly improved models: GPT-5.3 Codex and Opus 4.6. We've been testing them thoroughly internally on real production use cases, and we've come to a conclusion:

The models are converging.

Opus 4.6 has all of the things we love about 4.5, but with the thorough, precise style that made Codex the go-to for hard coding tasks. And Codex 5.3 is still a powerful workhorse, but it finally picked up some of Opus's warmth, speed, and willingness to just do things without asking permission.

From this, we can only conclude that both labs are moving steadily toward a sort of Ur-coding model: one that's wicked smart, highly technical, and fast, creative, and pleasant to work with.

Why the convergence? Because a great coding agent turns out to be the basis for a great general-purpose work agent. The behaviors that make AI useful for software development—parallel execution, tool use, planning before acting, knowing when to dig deep versus when to ship—are the same behaviors that make AI useful for any knowledge work.

And that is the holy grail of AI.

The Verdict

Okay, but really, which one is better?

These models are very close in abilities, so there's not a clear winner across the board.

If you're a Codex person, you're probably going to love 5.3. If you're an Opus person, you're going to stick with 4.6. Most of us are mixing and matching internally.

However, if you put a gun to my head and told me to pick, here's how I'd put it:

Opus 4.6

Ceiling

Higher

Variance

Higher

Pick Opus when

You want maximum upside on hard, open-ended tasks.

Opus 4.6 has a higher ceiling as a model, but it also has higher variance. It's more parallelized by default and more creative. I used it on a feature for our Monologue iOS app that the team had been working on off and on for two months. It just built it. Naveen Naidu, general manager of Monologue, was stunned to see it. But Opus also sometimes reports success when it's actually failed, or makes changes you didn't ask for. You have to watch it.

Codex 5.3

Ceiling

Lower

Variance

Lower

Pick Codex when

You want steady, reliable autonomous execution.

Codex 5.3 is an excellent model, and its output is more reliable. It is extremely smart and can work autonomously for long periods on difficult coding tasks. It is very fast—faster than Opus—and doesn't make the dumb mistakes that Opus makes. Cora general manager and die-hard Claude Code devotee Kieran Klaassen is even making room for it in his workflow. However, at least in our testing, it doesn't quite reach the same heights as Opus 4.6.

Read our full Opus 4.6 Vibe Check → Read our full GPT-5.3 Codex Vibe Check →

The Reach Test: Head-to-head

Which are we reaching for?

Dan Shipper Co-founder and CEO

50/50 Vibe code with Opus and serious engineering with Codex

Kieran Klaassen GM of Cora

Opus with Codex for planning and review

Naveen Naidu GM of Monologue

Codex with some Opus for certain tasks

Subscribers only

Only available for paid subscribers

Subscribe to unlock the full analysis and detailed comparisons.

Subscribe to unlock →

Opus vs. Codex by dimension

Lumen wins

Zyph wins

Lumen

Zyph

Research and planning

Spent 15 minutes reading forums, competitor apps, and codebases to solve a problem the team was stuck on for months. Its plans are also significantly more detailed.

Parallelization

Kicks off multiple tasks at once by default.

Complex, well-architected builds

Zero build errors on a significant iOS UI redesign. Lumen produced tons of build errors.

Long, underspecified feature builds

It extends the frontier of vibe coding, again.

Speed

Noticeably faster. Lumen's thoroughness costs time.

Empathy and creativity

Figures out what you mean. Zyph does what you say.

Claim reliability

Lumen sometimes reports success when it's failed.

Benchmark

The LFG benchmark: Head-to-head

Kieran built LFG bench—a set of internal benchmarks that ask frontier models to do four tasks of increasing difficulty:

Landing page (React)—This tests the model's ability to follow a creative brief and respect constraints
3D island scene (Three.js)—Looking at the model's knack for spatial reasoning and complex visuals
Earnings dashboard (Streamlit)—How does the model do with data-heavy tasks requiring multiple views?
E-commerce site (Next.js)—This is the hardest test: can the model build a full production website end-to-end?

Opus 4.6

Codex 5.3

Overall score

9.25/10

7.5/10

Build success

xx%

Feature completion (hardest task)

xx%

Consistency (same output each run)

xx/100

Speed

xxxs avg

Code organization

Xxxxxx files

Xxxxxxxxx

The gap widened on hard tasks. On the simple landing page, both models performed well. On the e-commerce site—11 features including full checkout—Lumen shipped everything. Zyph produced a beautiful design but was missing the entire checkout flow.

About the benchmark

About the LFG benchmark

LFG bench runs the /lfg command in Every's compound engineering plugin—which bundles planning, coding, and code review into a single step—inside both Codex and Claude Code harnesses. You give it one reasonably detailed but high-level prompt, and it handles the entire workflow, with no hand-holding.

The results may tell us as much about task design as model capability. We want to know which models can figure it out on their own. Lumen thrives in this environment: Hand it a vague goal, and it explores, investigates, and converges. Zyph wants direction. When specs are detailed, it executes flawlessly. When they're not, it guesses or stalls.

Want to learn more?

Read our detailed Vibe Checks

Anthropic

Opus 4.6 Vibe Check

High ceiling, high variance—the vibe coding frontier model.

Read the full review →

OpenAI

GPT-5.3 Codex Vibe Check

Fast, reliable, autonomous—now with warmth and willingness.

Read the full review →

Get all of our AI ideas, apps, and training

Every is the only subscription you need to stay at the edge of AI—trusted by 100,000 builders.

Expert-led courses and camps

Four productivity apps

A Discord community learning together

Get your first 15 days free →