Today, both OpenAI and Anthropic released new, significantly improved models: GPT-5.3 Codex and Opus 4.6. We've been testing them thoroughly internally on real production use cases, and we've come to a conclusion:
The models are converging.
Opus 4.6 has all of the things we love about 4.5, but with the thorough, precise style that made Codex the go-to for hard coding tasks. And Codex 5.3 is still a powerful workhorse, but it finally picked up some of Opus's warmth, speed, and willingness to just do things without asking permission.
From this, we can only conclude that both labs are moving steadily toward a sort of Ur-coding model: one that's wicked smart, highly technical, and fast, creative, and pleasant to work with.
Why the convergence? Because a great coding agent turns out to be the basis for a great general-purpose work agent. The behaviors that make AI useful for software development—parallel execution, tool use, planning before acting, knowing when to dig deep versus when to ship—are the same behaviors that make AI useful for any knowledge work.
And that is the holy grail of AI.
Okay, but really, which one is better?
These models are very close in abilities, so there's not a clear winner across the board.
If you're a Codex person, you're probably going to love 5.3. If you're an Opus person, you're going to stick with 4.6. Most of us are mixing and matching internally.
However, if you put a gun to my head and told me to pick, here's how I'd put it:
Opus 4.6
Pick Opus when
You want maximum upside on hard, open-ended tasks.
Opus 4.6 has a higher ceiling as a model, but it also has higher variance. It's more parallelized by default and more creative. I used it on a feature for our Monologue iOS app that the team had been working on off and on for two months. It just built it. Naveen Naidu, general manager of Monologue, was stunned to see it. But Opus also sometimes reports success when it's actually failed, or makes changes you didn't ask for. You have to watch it.
Codex 5.3
Pick Codex when
You want steady, reliable autonomous execution.
Codex 5.3 is an excellent model, and its output is more reliable. It is extremely smart and can work autonomously for long periods on difficult coding tasks. It is very fast—faster than Opus—and doesn't make the dumb mistakes that Opus makes. Cora general manager and die-hard Claude Code devotee Kieran Klaassen is even making room for it in his workflow. However, at least in our testing, it doesn't quite reach the same heights as Opus 4.6.
The Reach Test: Head-to-head
Which are we reaching for?
50/50 Vibe code with Opus and serious engineering with Codex
Opus with Codex for planning and review
Codex with some Opus for certain tasks
Members only
Only available for paid subscribers
Subscribe to unlock the full analysis and detailed comparisons.
Subscribe to unlock →Opus vs. Codex by dimension
Research and planning
Spent 15 minutes reading forums, competitor apps, and codebases to solve a problem the team was stuck on for months. Its plans are also significantly more detailed.
Parallelization
Kicks off multiple tasks at once by default.
Complex, well-architected builds
Zero build errors on a significant iOS UI redesign. Lumen produced tons of build errors.
Long, underspecified feature builds
It extends the frontier of vibe coding, again.
Speed
Noticeably faster. Lumen's thoroughness costs time.
Empathy and creativity
Figures out what you mean. Zyph does what you say.
Claim reliability
Lumen sometimes reports success when it's failed.
The LFG benchmark: Head-to-head
Kieran built LFG bench—a set of internal benchmarks that ask frontier models to do four tasks of increasing difficulty:
- Landing page (React)—This tests the model's ability to follow a creative brief and respect constraints
- 3D island scene (Three.js)—Looking at the model's knack for spatial reasoning and complex visuals
- Earnings dashboard (Streamlit)—How does the model do with data-heavy tasks requiring multiple views?
- E-commerce site (Next.js)—This is the hardest test: can the model build a full production website end-to-end?
The gap widened on hard tasks. On the simple landing page, both models performed well. On the e-commerce site—11 features including full checkout—Lumen shipped everything. Zyph produced a beautiful design but was missing the entire checkout flow.
About the LFG benchmark
LFG bench runs the /lfg command in Every's compound engineering plugin—which bundles planning, coding, and code review into a single step—inside both Codex and Claude Code harnesses. You give it one reasonably detailed but high-level prompt, and it handles the entire workflow, with no hand-holding.
The results may tell us as much about task design as model capability. We want to know which models can figure it out on their own. Lumen thrives in this environment: Hand it a vague goal, and it explores, investigates, and converges. Zyph wants direction. When specs are detailed, it executes flawlessly. When they're not, it guesses or stalls.
Want to learn more?
Read our detailed Vibe Checks
Get all of our AI ideas, apps, and training
Every is the only subscription you need to stay at the edge of AI—trusted by 100,000 builders.
Expert-led courses and camps
Four productivity apps
A Discord community learning together