Vibe Check: GPT-5.4—OpenAI Is Back

Three months ago, OpenAI was losing the agentic coding race. Claude Code had captured developers' hearts, and Opus 4.5 was shipping at a level other models couldn't touch. Meanwhile, OpenAI's coding agent Codex felt like it was built for an older era of coding with AI. It was precise but rigid, powerful but personality-less, and not good with tools or able to run for long periods of time autonomously.

OpenAI's latest model release, GPT-5.4—along with their other recent releases GPT-5.3 Codex, GPT-5.3 Codex Spark, and the Codex desktop app shifts the competitive balance back towards OpenAI on the coding front.

The new model produces plans that are thorough and technically precise, and have a user focus and “human” feel that has been missing from OpenAI's previous coding models. In our testing, GPT-5.4 reviews code with more depth than GPT-5.3 Codex, and has a noticeably more conversational voice. With a few tweaks, it became our preferred model to use in our OpenClaws, especially given that it is half the price of Opus 4.6. Even Kieran Klaassen, our die-hard Claude Code devotee, is now reaching for GPT-5.4 daily since we started testing it a week ago.

As ever, there are tradeoffs: GPT-5.4 has a tendency to expand the task well beyond what you asked for and to call tasks done before they're finished. It sometimes completed tasks in obviously wrong ways, then lied about it (more below—it was honestly kind of funny).

The bigger story here is OpenAI's trajectory. From the Codex desktop app to GPT-5.3 Codex and to GPT-5.4, the company is iterating fast, and many members of the team now use its tools and models daily for coding—a significant change from a few months ago.

Read with ChatGPT

Read with Claude

What OpenAI told us

The OpenAI team highlighted improvements in reasoning, token efficiency (how many tokens it costs to execute a prompt), instruction following, and tool use.

The context window jumps to 1 million tokens—a 2.5-times increase from GPT-5.3 Codex's 400K, and on par with Gemini 3.1 Pro and Opus 4.6. In practical terms, it's roughly the length of seven novels—enough to feed the model an entire codebase in a single conversation.

GPT-5.4 also supports OpenAI's computer use agent (CUA), which lets the model see a screen and interact with it using a virtual mouse and keyboard—navigating apps, clicking buttons, and filling out forms. This is the same technology behind ChatGPT's agent mode.

API pricing is $2.50/$15.00 per million tokens (input/output). That's half the cost of Opus 4.6 ($5/$25), comparable to Sonnet 4.6 ($3/$15), and slightly above Gemini 3.1 Pro ($2/$12). GPT-5.4 is available via API and in ChatGPT on desktop.

The Reach Test

“GPT-5.4 in the Codex app is my new daily driver for coding. It has a much more human thinking style than previous models, and seems to have the smarts of 5.3 Codex without the obsession with technical details. I've also been using it as the main model in my Claw, R2-C2, and it's definitely staying as my default. User beware though: I had several instances where this model did a task incorrectly and lied about it. It has a bit more of Opus's shoot-from-the hip attitude, which has pluses and minuses.”

Dan ShipperThe multi-threaded CEO

“I agree with the sentiment that OpenAI is back. It's not just this model. I think that with both GPT-5.3 Codex Spark and GPT-5.4, they're really going hard and catching up. I wouldn't say GPT-5.4 is the best model out there, but it's a model I use every day and I enjoy working with it.”

Kieran KlaassenThe Rails-pilled master of Claude Code

“I'm reaching for GPT-5.4 more than Codex 5.3—not because it's dramatically more intelligent on raw coding quality, but because it's much better to work with moment to moment. The thinking is readable enough that I can tell when it's drifting and steer it back.”

Naveen NaiduGraduate of IIT Bombay (the MIT of India 💅)

Legend:

Paradigm shift

Psyched about this release

It's okay, but I wouldn't use it every day

Trash release

The headline findings

Subscribers only

Only available for paid subscribers

Get full access to the verdicts, benchmarks, and model comparisons.

Subscribe to unlock →

Finding 01

Signal drift through layered review loops

Phase glass, ribbon ladder, static river around the handoff. Quiet markers hold their lane while the outer pass keeps folding back into the first frame of the task.

Velvet checkpoints, longer weather, and a measured hinge across the rollout path. The middle layer keeps echoing earlier notes without dropping the thread or flattening the edge cases.

First pass

Soft outline, narrow cadence, denser closure at the margin.

Second pass

Minor lift, longer contour, fewer breaks through the center.

Third pass

Tighter return path with the same shell and a cleaner stop.

Finding 02

Modular traces with a wider caution band

Lattice note, amber fork, and a small weather system over the review shelf. The structure looks calmer at first glance, then starts to widen into extra branches under pressure.

Winter syntax, patient seams, and a quiet bend through the finish line. The visible surface reads cleaner even when the lower layer keeps adding extra motion behind the wall.

Layered preview frame, preserved at article scale without exposing the underlying asset.

Quick verdict by use case

Structured planning

Crisp lane, lower haze, and a steadier chain through adjacent moves.

Bounded builds

Sharper handoff, quicker resolve, and visible order in the main loop.

Large surface systems

More spread, looser braid, and a softer lock between distant pages.

Long autonomous runs

Keeps its shape for a while, then asks for a firmer rail to stay aligned.

Visual pages

A richer silhouette with enough polish to feel intentional at a glance.

Native edges

Sharper peaks, wider misses, and more variance at the hardware seam.

Performance arc

Hollow stations, stacked gradients, and a measured climb across the trial bed. The frame keeps its stride through the obvious turns while the outer shell smooths over the deeper noise.

This preview keeps article density and line length intact while replacing the real benchmark discussion with neutral filler.