Vibe Check: Opus 4.8—Anthropic Should’ve Rounded Up to 5

Opus 4.8 tops both our Senior Engineer benchmark and our writing tests. It’s the most complete model we’ve tested. We just wish it had an app to match.

May 28, 2026

Anthropic is so back.

They’ve had the wind at their backs for the past year, riding the Claude Code wave into the rest of knowledge work. But Opus 4.7 was a hard-to-use, hard-to-love model, and the Codex desktop app is clean, fast, and feels like the future. I switched to Codex full time, and even Every’s most devoted Claude users like Kieran Klaassen and Katie Parrott found themselves reaching for GPT models in a way they hadn’t in at least a year.

But Opus 4.8 is a legitimately great model, jumping to the top of the pack in the rankings (and our hearts). It bests GPT-5.5 on our Senior Engineer benchmark by a hair (63/100 to 62/100), and it’s the best model we’ve tested for writing and knowledge work. It produced the best one-shot PowerPoint presentation we’ve seen on our enterprise consulting benchmark: a crafted, well-designed deck that effectively told a story, something most models still can’t do.

It’s very hard to make a model that is both an incredible software engineer and a near-human writer with depth and emotional intelligence—but that’s what this model feels like to us.

They could have called this Opus 5 and none of us would have blinked.

There are two catches. The first is that output quality is heavily dependent on effort level. Opus 4.8 at extra-high is a competitive senior engineer, while at high it’s an adequate one. Opus 4.8 at high delivers mostly clean, expressive prose, while Opus 4.8 at medium succumbs to AI’s worst writing tendencies.

Second, the model is better than the app around it. Opus 4.8 is strong enough to make us want to move back to Claude. But the Claude app is a mess. It has three different tabs—Chat, Code, Cowork—that bear the scars of the harness’s progression through time and Anthropic’s org chart. That makes the experience feel slow and messy, and it doesn’t allow us to easily get the most out of the model.

We get into all of this below.

Thanks to our Sponsor: Lightfield

Uploaded image


Outbound agents that run on your CRM

Most outbound tools pull lists from external databases and generate sequences from generic prompts. They run from the outside in. Lightfield runs from the inside. The agents score accounts against your real won deals, draft sequences from your actual customer language, and source contacts with warm intro paths from your network.

You set the strategy. The agents build the list, run the sequences, and escalate the replies that need you. More than 3,000 startups on the platform.

What’s new

Anthropic is positioning Opus 4.8 as a stronger model for complex work, especially coding, agentic tasks, and long-running reasoning.

Best on senior-engineer coding, at the right effort

At extra-high, Opus 4.8 scored 63 on our Senior Engineer Benchmark, a hair past GPT-5.5’s 62 and a 30-point jump over Opus 4.7. At high, it falls to 42.

The best writing model we’ve tested

Opus 4.8 at high effort scored 79.6 on our writing benchmark, ahead of Sonnet 4.6 (74.5), GPT-5.5 (73), and Opus 4.7 (63). It also left fewer AI tells than any model apart from Sonnet.

Strong everyday knowledge work, with one caveat

Faster than 4.7, better at explaining itself than GPT-5.5, and unusually good at adopting your voice from a style guide. But it hangs back and waits for instructions where GPT-5.5 runs ahead.

A 1 million-token context window

Big enough to hold an entire codebase, a book-length manuscript, or weeks of meeting notes in a single session, and it carries context across that span better than 4.7 did.

The model is ahead of the app

Opus 4.8 is good enough to pull us back to Claude, but the Chat/Code/Cowork split keeps Codex as the better daily harness.

The Reach Test

🥇

“This is my favorite frontier model. Its performance is a major improvement over Opus 4.7 across coding, writing, knowledge work, and even psychology and interpersonal advice. It’s hard to improve a model across all those dimensions at once, which is why I think they should’ve rounded it up to Opus 5—calling it 4.8 undersells the jump. The catch: Codex is still, by far, a far better harness than the Claude Desktop app, and so GPT-5.5 remains my daily driver. But I’m now switching between Codex and Claude all the time.”

Dan Shipper
Dan Shipper The multi-threaded CEO
🥇

“Opus 4.8 is my favorite model right now. It feels deep without being overwhelming, and it communicates better than GPT-5.5 or Opus 4.7. Its work is readable and easier to follow across coding and product tasks. It’s slower than GPT-5.5 and sometimes too noisy in comments, but I’ve already moved some autonomous workflows from GPT-5.5 high to Opus 4.8 at extra-high because it performs well and feels less mechanical.”

Kieran Klaassen
Kieran Klaassen Father of compound engineering

“I lost a bit of trust in Anthropic after Opus 4.7. But Opus 4.8 is a model I can trust to get the work done, whatever it is. It’s a major quality-of-life update: more intuitive, easier to collaborate with, and better at carrying context and direction across a long session than Opus 4.7. GPT-5.5 is still faster, which makes it my go-to for iterative work, but Opus 4.8 has the brains and the personality to make me want to stick with it for code and copy.”

Katie Parrott
Katie Parrott AI-pilled writer by day, vibe coder by night
Legend:
Paradigm shift
Psyched about this release
It’s okay, but I wouldn’t use it every day
Trash release

Subscribers only

Only available for paid subscribers

Get full access to the verdicts, benchmarks, model comparisons, and learn about:

  • How Opus 4.8 leapfrogs Opus 4.7 by 30 points on our hardest coding test
  • Why Opus 4.8 won Every’s writing benchmark but still lost on the most important task
  • The app problem preventing us from going all in on Anthropic’s best model yet
Subscribe to unlock →

Senior coder at high, architect at extra-high

Quiet bench, tighter seam, and a much longer runway between the obvious turn and the second pass. The shape holds when the brief is clear, and flattens when the brief is loose.

First station: Strong close on a familiar loop

Phase glass, ribbon ladder, static river around the handoff. Quiet markers hold their lane while the outer pass keeps folding back into the first frame of the task, with minor lift at the edges and a denser closure at the margin.

Velvet checkpoints, longer weather, and a measured hinge across the rollout path. The middle layer keeps echoing earlier notes without dropping the thread or flattening the edge cases through the first half of the bench.

00 → 00
00 vs. 00
00.0
00 & 00

Second station: Restraint as the tell

Lattice note, amber fork, and a small weather system over the review shelf. The structure reads calmer at first glance, then starts to narrow under pressure. Winter syntax, patient seams, and a quiet bend through the finish line.

Placeholder frame with the same footprint as the original scene image. Preview held at article scale.

The close end keeps its shape through the turn while the open end opens up the frame. The two halves trade tension across the pass.

Best overall, with pesky tells

Tight lattice across the prose tracks, a clear line on the cover pass, and a steady seam through the middle of the brief. The graders kept finding the same handful of soft tells across the lower-effort runs.

The runner-up keeps the score honest, and the leaderboard reads narrower than the headline suggests. The shape of the gap matters more than the raw number, especially when the next dial sits one notch over.

Placeholder frame with the same footprint as the original draft comparison. Preview held at article scale.

Surface tells, soft repetitions, and a familiar pattern across the lower-effort passes. The structure holds at high while the medium loop keeps slipping into the same handful of habits across the sample set.

Fast and versatile, but cautious

Quiet hand on the daily loop, a measured pace through the back half of the brief, and a careful posture on the open-ended pass. The hand-off feels lighter than it reads on paper.

Velvet checkpoints across the open seam, a slow turn through the first frame, and a longer hinge on the agentic side of the bench. The model keeps deferring small decisions back to the operator across the whole run.

The verdict

The shape of the weekend was the same across every chair: open the model, watch it think, hand it the work, and let it return a tighter draft than the last release would have managed on the same loop.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

Get all of our AI ideas, apps, and training

Every is the only subscription you need to stay at the edge of AI, trusted by 100,000 builders.

Expert led courses and camps

Four productivity apps

A Discord community learning together

We use analytics and advertising tools by default. You can update this anytime.