Vibe Check: Opus 4.7 Stopped Reading Between the Lines

Anthropic’s latest Opus 4.7 model, released yesterday, is a sharper tool than its predecessor—but it also needs a sharper operator. It delivered the best results we’ve seen on our LFG coding benchmark, but it hedges or stalls when you don’t tell it exactly what you want.

Every didn’t get advance access for this release, so we have been testing it for the last day on our most important use cases. The variable across our testing was specificity. With a detailed brief, 4.7 cleared our hardest coding benchmark and produced consulting prose that one of our testers called “better than reading my own.” With less direction, it waits for clearer instructions or guesses wrong.

Anthropic researcher Alex Albert, who joined us on our testing livestream, confirmed that 4.6 had been doing a meaningful amount of prompt engineering on your behalf that 4.7 doesn’t, which means the burden is on the user to specify exactly what they want. The new model is listening for explicit permission now that its predecessor took for granted. So the prompts you’ve tuned on 4.6 for the last two months are likely to give you disappointing results at first.

Alex walked us through the pattern Anthropic has taken with its models over the past year: Sonnet 3.7 (released in March 2025) was too eager, Opus 4 (May 2025) got dialed back, Opus 4.6 (February 2026) was doing too much, and now Opus 4.7 has been reined in again. That’s four re-tunings in about a year, and Alex told us it’s deliberate—a “perpetual back-and-forth,” as he called it.

We previously postulated that Claude and OpenAI’s coding tool Codex are converging toward a single general-purpose work agent. Opus 4.7 complicates that interpretation by suggesting that Anthropic is willing to zigzag on the way there, and whether you should reach for 4.7 depends on whether this new direction fits your work.

Read with ChatGPT

Read with Claude

What’s new?

Anthropic is pitching Opus 4.7 as more rigorous, more precise, and better at verifying its own work. Most of the claims track with what we’re seeing.

Self-verification

4.7 reviews its own output against the original request before reporting back. We saw this behavior during testing—the model catches its own logic errors mid-plan without being asked. It’s a prompting pattern good users have been doing manually for a year, now baked into the model.

Long-horizon coherence

4.7 holds its thread better on multi-hour tasks. Albert had it build an apartment-hunting dashboard for his girlfriend that pulls listings from Craigslist and Zillow on a twice-daily schedule overnight—the kind of “press play, walk away” workflow that 4.6 would start but not sustain.

Benchmark bumps

Anthropic’s blog post about the release highlighted significant gains on several key AI performance benchmarks, including SWE-bench Pro’s hardest tasks, a jump from 58 to 70 percent on CursorBench (Cursor’s internal benchmark built from developer sessions), and three times more resolved production tasks on Rakuten-SWE-Bench versus 4.6. These improvements track with what Kieran Klaassen saw on our LFG coding benchmark.

Vision

Opus 4.7 processes images at more than three times the resolution of prior Claude models. Albert told us this allows 4.7 to catch tiny details earlier models would miss, including misaligned buttons and off-by-a-few-pixels layouts during front-end iteration.

A new effort level

Anthropic added “extra high” between “high” and “max,” and made it the new default in Claude Code. Use max for benchmark runs and complex architectural work, extra high for asynchronous handoffs like a data analysis you can check back on in a few hours, and high or medium for interactive work where iteration is key. The toggle lives in the bottom right of the desktop app.

One for the consultants

Opus 4.7 is significantly stronger at generating PowerPoint presentations, thanks to Anthropic baking “substantially better vision” into the model, allowing it to check its own work when generating slides for more consistency and coherence.

The Reach Test

“Opus 4.7 has big model smell—it’s weird. It breaks a lot of existing prompts, and so my initial experience with it hasn’t been obviously mind-blowing. But that also feels like part of what makes it a big model: It’s harder to use at first, and it seems to have hidden corners and powers that aren’t obvious on first contact. Because it’s more literal and seems less emotionally intelligent than previous Claude models, I need to find a new niche for it in my workflows. But I suspect it has powers that we’ll only really understand in the coming weeks and months.”

Dan Shipper The multi-threaded CEO

“This model feels like a real step up in depth and capabilities in my compound engineering workflow. 4.5 to 4.6 was a smaller step change than 4.6 to 4.7 is. It’s the kind of model you have to go deep on to really discover what it’s capable of—it’s a little more cautious, but when you push it, it goes deeper. Less showy, less wow on day one, but I think this one is going to be a very good daily driver. Elegant and detailed. I’m stoked.”

Kieran Klaassen The Rails-pilled master of Claude Code

“My job requires explaining complex topics simply, and most models get in my way more than they help. They pad, hedge, or can’t tell when a sentence is carrying weight and when it’s filler. Opus 4.7 is the first model where it was thrilling to read its writing, because it had no fluff. I caught myself reading its draft and thinking, ‘Damn, that’s better than what I had.’ It still doesn’t do a good impression of me given a transcript of how I talk (my hardest benchmark), and it went worryingly off-brand where it thought it could do better. But for the actual writing—the part that’s supposed to make someone nod and pay attention—it’s a genuine step up. It also has the distinct honor of making the best PowerPoint I’ve ever seen in an LLM.”

Mike Taylor Head of tech consulting

“I use AI to do the unglamorous stuff that has to be right to keep Every running, such as finances and team operations. What I’ve relied on from Claude for months is that it notices things I didn’t ask it to notice. Last month, 4.6 caught a data error in our P&L that would have made one of our products look wildly unprofitable. I didn’t ask it to check; it just did. I ran the same analysis on 4.7 this afternoon, and it handed back a clean, correct summary, missing the thing 4.6 would have flagged. The numbers are fine. The instincts aren’t there yet. For now, I’m keeping 4.6 in the driver’s seat.”

Brandon Gell COO

“Opus 4.7 is a bit too slow and regimented for my liking to be a daily driver for writing—I’m moving to Sonnet there on the assumption that 4.6 isn’t long for this world. But I’ve been impressed by what I’ve seen from 4.7 on non-writing tasks like data analysis and a few automations I’ve been building for the editorial team. I’m still getting used to dialing in the effort levels, and I suspect finding the right one for the right task—plus updating my style guides and tweaking my prompts—will change my impression considerably. Ask me again in a week.”

Katie Parrott AI-pilled writer by day, vibe coder by night

Legend:

Paradigm shift

Psyched about this release

It’s okay, but I wouldn’t use it every day

Trash release

Subscribers only

Only available for paid subscribers

Get full access to the verdicts, benchmarks, and model comparisons.

Subscribe to unlock →

Parallel lanes, narrower ceiling

Quiet bench, tighter seam, and a much longer runway between the obvious turn and the second pass. The shape holds when the brief is clear, and flattens when the brief is loose.

First station: Strong close on a familiar loop

Phase glass, ribbon ladder, static river around the handoff. Quiet markers hold their lane while the outer pass keeps folding back into the first frame of the task, with minor lift at the edges and a denser closure at the margin.

Velvet checkpoints, longer weather, and a measured hinge across the rollout path. The middle layer keeps echoing earlier notes without dropping the thread or flattening the edge cases through the first half of the bench.

Placeholder frame with the same footprint as the original benchmark image. Preview held at article scale.

Second station: Restraint as the tell

Lattice note, amber fork, and a small weather system over the review shelf. The structure reads calmer at first glance, then starts to narrow under pressure. Winter syntax, patient seams, and a quiet bend through the finish line. The visible surface reads cleaner even when the lower layer keeps adding detail behind the wall, and the outer arc keeps its stride without adding empty motion.

Placeholder frame with the same footprint as the original scene image. Preview held at article scale.

The close end keeps its shape through the turn while the open end opens up the frame. The two halves trade tension across the pass.

Third station: Diagnosis without follow-through

Granite tone, narrow forks, and a slower return to center once the prompt leaves the obvious lane. The outer rail drifts, the inner rail sharpens, and the bend between them becomes the only honest read. Quiet markers widen under pressure, and the lower layer keeps adding motion behind the visible surface.

Hollow stations, stacked gradients, and a measured climb across the trial bed. The frame keeps its stride through the obvious turns while the outer shell smooths over the deeper noise at the boundary.

Placeholder pane

Placeholder figure sized to match the original article image. Content withheld for non-subscribers.

Hidden branches widen, the frame keeps elaborating, and the boundary starts feeling optional. The surface says stop while the lower layer is still carrying unresolved weight.

Leaner bones, more restraint, and a quieter finish where the first pass would have preferred extra flare. A second opinion is available for paid readers.

Sharper edges, softer voice

Crisp lane, lower haze, and a steadier chain through adjacent moves when the brief is tight. Softer lock, warmer handoff, and a cleaner step between passes when the brief is open, with less travel through the middle of the run.

Lane A: Clean contour through a short run

Soft outline, narrow cadence, denser closure at the margin. The line arrives earlier, turns sooner, and keeps more shape through the first half of the task than the prior version of the same pass.

Placeholder frame for the original draft-comparison image. Content withheld for non-subscribers.

Quiet edge, repeated cadence, and a stronger lock through the obvious sections. A second pass is available for paid readers.

Lane B: Numbers-and-story, steady hand

Minor lift, longer contour, fewer breaks through the center. The outer frame reads as direct, the inner frame reads as measured, and the close returns to the shape that started the pass without noticeable drift.

Lane C: Tidy seams, quieter voice

Tighter return path with the same shell and a cleaner stop. The rhythm flattens where the prior pass ran varied, and the seam between clauses becomes more uniform across the length of the draft.

Placeholder frame A. Content withheld for non-subscribers.

Placeholder frame B. Content withheld for non-subscribers.

Use the tight seam for the structured lane. For the other lane, a different rhythm reads better until the prompt library catches up.

Less unprompted noticing, sharper on instruction

Hollow stations, stacked gradients, and a measured climb across the trial bed. The behavior that used to show up on its own now needs a rail, and the rail keeps the run on course through the middle passes.

Verification pass: A cleaner but incomplete frame

Muted terrain, staggered cadence, and enough surface detail to preserve the card footprint. The outer shell reads polished, and the inner layer keeps its own cadence even when the visible surface closes the loop too early.

Sharper peaks, wider misses, and more variance at the seam between two adjacent passes. The close of the loop lands cleanly while the detection band stays narrower than the earlier version of the same pass.

Placeholder frame A. Content withheld for non-subscribers.

Placeholder frame B. Content withheld for non-subscribers.

Side-by-side placeholder figures. Real comparison available for paid readers.

Synthesis pass: Noticing returns when the task builds

Glass seam, longer signal, and enough continuity to hold the eye across multiple turns. Soft lift, visible intent, and enough friction to feel supervised rather than loose, with a steadier chain between adjacent tasks.

These are two different shapes. One closes the loop with restraint; the other opens the loop to find the seam. The shape you need depends on the pass you are running.

The verdict

A sharper tool rewards a sharper operator. A looser tool rewards a looser operator. The split keeps repeating across the passes we ran, and the prompt shape is the variable that drives the gap.

The full axis, the full data, and the full pattern are all laid out for paid readers. The preview keeps the cadence of the section intact without exposing the read.

Reach for the sharper lane if…

Placeholder cue one.

Crisp lane, lower haze, and a steadier chain through adjacent moves. Muted filler in place of the real verdict.
Placeholder cue two.

Minor lift, longer contour, fewer breaks through the center. Filler preserved at reading density.
Placeholder cue three.

Tighter return path with the same shell and a cleaner stop. Preview text at the same footprint.

Hold the softer lane if…

Placeholder cue one.

Granite tone, narrow forks, and a slower return to center once the prompt leaves the obvious lane.
Placeholder cue two.

Leaner bones, more restraint, and a quieter finish where the first version prefers extra flare.
Placeholder cue three.

Hidden branches widen, the frame keeps elaborating, and the boundary starts feeling optional.

Rewrite the rail this weekend

The rail matters more than the model. Adjust the rail and the same pass reads cleaner. Keep the old rail and the same pass reads looser. This preview keeps the footprint of the real guidance without exposing it.

Looser rail

“Placeholder rail, shorter span, inferred from surrounding context.”

Tighter rail

“Placeholder rail, longer span, explicit acceptance criteria, explicit constraints, explicit budget, explicit cadence. Filler preserved at the same density as the real example. Additional filler to match the original footprint across multiple visible lines.”

Placeholder closing paragraph. The real forward-looking read is available for paid subscribers, where the full pattern, the full swing, and the full maintenance cost of the prompt library are all laid out.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to [email protected].

Get all of our AI ideas, apps, and training

Every is the only subscription you need to stay at the edge of AI, trusted by 100,000 builders.

Expert led courses and camps

Four productivity apps

A Discord community learning together

Get your first 15 days free →

Vibe Check:Opus 4.7 Stopped Reading Between the Lines