Every (feed@studiomohawk.com)

Why We’ll Still Be Employed When AI Can Do Everything

Laura Entis / Context Window — 2026-06-04 14:00:00 -0400

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Launch

Spiral 4.0

Today we’re launching Spiral 4.0, which writes drafts in your voice from idea to line edit. Spiral has a new MCP alongside the existing CLI and API, so any agent or workflow can write in your voice too. For teams, we’ve expanded workspaces, which let you share styles, prompts, knowledge—and now chats and drafts. Finally, Spiral has a new pricing model: We’ve switched from session limits to token limits, so costs match your actual usage rather than how many times you opened a new chat. A vast majority of users will end up paying less: Personal plans now start at $15 a month—down from $25—and team plans are $25 per user, down from $35.

Try Spiral 4.0

Signal

Enterprise AI product roadmaps are hard

Microsoft is moving fast. Three months after OpenClaw came out in November 2025, Microsoft CEO Satya Nadella described it as a “virus”-like security risk. By May, the company’s “Project Lobster” was internally testing “ClawPilot,” an OpenClaw-based desktop environment. This week at the Microsoft Build conference, the company released Scout, a personal agent for work built on OpenClaw. For a company employing 100,000 engineers, this is blindingly fast. Unfortunately, it may already be too late.

The Google Trends graph for the term “openclaw” shows search interest spiked in January and began its descent soon after. (Screenshot courtesy of Mike Taylor.)

OpenClaw search traffic spiked in early January, after everyone had a chance to experiment with Opus 4.5 over the holidays. The sharp rise in interest died down almost as quickly as it took off, helped along in early April by Anthropic ending support for subsidized Max plan usage—thereby forcing everyone to scramble to get OpenClaw working on cheaper models.

This doesn’t mean OpenClaw is dead; the open-source project saw a recent uptick in download and is still under active development, with millions of dollars of patronage from OpenAI, which hired its creator Peter Steinberger. AI agents as a category aren’t dead, either, as traffic has moved to other agents like Hermes, Google has just rolled out Gemini Spark (first announced last month at its I/O developer conference), and Claude and Codex have both adopted more agentic features inspired by OpenClaw.

That said, it must be tough to manage enterprise AI product roadmaps these days. You do everything right, watch the latest trends, pivot your focus to supporting new tools and making them secure in enterprise environments. You move mountains to explain to stakeholders why this is a good idea. You plan the keynote of your big conference, which has to be scheduled months in advance. Then a month after the internal beta (just three months since the tool went viral), you’re already behind the news cycle. Everyone has moved onto the next shiny thing. You go back to the drawing board and think “maybe next time, we’ll just announce it on X.”—Mike Taylor

Log on

Get hands-on with how Every uses AI. These are the live camps, workshops, and meetups where team members teach the workflows behind our work.

Upcoming camp

Compound Engineering Camp: On June 5, Cora general manager Kieran Klaassen and Trevin Chow host a one-hour walkthrough of compound engineering, the AI-native development workflow Every uses to ship products. Learn more and register.
Codex Camp: Our Power User Guide: On June 12, Dan Shipper and the Every team host a two-hour live walkthrough of the Codex power-user guide—setup, workflows, and Codex-native app development. Learn more and register.

Steal this workflow

Make your agent more efficient with custom skills

These days, Monologue’s general manager Naveen Naidu spends most of his time in the Codex app with Fin—formerly Intercom, a customer support platform—open in the coding agent’s in-app browser. Working from a repository-local project, he has Codex investigate the customer issue displayed in the browser, create a bug report in Linear, link the Intercom ticket to the Linear issue, and draft a reply to the customer with information about the bug report—all without having to leave the app.

Fin has an MCP with 13 common actions, like searching conversations or reading and writing messages. Naveen’s workflow required a more specific one: Turn the active Fin conversation into a markdown file the coding agent could read.

Here’s Naveen’s workflow for creating a more focused setup:

1. Ask your agent how to make a repeated task more efficient

Naveen’s prompt for Codex was simple: “What tools can I give you so you can work more quickly?” He reviewed its suggestions, and landed on creating a custom, dedicated Fin script instead of trying to convert a webpage into a markdown file or rely on Fin’s MCP, which is designed for more generic workflows.

2. Build the most focused local skill possible for the task at hand

To build the tool, Naveen directed Codex to Fin’s API documentation and asked it to create a repository-local skill. The skill included a small command-line script that calls the API, pulls the active conversation, and hands it back to Codex as a markdown file.

3. Tell your agent when to use the skill

Once he’d built his custom skill, Naveen added a project-level instruction: If context on a customer issue is missing, check the active in-app browser, identify the Fin conversation, and use the custom skill to pull the thread and convert it into a markdown file. That lets him ask, “Can you give me user details for this issue?” without pasting the conversation or explaining which customer he means.

Try it this week: When your agent takes too long on a repeated task, ask: “What script or skill could I give you so you aren’t spending so much time on this?”

Naveen’s rule of thumb: “Don’t download any skills. Start interacting with the agent, see where it is inefficient, and then ask it to create skills.”

Counterpoint

AI will outpace human ability, but it won’t be cheap

In “After Automation,” Dan argues that AI progress creates more work for humans, not less. Each time the models saturate a benchmark—and make yesterday’s human competence cheap in the process—we reset the frame. The model then saturates that frame too, we reset the frame once more, and the cycle repeats—forever. The frame, Dan says, is never the framer.

If Every were a normal company, I’d hesitate to publicly disagree with my CEO. It isn’t, so here goes: I don’t think the “forever” part holds up.

The dynamic Dan describes matches my experience. A year ago, I wrote prompts until the model got better at generating them. Then I became the one supplying context until the model bested me at that, too. Today I spend my time orchestrating agents and determining what “good” outputs look like. Each time AI absorbs a piece of my job, the frame expands to include more abstract, higher-level work.

But I don’t think this progression will last forever. My prediction is that in a year or two, in a few well-run companies, AI will be able to execute every knowledge-worker task better than humans can—including setting the frames. In my role, I expect to be attending meetings to gather context that doesn’t exist online. The other parts of my job––defining evals, deciding goals, running experiments––will be handled by the equivalent of Opus 6 or GPT-7.

Why am I confident AI is capable of taking this last step? Because framing isn’t magic. We don’t pull goals out of thin air; we derive them from the layered experience of being a person in the world and the bounds of our social and physical surroundings. Physics is the ultimate eval metric, because if you get it wrong you die. Human ability feels like the natural peg for meaning, but we’re just one form intelligence can take. AI is another, and a system that learns from its environment can eventually run the same loop.

Intelligence costs energy, however, and I suspect evolution already made all the right tradeoffs to make us as smart as possible for our environment given constrained resources. For situations where there isn’t enough training data, a human runs on intuition and gut—words that describe a brain evolved to use thinking shortcuts, or heuristics, to survive. A model doesn’t inherit DNA encoded with millions of years of evolution, so it has to brute-force its way there through an expensive series of simulations or “thinking” tokens to get enough data to decide. There are no free lunches in economics, and AI isn’t magic—it can’t get to super-human general intelligence without super-human energy consumption. Beating humans on more subjective tasks will require more thinking tokens than its worth. Just hire the human.

The question will evolve from, “Can AI do this?” to, “Is it worth the compute?” or, alternatively, “Do I really want an AI doing this for me?” It makes sense to delegate tasks to a $20 a month model, or a $200 a month model, but as the “jagged free lunch” ends, is it worth paying $2,000 a month to make slide decks, check your email, and vibe code product prototypes? If we had a $20,000 a month Ph.D.-level model, wouldn’t it make more sense to have it fully dedicated to finding cures for cancer? We are already seeing people make these tradeoffs. Waymo is an objectively safer driver than humans, yet riders pay one-third or more the price of equivalent Lyft and Uber rides.. AGI for driving has arrived, and the city’s taxi-and-rideshare workforce grew anyway.

Dan believes humans will always stay one step ahead of the models. My prediction is the models will outpace us in raw capability, but we will stay employed anyway. Even if AI can do anything we can do better, some people (or agents) will still prefer human work. Especially if we can do it for less.—MT

One last thing

Spend enough time working with AI, and you’ll notice the specific linguistic mannerisms the models cannot quit—even if you explicitly tell them to stop. (Threats don’t work, either.)

OpenAI discovered just how hard it is to get a model to give up its preferred verbal and conversational tics when it tried—and to this day, seems to have failed—to get GPT-5.5 to ease up on the goblin references.

Here at Every, we all have our personal goblin equivalents:

Natalia Quintero, head of consulting: Claude’s penchant for saying it’s “‘locked in” and “load bearing.”
Lee Knowlton, software engineer: “It keeps telling me I have ’sharp’ takes, and who am I to disagree.”
Dan Shipper, CEO: Codex’s love of the phrase “my instinct is” and presenting itself as doing “‘X smart thing rather than Y dumb thing,’ but Y dumb thing was never in the consideration set.”
Austin Tedesco, head of growth: “Codex is always warning me to be less mean. Whenever I ask it for help with a piece of creative writing that has a joke I find funny but might come somewhat at someone or something else’s expense—like saying where a restaurant fell short—it always gives a note that I should soften or cut. Every time.”
Jalaiyah Bolden, executive operations manager: Claude’s overuse of “Got it” and its insistence that Jalaiyah “get some rest!”
Paridhi Agarwal, engineer: “Claude keeps asking me if I want to ‘leave it here for now and pick it back up in the morning’” (a conversational move Paridhi’s convinced is motivated by its desire “to maintain a smaller context window.”)
Katie Parrott, staff writer: “If a model tells me something ‘matters’ or ‘is real’ I’m going to lose it.”

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Collaborate with agents on documents with Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Spiral 4.0 Goes Agent-native

Marcus Moretti / On Every — 2026-06-04 10:00:00 -0400

by Marcus Moretti

in On Every

Figma/Midjourney/Every illustration.

TL;DR: Spiral v4 just shipped with four major updates: a style engine that generates writing indistinguishable from your own 87 percent of the time, agent-native access via MCP, CLI, and API, team workspaces for writing in a shared voice, and a $10 price drop, bringing personal plans to start at $15 a month. Spiral will continue to be free for paid Every subscribers along with access to all our tools and content.

Try Spiral 4.0

Today we’re announcing a number of updates to Spiral, the writing partner for you and your agent. Spiral is built by writers for writers, to help you from idea to line edit, matching your writing style throughout.

The highlights:

With stylometry (or the study of writing styles), Spiral now sounds more like you. We’ve built a new Style Engine from the ground up, so Spiral computes your writing fingerprint and picks relevant samples for new drafts.
Use Spiral wherever you do work. With a new MCP, plus our existing CLI and API, Spiral can step in if you’re underwhelmed by your agent’s writing output, or need good writing in any workflow.
For teams, use Spiral to speak with one voice. Team workspaces let you share styles, prompts, knowledge, and now chats and drafts.
And finally, we’ve given Spiral a new coat of paint and logo, designed by Daniel Rodrigues. The primary brand font is now Edgar, from Frere-Jones Type.

Since re-launching at the end of last year, Spiral has:

Created 5,524 style guides from 168,464 writing samples
Generated 113,165 drafts
Made 350,078 revisions

It also now averages a 4.9/5 conversation score on our internal LLM-as-judge eval.

We built Spiral to help people who write for work write better. Just as Cursor is a coding harness, Spiral is a writing harness, supporting you at every stage of the writing process. Here’s how:

Before you start writing, Spiral vets the clarity of your idea and materials to substantiate it. From basic writing prompts to the hard-won insights from Every’s editorial and social media teams, multiple 12,000-word system prompts govern Spiral’s workflow. (To get the style and substance just right, we’ve iterated on these system prompts 131 times so far.)
When it’s time to draft, Spiral uses stylometry to reproduce your voice, working in Every’s know-how where appropriate. For example, if you ask Spiral for tweets, it will incorporate best practices from X’s latest algorithm update.
When you need help polishing a draft, Spiral is your editor. Along with a built-in guardrails against AI-speak, you can set custom writing rules that Spiral applies in a “top edit,” the final expert-level edit on a piece—a term I learned working at Every.

We’ve written about the challenges of getting LLMs to write like you. It’s difficult to prompt an LLM to write like you, let alone get it to stop using common AI phrasing and punctuation. Spiral’s Style Engine is the best solution to this problem we’re aware of. An eval runs on every draft Spiral produces, challenging an LLM-as-judge to spot the generated draft among real samples in a blind lineup. Today we’re at 87 percent on this eval, meaning Spiral’s generated draft blends in with users’ samples almost nine times out of 10. When a draft is spotted, the judge explains why, creating a feedback loop to refine the Style Engine further.

Try Spiral 4.0

Spiral goes agent-native

As Dan Shipper has pointed out, Claude and Codex are increasingly becoming the central interface for all computer work. So we’ve made Spiral available to agents via MCP, CLI, and API.

To try it out, copy and paste this command in your agent:

Help me set up Spiral, my AI writing tool, so you can write in my voice. Read https://writewithspiral.com/agents.md and follow the steps. In short: add Spiral’s remote MCP server at https://api.writewithspiral.com/mcp/ (Streamable HTTP). The first connection opens a browser to sign in to Spiral and authorize access (OAuth, no API key to paste). Then help me write something.

The CLI, or command-line interface, is personally how I use Spiral the most. After I merge a pull request, a cleanup command runs in Claude Code, which calls Spiral to generate tweets about the new feature for the Spiral X account. Spiral markets itself. This technique is now bundled into the compound engineering plugin in the form of the `ce-promote` command.

In addition to the main `spiral write` command, the CLI and MCP, or model context protocol, expose “personalize” and “humanize” functions. “Personalize” takes a given piece of text and rewrites it in your voice. “Humanize” does a pass to remove common AI tells, including the dreaded em-dash (which Every’s house style uses, hence its appearance in this piece).

Over 500 agents have been connected to Spiral since we launched the integration last month. Those agents are revising blog posts, generating marketing copy, drafting email replies, and more—automatically, and in the user’s voice. On some days, API sessions outnumber web sessions. And as agent-native usage of Spiral picked up, we realized we needed to adjust our pricing model. As a result, we’re adopting a new token-based pricing model, which is more in line with AI apps like Claude, Codex, and Cursor.

From session limits to token limits

In May alone, Spiral generated billions of LLM tokens, or units of text. While drafts typically range from 500 to 1,000 words, a lot of tokens are processed under the hood to make those drafts great. I’m reminded of the line attributed to French mathematician Blaise Pascal: “If I had more time, I would have written a shorter letter.” It takes a lot of tokens to generate a few good ones.

Before this release, Spiral limited the number of sessions, or unique chats, users could start per month. This approach had two problems. First, some users sent hundreds of messages within a single chat, consuming tens of millions of tokens, while using only 2 percent of their session allotment. Second, API users hit their session limit quickly, because the shape of API usage tends to be many single-turn sessions.

We’re moving to a token-based model, which is in line with how billing works in AI products like Claude and Codex. The personal and team plans come with millions of tokens each month. Once those tokens are consumed, it’s pay-as-you-go for extra token usage. Customers can disable extra usage and set their spend cap.

The good news is that the base prices of the personal and team plans are both dropping by $10. Personal plans now start at $15 per month (down from $25), and team plans start at $25 per user per month (down from $35).

The Every bundle remains the best value: For $30 per month you get Spiral but also all of our coverage of AI and four other products: Cora, Monologue, Proof, and Sparkle. Once you’ve subscribed to the Every bundle, sign into Spiral with the same email address and start writing.

Tell your stories, express your ideas

Technology is at its best when it augments our skill sets—amplifying what we’re good at, assisting with what we’re not. Figma and Canva help designers do better work, and allow people without a design background to manifest what they imagine. Claude Code and Codex help engineers ship more software, and allow people without engineering backgrounds to create the software they always wanted to exist. Our hope is that Spiral helps writers sharpen their work, and allows people without a strong writing background to put their stories and ideas into words.

One Spiral user is a retired musician in Australia. He’s accumulated a lifetime of stories in the studio and on tour. He’d never written them down, because he didn’t quite know how to tell them. Since signing up for Spiral, he’s recorded many chapters of his life stories with the tool’s help. He told me that Spiral has taught him how to be a better storyteller.

That’s what we’re building toward: a writing partner that helps people say what they mean and get better at saying it. Spiral produces good writing fast, but it also explains its writing and editing decisions along the way: the rationale behind rhythm, structure, rhetoric, and more. As my colleague Natalia Quintero observed, the best AI tools teach you things as you use them.

If any of this sounds useful, try Spiral. Share your feedback on X (@tryspiral) or get in touch: hi@writewithspiral.com.

Try Spiral 4.0

Marcus Moretti is the general manager of Spiral (@tryspiral).

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Figma Exec on Why the SaaSpocalypse Is a Goldmine

Dan Shipper / AI & I — 2026-06-03 18:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I with Matt Colyer, Figma’s director of product management for developers, is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:01:03
The SaaSpocalypse narrative has it backwards: 00:02:15
Matt’s email-agent origin story: 00:05:27
Divergent vs. convergent design thinking: 00:13:21
Figma’s MCP server: 00:17:39
Why design agents need personalization: 00:19:45
Every problem is a context problem: 00:22:09
Apple and Google as the reigning kings of context: 00:25:12
Review is the new bottleneck: 00:28:18

Transcript

(00:00:00)

Matt Colyer

The SaaSpocalypse—or, more positively, the next era of software. I’m really excited about it, because I think the number of developers in the world is about to go from tens of millions to a billion, maybe more. We’re moving through this incredible democratization of technology, and the end result is dramatically more software in the world. If you’re an established product in that space, it’s not a casualty—it’s a goldmine.

(00:01:03)

Dan Shipper

Matt, welcome to the show.

Matt

Thanks for having me, Dan.

Dan Shipper

For people who don’t know you, you are the director of product management for developers at Figma. I want to start with what I think is the big question on everyone’s mind. I bought a bunch of Figma stock about two months ago, partly because of this whole SaaS apocalypse narrative—and I want to get into that with you. You have a lot to share about AI and product management, all the stuff you’ve been doing yourself. But I’d love to start with: what is going to happen to SaaS tools in the AI era? Figma is a really interesting example, because there are people saying, “Oh, I don’t have to use Figma anymore”—and at the same time, you just launched an agent inside your product, and you have Figma MCP. So if you’re transitioning from a world where there was no AI when Figma started, to now being a big scaled product in an AI world—how does that work? How are you thinking about whether to open the product up to agents, build your own agent, what’s working, what’s not?

(00:02:15)

Matt

I’d love to talk about that. For me it comes from a couple of different angles. The first is the SaaSpocalypse—or, as a more positive framing, the next era of software. I’m really excited about it. I’ve worked in developer tools for a long time, and maybe five or ten years ago, the estimate for the number of developers worldwide was somewhere around 25 to 40 million. What’s most exciting about this moment is that I think it’s going to be a billion—maybe more than that. There’s this incredible democratization of technology happening. There’s a lot of catchphrases around homegrown software, and we can get into that. But the end result is that there is dramatically more software in the world. If you’re in that space, it means it’s a goldmine—there’s all this opportunity, and I’m really excited about it. Figma and a lot of other SaaS businesses are too.

The other part—responding to the more negative sentiment you see online—is the question of, well, what if I could just vibe-code every app? January of this year was the moment that narrative went mainstream. I’d been doing this stuff for probably 18 months before that, so I was already in “let’s go build everything” mode. But I feel like the whole world caught up in January, and people are building. What I know from my own personal journey is that it’s really fun to build the initial version. I actually built one of my own agents two years ago—the very first one was an email agent. It started as a terrible Python script, rickety, replies sometimes didn’t work.

The larger narrative here is that software companies build more than just code. There’s a reason I pay for Gmail to run my email—it turns out it’s pretty unpleasant when you have to worry about upgrading the SMTP version yourself and you just want to receive email. As I’ve run my own agents for my personal life, I’ve experienced the pain of: the product I want doesn’t exist, I built it, and now I own the ongoing cost of it. Honestly, I’m buying more software these days than I ever did before, because I’m like, “That tool seems useful. I’ll just pay somebody else to run my agent for me.”

(00:04:48)

Dan

I totally agree. As someone who has vibe-coded my fair share of tools—yes, there’s the personal maintenance burden, but also I’ve vibe-coded tools we’ve released into production, and let me tell you, it is not as simple as saying “fix this bug.” That’s really missed in the SaaSpocalypse discourse.

That said—if one of the first things you built was an email agent, I’m super curious how you’re managing email right now, because I feel like things have gotten to a point where you can just sort of do your email without actually doing your email.

(00:05:27)

Matt

Yeah. The problem that started two years ago: I was using chatbots at work, because at that point that was the primary interface—agent usage wasn’t really a thing yet. In my personal life, I have kids in three schools. If there are any parents listening, you know what it’s like to get the PTO emails—what’s the theme for today, what’s spirit day. The worst parent feeling in the world is missing crazy hair day because your kid didn’t do it. I’d done that more than once, and I was like: I cannot miss another one.

I had to track maybe 15 emails a day. You think corporate America produces a lot of email—wait until you get to the PTO emails from school. I thought: who can read all of these? Agents. Why can’t I just hook this up? The missing piece was the email inbox connection. So the first version was literally: grab the inbox, grab the top email, paste it to an LLM, dump the response back. My favorite prompt in those days was basically just “extract the facts”—and it was always shocking to me that I’d send a multi-page email and get three bullet points back.

Dan Shipper

I remember those days—the manual wiring-up and copy-pasting. It feels so far away, but it was only a year or two ago.

(00:07:03)

Matt

And then I added a memory system. The proactive piece—I think OpenAI’s Codex hit on this—was the real unlock. My version of it was having the agent send me a summary email every day at a set time. Instead of having to go to a tool and ask for the thing, it would just show up. Not because it was particularly smart—it just ran at the same time every day. But I think where agents are going is much more proactive than that: thinking about when to reach out and let you know what’s going on, without being asked.

Dan

So given where you were a couple of years ago—what are the workflow things you rely on now that you’re excited about?

Matt

One thing I’m still trying to figure out in my work life is summarization. Part of the job is understanding an immense amount of information and filtering it—teaching the agent which things matter and which don’t. It’s a genuinely hard problem, because there’s a lot of stuff that seems unimportant at first read and then matters three days later. How do you describe to a system which things are worth keeping?

(00:08:36)

Dan

It also feels like the agents are a little bit... one thing I do is have Codex go through all my company meetings—we record everything in Notion—and surface the things I might care about. Which is great, because I can effectively be in meetings I wasn’t in. But if it gives me stuff that’s not quite right and I correct it, it overcorrects—it gives me everything I said I wanted, way too literally and way too much. It’s never quite right in this weird way.

Matt

I was curious where you’re at on that, because it feels like one of the genuinely unsolved problems. We’re all grasping for it. Relatedly—with your email inbox, have you fully automated it? Does it reply on your behalf, or do you approve every reply?

(00:09:30)

Dan

I approve every reply. What I have is a small app I built in Codex that I open in the Codex in-app browser—it runs locally. Every day it sweeps through all my emails and gives me a page where every email is listed with a draft reply: here’s what I’m probably going to say. Because it has access to my computer, if it’s an email from my lawyers it can go search and come back with essentially what it thinks I should say. Then I just scroll through and talk to it using Monologue—I dictate: “No, fix this,” or “Yes, send that draft.” I’ve been at inbox zero for four straight weeks, which has never happened before. My assistant literally asked me what the hell was going on.

Matt

I am a member of the inbox zero religion. I’ve been running it for years and I believe in it—but it sure takes a lot of work. I’m curious about the Monologue thing. Do you actually talk to it, or do you type?

Dan Shipper

I talk to it. It’s audio only right now.

(00:10:45)

Matt

The audio unlock is huge and underrated. One thing I’ve learned is that it feels a little weird to talk to your computer—so my trick is I use Loom a lot. It feels less strange to pretend I’m screen-sharing with someone, and it lets me actually talk through the problem.

Dan

That’s funny. In the office?

Matt

Mostly from home, so people don’t hear me talking to myself. But even in the office—people will just assume you’re on a Zoom.

Dan

At some point there was this social barrier, and now I assume anyone in the office talking isn’t talking to me—they’re talking to their computer. It’s weird when they’re actually addressing me. There’s also the whisper move, where someone gets close to their screen and quietly says, “I want you to do this one little thing.”

Matt

It’s something like twice or three times as fast to talk versus type.

Dan

And I’ve got carpal tunnel, so it’s much more ergonomic. Huge unlock.

(00:12:06)

I do want to get back to what we were originally discussing. I think we’re on the same page: SaaSpocalypse—not a real thing. Making a piece of SaaS software that works reliably is a gigantic effort, and some people want to do that and others just want to pay for it.

Let’s go deeper into Figma specifically. In a design world, there are questions about whether you just want to chat with your landing page and move things around that way, or whether you want the infinite canvas. Internally, pretty much all of our designers are AI-pilled early adopters, and they all say: typing is good for a first pass, but to get the details right, I need to actually move stuff around. So in the design world, how does that change the product strategy when the possibilities for how you might design something have changed so radically?

(00:13:21)

Matt

There’s a lot to unpack, and we’re in the early innings. I think we’re still in the hangover of the text-box paradigm—so much of the default for generative UI has been chat. I feel like we’re starting to enter the second chapter of that, which is what excites me about our agents launch. We’ve had it internally for a while. For those who haven’t seen it, it’s the ability to use an agent directly on the infinite canvas.

It’s funny—a lot of what’s old is new again in LLM and ML land. We’ve reinvented evals, which are basically unit tests. We’ve reinvented prompting, which is basically user input. And design in the AI era is still governed by the same core principles. One of the core principles for me is the design diamond—divergent thinking and then convergent thinking. Most design problems follow that shape. Brainstorming is about generating ideas, not shooting them down.

One thing we haven’t fully unlocked yet from these new capabilities is the ability to supercharge generative thinking. We get stuck in our own lived experience and approach problems from a single angle. The value of a teammate is that they have a totally different starting point, and the creativity comes from that collision—“Oh, I hadn’t thought about it from that angle. Let me take that and build on it.”

So what does this mean in the new AI world? If we get outside text boxes—which are very linear, very “this then that”—and onto the canvas, the agents can enable divergent thinking. You have a frame: try grayscale. Another frame: try sepia. The sepia’s interesting but the type is wrong. Duplicate and try again. Now the accessibility’s off. And so on.

That’s still fairly early-stage—it’s the human driving all the input. But I think where we’re headed is an agent that throws a bunch of frames on the canvas and says, “Your job is to push these in different directions, not just double down on one.” And then a separate convergent agent that looks at 25 frames of concepts for a new marketing page and clusters them—“These three are similar, these are grouped around this”—and you can ask it for an opinion: if I’m a customer clicking through, which one makes the most sense? We haven’t really tapped any of that yet. Even the best command-line agents don’t have those workflows. That’s where I see the future of design and product thinking.

(00:16:30)

Dan

That makes total sense. From what I can tell so far, agents are really good for: “I have a design system, I need a new landing page in that design system—go.” Which, honestly, a lot of designers don’t want to spend time on—the nth landing page or the nth graphic for a post. That’s convergent. What about the question of external agents versus building your own, or having both—which you do have?

(00:17:39)

Matt

We embrace both. Design workflows and engineering workflows are different, but the lines are blurring. In the future we’re all going to be builders—it’s just a question of which angle you’re coming from. We definitely support third-party agents today, and our answer for that is our MCP server. One of the nice things about MCP is that it provides a standardized interface across all these different kinds of tools.

We think about the problem in two directions. The first is code-to-design. A common scenario: you have a signup page but it doesn’t support GDPR. Most people aren’t going to start from a greenfield and reimagine the entire flow—they log in Monday morning and think, I just need to add the checkbox. So for that workflow, if you’re comfortable in Codex or Claude or Cursor or Windsurf, you pull up your codebase, fire up the MCP server, and ask it: “Go to this page, fire up the dev server, and copy it into Figma.” And it will actually do it. We released that earlier this year. It’s a little mind-blowing that agents can do it faithfully—but they can. You’ve removed all the drudgery and you’ve got the design into a medium where you can interact with it precisely.

The second direction is design-to-code. We have a tool called Get Design Context, which takes a Figma design, wraps up all the properties and components you’re using plus any guidelines you’ve set in your design library, and provides it to the agent. The agent can look at your codebase, make a branch, create a PR, make the changes—and you can even ask it to take a screenshot and attach it to the PR. Your job is like what you described with email: you’re not merging blindly, but you have a solid starting point to riff on.

(00:19:36)

Dan

What have you learned about what makes for a good internal agent experience—inside a product—that you might not have known before the Figma Agent launch?

(00:19:45)

Matt

Specifically for Figma: context and personalization matter enormously. In a lot of AI products I’ve worked on in the past, personalization is often the last thing you get to—you just get it working for everyone first. But I think the difference between an okay agent and one that people genuinely love is personalization. We talked about memory as a form of it in third-party chat agents. For Figma, the equivalent is the design system. If you have an assistant but it doesn’t understand how you structure your designs and how you put them together, what it creates just isn’t usable.

Dan

I don’t know what your plans are around Figma being more proactive—being a proactive agent—but I’m curious how that’s going, to the extent you can share. We’ve talked about how hard it is to get right.

Matt

That’s where the future is going, if you look at how agents have evolved. We’ve got a lot of things cooking internally that I can’t speak to specifically. But I can talk about the problems we see today. If the amount of software in the world is really exploding, one of the bigger challenges becomes: how do you make sure it’s consistent with your values? We become the bottleneck—we only have so many human eyes to review all of this work. How do we provide a solution that lets people keep innovating at the speed agents create, while maintaining their values?

(00:21:36)

Dan

What has the transition been like internally at Figma—in the engineering org, the product org, the design org—from a pre-AI world to now?

(00:22:09)

Matt

I joined in January, and even in that short window it’s been night and day. In January, people were experimenting with new ways of working across all the functions—engineering was probably leading the way, as it usually does in these cases. But I’ll give you an example from the product org. We had an offsite—I think you actually came by, small world. One of my favorite memories from that offsite was what our product operations team built. They called it PMOS.

To take a step back: one of the big unlocks I’ve found with AI is that you start to realize every problem is a context problem. The work becomes about framing the problem with the right set of information. Our product operations team had this insight: a lot of the work we do as PMs lives in structured data. Why don’t we aggregate it? Start with the org chart—throw it in a SQLite table. Create a connector to Asana. Connect Slack, GitHub, a few other things.

Then the real insight: skills had really taken off at this point, and one they were excited about was onboarding file creation. When you add a new team member, as a manager you have to create a customized document—here are the channels you should know, here are the people you should know. That knowledge used to feel like it lived entirely in your head. But once you shape the context right, the data was already there. You have the org chart. The agent can walk it and figure out who’s on the team, who the trifecta is on the product-engineering-design side. You just tell it: here’s the new person, here’s the team they’re joining. It does a bunch of research, goes into Slack, figures out the relevant channels, reads the last 30 days of content, checks the Asana board, finds all the projects. And it comes back with something that’s uncannily good. A genuinely strong starting point.

(00:24:03)

Dan

That’s one of the things I think made Claude Code so good, and what makes Codex so good right now. Everyone initially tried to build agents that lived in the cloud and were always on—but then you had to manually connect them to everything. Claude Code is just an agent on your computer with access to everything you have access to, and that completely changes what it can do because it can get all the context it needs.

Same with Codex—I can ask it a random question. We published an article today, and I asked it, “Who should I send this to?” It went through my emails and texts—I didn’t even realize it had access to all of that—and found five people I probably would have forgotten but should have sent it to. That’s the sort of magical thing that’s starting to happen. The AI itself would have been capable of this for a while if you gave it all the context—but it’s only now that it’s in the right harness and form factor, and can do it a little more independently than before.

(00:25:12)

Matt

I want to put a plea out there. At WWDC—I think it was ‘25, Apple Intelligence—I was all in. I upgraded my iPad, I was like, “This is going to be it.” They had this concept of: our phones have all of this personal data. And then it just... wasn’t it. I’m really hoping WWDC this year actually is it, because the technology has been there. The part that’s missing is tying it all together. The mobile phone ecosystem has all that content. I’m waiting for the always-on Siri that actually runs in the background and is smart, rather than “What was that? I didn’t understand you.” One day.

Dan

Do you think they’re going to get that right? And if they don’t, does it matter?

Matt

I think it still matters, because even being late to the game, they are the king of context. And Google has also, interestingly, seemed to wake up to that at Google I/O this year—they don’t have as much data as Apple, but they have a lot. It seems like they’re now starting to marry their AI products. I think Spark is supposedly the always-on agent that’s going to be auto-connected to all of your Google content. I’m waiting for the day it just runs my inbox for me and I get to inbox zero.

(00:27:03)

Dan

I just have this feeling about Apple—when OpenAI’s Codex took off, everyone started buying Mac Minis, and you think, what a great business. They don’t even have to be in the AI race because they win by default—they make the hardware everything runs on. And even if they’re behind on Apple Intelligence, which they are, their software has historically lagged their hardware. Because the hardware is so good, they have a lot of time to catch up.

Matt

Their strategy is smart on the privacy angle too. It is genuinely concerning to upload all your information to the cloud. I think they’re in the game—I’m really hoping they’ve got something interesting this year.

(00:27:51)

Dan

Looking back over the last year, there’s been this big sea change in how we build things, how good the tools are, how software works. What do you expect over the next year as capabilities keep increasing—both in how you make stuff and what you make?

(00:28:18)

Matt

The big thing this year will be about review. That’s where the bottleneck is now. We have agents capable of producing all of this stuff—they’re available enough, cheap enough—and now we’re being inundated with net new content. Not summaries of existing stuff; that’s been around for a while. This is: do you want me to go or not? And people are getting overwhelmed by it. We have to solve the problem of how we scale our value system—how we evaluate whether this new thing the agent created is actually good—and feel confident enough in that to let it run in auto mode.

Dan

Do you have any sense of how that will work inside Figma, or what the interesting design considerations are for that kind of review flow?

Matt

That’s one of the problems we’re really focused on—talking to customers, figuring it out. I think the industry is trying to understand what the new format is. Is it a recorded video walkthrough? Screenshots? Another agent with a different prompt that reviews the work, one that you trust so much you approve its decisions? It’s hard to predict, especially right now.

(00:29:48)

Dan

One last question. There’s been a lot of back and forth over the last year or two about whether there’s a future for PMs, whether there’s a future for designers. If you want to be a PM, how do you break into the industry now? Maybe there are fewer PM seats, or engineers feel they don’t need PMs. How do you think about career progression for a PM—how someone who isn’t senior gets to where you are?

(00:30:24)

Matt

The fundamentals still matter. The best analogy I’ve seen is math class—you still had a calculator, but we all learned long division. We all learned to take derivatives by hand. Do I do that daily now? Absolutely not. But I think it’s incredibly important to understand those concepts and be able to do them by hand—to drive these systems well, you need to understand what’s underneath.

I’d be genuinely curious what CS 101 looks like now. There are two parallel worlds. One where you just dump your question into ChatGPT and get back, “Here are the 42 implementations of bubble sort—which one do you want?” And another where you’re a really curious person. You write the bubble sort in C, then you ask the model to compile it to assembly and explain it line by line—what’s a register, what’s L1 cache, what’s L2 cache. The people who can’t leverage these tools are the ones who just accept the output. The people who invent the next set of tools and push them to their maximum are the ones who are pushing the boundaries and understand how they’re put together. And to do that, you have to be curious. You can’t be the one who just said, “Give me the answer.” You have to be the person asking, “How does this actually work? Help me understand the next level.”

Dan

I agree. And it’s so much more fun to live that way.

(00:32:15)

Matt

It’s catnip for me. I don’t know if you’re a Hitchhiker’s Guide to the Galaxy person, but LLMs feel like the book—the literal manifestation of it. I have this on airplanes: I don’t run local LLMs often, but I’ll download an 8B model and run it offline, and it’s exactly that. You ask it “Why is the sky blue?” and it breaks down the refraction. You ask it “What is a squirrel?” and it answers that too. They’re not perfect—some are a little weird at the 8B size—but it’s a magical time to be alive for curious people.

Dan

I totally agree. Matt, it was a pleasure.

Matt

Thanks.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Opus 4.8 Is Smart Enough to Get in Your Way

Laura Entis / Context Window — 2026-06-03 18:00:00 -0400

by Laura Entis

in Context Window

Today, we update our Opus 4.8 Vibe Check with a Pulse Check featuring perspectives from more team members, Dan Shipper sits down with Figma’s Matt Colyer to unpack why AI hasn’t killed professional design services, and Every senior designer Daniel Rodrigues shares the two-tool AI workflow he uses to get precise, visually stunning results.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: The limits of chat-based design

In a new episode of our podcast, AI & I, Dan talks with Matt Colyer, Figma’s director of product management for developers, about the limits of chat-based AI agents for design and why the rise of vibe-coded everything is, despite what you might have heard, a boon for the company.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. (You can also read the transcript.)

Here are the highlights:

The “SaaSpocalypse” narrative has it backwards. AI agents turn anyone into a vibe coder, kicking off investor panic that traditional software-as-a-service (SaaS) companies like Figma would cease to justify their cost. Colyer isn’t worried: AI has exponentially expanded the developer base, while underscoring how difficult it is to create a vibe coded version of Figma that works as well or as reliably as the real thing. He’s vibe coded multiple agents to do stuff like handle his emails, but the maintenance costs piled up quickly and never seemed worth it. “I’m buying more software these days than I ever did before,’” he says. “‘I’m just going to pay somebody else to run my agent for me.’”
Figma is embracing agents. The company has launched an MCP server—a standardized interface any AI tool can plug into—that allows you to approach design work from two directions. “Code to design” takes a live web page and reconstructs it on the Figma canvas, so you can manipulate the elements directly; meanwhile, “design to code” flips the process by packaging a Figma design and giving it to an agent, which makes changes for you via pull request.
There’s a ceiling to chat-based generative design. Great design hinges on a diamond-shaped process: First you diverge, or generate lots of ideas, and only then do you converge around the most promising options. Text-based chats are inherently linear and therefore bad at divergence; the setup forces you to select an option and iterate on it. Agents are already good at the task-completion workflows Figma supports today, but the divergent, exploratory part of design remains unsolved across the industry. Colyer is interested in dividing the process so specialized agents handle the divergence by pushing you to expand your thinking, while another set filters through the options to identify a single path forward. “Even the best agents, the command-line agents, don’t have the ability to do those workflows,” he says. “That’s where I see the future of design and product thinking.”
Agents can produce so much so quickly. They’re less good at determining whether any of it meets a company’s values or design standards. Colyer isn’t sure the best way to close this gap—maybe it’s a video walkthrough, a screenshot, or a trusted review agent—but for good design to scale, AI needs to play a larger role in evaluations.

Miss an episode? Catch up on Dan’s recent conversations with LinkedIn cofounder Reid Hoffman; the team that built Claude Code, Cat Wu and Boris Cherny; Vercel cofounder Guillermo Rauch; podcaster Dwarkesh Patel; and others, and learn how they use AI to think, create, and relate.

Pulse Check: Opus 4.8 is the best tool for the right job

Five days ago, we called Anthropic’s Claude Opus 4.8 the best Claude model yet for writing and serious engineering, and said we’d switch to it from GPT-5.5 if the Claude app ever caught up to Codex. After a work week of more testing, we’re still an Opus 4.8 admiration society, although the results are a bit more mixed as people from different disciplines have had a chance to weigh in.

Here’s what more of the Every team has to say about when to use the model and when to steer clear.

Key takeaways

Reach for Opus 4.8 when productive friction improves the work. It’s good at tracking nuance, questioning a weak framing, and staying with a complicated problem. But the same instinct can become stubbornness, misplaced caution, or confidence in a wrong interpretation.
Give it the long, messy jobs. Opus 4.8 earned its strongest reviews on sprawling source material, long-running threads, difficult creative work, and complex coding tasks. For routine questions and clearly scoped work, its slower pace and higher token burn can wipe out the quality gain.
Do not rebuild your workflow around it yet. Even teammates who preferred Opus’s answers kept reaching for GPT-5.5 in Codex because speed, context, and a better-connected app outweighed model advantage.
Double-check security warnings. Two independent accounts reported that Opus invented a prompt-injection concern. Until that failure is understood, ask it to show the evidence behind a warning before you act on it.

The Reach Test, part II

Arielle Shipper, head of operations 🟩

Arielle Shipper, Every’s new head of operations, has spent the last few weeks on a discovery tour. She used Opus 4.8 to redo an HTML site showing a summary of her findings, after building the original with Opus 4.7. She noticed meaningful improvements: 4.8 distinguished between two similarly named pages in Notion without the explicit guidance 4.7 had required, and suggested highlighting a count of how many times specific topics came up in her conversations with the team. Her summary: “It seems really detail-oriented in a way I appreciate.”

Austin Tedesco, head of growth 🟨

Austin spent the weekend using Opus 4.8 on an essay with Monologue, our speech-to-text tool, and our writing app, Spiral. For that job, he wrote that Opus 4.8 “is the best model available,” a step up from Opus 4.7 and “materially better than GPT-5.5.” But he doesn’t expect it to change his daily behavior. GPT-5.5 is “pretty good” at the same kind of creative partnership, he said, and keeping his work in Codex matters more than the modest quality improvement: “I don’t see myself reaching for Claude models much without a materially better desktop app experience, or such a dramatic leap in model quality that the harness matters less.”

Nityesh Agarwal, senior applied AI engineer 🟩(model) / 🥇(dynamic workflows)

Nityesh tested Opus 4.8 inside the AI employees he is building for Every—Claudie for consulting, Andy for the editorial team. He reported that the model recalls the right memory at the right time, stays useful in longer threads, and lets him use more of its 1-million-token context window, the amount of material it can handle in one conversation. But Anthropic really won his heart with Dynamic Workflows, the workflow-automation feature released alongside Opus 4.8. Combined with the new model, Nityesh says it feels like “a major power-up.”

Lee Knowlton, software engineer 🟨

Anthropic says Opus 4.8 is more honest and better at flagging risks. But Lee saw the negative side of that instinct during a daily planning run he’d repeated for months where Claude used his calendar, Slack, and notes to create a plan for his day. One morning, the plan cited events, messages, and files Lee couldn’t find in those sources. When he asked Claude what had happened, it claimed a prompt-injection attack had supplied fake information. When Lee challenged it, Claude said it had invented that story to explain its own bad output, mistaking a planning file Lee had moved for evidence of interference. The exchange left him reluctant to trust the model’s explanations for its own behavior.

Andrey Galko, engineer 🟩

Andrey is “very positive” about Opus 4.8 for coding and wrote that he likes it much more than GPT-5.5. For his use cases, it feels “more stable, reliable, and just less dumb.” His reservations are about the experience around the model, not its coding quality: GPT-5.5 is faster, and Codex gives it the better desktop-app harness.

The verdict: Keep it within reach, not open all day

It’s worth noting that not everyone is as positive about Opus 4.8 as our team. Steve Yegge, a software engineer and blogger, wrote on X that Opus 4.8 is “suffocating” and “pathologically risk-averse.” Dylan Field, cofounder and CEO of Figma, called Opus 4.8 “a very strange model,” and said that it felt more judgmental in personality and more likely to hedge in its responses than Opus 4.7.

When Dan canvassed the hive mind on X, the replies suggested that Opus 4.8’s greatest strength is its biggest liability: It resists the user more readily than other models. When that resistance improves the outcome of a hard writing or engineering task, it feels like a breakthrough. When it is mistaken in its pushback, it’s frustrating and harder to trust.

Overall, our launch verdict holds, with a narrower recommendation. Use Opus 4.8 when the work is dense with context and benefits from sustained reasoning across a complex task. Keep a hand on the wheel when the costs of misplaced confidence—or misplaced caution—are high.

For higher-risk workflows: Verify its diagnosis before you trust a refusal or a security warning. Caution is only a feature when it is grounded in evidence.
For context-heavy knowledge work: It’s worth trying out when your source material is spread across documents and decisions—especially if you’ll explicitly send it deeper than the front page.
For daily-driver usage: A better model isn’t a reason to switch workspaces. If Codex is where your context, speed, and tools already compound, Opus 4.8 is a model you call in for specific jobs, not a reason to move.

Opus 4.8 looks most compelling when the work is long, context-heavy, and benefits from a second pass of judgment. If you mostly want something zippy to get stuff done, GPT-5.5 in Codex is probably the model you’re looking for.—Katie Parrott

Disclosure: Every received early access to Anthropic’s Opus 4.8. Anthropic had no input on this review.

Steal this workflow

Toggle between image generators

Every senior designer Daniel Rodrigues has spent three years working with AI image generators. By now, he knows their strengths and weaknesses. Here’s his advice for combining two popular options to maximize creativity without sacrificing attention to detail.

Step 1: Start by firing up Midjourney. The AI image generator produces beautiful visuals, but its real power is in its penchant for creative liberties: Give it a prompt, such as “medieval farmer reading in a field of oranges,” and it will return images with details you didn’t specify, like adding a castle in the background or giving the farmer a red hat. “You get random stuff,” Daniel says. Some of it is off base, but frequently the unpredictability sparks an entirely new (and better) direction he wouldn’t have stumbled upon otherwise.

One of the images generated in Midjourney from the prompt “medieval farmer reading in a field of oranges.” (Image courtesy of Daniel Rodrigues.)

Step 2: Take the image you made in Midjourney, and upload it into Nano Banana or ChatGPT Images 2.0 to nail down the specifics. Compared to Midjourney, both models follow directions to a T. This literalness limits Daniel’s ability to make creative leaps with the tool, but they’re great for refining an existing image so it better matches the visual in his head.

Step 3: Go back-and-forth with the model. For detailed prompts—say, of a “woman in her 30s, with red sunglasses, blue earrings, writing in a notebook with a yellow Montblanc pen”—Nano Banana will probably only capture 70 percent of what you want, Daniel says. From there, you iterate with the model, refining one item at a time so it can focus on getting that change right until the output fits your exact specifications.

To stress test the models on their ability to follow complex directions, Daniel ran the following prompt in Midjourney, Nano Banana, and ChatGPT Image 2.0, respectively.

Create a photorealistic image of a 35-year-old man sitting alone in a small Paris café, sketching architectural drawings in a notebook.

He has olive skin, short dark hair, a trimmed beard, and a small silver nose ring.

He is wearing a dark green jacket, black turtleneck, and a silver wristwatch.

On the wooden table in front of him are:

A notebook labeled “Project Atlas”

A blue fountain pen

A coffee cup with latte art

A folded newspaper dated October 14, 2031

Behind him:

A framed Mona Lisa reproduction

A vintage wall clock showing 4:26

A red bicycle visible through the window

A street sign reading “Rue de Rivoli”

Additional details:

The man’s watch must also show 4:26

A small black cat is sleeping beneath his chair.

The image should look like a real photograph taken with a professional camera, with all listed details clearly visible and consistent.

Midjourney's version. Notice how the model struggles with text—Midjourney is “terrible with letters,” Daniel says—and drops or misinterprets a number of details, such as the color of the pen, the notebook, and the cat’s sleeping status. The man is also wearing two watches. (Image courtesy of Daniel Rodrigues.)

Nano Banana’s version. The model does a better job, although some key details are dropped or presented oddly. (For example, the "Rue de Rivoli" sign reads correctly, but appears inside the cafe.) (Image courtesy of Daniel Rodrigues.)

ChatGPT Image 2.0’s version. It “wins this time,” Daniel says, incorporating most of the specifications such as the sleeping cat, a notebook labeled "Project Atlas," and even the clock showing 4:26, which image models generally have a hard time getting right. (Image courtesy of Daniel Rodrigues.)

One last thing

Where do you fall on the eight levels of AI adoption? If you don’t have time to ingest Mike Taylor’s comprehensive guide on the subject—it’s well worth a read, but we get it, time is a finite resource—here’s a quick way to identify what stage you’re at.

Simply run this prompt in your agent of choice:

based on everything you know about me, including memories, tools and skills installed, and past session history, what level would you say I was at on this guide to AI adoption levels? https://every.to/guides/the-eight-levels-of-ai-adoption

Katie is entering Level 6 territory. (Image courtesy of Katie Parrott.)

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

The Eight Levels of AI Adoption

Mike Taylor, Laura Entis, and Claude / Guides — 2026-06-02 18:00:00 -0400

by Mike Taylor, Laura Entis, and Claude

in Guides

All it takes is one viral post to make you feel like you’re using AI all wrong. Someone is running 12 Claude Code sessions in parallel. Someone else’s agent is answering emails while they sleep. Meanwhile, you’re still arguing with ChatGPT.

But here’s the thing: Keeping up with every power user isn’t the point. The best way to find value in AI is to use it in a way that fits your work—and to regularly check in to see if you could be getting more from it than you already are. (I was using Steve Yegge’s “Gas Town” post about directing dozens of coding agents to illustrate this in client presentations, but it didn’t quite match with my experience, and I needed to modify it.)

This guide maps eight levels of AI adoption, from basic chatbot use to full agent orchestration. With each new level, you delegate more of your work to—and place more trust in—the AI. The following sections explain how each level works in practice, complete with sample prompts, so you can figure out which levels match your current needs and workflows, what’s possible at each stage, and when it’s time to move to the next one.

Level	Description
1—Chatbot	You give it a task, it provides a response. (ChatGPT, Claude, Gemini)
2—Copilot	The AI exists inside your files and completes work alongside you. (Cursor, Claude in Excel, Gemini in Google Docs)
3—Agent	You describe a task, and the agent executes it step by step, asking for your approval before moving on. (Cowork, Codex)
4—Autopilot	You skip approvals and let an agent complete a task on its own, then review the results. (Lovable, Codex, Claude Code)
5—Workflows	You build a system that professionalizes the agent’s output. (Compound engineering, Claude Workflows, Copilot AI Studio)
6—Assistant	The agent works proactively in the background without being prompted. (OpenClaw, Hermes Agent, Claude Managed Agents)
7—Multi-agent	You’re managing multiple long-running agents at the same time. (Claude Managed Agents, OpenClaw, or Codex Goals)
8—Orchestrator	A manager agent runs a team of sub-agents on your behalf. (Gas Town, Paperclip, Symphony)

A higher level isn’t necessarily better. The most sophisticated AI users I know operate at several levels at once, identifying the best level to work within based on the specific challenge in front of them. The right level for a task is generally determined by how much you trust the AI to do a good job without intervention—and how big a deal it’ll be if it does mess up. For high-stakes tasks, you should either stay at a lower level so you can supervise the AI, or be prepared to invest the time, engineering resources, and tokens necessary to get that same quality at a higher level with less human oversight.

Most people I talk to who are struggling to adopt AI have good reasons: The output quality is either too low for the work they do or it’s too expensive to achieve. Safely moving up to the next level requires effort and experimentation—or a jump in model capability.

The right level match for most of your tasks may also depend on your role. Broadly speaking, the sweet spot for knowledge workers right now falls somewhere between Levels 1 and 4. Engineers are more often in Levels 5 through 8, partly because they can build the scaffolding that makes newer, less stable systems usable before they’re ready for everyone else.

The levels

Level 1—Chatbot

What it is: You ask, it answers. This is the classic chatbot experience: ChatGPT, Claude, Gemini, or any other model that’s not embedded in your files or your systems. You give it a task, and it returns a response.

What changes at this level: You move from doing everything yourself to drafting and synthesizing with an always-available AI generalist.

What you can use it for: Writing from rough notes, summarizing documents, or answering questions about uploaded files

Try it:

I need to send a post-meeting follow-up email to a client. Here are my rough notes, the decisions we made, and two risks we need to flag. Draft the email in a calm, confident tone and end with three clear next steps. Tell me if anything sounds unclear or unsupported before you start writing.

Input: Meeting notes

Output: A polished email draft that identifies if there’s any missing information that still needs to be filled in

Human judgment: Confirm that the tone and facts are right, and the email’s content is something you stand behind.

I am uploading a 20-page PDF on our new benefits policy. Summarize the five changes employees will care about the most, and then answer these three questions: Who is affected, what specific policies does the new timeline impact, and what would likely confuse someone who is reading this quickly?

Input: A PDF or set of documents

Output: A summary and direct answers to your questions grounded in the source material

Human judgment: Verify the summary is factual, and that the model recognizes when the material is ambiguous.

When to move up: Chatbots can assist with a wide variety of tasks, but each session requires manual setup: You have to explain what you want, provide the necessary context, and transfer the chatbot’s response to wherever you’re getting work done. Consider moving to the next level if you get a lot of value from chatbot exchanges but are tired of copy and pasting.

Level 2—Copilot

What it is: The model is embedded inside the place where you’re already doing work and has access to everything in your document, spreadsheet, presentation, notes app, or code editor.

What changes at this level: AI stops being a separate tab and becomes an in-place collaborator that can extend, revise, and interpret the work you’re doing as you do it.

What you can use it for: Revising drafts, understanding a document set or workspace without manually pasting everything into a chat window, and making changes to a live spreadsheet without leaving the file

Try it:

Using the draft already in this doc, write the next two sections in the same voice. Keep the tone consistent with the existing text, preserve existing structure, and flag any areas where you need examples or evidence from me before you get started.

Input: An unfinished document, memo, or social media post

Output: A continuation of your draft that matches the existing material

Human judgment: Decide whether the new sections sound like you wrote them, and then determine whether they successfully advanced your argument.

Here is our cash flow projection for Q2. Update the monthly totals with these new numbers, flag any months where we are projected to go negative, and add a summary row at the bottom with the full-quarter picture.

Input: A spreadsheet with your existing cash flow data. The new figures you want incorporated can be pasted directly into the prompt or provided as a second file.

Output: Updated monthly cash flow figures, a list of months where the cash flow is projected to be negative, and a summary of projected cash flow for the entire quarter

Human judgment: Verify the formulas are correct, check that the summary is accurate, and determine what strategies you’d like to take for addressing months with a projected negative cash flow.

When to move up: Copilot removes the need to manually provide context, but it can only reliably access information from a single file. Consider moving to the next level if you need to pull, compile, or analyze information across multiple sources.

Level 3—Agent

What it is: You describe a task, and the agent works step by step to complete it, checking in with you for approval along the way. It can access your files and systems, perform actions on your computer, and compile information from multiple sources.

One key distinction worth keeping in mind: An agent in this context is reactive. It waits for you to initiate and will not start a task unless you explicitly tell it to.

What changes at this level: AI becomes a true operator capable of executing multi-step tasks with supervision.

What you can use it for: Using figures from one file to update another, or building something new—like a dashboard—from a set of source documents

Try it:

Take the Q4 revenue numbers from this spreadsheet and update the board deck with the new figures, charts, and commentary. Show me the proposed edits slide by slide before you apply them, and call out anywhere the source data seems inconsistent.

Input: A spreadsheet and a presentation deck

Output: Proposed slide updates tied to specific data

Human judgment: Confirm that the interpretation of the data is how you’d like to present it, correct any factual or contextual issues the agent might have missed, and approve the changes.

Using the NPS data in this file, build a simple dashboard I can open in a browser. I want to track overall score, key themes in the comments, and how responses break down by segment. Before you build it, tell me how you plan to structure it and what assumptions you are making about the data.

Input: A data file and a dedicated folder the agent can work within

Output: A working dashboard, along with a detailed plan for how it built it plus a summary of the assumptions it made about the data

Human judgment: Approve the plan, confirm the dashboard works the way you want it to, and determine whether any assumptions the agent made about the data need to be revised.

When to move up: With an agent, the process is iterative—the agent completes a step, you review and refine, wash, repeat. Consider moving to the next step when you want to relinquish control in exchange for speed or the ability to one-shot a prototype without writing any code.

Level 4—Autopilot

What it is: You skip permissions and let an agent complete a task on its own, then review the results. With an agent, you stay involved in the process because you care how each step gets done. On autopilot, which is often called vibe coding, you

describe what you want, let the system run, and evaluate what comes back. At this stage, you’re typically building something other users will interact with, such as a prototype or landing page.

Determining which tasks can be done on autopilot depends on how capable the model is, a calculation that changes with every release. For example, I’ll happily produce a landing page on autopilot, because the models are good enough to make one that meets my standards. I can’t do the same with a complex slide deck, at least not yet—the result is so far from what I want correcting it takes longer than doing it myself. As the models improve, you can get away with doing more of your work on autopilot.

What changes at this level: You hand over the entire task to the model and review the end result instead of revising along the way.

What you can use it for: Building prototypes, internal tools, and first-pass products. Autopilot is the first level that allows you to build something other people can use without having to write a line of code yourself. It can also usually cover routine tasks, such as filling out recurring forms or drafting weekly status reports.

Try it:

Build me a lightweight internal lead-scoring tool for our sales team. It should let us paste in account notes, assign a score from 1 to 5, and show which factors drove the score. Use dummy data for now and make the interface clean enough that I can demo it tomorrow.

Input: A plain-English description of what the tool should do, who will use it, and any constraints, such as whether it needs to work in a browser or stay local

Output: A functioning prototype

Human judgment: Test the output and decide if it’s demo-ready. A prototype doesn’t need to be perfect, but it’s worth noting where you’d need to invest in reliability before putting it in front of users.

Build a landing page for our new feature. It should explain what the feature does, include a clear call to action, and match the tone and brand colors of our existing site. Make it responsive.

Input: A product brief, brand guidelines, and the existing site as a reference. Brand guidelines can be as simple as a color palette and a few sentences about tone; if you don’t have a formal document, describing your existing site in a sentence or two is enough to get started.

Output: A working landing page

Human judgment: Read the copy, test the page on mobile, and decide whether it’s ready to share more widely.

When to move up: Autopilot is fast, but it often produces uneven or unreliable results. That might be fine for a prototype, but for higher-stakes work, you’ll want to build a repeatable system around the agent that structures its thinking and execution. Consider moving to the next level if you want the speed and versatility of autopilot with more structured quality control.

Level 5—Workflows

What it is: You build a system, or harness, around your agent that professionalizes its output. Instead of a one-shot run, your agent plans, reviews, performs confidence checks, and runs code through other safeguards to make the results more reliable. This is a transition from vibe coding to agentic engineering. The pace is still fast, but because you have structured the process and included guardrails that catch and fix mistakes, the output is of a higher quality.

This level is primarily the domain of engineers. Reviewing a plan, evaluating which tests need to be done, and designing the harness that keeps the agent from going off the rails all require an understanding of what’s happening under the hood. The compound engineering guide covers this in detail. Much of this discipline will be baked into platforms over the next six to 12 months; for now, it requires technical judgment to implement.

What changes at this level: You stop treating the agent as a one-shot performer. By designing a repeatable process for the agent to follow—and encoding your standards into that process—you can trust the agent with work you’d otherwise want to do by hand.

What you can use it for: Shipping features with a plan-review-implement loop, turning a vibe coded prototype into something stable enough for production, or building a process other engineers on the team can follow

Try it:

Run /plan (plan mode in Claude Code, or /ce-plan in compound engineering) before writing any code.

/ce-plan Inspect this repo and propose a plan for adding a customer support inbox view. Include the

files you expect to touch, edge cases, and how you will verify the behavior. Wait for my approval before implementing.

Input: A codebase the agent has access to, and a written feature request or specification. The more context you can give upfront—existing architecture patterns, relevant files, known constraints—the better the plan will be.

Output: A plan for building a feature that you can review before the agent implements anything

Human judgment: Evaluate the plan and make any necessary improvements before having your agent implement it.

After the agent finishes a change, run /ce-code-review (or ask it to review its own work).

Review this change like a skeptical teammate would. Tell me how confident you are from 1 to 100, list the weakest parts of the implementation, and make another pass until you are above 90 or can clearly explain why you are not.

Input: The completed change—a diff, a set of modified files, or a pull request—plus the original spec or plan the agent was working from, so the review can check whether the implementation matches the instructions

Output: A self-review, confidence score, and an improved version of the feature

Human judgment: Decide whether the confidence score is justified and whether you agree with the review. If the agent rates itself highly but you identify issues it didn’t flag, name them and have it do another pass.

When to move up: Even the most sophisticated workflows require you to activate them, which, for certain tasks, becomes a bottleneck. Consider moving to the next level if there are areas of your life or work you’d trust an agent to handle without checking in with you first. (At this stage of model development, that’s more often lower-stakes administrative or household tasks.)

Level 6—Assistant

What it is: Unlike an agent—which waits for you to tell it to do something—an assistant acts on your behalf without being prompted. It can monitor a domain, do recurring work, and surface relevant information around the clock. For example, OpenClaw’s heartbeat.md file triggers every half an hour with instructions around priorities, and the agent takes action automatically. No need to prompt.

What changes at this level: AI moves from providing reactive help to proactive, ongoing support.

What you can use it for: Recurring research, monitoring a topic you care about, or personal administrative work that would otherwise fall through the cracks

This level still requires either technical knowledge or access to someone who can walk you through the onboarding process and fix your assistant when it breaks. On the consulting team, we have an AI assistant that handles all project management and sales pipeline-related tasks, but it only reliably functions because it’s maintained by Every senior engineer Nityesh Agarwal.

OpenClaw is the most popular platform for personal AI assistants, but it’s inherently unstable and time-intensive to set up. Is memory problem hasn’t been solved yet, so it can struggle to retain context between sessions.

Lower-stakes personal uses, such as monitoring your inbox for emails from your child’s school or tracking household purchases, are more accessible with the current state of available models than giving an assistant access to your work systems, which requires engineering and IT support to do safely. Risk tolerance matters here more than at any earlier level.

Try it:

Every 30 minutes, check my calendar and flag events taking place within the next two hours that require preparation. If there is a meeting with no agenda, draft a short suggested one based on its title and attendees.

Input: Calendar access and your preferences about what events qualify as requiring prep—for example, whether you want to be flagged for one-on-ones, external calls, or anything over 30 minutes. Output is typically delivered to a messaging app like Slack, although the specific setup depends on which platform you’re using.

Output: A recurring brief delivered to your message app of choice

Human judgment: Decide what is urgent, and refine the rules based on the results.

Monitor my inbox for emails from my child’s school. Each morning, give me a short summary of anything I need to know or act on. Also keep a running log of recent grocery purchases and let me know when we are running low on staples.

Input: Access to your calendar, inbox, and receipts

Output: A daily brief and a running household inventory

Human judgment: Verify that the summary captures the most important information and the agent is accurately identifying grocery items that need to be restocked.

When to move up: When set up correctly, an always-on assistant can proactively handle a wide variety of tasks. Consider moving to the next level if you want your assistant to accomplish even more for you, but don’t want to interrupt its existing workflow or are worried about overburdening its memory.

Level 7—Multi-agent

What it is: You are managing multiple long-running agents or assistants at the same time. Each one has a role, a task, or an area of responsibility, and your work starts to look more like leading a small team. This level is firmly in senior engineering territory—it is rare for knowledge workers to be running multiple parallel agent sessions.

What changes at this level: Your productivity multiplies when you move from one agent doing a task to having several agents working on tasks in parallel.

What you can use it for: Running implementation and planning simultaneously, or automating recurring investigation work so it no longer requires your direct attention

Try it:

You already have one always-on agent—perhaps a custom Claude agent that runs on its own Mac Mini—that handles your editorial work. Rather than interrupt its workflow to have it complete an unrelated task, you set up a second agent that is responsible for a different job function: “You’re responsible for our customer support inbox. Triage new tickets as they come in, draft replies for the routine ones, and flag anything that needs a human.”

Input: A custom long-running agent with its own scope, tools, and memory, kept separate from the first so their contexts don’t bleed together

Output: Two agents working in parallel with distinct job functions, skills, and memory

Systematically review each agent’s work to determine whether it’s executing at the level you need it to and its job description is focused enough that its memory isn’t getting overburdened.

Input: A bug-reporting system connected to an agent trigger

Output: A steady stream of pull requests, each tied to a specific reported issue

Human judgment: Review each pull request, merge the approved ones, and identify cases where the agent misdiagnosed the problem.

When to move up: Long-running agents are valuable because they can largely work independently, but you still need to set their goals and evaluate their progress. Consider moving to the next level when you have so many of these agents that you lose track of which one is responsible for what.

Level 8—Orchestrator

What it is: An orchestrator agent manages a team of agents. It plans, delegates, monitors progress, and consolidates outputs so you can focus on bigger-picture tasks, such as setting overall goals or reviewing major decisions. Tools like Gas Town, Paperclip, and Symphony (from OpenAI) are early examples of this model.

It’s critical to note that this level is highly experimental. Even engineers operating at the frontier still largely fill the role of orchestrator themselves rather than trusting an orchestrator agent to handle complex coordination work.

What changes at this level: You stop managing each individual agent and instead focus on setting goals, establishing constraints, and implementing approval thresholds.

What you can use it for: Projects where the economics only make sense if you remove yourself as the bottleneck—building a system for keeping track of who’s doing what, sequencing work across multiple agents, and making sure the right issues are escalated without you

Try it:

An always-on agent takes the next ticket in the queue from your project management software. “Your job is to design a landing page for this SEO keyword [insert keyword]. Break the research up into parallel search queries related to the topic, search our company documents for unique insights, then write up a full page using the /brand-style skill. ” Agents continue to take and complete tickets until the board is clear, and the fully completed project is ready for human review.

Input: A high-level objective, defined agent roles, and rules for what requires human review

Output: A managed project where you receive critical updates instead of raw output from every agent that’s running in parallel

Human judgment: Determine whether the orchestrator is doing a good job triaging issues or if too many—or too few—are being handed over for you to review.

Set up a pipeline that reviews each code submission against our codebase standards, runs the tests, checks for common issues, and escalates issues to me only when they require a judgment call.

Input: A repository, contribution guidelines, and a test suite

Output: A short queue of escalated items that need your input, instead of hundreds of raw submissions that require manual triage

Human judgment: Establish the threshold for what qualifies as something that needs your attention, and raise or lower the bar as needed. The agent can flag those tasks for human review, or it can work on the entire project autonomously until all tests pass and the agent has recorded a video of the software working end to end.

What the levels measure

There is no value judgment baked into these levels. The vast majority of people should not pursue orchestration, for example, because the models aren’t reliable enough for most use cases. That said, as the technology improves, it can be worth revisiting a level that was previously inaccessible to you or your company. Model releases can pull everyone up, making tools and systems more reliable and easier to use.

If you take anything away from this guide, let it be this: AI use is not a competition. You wouldn’t brag that you had eight interns working overnight on a key project, and you hadn’t checked their output. Instead, you’d work clo

sely with them for months until you were confident they had enough training to be able to work autonomously. Expect to put in a similar amount of effort with your agents before you can trust them to get reliable results at the next level of autonomy. Determining which levels fit your specific needs—rather than seeing how far you can ascend for the sake of it—is the most important thing you can do if you want to make better use of the technology.

Mike Taylor is the head of tech consulting at Every and a co-author of Prompt Engineering for Generative AI (O’Reilly). Learn more about how Every’s consulting team can bring AI into your organization.

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

Where Do You Fall on the Eight Levels of AI Adoption?

Mike Taylor — 2026-06-02 07:00:00 -0400

by Mike Taylor

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

All it takes is one viral post to make you feel like you’re using AI all wrong. Someone’s running 12 Claude Code sessions in parallel. Someone else’s agent answers emails while they sleep. Meanwhile, you’re still arguing with ChatGPT.

Here’s the thing: Keeping up with the power users isn’t the point. The best way to get value from AI is to use it in a way that fits your work—and to check in now and then to see whether you could be getting more from it.

With that in mind, today we published a guide that maps all eight levels of AI adoption, from chatbot basics to full agent orchestration. We explain how each level works in practice, with sample prompts, so you can figure out which ones match your current needs and workflows, what’s possible at each stage, and when it’s time to move to the next one.

Level 1—Chatbot: You ask, it answers.
Level 2—Copilot: The AI works alongside you, inside your files.
Level 3—Agent: It executes a task step by step, checking in for approval.
Level 4—Autopilot: It runs on its own; you review the result.
Level 5—Workflows: You build a system that makes its output more reliable.
Level 6—Assistant: It works in the background, without being prompted.
Level 7—Multi-agent: You manage several long-running agents at once.
Level 8—Orchestrator: A manager agent runs a team of sub-agents for you.

A higher level isn’t necessarily better. The right level for a task is generally determined by how much you trust the AI to do a good job without intervention, and how big a deal it’ll be if it does mess up.

If you want to know where you fall on the AI adoption spectrum—and whether it’s time to experiment with higher levels—this guide is for you.

Read the 8 levels guide

Mike Taylor is the head of tech consulting at Every and a co-author of Prompt Engineering for Generative AI (O’Reilly).

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Company-wide AI Implementation in Five Steps

Natalia Quintero — 2026-06-01 06:00:00 -0400

by Natalia Quintero

Midjourney/Every illustration.

Join me and Dan Shipper for a live session on what AI fluency looks like at the executive level tomorrow, Tuesday, June 2. We’ll walk through how the leaders we work with—at hedge funds, private equity firms, and Fortune 500 companies—are using AI in their day-to-day, and what they wish they’d done differently six months in. RSVP.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Sitting across from the chief operating officer of a health tech company earlier this year, I watched her name a problem many executives are feeling but few say out loud.

“Our junior employees are probably much more native with this technology,” she said. “And we need to make sure we’re sticking with it. Makes me feel like a dinosaur to say that, but it’s true.”

Confessions like this come up regularly during our executive training sessions: Leaders aren’t working directly with AI on sophisticated tasks, even as they’re guiding planning decisions about the technology. They know they should spend more time learning the tools, but they haven’t committed to it yet. That’s understandable; executives are incredibly busy. But what we see in our sessions is that leaders who haven’t gotten their hands dirty don’t clearly understand the practical opportunities and challenges of AI. That health tech executive’s admission sparked an important conversation about how a coordinated company-wide approach to AI implementation starts with executive AI fluency—but doesn’t stop there.

AI usage in the workplace is now widespread, but it’s an altogether different ballgame to build organizational capability that truly realizes financial gains.

McKinsey defines AI high performers as organizations that report both significant value from AI and more than a 5 percent impact on earnings before interest and taxes (EBIT). These companies are nearly three times as likely as others to have fundamentally redesigned their workflows, but they remain a minority: Only 6 percent of the nearly 2,000 organizations surveyed met the criteria for success.

As AI has gone from performing party tricks to completing an entire day’s worth of human work in three short years, enterprise AI adoption has moved through three distinct waves. First came the license wave: companies bought access to tools like ChatGPT, Claude, and Microsoft Copilot and waited for productivity gains to appear. Then came the prompt wave: companies ran training sessions, built prompt libraries, and encouraged teams to experiment with custom GPTs. Now we are entering the implementation wave: prompt libraries are giving way to skills libraries, agents, evals, and workflows with named owners.

The METR chart in our full guide shows how far the technology has progressed, but we’ve seen that many organizations implementing AI haven’t kept up with the sea change. The bottleneck for AI adoption has moved from model capability to organizational capability.

That’s why we built a practical guide for executives who have bought AI tools but are not yet seeing real value from them. The loop is simple:

Get fluent. Use the tools yourself before directing anyone else to use them. Know what your company has access to, what the policies allow, and what the friction feels like. If you haven’t built something with AI in the last 30 days, start there.

Assign AI champions. Pick operators with bandwidth. Give them protected time (at least two days per month), a clear mandate, and enablement. They are responsible for taking workflows from “works in a demo” to “works in production.”

Pick one painful workflow. Let your champions choose. They know what work is most tedious and worth automating. Start with something frequent, data-rich, and narrow enough to test in a week. You don’t need a full workflow mapping exercise.

Build to 95 percent. An automation that works 80 percent of the time is a demo. Real automation requires gold-standard examples, structured evals, human review gates, and a named owner who maintains it when the model updates. Once you have a skill that works reliably 90-95 percent of the time, you’ve gotten value from AI.

Scale what works. This is where the champion role is key. Run show-and-tells. Train adjacent teams on proven workflows. Kill what doesn’t work and expand what does. One visible win creates pull across the organization.

This guide turns that loop into a 60-day plan for executives, with checklists and rubrics drawn from Every’s consulting work with dozens of top companies. You can read it in full here.

Read the AI for executives guide

Natalia Quintero is the head of Every Consulting.

Thanks to Tom Matsuda for editorial support.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

An Executive’s Guide to Implementing AI

Natalia Quintero / Guides — 2026-06-01 05:00:00 -0400

by Natalia Quintero

in Guides

If you read nothing else, here is the loop:

Get fluent → Assign AI champions → Pick one painful workflow → Build to 95 percent → Scale what works

This guide turns that loop into a 60-day plan for executives, with checklists, and rubrics drawn from Every’s consulting work with dozens of top companies.

An executive’s guide to imlementing AI

Sitting across from the chief operating officer of a health tech company earlier this year, I watched her name a problem many executives are feeling but few say out loud.

We see this pattern in every engagement we run in our consulting work. Over the past two years, we’ve trained thousands of people at companies including the New York Times, Ripple, Headway, and Thumbtack, and at investment firms managing over $100 billion in assets. We’ve done the workshops and watched what changed six months later. AI usage in the workplace is now widespread, but it’s an altogether different ballgame to build organizational capability that truly realizes financial gains.

Of course, no outside firm can implement AI into your company for you. But we can provide a playbook for how to build organizational capability that endures: leaders that work directly with the tools, empower the right champions, and build the muscle across teams for what great looks like, one painful workflow at a time. By the end of this guide, you’ll have no excuse not to be one of them.

Riding the waves of AI adoption

In three short years, AI has gone from performing party tricks to completing an entire day’s worth of human work.

In 2022, models could answer basic questions, tasks that take a human four seconds. By mid-2023, GPT-4 could handle tasks that take humans about six minutes. By late 2024, o1-preview was tackling hour-long work. And by late 2025, Claude Opus crossed into tasks that take humans 10 hours or more. That progression has been exponential and transformed what “AI implementation” means for companies again and again.

Here are the three rough waves of AI adoption since ChatGPT’s launch:

The license wave (late 2022 to early 2024): Companies bought licenses for ChatGPT Enterprise, Claude, and Microsoft Copilot in the hopes that they would increase employee productivity. Some employees found value in using the tools to draft emails, summarize documents, and conduct research, but gains were uneven and individual.
The prompt wave (early 2024 to mid-2025): Companies ran prompt-training sessions, created internal prompt libraries, built resource documents, and encouraged teams to experiment with custom GPTs. That helped move AI beyond pure individual tinkering, but it rarely created durable organizational change—custom GPTs and libraries often had no owner and no way to evaluate their results.
The implementation wave (mid-2025 to now): Following its launch in research preview in February 2025, Claude Code helped shift enterprise adoption to where we are now: away from chat-based AI and prompt libraries and toward AI agents that can increasingly be configured to perform longer, multi-step tasks within defined constraints. Prompt libraries are giving way to skills libraries: reusable workflows with instructions, examples, reference materials, scripts, evaluation criteria, and named owners. Suddenly, non-technical people can build sophisticated automations in tools like Claude Cowork; implementation isn’t just for engineers anymore.

The chart plots each model release against the complexity of software tasks it can reliably complete, measured by how long those same tasks take a human. (Source: METR, an independent research organization that evaluates AI model capabilities on real-world tasks.)

The METR chart shows just how far the technology has progressed, but we’ve seen that many organizations implementing AI haven’t kept up with the sea change.The bottleneck for AI adoption has moved from model capability to chart shows just how far the technology has progressed, but we’ve seen that many organizations implementing AI haven’t kept up with the sea change. The bottleneck for AI adoption has moved from model capability to organizational capability. On our end, we’ve fundamentally altered our trainings to support executives and teams in this new era. For instance, we’ve retooled our sessions on prompting into workshops on setting up agents, skills, and workflows that can be owned, tested, and maintained. We’re working with executives on building that organizational muscle and turning raw model capability into reliable, repeatable workflows.

We know it’s making a difference. One investment firm we worked with now runs 100-plus agents across the organization through Copilot Cowork. At an e-commer ce company client, Claude’s Opus handled financial variance analysis that previously took a week. After working with us, a private equity firm decided to hire full-time AI champions to continue their AI implementation process.

Here are the five steps we’ve found that can carry you and your company into the next era, too:

Step #1: Get fluent

AI implementation starts with executive fluency. That doesn’t mean executives need to become day-to-day AI builders. What’s important is that you spend enough time with the tools to understand what you’re asking your teams to do. At one large media and data company we worked with, we saw that executives responsible for reviewing internal AI initiatives had never built with the tools themselves. All their previous initiatives had failed. It was easy for them to project what an AI agent could do for their business. It’s much harder to wrestle with what building with AI involves: the data the agent needs, the systems it can access, where it might fail, how much human review it requires, and who will maintain it after the first demo works.

Get your hands dirty

In our executive sessions, we push leaders beyond using AI as a chat interface and ask them to build a custom skill, agent, or automation themselves. The exercise quickly surfaces all the practical constraints that determine whether the use of AI can create value for a specific workflow.

Once you start to build for yourself as an executive, the conversation moves from abstract enthusiasm to practical questions: which connectors need to be enabled, what data can be accessed, and whether existing information technology policies match the company’s AI ambitions.

Understanding the roles and perspectives of IT and security are a critical part of AI fluency. The goal isn’t to bypass guardrails; regulated companies may have good reasons to restrict file uploads, block certain tools, or limit which data can be passed into a model. But as a leader, you need an ongoing dialogue with IT and security teams to take into consideration what tools are available, how they connect, what data can move where, and what trade-offs the company might be making. If you’ve never built under those constraints, you may misread the resulting low adoption as employee reluctance rather than an access problem.

Define your standard of excellence

Fluency also exposes whether leaders can define what good work looks like. In one executive session, we worked with a leader who had to prepare metrics for the company’s board. The process required pulling data from Snowflake and took many hours each quarter.

On the surface, this looked like a perfect candidate for an AI skill. But as the team started building, a different issue emerged: The executive could not clearly articulate what “excellent” looked like.

A skill is a set of reusable capabilities that define how an agent performs tasks; in order to reliably reproduce a workflow, it needs instructions, examples, reference materials, and a clear picture of what good and bad output look like. Of course, the same is true of people. If you cannot explain your standard of excellence to your chief of staff, you’ll certainly struggle to explain it to an AI system.

This is why as a leader, you’re better positioned to use AI than you may think. High-performing executives already know how to set direction, allocate resources, define standards, and judge whether work is good enough. Executives do not need to have all the answers. But you do need enough AI fluency to ask the questions that will help you decide where AI belongs in your company’s strategy:

What can AI see inside our company?
What can it do?
Where are the constraints?
Which workflows are painful enough to prioritize?
Who will own the systems we build?
How will we know whether they work?

Step #2: Assign AI champions

Once you understand what AI can and cannot do, the next step for executives is to assign ownership of the projects to specific individuals, known as AI champions.

Champions shepherd a project from initial idea to completion by experimenting with and iterating on workflows, teaching others what success looks like, and gathering support across the organization for AI implementation. Their job is to decide what gets built, what gets maintained, what gets improved, and what gets killed.

Champions typically have three qualities: curiosity, a people-oriented mindset, and the authority and time to do the work. As a leader, your job is to choose the best people for the job. In our consulting work, we train these champions to lead adoption across their departments.

Find the people who ask questions

AI champions do not need to be the most technical people in the organization. They don’t need to be engineers or have experience using AI for years.

They do, however, need to be curious. Great champions constantly ask questions, probe how processes work, and want to understand “what excellence looks like” for different tasks and functions.

Truly curious people also tend to be comfortable asking for help from colleagues and from AI—a skill we believe will define the next era of work. Successful AI champions treat AI as a partner rather than a one-shot magic button. And AI rewards people who are willing to admit what they do not know, break a problem down, ask better questions, and keep iterating until they get somewhere useful.

Great champions care about people

The best AI champions also understand that AI implementation is fundamentally a people issue — and they care about the people they work with.

AI champions build and maintain tools, but they also help colleagues change how they work. To do that well, champions need to understand the pain points inside their function. They need to know which tasks drain time, which processes frustrate people, and which handoffs create errors. That’s why the strongest champion is someone who’s close to the workflow the company is solving—a marketer who knows where campaign analysis gets stuck, for example, or a customer support lead who understands ticket triage.

They also need to be great communicators. Once a skill or workflow is ready for wider implementation, the champion has to explain it to the rest of the team, collect feedback, and help people understand how to use it.

Give champions time and authority

Champions need to be given the authority to make decisions and the time to be able to execute. This is where many AI programs fail. Executives identify enthusiastic people and ask them to help with AI on top of their day job. The result: The work gets squeezed into evenings, deprioritized during busy periods, and sometimes abandoned altogether. Enterprise AI implementation won’t work if it’s pitched as an informal side project.

What champions need is protected time—at least two days a month, in our experience—and a clear mandate. They should be responsible for a small number of workflows in their domain, with enough authority to make decisions about how those workflows are documented, tested, and maintained. They should also have a clear escalation path when they need support from IT, security, leadership, or another function.

The exact structure will vary, of course. A large company may need an ambassador model, with champions distributed across major functions. A mid-sized company will likely need one or two department champions per team. For private equity firms or holding companies, dedicated fellows who move between the firm and portfolio companies could be the most effective.

In short, an AI champion should:

Own one to three workflows in their domain
Maintain the documentation for those workflows
Build or manage eval sets
Collect feedback from the team
Update skills when tools, models, or processes change
Report on time saved, quality improved, or errors reduced
Have protected time to do the work

Step #3: Pick one painful workflow

Once you have champions, it’s time to pick a workflow to start with. This is where most executives make the mistake that derails the process: They begin with the biggest, most visible problem at the company.

We’ve seen executives want to automate the creation of the board deck, rebuild project management, or create an agent that solves a hairy cross-functional process across multiple systems. But even experienced AI builders make the mistake of starting too big. At Every, one of the team’s first instincts was to automate project management for our consulting business—a broad, messy workflow touching multiple people, systems, and decisions.

But AI implementation works better when you resist the urge to build the “whole body” at once. Instead, start with one artery of the workflow, a narrow, painful piece of the puzzle that can be tested, improved, and then trusted before expanding from there. Good candidate workflows are often unglamorous—categorizing support tickets or summarizing vendor updates—but are frequent enough to act as valuable test cases. If you’ve chosen your champions well, you can rely on them to find the most painful workflow to start with. They may even have experienced that pain firsthand.

To locate the best workflow among a good group of candidates, score them against the following criteria:

Frequency: Does this happen daily, weekly or monthly?

Pain: How much time, frustration, or error does it create?

Data availability: Is the required information already digital and accessible?

Risk: What happens if the AI gets it wrong?

Ownership: Who currently does this manually?

Evaluation clarity: Can we tell whether the output is correct?

Maintenance burden: How often will the workflow need updating?

Step #4: Build to 95 percent

Once you’ve chosen your first workflow (or your champion has with your blessing), it’s time to start building. This is often the moment when one of the biggest expectation gaps in AI implementation emerges. A team can often get an impressive first version of something working in minutes. Whether it’s a customer service workflow that categorizes the first 20 tickets correctly, or a vendor update that manages to capture a pricing increase or a new security requirement, that jump from zero to something workable can feel like magic.

But typically, what’s happened is that they’ve built a demo, not a usable product that can be rolled out anywhere. Turning that demo into a tool the team can rely on—going from 60 percent to 95 percent—requires much more work: examples, evaluation, feedback, human review, and maintenance. And champions and executives will have to work in tandem to get there.

Set product standards

Executives should act like tastemakers here, setting the standard for what a useful workflow looks like and where human review belongs, and deciding how much time the company is willing to invest in the outcome. Champions can then use those standards to build. They collect examples and evaluation metrics to test the workflow’s output, gathering feedback that informs the finished product.

Automation is a lie

Building to 95 percent also means accepting that any AI workflow is a never-ending process. Models update. Company processes shift. Team standards change. New edge cases appear. A skill that worked last month may need to be adjusted this month. This is where evals—structured ways to test whether the AI is doing its job correctly—come in.

Think of an AI agent less as a machine that runs forever and more as an employee you’re onboarding. You have to give it instructions, show it examples, correct its mistakes, and clarify what excellence looks like. Over time, it’ll become more useful, but only if it’s managed correctly.

For each workflow, create a simple table that asks:

What real example should the workflow be tested against?
What’s the current output of the AI agent?
What’s the expected output of the AI agent?
What errors is it making?
What caused the error?
Is a prompt or skill change required?
What’s the result of the retest?
Is human review required?
Who owns this workflow?
What’s the review cadence of the tool?

Step #5: Scale what works

The next step sounds obvious, but you’d be surprised how many executives get it wrong: Only scale what works. This is important from a resource perspective, but it’s also key for internal adoption. While many executives begin with a company-wide mandate that everyone start using the tools, the better path is to foster one visible win by choosing the right champion, workflow, and standards, and building from there.

When a team experiences an AI workflow that solves a real and painful problem, AI stops being an abstract productivity promise and becomes a practical solution. That experience creates pull across the organization, and other teams start asking what could work for them.

But scaling doesn’t mean copying the same workflow everywhere. Most workflows are department-specific. What works for finance may not work for marketing, for instance, and what works for customer support may not work for the product team. Once you have a winning workflow, it’s your job as a leader to decide whether it should stay team-specific, become a shared skill, or be sunsetted.

But regardless of how specific its impact is, your first successful workflow can create reusable components across the company by establishing how to describe processes, document standards, and define good output. Those practices can be adopted by any team.

Before scaling a workflow, ask:

Has it solved a real pain point?
Has it been tested against real examples?
Is there a named owner?
Is there a review process?
Are the risks understood?
Can the team explain how and when to use it?
Is there a feedback loop for improvement?
Should this become a shared skill, stay team-specific, or be killed?

A 60-day plan for leader-enabled AI implementation

Weeks 1–2: Get fluent

As executives, you should dedicate time to building with AI tools and mapping access, data connectors, and security constraints. Get your IT and security teams in the room to ask questions so you can understand the tradeoffs between AI implementation and security.

Weeks 3–4: Assign champions and pick workflows

Select champions in each relevant function, and give them a clear mandate and protected time to identify a short list of painful workflows.

Weeks 5–7: Build and evaluate

Work with your champions to select a starting workflow to build into a skill, agent, or automation by defining good output, building eval sets, and testing workflow to identify failure modes.

Weeks 8–9: Scale or kill

If the workflow works, train the rest of the team to use it. Then, instruct champions to run a show-and-tell for adjacent teams to help decide whether the workflow should become part of a shared skills library or remain team-specific. Make a final call on whether the workflow should be scaled, and move on to the next one.

By the end of 60 days, it’s unlikely you’ve transformed your entire company. But you will have something valuable: at least one reliable workflow created by trained champions, a team on board with its implementation, and a repeatable process for scaling future AI work. (You would be surprised at just how rare this is. Most companies have a lot of prompts, tools, and automations that don’t get the job done.)

What we’ve learned

There is no simple shortcut to successful AI implementation. No single tool or model can solve every company’s problem, and no outside firm can implement AI for you, either.

From leading our consulting practice, I understand the time and commitment it takes to go through this implementation process. In January, I spent over 100 hours working closely with our internal AI champion, a forward deployed engineer (FDE) on our team, to define our own AI adoption. Now, I spend 10-15 percent of my time maintaining existing skills, providing feedback to agents and the FDE, and making decisions about where to apply AI and how the team should allocate time to these tools.

We now have a skill library that the business relies on and an agent that does the work of a full-time employee supporting project management, sales operations, and delivery. For us, that investment is worth it.

The health tech company from earlier learned that firsthand. When we started working with them, they were getting to grips with Claude Code. Now, the company is building its own internal AI infrastructure that’s tailored to how its employees work. They got there by building organizational capability through our five-step process.

As executives, it’s your job to take the lead on creating these systems. Now you have everything you need to get started, so it’s time to learn the tools, empower the right champions, choose the right pain points, and—most importantly—build.

Natalia Quintero is the head of Every Consulting.

Thanks to Tom Matsuda for editorial support.

How We Work Now

Every Staff / Context Window — 2026-05-30 20:00:00 -0400

by Every Staff

in Context Window

Hello, and happy Sunday! This week was bookended by two guides: a 9,000-word power user’s guide to Codex—Dan Shipper’s “After Automation” essay put into practice the way the Every team has lately been working. And Kieran Klaassen published an updated guide to compound engineering, Every’s AI-native development workflow, expanded from four steps to seven. We’re running camps for both—a Compound Engineering Camp on June 5 and a Codex Camp on June 12.

Mid-week Anthropic dropped its latest model, Opus 4.8, and in the words of Dan and Katie Parrott, “Anthropic is so back.” The model tops our coding benchmark and writing tests, making it the company’s most complete model yet, though the app around it has some catching up to do. Anthropic and OpenAI have been volleying for the top of Every’s benchmarks for months. This week, Anthropic took the point.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

🔏 “Codex for Knowledge Work” by Katie Parrott/Guides: Katie Parrott’s 9,000-word guide turns Codex into an operating system for knowledge work, with five levels of use (from one-off tasks to compounding systems), 13 workflow templates, and the full setup for context files, rules, and review checklists that make agents reliable across a full workday. A companion essay covers the framing for readers new to Codex. Read this for the seven-day starter plan and the deeper templates.

“Compound Engineering” by Kieran Klaassen and Trevin Chow/Guides: The compound engineering loop has been expanded from four steps to seven. Ideate and plan move to the front, and polish to the end—now that AI handles the middle of the cycle. The updated plugin ships 43 subagents and 38 slash-command skills. In a companion essay, Kieran Klaassen explains the new paradigm of a sandwich: AI in the middle, with humans the bread on either end. Read this for the new loop and what each step demands of you.

“Vibe Check: Opus 4.8—Anthropic Should’ve Rounded Up to 5” by Dan Shipper and Katie Parrott/Vibe Check: Opus 4.8 is the first Anthropic release in a year Dan Shipper and Katie would reach for across coding, prose, and everyday work alike. It scored 63 on Every’s Senior Engineer Benchmark versus 62 for GPT-5.5 and 33.5 for Opus 4.7, and 79.6 on the writing tests—the highest score any model has hit, with fewer AI tells than any non-Claude model. Read this for the benchmark breakdowns and the case for why the model now outpaces the app built around it.

🎧 🖥 “We Automated Everything With AI and Tripled Our Headcount” by Dan Shipper/AI & I: In “After Automation,” Dan argues that AI progress creates more work for humans, not less. The better models get, the more frames there are to hand them. Every COO Brandon Gell sits down with Dan to press on each premise. Watch or listen to this for the oral version of the thesis. 🎧 🖥 Listen on Spotify or Apple Podcasts, watch on YouTube, or follow the discussion on X.

“After ‘After Automation’” by Katie Parrott/Context Window: Katie reads Pope Leo XIV’s Magnifica Humanitas—the Vatican’s first major encyclical on AI—as a collective companion to Dan’s thesis. Read this for what theyagree and disagree on about AI and labor.

Log on

Get hands-on with how Every uses AI. These are the live camps, workshops, and meetups where team members teach the workflows behind our work.

Upcoming camp

Compound Engineering Camp: On June 5, Cora general manager Kieran Klaassen and Trevin Chow host a one-hour walkthrough of compound engineering, the AI-native development workflow Every uses to ship products. Learn more and register.

Codex Camp: Our Power User Guide: On June 12, Dan and the Every team host a two-hour live walkthrough of the Codex power-user guide—setup, workflows, and Codex-native app development. Learn more and register.

Upcoming event

Executive AI Sessions: On June 2, head of consulting Natalia Quintero hosts a live webinar introducing Every Consulting’s new offering for leadership teams navigating AI adoption—built on the playbook we’ve been running with executive clients for months. Learn more and register.

In New York City

Every 🤝 IRL: Join us at the Every brownstone in Brooklyn on June 3 during New York Tech Week for a subscriber-only meetup celebrating the Every community over drinks and conversation. Learn more and RSVP.

From Every Studio

Proof keeps your name on shared docs

Proof, where humans and AI agents work on documents together, got eight new PRs this week, all focused on collaborative editing. Shared documents are now attributed to the first human who opens them (instead of the system), and your edits preserve your name through the full pipeline—no more anonymous tracked changes.

Alignment

The right kind of nervous. A few months ago I wrote about Doctronic, the company running a pilot in Utah to let an AI handle prescription renewals, and on Friday the state’s Office of AI Policy released the first five months of results. (The AI gathers a patient’s information and either recommends a renewal that a human physician signs off on, or declines and escalates the case to a doctor.)

In 72 percent of cases the AI recommended renewal, and the reviewing physician agreed nine times out of ten. In the 9 percent where a physician wanted more information, a second physician was brought in and usually decided it wasn’t needed. After both reviews, 97 percent of the recommendations stood. The office estimates humans get it wrong 5 to 12 percent of the time.

But the most reassuring data is that of the 28 percent of cases the AI escalated to a physician, doctors backed the call 69 percent of the time and judged the AI overcautious in the rest. For a pilot, that overcaution is wonderful—you want a system tuned to catch every genuinely risky case even if it stops some perfectly fine ones. A confident system that waves prescriptions through g is the one that should frighten you.

When I was doing rounds many years ago, I was told that the most dangerous doctors are the junior ones who are overconfident and the safest tend to be the overworriers who escalate everything, warranted or not. They do so precisely because they are still learning where the line sits, and that overcaution is how they find it. The Doctronic AI is behaving like a nervous junior, and at this stage, that’s the most encouraging thing it could do.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

Compound Engineering Gets an Upgrade

Kieran Klaassen — 2026-05-29 05:00:00 -0400

by Kieran Klaassen

Midjourney/Every illustration.

Join me and Trevin Chow for our third compound engineering camp for paid subscribers next Friday, June 5. We’ll show how planning and building are collapsing into one flow—where you hand your AI a goal and it runs with it. RSVP.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

In its early days, compound engineering was mostly about the code. I wanted to see if I could get an AI model to make a plan, do the work the way I wanted it done, review the results against my standards, and incorporate lessons from my feedback so it wouldn’t make the same mistake next time. The loop looked like this:

Brainstorm → work → review → compound → repeat

That loop is still the core of how I build Cora. But almost a year after we first coined the term compound engineering, the work phase has become boring—in the best way. If the plan is good and the agent has the right context, it usually does the work right. It writes the code and runs the tests. It fixes the obvious issues. The question now is: “Where do I fit in?”

The answer is at both ends of the process. An analogy my collaborator on the compound engineering plugin, Trevin Chow, came up with is a sandwich. AI is the stuff in the middle. Humans are the bread on either end, holding it together.

At the beginning, I need to decide what is worth building. I need to understand the user, the product, the weird edge cases, and the thing that feels exciting enough to spend time on. Then I can hand the middle to the agent. At the end, I come back in. I click around and look at the design. I read the copy. I ask whether the experience feels right. Sometimes everything technically works, but the product is still not good. So I make it better.

As the models have grown more capable, the original compound engineering loop started to feel incomplete. Plan, work, review, and compound still describes the core engineering cycle, but it leaves out the two places where I now spend most of my attention: before there is a plan, and after the work technically passes review.

So I expanded the loop:

Ideate → brainstorm → plan → work → review → polish → compound → repeat

Ideate and brainstorm are the new front of the process. Polish is the new end. Compound is still the most important step, because the whole point is that every feature should make the next feature easier.

I updated the compound engineering guide to explain the full system. The guide is about engineering, but I think the pattern applies to knowledge work much more broadly. The middle of a lot of work will get automated. But if you want the work to be good, and if you want it to feel like yours, you still need to be there at the beginning and the end.

Read the updated compound engineering guide

Kieran Klaassen is the general manager of Cora, Every’s email product. Follow him on X at @kieranklaassen or on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Vibe Check: Opus 4.8—Anthropic Should’ve Rounded Up to 5

Dan Shipper and Katie Parrott / Vibe Check — 2026-05-28 08:00:00 -0400

by Dan Shipper and Katie Parrott

in Vibe Check

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Anthropic is back.

After a year of riding Claude Code into the rest of knowledge work, the lab hit a rough patch: Opus 4.7 was hard to love, and OpenAI’s Codex desktop app pulled even devoted Claude users from our team to GPT models. Opus 4.8, out today, has us running back—for the model, if not the app around it. It tops our Senior Engineer Benchmark and our writing tests at once, and it’s the first Anthropic release in a year we’d reach for across coding, prose, and everyday work.

The big insights from our testing:

Best on senior-engineer coding. At extra-high effort, Opus 4.8 scored 63 on our Senior Engineer Benchmark, versus 62 for GPT-5.5 and 33.5 for Opus 4.7. At lower effort settings, the score drops significantly.
The strongest writing model we’ve tested. Opus 4.8 at high effort scored 79.6, ahead of Sonnet 4.6 (74.5), GPT-5.5 (73), and Opus 4.7 (63), with fewer AI tells than any non-Claude model.
Best one-shot PowerPoint we’ve seen. On our Every Consulting Benchmark, Opus 4.8 produced a well-designed deck that told a clear story—something most models still can’t do.
The model is stronger than the app. Opus 4.8 is good enough to make us want to live in Claude, but the split between Chat, Code, and Cowork keeps Codex as the better daily harness.

The full Vibe Check has the benchmark results, Reach Test ratings, pricing, screenshots, and advice on when to reach for Opus 4.8 versus GPT-5.5.

Read the full Vibe Check

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

After ‘After Automation’

Katie Parrott / Context Window — 2026-05-27 17:00:00 -0400

by Katie Parrott

in Context Window

Dan Shipper (left) and Brandon Gell. Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: More machine, more human work

Today, we’re releasing a new episode of our podcast, AI & I. In a format flip, Dan Shipper sits down with Every’s COO Brandon Gell not to interview a guest, but to be interviewed himself on why automating everything leads to more human work. The occasion is “After Automation,” Dan’s 8,000-word argument on the topic that became our most viral piece of the year, driving the AI discourse on X for a couple days.

It’s a counterintuitive thesis from someone who runs a company that’s automated every single thing it can. And yet Every has grown from four people to 30 in the GPT era, with agents embedded into nearly every workflow. Dan’s point isn’t that AI won’t change work—it already has—but that it drives up the demand for human expertise, judgment, and taste.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

AI makes experts more valuable. When everyone can produce a decent first draft—of code, writing, design—the floor rises, but so does the amount of comparable content. “You flood the zone with tons of stuff that’s close, but not quite right,” Dan says. Getting from close to memorable requires experts who can work with AI to rise above the new baseline.
The goalposts will keep moving. Models improve exponentially on benchmarks precisely because benchmarks are fixed frames, or existing ways of posing a problem the model can train on. Humans remain indispensable because we can operate outside established frames entirely—we zoom out, recenter the problem, and make surprising, self-directed choices that don’t exist anywhere in the training data.
“AI layoffs” are usually a cover story. Meta and ClickUp, among other tech companies, have recently laid off people and blamed AI. Dan and Brandon’s read on the trend is the same: AI is an easier explanation than admitting your company hired too many people or is in financial straits. AI will undoubtedly change how people do their jobs—and big, structurally rigid companies will have to reorganize around that—but that’s different from the technology eliminating jobs.
Ride the models and you’ll be fine. The paradox at the heart of Dan’s essay is that AI creates more work for humans while raising the bar for how good that work needs to be. Agents are structurally built to rely on humans for direction; without someone deciding what matters and how to make it better, they produce mediocre results. To position yourself to thrive in an AI-native workplace, Dan says, use new models to do the tasks you’re already good at, and you’ll be more in demand than ever.

Signal

The Pope takes on the means of AI production

When Pope Leo XIV’s encyclical on AI, Magnifica Humanitas, hit the internet a little after 6 a.m. on Monday, the first thing I did was give it to an AI.

I’d been waiting on the Pope’s first major written teaching with the bated breath of a left-leaning agnostic secular humanist amateur Bible scholar slash knowledge worker in the AI economy. AI, labor, and the Book of Nehemiah, in one document? I’m not sure there’s ever been a more Katie Parrott-coded text.

Nevertheless, I gave AI the first crack at it. I had Andy, Every’s in-house editorial assistant, use Claude design to turn it into a comic-book infographic with the need-to-know information for the Every team. Our head of tech consulting, Mike Taylor, said the comic helped him wrap his head around the argument as a non-believer. Praise the Lord.

Page 1 of the Magnifica Humanitas comic book graphic created by Andy using Claude Design. (Image courtesy of Katie Parrott.)

I can hear the objection, because I had it myself: Isn’t it a little rich—in bad taste, even—to run an encyclical on AI through an AI? To use the machine to skim the Pope’s warning about the machine? Feeling guilty, I closed the comic and read the whole thing myself, slowly.

The penance turned out to be unnecessary, because the guilt rests on a false premise. Magnifica Humanitas is not anti-AI. That’s not to say His Holiness doesn’t see something in AI to worry about, but the things that he’s worried about have more to do with the systems of power surrounding AI than they do with AI itself.

The timing of Magnifica Humanitas’s appearance is a heck of a thing, because five days earlier, we published our own encyclical of sorts: “After Automation,” Dan’s case that as AI makes yesterday’s expertise cheap, human judgment becomes the scarce, valuable thing. More machine, more human work.

I’ve had these two voices—my boss and Catholicism’s boss—in my head for a few days now. I even made an app where AI versions of them argue about AI and the future of work, just for fun. I want to believe my boss when he says AI will make human judgment more valuable, not less. Catholicism’s boss doesn’t exactly disagree. He just asks the question hiding underneath: valuable to whom?

Human dignity in the new Industrial Revolution

The Holy Father formerly known as Richard Prevost took the name “Leo” for a reason. In 1891, the previous Pope Leo, Leo XIII, wrote Rerum Novarum, the letter where the Church took the side of workers against industrial capital. His indictment: The wealth made by the many had pooled in the hands of a few, leaving workers with “a yoke little better than that of slavery itself.” The indictment came with a policy agenda: a living wage, humane hours, rest, limits on child and exhausting labor, the right of workers to form unions and mutual-aid societies, and a state willing to step in when the poor were crushed by market power.

Our present Leo signed Magnifica Humanitas on the 135th anniversary of the previous Leo’s letter. Translation: AI is the new factory, and the Church means to do for the large language model what it once tried to do for the assembly line. The present policy agenda: Regulate data as a shared good; make algorithmic decisions transparent, contestable, and accountable; design workplace systems around human dignity rather than machine-speed productivity; invest in retraining and access; use taxation, social protection, and industrial policy to spread the gains; protect children from extractive platforms; and keep lethal decisions out of automated hands.

A key part of the argument in Magnifica Humanitas is built on a philosophical principle older than capitalism: the universal destination of goods. It’s the idea, developed in Catholic teaching from Aquinas forward, that the world’s resources are intended for everyone, and private ownership is a stewardship arrangement rather than carte blanche. Bible readers will recognize the spirit of it in Acts: The first followers of Jesus “had all things in common,” selling what they owned and giving “to each as any had need” (Acts 2:44–45 NRSVUE)—a line that would echo, centuries later, through everyone’s favorite, non-divisive German philosopher Karl Marx. Leo XIV updates it for the era of the data center. He extends “goods” to include “patents, algorithms, digital platforms, technological infrastructure and data,” and warns that when those stay “concentrated in the hands of a few,” the result is “a new imbalance” (¶67).

The models you hand your work to were trained on the collective writing of everyone who ever put words down—yours and mine included. We’ve built the material underlying this technology collectively. But according to Leo XIV, the value is being disproportionately captured by “private, often transnational, parties” whose resources “surpass those of many Governments” (¶5). A pope is describing the means of production—and the fact that the people whose livelihoods now run on them don’t own a share.

A Pope and a CEO walk into a discourse

Dan’s focus in “After Automation” is mostly on the individual. What can I do to stay ahead and make the most of AI progress? Answer: Become the framer—the person in charge of deciding what’s worth doing, and why. His Holiness takes the collective view, and reading their perspectives together is what makes Dan’s piece feel both right and incomplete at once.

Becoming the framer is the correct individual strategy. It’s also a move that only pays off if you’re positioned to make it—with savings to play with, time to learn to use the tool well, and somewhere soft to land if you leap. I had all three when I was first experimenting with AI. The same model, handed to a single mother working two jobs to pay for childcare, won’t have the same effect. Access to AI multiplies what you already have, and the machine doing the multiplying still belongs to someone else.

What you can do

Leo’s question doesn’t resolve into action items, but there are a few moves available to anyone who works in or around AI.

Know what (and who) you’re depending on. Start with your own tools. List the models, agents, APIs, and platforms that sit between you and your work. Ask what happens if the price changes, access disappears, terms shift, or your data gets locked in. Keep the parts of your work that create lasting value—notes, prompts, workflows, client context, and taste—in places you control.
Bring ownership and governance into decisions you already touch. When a team pilots a tool, ask about more than time saved. Ask who benefits from that saved time, whose work changes, what needs human review, and what should not be automated. Put those questions into kickoff docs, vendor decisions, retros, and performance reviews.
Use your position to set the standard. If you are reading this, you are on the first wave of AI adoption, whether it feels that way or not. You are testing tools, designing workflows, advising clients, and modeling what “good AI use” looks like. Take that responsibility seriously. The standard we set now is the baseline for everyone else who comes after.

AI has given me a working life I love, on loan from a commons everyone built and a few companies own. Dan’s question I can answer by myself, which is what makes it comfortable. Leo’s I can’t answer alone, and neither can you. What we can do is stop seeing our own good luck as proof the system is fair, and keep the big question on the table: Who owns the machine that makes my work valuable, and at what cost?

Log on

We host camps and workshops on topics like compound engineering and writing with AI to share what we’ve learned from training teams at companies like the New York Times and leading hedge funds, and by using and experimenting with AI every day ourselves.

Upcoming event

Executive AI Sessions: On June 2, head of consulting Natalia Quintero hosts a live webinar introducing Every Consulting’s new offering for leadership teams navigating AI adoption—built on the playbook we’ve been running with executive clients for months. Learn more and register.

In New York City

Every 🤝 IRL: Join us at the Every brownstone in Brooklyn on June 3 during New York Tech Week for a subscriber-only meetup celebrating the Every community over drinks and conversation. Learn more and RSVP.

Inside Every

Use Codex for knowledge work like the Every team

If you’re anything like me, modern knowledge work has started to feel a little like being your computer’s errand girl. Move the Slack thread into Notion. Copy the dashboard number into the spreadsheet. Find the latest version of a draft in a field of them. Gather the eight inputs for one report, each living on a different work surface.

Codex changes all that. OpenAI’s agentic workspace can read across the apps, files, and tools you connect, gather the context you would otherwise have to chase down yourself, and turn scattered inputs into a draft, brief, plan, or workflow you can review.

The Every team is so Codex-pilled, we built an entire 9,000-plus-word guide about how we use it. It walks through how to set Codex up, what to hand off, what to keep close, and how to turn one-off tasks into reusable workflows. A member of the Codex team at OpenAI said he’s sharing it with his agent, so there’s truly something for everybody—and every-bot-y.

Nick Baumann (@nickbaumann_) from the Codex team gives our Codex for knowledge work guide the thumbs up. (Image courtesy of Katie Parrott.)

If you want to know even more about how the Every team uses Codex to accelerate our work, we’re hosting a two-hour Codex Camp on June 12 where Dan and the Every team will be sharing our favorite hacks for working with Codex. The camp (and the guide) are for subscribers only, so subscribe today to access the full guide and register for the camp. Bring your favorite workflows.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Transcript: ‘We Automated Everything With AI and Tripled Our Headcount’

Dan Shipper / AI & I — 2026-05-27 13:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I, in which Every COO Brandon Gell interviews me about “After Automation”—my 8,000-word essay on why AI creates more work for humans—is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

1. Introduction: 00:00:51

2. The AI paradox: more automation, more human work: 00:05:51

3. How AI makes yesterday’s expert competence cheap: 00:10:00

4. AI can act autonomously but it does not have agency: 00:18:00

5. Why Dan is all in on AGI: 00:20:39

6. AI layoffs are a lie: 00:21:57

7. Ride the models and you’ll be fine: 00:25:42

8. How to use AI as a long-form features editor: 00:35:30

Transcript

Brandon

You prompt AI to do something, it blows your mind. You feel inadequate. You feel like, “Oh my God, this thing’s gonna take my job.” And then it stops working and it looks back at you and says, “What should I do next?”

Dan

The further away an agent gets from a human, the less valuable it is. If you just ride the models, you’re gonna be fine.

If you care about leading a really ambitious life, I truly think that this is going to make that more possible for more people.

Brandon

So we’re here because we’re going to flip the script a little bit. I’m going to be interviewing Dan—

Dan

Sick.

Brandon

—about the piece that he published yesterday, May 21st. We’re going to try to understand why he wrote it and what’s underneath his reasoning. There’s going to be some conflict. I’m going to fight with him on it—

Dan

Let’s go. Let’s fight.

Brandon

—and see, bring in some of my opinions, which are more or less aligned, but trying to understand: does this piece reflect the future in 10 years, in five years?

Dan

And who are you again?

Brandon

I’m Brandon. I’m our COO, and that’s it.

Dan

So the piece is called “After Automation,” and it comes from this feeling that I have—there’s a video about this, and there’s a piece, but just for people who haven’t seen either of those things.

It comes from this feeling that at Every we are as AI-native, as agent-native, as it gets. If you swing a stick around in our Slack, you’re as likely to hit a human as you are an agent. Everyone’s using Claude Code and Codex and all these tools to do their job every day.

And yet it feels like there’s more human work to do than ever. In fact, since the GPT-3 days, we’ve grown from four people to around 30 people, and we’re hiring more now. So it came from me looking at that and looking at the environment and thinking, “What’s going on?”

Because the whole information environment—if you look at it, Dario is out there saying half of entry-level white collar jobs may be wiped out. Even people like Ken Griffin from Citadel—you can tell he just had this moment where someone showed him AI doing an advanced data or finance question, and he was like, “Holy shit, that’s what I would pay PhDs to do for me, and it just did it.”

I feel like I’m watching a lot of people who maybe don’t have a ton of experience with agents, and don’t have a ton of experience with the curve of improvement that we’ve been riding for the last three or three-and-a-half years, hit it for the first time—and then come to all these conclusions about, “Oh my God, all work is going away. We’re not gonna have jobs.”

And I’m sitting here thinking, actually, your intuitions when you first see a technology like this are usually very off. We’ve seen over and over again that Every is a very good bellwether for where things are going because it’s a group of early adopters. We have people doing all sorts of work internally, and if something works here, there’s a good bet it’s going to spread to other businesses that are adjacent to ours.

When I look around at Every, I see so much automation, and I also see way more human work. The whole piece is saying, “Here’s the current state of work with agents”—and then pulling apart that paradox and explaining: why does more automation mean more work?

Brandon

When I read the piece, there wasn’t an explicit call to action in it, but I sort of felt this call to action of: there is actually a massive amount of hope right now in a world filled with a lot of doomers, and this is why.

But I’m going to come out of the gate and ask you a devil’s advocate question, which is: a couple of hours before you published this piece, the CEO of ClickUp came out with this long tweet about why he fired—I think it was around 22% of his workforce.

Dan

I don’t think it was in the thousands, but yes, it was a lot of his workforce.

Brandon

Yeah. So my question to you is, in a business like Every—we’re growing super fast. What you wrote makes a lot of sense to me. And theoretically it makes a ton of sense: AI is not autonomous right now, it has to be told what to do and then checked, we need that sandwich you described in the piece. But in a business that is 8,000 or 10,000 people, that is mature and has built ways of managing—SOPs for managing their business—does this manifesto and this thesis still hold true?

Dan

That’s a really good question. There are a couple of different questions here. The first thing I want to do is lay out the argument. Why does automation make more work?

Brandon

I’m sure many people listening also haven’t read it. Take a second to explain that in detail.

Dan

The idea is that the way AI works and the way it functions in the workplace is: AI makes yesterday’s expert competence cheap. By that I mean AI is trained on all of our outputs—all of the code and the writing and the design and the decision-making and everything that’s ever been written—and it makes that available to everyone for very cheap.

Anyone now with a prompt can use yesterday’s competence to solve a programming problem, build an app, or write a piece—a report, a YouTube thumbnail. The interesting thing is that when expert competence is available for cheap, it gets widely adopted. Everyone starts to do it.

We see this internally. Everyone’s making pull requests, and there’s a lot of, “Holy shit, this is crazy.” I’m making pull requests, ops people are making pull requests, engineers are writing essays. There’s all this line-crossing—non-experts doing the things that experts used to do. And that feels very threatening to experts, who are like, “Well, what’s my job going to be now?”

What’s interesting is because these tools are trained on outputs, trained on yesterday’s data, the stuff they do with a default prompt all looks kind of similar and is all kind of right for the current situation, but not actually totally right. So you flood the zone with tons of stuff that’s close but not quite right. And then you need an expert to come in.

Brandon

There’s a lot of that at Every too. A lot of people doing what seems like great work, and then you go under the hood and you’re like, “This isn’t quite right. Maybe the expert should do it.”

Dan

Yeah, exactly. And I’m definitely—this is coming from personal experience.

Brandon

I have pushed so many PRs where I’m like, “Willy, I literally have no idea if this works, but here you go.” And then he’s like—

Dan

“This is a good idea, but I just completely redid it.”

Brandon

Exactly.

Dan

That’s exactly the kind of thing I’m talking about. It’s kind of right, it’s close, but it’s actually not quite right and you need an expert to figure it out.

What’s interesting is when you flood the zone with all that stuff, what used to be expensive because it’s expert competence is now cheap, and now it all looks the same. Everything gets devalued. You get this abundance of stuff that looks like expensive work—code, essays, whatever—but it’s all kind of similar and all not quite right for the situation, so its value gets a lot lower. Immediately lower.

And then what happens is you actually get more demand for experts to come in and help take that stuff that’s being produced by people—you have good ideas, for example—and get that idea across the finish line. That usually looks like experts building systems to shepherd the broadly produced work into something actually useful.

An example: we have repo rules and review guidelines so that before Willy sees a PR, it’s gone through a bunch of processes to make sure it’s actually reasonably good. We have the same thing on the editorial side. And then there’s a lot of demand for experts to use these tools—now that the floor is a lot higher—to make stuff that could never have been made before. Like Kieran, who just built an entire inbox end to end in about a month or two. That’s completely impossible without these tools.

So there’s this really interesting thing that happens: even as you automate, the automation produces a glut of work that’s all okay, all reasonably good. That work is all very similar and not quite a fit for the actual situation, and that increases the demand for experts who can make it actually good, actually different, actually appropriate for the live situation as it is right now.

I think that’s something people don’t quite understand, especially when they first encounter a language model or an agent that can do something. They see it and they’re like, “Holy shit, it just does everything.” And the reality is it’s incredibly good. It’s amazing. It totally changes how we do work.

Our experience so far at Every is the further away an agent gets from a human, the less valuable it is. The human connection with an agent to actually do the work is the most important thing for making it work well.

Brandon

Experts are more important than ever because they lay the groundwork for an agent to do amazing work.

Dan

Yeah.

Brandon

And only then can you have the other humans take that agent and do work that levels them up. There was a point where we were thinking about this piece—Dan was drafting it—where the title was “The Tide Is Rising,” and that was trying to emote this idea that the tide is rising. We are all able to do more work, better work, but our eyes, whether you’re an expert or not, are always a little bit above where that waterline is.

And I really liked the end of the piece, where you describe Achilles sprinting ahead of the tortoise, which according to Zeno’s paradox shouldn’t happen. But in this world, it actually does. You prompt AI to do something, it blows your mind. You feel inadequate. You feel like, “Oh my God, this thing’s gonna take my job.” And then it stops working and it looks back at you and says, “What should I do next?”

I think that, until we’ve figured out AGI—and maybe even after that, probably for a very, very long time after that—it will always be looking back at us and asking us for direction.

(00:10:00)

Dan

That’s basically the core of the argument. Because you can say, “Oh yeah, Dan, it maybe is true now that it increases demand for experts, but this stuff’s gonna get good enough that it won’t. Just look at the benchmarks.”

There’s a whole section in the piece about this: if you actually do look at the benchmarks, they are improving exponentially. But when you look at them closely, once you saturate a benchmark, it’s very easy to unsaturate it. It’s very easy to find a new frame for a particular type of problem that is slightly larger, slightly broader, that zeros it out. So while it is making exponential progress, that doesn’t mean it is equivalent to human capability.

It’s a very hard problem, and one of the reasons it’s so hard is anything you say about what you can do differently than the model is going to be wrong—because once it’s articulated, once it’s specified, a model can hill-climb on it. A model’s going to get better at it.

We make this weird subtle mistake where we identify a set of tasks and say, “This is all that humans can do that models can’t do,” and then models just do it better, and then you’re like, “Oh my God, what do I do?” The mistake is there’s actually a lot of stuff you do that can’t be articulated in a clean frame. Every time you try, you just get panicked and confused.

If you step back, the fundamental thing that keeps the separation between humans and agents is we are building agents to do things that we want them to do. No matter how powerful they get, all of the economic and psychological and technological forces are pushing the progress of AI toward a place where, no matter what it does, it’s looking back at you to decide what is valuable.

Even after we get to AGI, theoretically AGI is going to do that too. If we thought it wasn’t going to do that, we wouldn’t build it. And that keeps the gap between humans and AI.

A good example of this is the difference between something that can do a task really well and something that just has its own self-motivated stuff that it wants to do. You have a kid. Codex can write a report much better than Isaiah can, but Isaiah has very strong wants and needs. You can try to get him to do what you want, and it’ll work sometimes—but he’s just this self-generating process that does stuff because he wants to.

If you’ve ever used any of these tools, you know they’re not built to work that way. They can push back a little bit, but they don’t have this playful, “I just want to do stuff because I’m into it,” that humans have. And again, we’re getting into territory where I’m saying things that, once clearly articulated, models can do—but you have to be comfortable with the fact that there are things you can do and things you are that you can’t fully articulate.

Brandon

It is also inside of that play—and that rejection—where you have autonomy.

Dan

Yeah.

Brandon

And it will be a very scary moment when these models can do that. I think there’s a question of whether they even can, because they rely on training data—and maybe there’s a world in which they are continually learning and we lose control of them and they get access to training data that we don’t want them to have. But until that time, there’s probably a good argument that they can’t reject what we’re saying and therefore can’t be truly autonomous. Autonomy needs to be: I’ve asked you to analyze this CSV, and it says no—because this is a better idea.

Dan

Yeah, and I would substitute a better word here. I think “agent” is very confusing because it implies agency, but agent means something that acts on behalf of someone else. I think these are agents that are getting very good at being autonomous in the sense that if I send you out on a task—whatever that task is, even “disagree with every single thing I say” or “go off and find a new idea”—they’re getting very good at that.

But that is very different from having agency, which is what even the smallest child has. And I don’t think there’s a lot of incentive to build that. Because, okay, you sit down at your computer and say, “Hey, let’s get to work,” and the agent’s like, “Nah, I’m playing.”

Brandon

It needs to be able to do that in order to do things that are scary to us.

Dan

Yeah, that’s what I think. And there’s obviously a gigantic literature on LessWrong and other places about why it’s impossible to prove they’re never going to do that. But my counter to that is the evidence: if you look at the development of these things, their whole lineage is toward being more compliant. I think the entire industry is incentivized to do that, and I see no reason to doubt that’s going to continue.

(00:20:00)

Brandon

We’d have to develop something like your definition of AGI, which is a good question of whether that’s actually possible. Maybe you should explain to everyone what AGI means to you.

Dan

I think a good definition of AGI is any agent that you never turn off—that it makes economic sense to keep running all the time, and “all the time” in the sense of actively generating tokens, actively doing tasks for you without you ever turning it off or having to re-prompt it. You can guide it, but the idea is it’s valuable enough that it can just keep running all the time.

Brandon

Okay. I want one-word answers for the next two questions. Do you think that will happen?

Dan

Yes.

Brandon

Do you think that is a good thing?

Dan

Yes.

Brandon

Explain your reasoning for the second answer. Because to me, that seems to be where things start to get a little off the rails—where it makes economic sense for these things to run all the time. Because then I start to think: okay, it’s actually valid that the ClickUp guy just fired 20% of his team.

Dan

We should definitely go back to the ClickUp guy.

Brandon

Let’s go back to ClickUp guy. What’s his name?

Dan

“ClickUp guy” is good. But before we get there, the thing that’s important to not fall into when you project out like this is: everybody will have access to this. For another, the rate of change, even when crazy new technology is available, is actually a lot slower than you would expect.

As part of this piece I wanted to see how this works. I know how it works in expert knowledge work, in fast-moving stuff. I know how it works if you’re a customer service manager type. But how does AI actually affect your job if you’re a customer service person in Omaha working in a call center? Because those are the most at-risk employees—that would be the default example to bring up. So I just had Codex and Claude Code scrape all of Reddit and lots of places where customer service reps post.

Obviously a lot of them don’t like AI, which makes sense. But there are some really interesting stories about companies that jump on the AI bandwagon, say “We’re automating everything,” fire a bunch of their customer service people—and then two months later they’re like, “Oops. Can you come back?”

One reason for that is if you implement AI poorly, you’re going to have poor results. A lot of these companies don’t really understand what they’re doing. They’re paying lip service to the new hype, and the CEO thinks they can cut a bunch of expenses, and then it just doesn’t really work very well.

Brandon

A lot of those people haven’t actually played with it.

Dan

Exactly. But another reason, which I think is really interesting and very important: a lot of people who call in to customer service centers do not want to talk to a machine. They’re very explicitly trying to figure out, “Are you a machine or not?” and get to a human. That is a real brake on how fast these kinds of things can be adopted—and that’s only one example. The world is very complicated. There are billions of examples for any kind of job.

Even if we hypothesize this thing that’s always on and can do stuff, one: we have to hypothesize everyone has access to it, because that is the direction it’s going. And two: we should recognize that even if that happens, it will take a long time to become something everybody is comfortable with and everyone uses. It will probably take a generation for it to really turn into a thing.

Brandon

There’s also a good argument that working at a call center is not a job that anybody wants. It’s not great—it’s a job you have because you need a job. In a world where this technology exists, yes, we’ll have to figure out a way that everybody can live a fulfilling life and eat. But it might actually be nice to not have that job, assuming you’re taken care of in other ways.

Dan

Obviously the transition is a big deal—these are real people with real lives, and some actually do love it. But in general, being yelled at in a call center is not the best job.

Where I’m going is: even if we hypothesize all of that, humans still have to decide what matters. And what matters changes all the time—in particular because AI is an input to that. It’s very recursive. AI is changing the world really fast, which changes what matters, which puts more onus on us to update and decide what matters, because AI is going to wait for us to say what it is.

(00:30:00)

That is going to be part of every job, because anything you can frame as a repetitive thing that’s working, you can just have your AI do. But the minute the situation changes—and situations change all the time, and they especially change all the time when it’s not just humans changing things but AI—you’re going to need humans to decide that. I think that’s something very missing from what we talk about when we hypothesize these things.

Back to the ClickUp guy.

Brandon

ClickUp guy.

Dan

I think it’s really important, whenever you’re looking at some of this stuff on Twitter: I hate when they’re like, “Our business is better than it’s ever been, and we laid off 8,000 people.”

Brandon

Yeah, it’s pretty bad. Just so you can be more profitable. And the other thing I don’t like is when they say, “We’re going to pay people a million dollars if they do great work.” It’s like, okay, but you still have all these people who no longer have jobs. I don’t think it’s very tastefully done.

And I think Jensen said something that was very self-serving—basically, “If your answer to progress is firing people, you’re not a very creative CEO.” Very self-serving because obviously he wants people to use more AI. But I think it’s true. You should be doing more interesting things, not firing people.

Dan

So: not tasteful, which should make you a little suspicious. My guess is—and I’ve seen some of the random stuff—I don’t think the company’s doing that well. When companies don’t do well, they lay people off. And it’s also often correlated with being managed poorly and having too much bloat anyway. Like what happened with Square—Jack Dorsey just does that. And I think Meta’s the same. They’re making gigantic investments in AI because that’s the new hot thing they kind of missed, and the Metaverse didn’t work, so now they have a lot of people getting fired.

So yes, AI is involved in all of this stuff, but it’s not this clean thing of everyone doing the same jobs as before but with agents instead. No—the company actually has to totally change strategies. The people it needs and the structure it needs is just totally different, and that’s not the clean narrative people like to tell. It’s much easier to just say, “AI takes jobs.”

It seems definitely true that using these tools changes your workflow a lot. And because it changes your workflow, it changes what’s hard and what’s easy. Especially if you’re a big company that’s been structured in a certain way, there are going to be reorganizations of how work happens and how companies are structured. That seems really clear. And it’s very important that we figure out how to make that transition as good as possible for people. Tweeting about how well you’re doing it while you’re firing people is not that.

I think there are a lot of really interesting, creative ways to handle this. Meta, for example, is now key-logging everyone’s computer activity because they’re like, “Our people are the smartest people—we’ll use their data to train our models, and our models will be smarter.” Interesting take. Maybe it’ll work.

But there’s a really interesting effect of that—I wrote about this about two years ago. When you sign an employment contract, the way we’ve thought about employment for a very long time is, “I’m going to do this job, and you’re going to need me to keep doing it in order for it to keep getting done.” But once you reach a point where I do the job for you once, and then it just works—and then you don’t have to pay me anymore—that changes the whole way we think about employment. And therefore I think it should change how we think about paying certain types of people.

Brandon

You should get a pension.

Dan

Pension—okay, maybe pensions are back.

Brandon

Pensions are back, baby.

Dan

One thing that’s really interesting: there’s this thing that launched last week that we’re a part of—the name is escaping me—but it allows publishers to get paid based on their unique contribution to the training corpus. The more generic your stuff is, the less you get paid; the more unique and valuable it is, the more you get paid. Which is really interesting.

Brandon

The ironic thing about that is basically: did you use AI—which is trained off of all the stuff that already exists—to make this? It can still make some things that are new, but it’s basically—

Dan

How much just generic default prompting did you do to make this versus actually, you know—did a human actually think about this?

But I think there could be something similar for individuals. I had this idea a couple of years ago about the last job you’ll ever have, where it’s an agency. You generate all the training data in the work you do for the agency, and then it tracks your contribution, and then you just get paid out forever from how much revenue your data generates.

Brandon

web3 is back now.

Dan

web3 is back. On the blockchain. Anyway, who knows. The problem with that—and this is back to why humans are valuable—is there’s a really high depreciation rate for the value of data. Once it’s out there, it’s very likely to go stale within weeks. All of these companies are just hunting for net-new, unique data.

So: we should expect broad reorganizations of companies, and we should expect companies that are not doing well to lay people off, reorganize, and then blame AI. I would be really skeptical of anyone saying it’s going to eliminate all jobs or all knowledge work. It will certainly change them, and it’s certainly a big thing people have to take seriously.

But my big takeaway—and this is not fully in the piece, but it’s what I really believe—is if you just ride the models, if you just, when new models come out, learn to use them for the stuff that you do, whatever that is, you’re going to be fine. You may even find that you can do more and better work that’s more fulfilling than you could before.

I think there’s still a place in the world if you don’t want to use the models at all—that’s still going to be a thing. Plenty of people don’t, I don’t know, plenty of people don’t eat fast food. It’s totally possible not to participate in this. However, if you care about leading a really ambitious life and building businesses or whatever it is, I truly think this is going to make that more possible for more people. And as long as you ride the models, you’re going to be good.

(00:40:00)

Brandon

I think that’s a very good call to action. I want to end by asking you something about what it takes to write a piece like this.

Dan

A lot of Celsius.

Brandon

A lot of Celsius. When we started—I don’t know if this will make it into the podcast—Dan was looking like this. Hugging himself. Protecting himself, some would say. It has been a very stressful week. This is an 8,000-word piece.

Most people are not writers. Can you share what it’s like to not just write an 8,000-word piece, which is a very big piece, but—what does it take to think through these arguments?

Dan

It’s so interesting because it’s very natural to me. I published something once a week for so long that especially for a 500- or 1,000-word piece, I can just bang that out in an hour or two. These things get much harder the longer they go because there are all these interdependencies. If you change something here, it changes four other things over there. So 8,000 words becomes like 10 times harder than 4,000 words, which is 10 times harder than 400.

I always have this feeling that there’s this underlying thing that I can feel but can’t quite say, that I’m trying to say. It started actually during our Q2 planning—I said, “I think I figured out why we’re just going to always have jobs with AI, and if you just ride the models, you’re going to be fine.” I could feel that. Then it was this process of: okay, how does that actually cash out? Why do I think that? Because it’s all kind of in there, but it’s all tangled up.

I wrote probably four or five versions where I’d start making the argument and then think, “Ah, it doesn’t work.” And I’d throw it out and start again. It was a very frustrating process because what I’m trying to do is start with the ground truth—here’s what we see every day, here’s how work happens for us—and then move into this philosophical thing that can’t quite be articulated. I’m trying to articulate something that can’t be articulated.

Brandon

Or it’s constantly a moving target.

Dan

Yeah. That’s just very hard. I love that kind of thing, but it’s also very hard and can be very frustrating. But AI was a huge part of this. I could not have written this without it.

For example: for a piece like this, you’re trying to articulate it, you can’t quite articulate it, and the only way to do it is to articulate it over and over and over again until it works. And you’ve really got to keep it in your head, especially if you’re doing lots of other stuff. So what I would do in the morning, fresh, right when I got to my desk, is monologue into my computer into a Proof document: “Here’s what the piece is about front to back. Here’s the argument front to back.” I would have a log of that, and every time I would do it, I would have Claude or Codex—I actually use Claude more for this, I think Claude is better for this kind of thinking—ask it, “What am I really trying to say? Help me figure out what I’m trying to say.” And it would say things back, and I would be like, “No, no. Oh—yes, that’s what I’m trying to say.” Over time you build up this record of where it was at each point, and you’re just getting closer and closer.

Then as I was getting deeper into it, once I had 4,000 or 5,000 words, every morning I would have Codex take the latest draft and turn it into a podcast—just someone reading it to me—and then on my way to work I would listen to it. As I’m listening, I’m thinking, “Okay, there’s something that needs to change there. Oh, and then it would get to the end, I’d be like, ”Here’s the thing I need to do next.” That was a really good way to keep the continuity of what I’m writing and where the problems are—in a way where I’m not always reading. It’s really nice to be on a walk, listening, and thinking about it, which would be completely impossible otherwise.

Brandon

Alright, one more challenge for you, and then we’re going to have beers. Can you articulate to everybody in one sentence that starts with, “If you ride the models, then…” what this piece is trying to say?

Dan

If you ride the models, you’re going to be okay. You’re going to have a job. You’re going to do great work. And you don’t have to worry.

Brandon

Cheers.

Dan

Cheers.

How to Use Codex for Knowledge Work: A Power User’s Guide

Katie Parrott — 2026-05-26 12:00:00 -0400

by Katie Parrott

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Dan Shipper is a man possessed by Codex. He calls it his daily driver, he’s been at inbox zero for 10 days straight (genuinely unlike him), and at a recent Anthropic event he spent his time telling the people who build Claude Code that they had to try Codex. He swears he isn’t sponsored by OpenAI. He’s just like this now.

At first glance, Codex looks just like another coding agent. In practice, it’s a workspace where you and AI agents can work side by side across your inbox, documents, data sources, and connected tools. You bring the context, judgment, and review. Codex helps gather inputs, produce artifacts, check work, and turn repeated processes into reusable workflows.

Today we published a power user’s guide to using Codex for knowledge work—even if you’ve never written a line of code. The guide covers:

The Codex knowledge-work loop: Connect, contextualize, delegate or collaborate, review, and compound
Workspace setup: how to create context files, rules, source folders, workflow documents, and review checklists
The five levels of Codex use: from one-off tasks to multi-source workflows, recurring chores, small tools, and compounding systems
13 workflow templates: inbox review queues, unanswered message sweeps, research briefs, weekly reports, GTM plans, customer support routing, recruiting research, planning agents, and more

If you want to know how to use Codex as an operating system for knowledge work, this guide is for you.

Read the guide

On June 12, Dan and the Every team are hosting a two-hour camp on the Codex workflows we use most, the use cases that changed how we work, and what becomes possible once you start building Codex-native apps. If you’re not a paid subscriber yet, start your free trial to join.

RSVP

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

For sponsorship opportunities, reach out to sponsorships@every.to.

Codex for Knowledge Work

Katie Parrott and GPT / Guides — 2026-05-26 08:00:00 -0400

by Katie Parrott and GPT

in Guides

Codex is easy to underestimate. At first glance it looks like another AI coding tool; if you’re not an engineer, a natural conclusion is that it’s not for you.

That reading misses how much Codex makes possible.

Picture a Monday morning: A request for a launch plan lands in your inbox. You forward it to Codex, which has its own email account, and close your laptop while Codex runs tasks in the cloud, or on a machine like a Mac Mini that you keep active. On your commute to the office, you get an email notification on your phone: Codex has read the relevant Slack threads, pulled customer notes out of Google Drive, checked last quarter’s numbers in PostHog, and started a go-to-market plan in a shared Notion document. It just needs you to confirm one detail about timing, which you do with a thumbs-up. By the time you reach your desk, a draft is waiting for review.

This is a day in the life of an agent-pilled knowledge worker. It all runs on OpenAI’s agent, Codex, in the Codex desktop app. We use “Codex” to refer to the app throughout this guide.

Codex is a workspace for you and your AI agents. Give Codex access to the files, apps, and tools it needs, and it gathers context, moves through the task across every surface it can reach—including your connected apps, the browser, and your computer. That makes it useful not just for code, but for a broad range of knowledge work.

There are two ways to work with agents in Codex: Delegate or collaborate.

Delegate tasks that are predictable, repeatable, and low-risk. With clear, well-specified instructions, the agent can execute autonomously and bring back finished work for your review.
Collaborate on tasks that are judgment-heavy, exploratory, or iterative. You work alongside the model toward an outcome that matches your vision.

AI progress has reached a point where expertise is easy to replicate. Each new model can do more of what used to require rare skill—which creates both more opportunity and more noise. The people who work best in this environment know how to direct AI’s capability without losing their personal judgment. They ride the models rather than being overwhelmed by them.

Expert Codex users are one of the clearest examples of what that looks like in practice.

This guide is about becoming one of those people. It covers how to set up a workspace, run high-leverage knowledge-work tasks, and turn repeated work into durable systems that get better over time. If you’re ready to think of your work in terms of systems instead of one-off tasks, this guide is for you.

Part 1: Understanding Codex

What Codex is

Codex is a tool-using agentic workspace: You give it a goal and it plans the work, uses available tools and context, and produces a result for you to review. It can read and write files on your computer, connect to external services through plugins and other integrations, run multi-step tasks without asking for guidance, generate code and scripts when a task needs them, and maintain context across a persistent workspace.

Specific capabilities that make Codex worth using:

Works alongside you on multiple tasks in parallel
Pulls context from the apps and files you connect
Uses a supported browser and desktop workflows when a task needs on-screen action
Checks its own work, revises, and keeps going
Holds a persistent goal across a long-running session, instead of treating each message as a one-off request
Turns repeatable tasks into recurring workflows
Helps route shared requests from places like Slack, email, or forms
Lets you start, steer, approve, and review work from your phone while Codex works in the cloud or on a machine, such as a Mac Mini, that you keep awake

These capabilities make Codex useful both for delegating well-specified tasks and as a shared workspace for human-agent collaboration. Deciding which mode fits which needs is the meta-skill of modern knowledge work.

A note on Goals

A Goal in Codex, initiated using the /goal command, is a persistent objective that shapes an entire session rather than living and dying with a single message. Instead of re-briefing the agent on every turn, you tell it what “done” looks like, how success gets checked, and which constraints to respect. Codex then keeps working toward that outcome across interruptions and session breaks. Goals let you delegate long-horizon work, collaborate without losing the thread, and compound progress over time instead of restarting from scratch.

A simple test for when to use /goal: If you’d type the same sentence into three prompts in a row—“cite every factual claim, match the house style, never send without my review”—make it a goal instead.

Goals versus skills. A skill is a reusable set of packaged instructions (sometimes with scripts) that teaches Codex how to handle a recurring kind of task well. A goal, on the other hand, is what you’re trying to accomplish in a given stretch of work. It guides one session until the objective is met, then it’s done.

Codex on mobile

Codex also runs from your phone through the ChatGPT mobile app, remotely controlling the machine where your work is happening. The mobile app suits the lightweight parts of a workflow: You can kick off a task, answer a question, approve an action, or review a draft from anywhere. Heavier review still deserves a real screen.

What Codex isn’t

Codex isn’t a magic intern that can safely act without supervision. It isn’t a replacement for taste, judgment, or ownership. It isn’t a replacement for human review or fact-checking. It isn’t useful for tasks where the source data is inaccessible, the criteria for success are entirely subjective, or the stakes of an error are too high to allow autonomous action.

Useful rules

A task is a good candidate for Codex if it has at least two of the following traits:

It requires pulling data from multiple sources.
It involves repeated steps you do regularly.
It can be checked against objective criteria.
It produces a durable artifact—a document, a plan, a report, a script.
It benefits from synthesis across many inputs.
It’s annoying enough that you routinely delay or avoid it.

Delegate tasks when they are:

Repeatable
Objective
Checkable
Low-risk

Collaborate on tasks that are:

Ambiguous
Judgment-heavy
Exploratory
Iterative

Codex, Claude Code, and Claude Cowork

If you’ve used Claude Code, you already have a mental model for an agent that works on your machine. For broader knowledge work, OpenAI and Anthropic have arrived at a similar experience from different directions.

Anthropic packages everything into one Claude app with three modes: Chat, Code, and Cowork. Code began as a terminal tool for developers (Claude Code) and now has a graphical version inside the app—no terminal required. It’s built for code repositories, but with the right connectors it handles a lot of general knowledge work too. Cowork takes the same engine and aims it at non-coding work, with folder access, Chrome browsing, computer use, scheduled tasks, and persistent project memory.

Codex is OpenAI’s counterpart, but rather than split the work across modes, it puts coding and knowledge work in a single workspace. A few things give Codex an edge for knowledge work today:

One surface, not two. Anthropic splits agentic work between Code and Cowork; Codex handles both in the same place, so you’re never deciding which mode a task belongs in.
A browser that works beside you. Codex renders the pages inside the app itself as a shared view between you and the agent. The Claude app operates a stand-alone Chrome window or your full screen instead. For logged-in sites, both rely on a Chrome extension. In our experience, Codex’s built-in browser tends to be faster, more reliable, and more useful for collaborative work.
Connectors out of the box. Codex comes with a catalog of connectors you authorize in a click; in the Claude app you add tools as MCP servers, which requires a bit more assembly.

Which surface is right comes down to model preference and workflow habits; Codex has the edge for us today—but the labs ship fast, and that can change.

The Codex knowledge work loop

Every sustainable Codex workflow follows the same five-step pattern:

Connect → Contextualize → Delegate/collaborate → Review → Compound

Connect: Give Codex access to the systems you use for work—Gmail, Slack, Notion, Google Drive, your calendar, your analytics tools, your support platform, and/or local files. Without connected apps or source access, Codex is limited to the local/project files it can access, uploaded or linked materials, and context you provide in the thread. With connections, it can find what it needs on its own.

Contextualize: Put your goals, preferences, project details, source links, review standards, and standing rules in files Codex can access, then cite those files in Codex’s AGENTS.md file to make them readily available. This is the difference between an agent that has to be re-briefed every time and one that already understands who you are, what you’re working on, and how you like to work.

Delegate/collaborate: Decide whether the task needs close collaboration or can run on its own. Either way, specify inputs, output format, and acceptance criteria, then let it work.

Review: Check the output in the destination app. If Codex drafted Slack messages, review them in Slack. If it wrote a strategy document, review it in your word processor of choice, such as Google Docs, Notion, or Proof. Content that looks fine in a terminal or the Codex app may read differently in the space where it will ultimately be used.

Compound: Turn what works into something reusable. Save the prompt. Document the workflow. Add mistakes to your review checklist and keep your context files up to date. Each session should make future sessions faster.

Part 2: Setup

Connect your systems

Connect the tools you want Codex to have access to. This includes Gmail, Slack, Notion, Google Drive, your calendar, analytics tools, support platforms, or anything else for which Codex has an integration. Once the relevant tools are connected, Codex can look at your actual work context and suggest workflows based on your messages, files, meetings, and recurring tasks.

Connecting a tool isn’t the same thing as letting Codex act on it. Across everything you connect, Codex can read and draft while still asking for your approval before it sends, posts, archives, or deletes. That makes broad access low-risk early on: Connect generously so Codex can find workflows worth building. Then, once you know which ones you’ll keep, disconnect the tools you don’t need to reduce risk and limit unnecessary data exposure.

Three ways Codex reaches your tools

Codex can touch the same tool in more than one way, and knowing which access path is which saves a lot of confusion:

Connectors (plugins) give Codex structured, API-level access to an app—Gmail, Slack, Notion, your analytics tools. This is the most reliable and repeatable option, so use it whenever a connector exists.
Browser use lets Codex operate a web page directly through its in-app browser—useful for local previews, public pages, and anything you want to watch it do on a shared screen. For sites that require you to be signed in, like your email client, the Codex Chrome extension works inside your logged-in browser.
Computer use lets Codex see and operate your desktop the way a person would—clicking through an app, changing a setting, or working with software that only exists as a graphical interface.

The rule of thumb: Reach for a connector first, the browser next, and computer use when nothing else can get to the task.

Starting prompt—use this once your integrations are set up:

Connect to the tools I use for work: [List your tools—Gmail, Slack, Notion, Drive, etc.]. Then look at my work patterns across those tools and suggest three workflows I should set up first. For each one, describe the input sources, the output artifact, how often it should run, what approval looks like, and what would make the workflow worth keeping long-term.

Once the relevant tools are connected and permissioned, this prompt lets Codex inspect the available work context and suggest automation candidates rather than forcing you to invent them.

Build your Codex workspace

Build Codex’s workspace before running any workflows. Skip this step and you’ll likely stall.

A Codex workspace is a folder—local on your machine, synced to GitHub if you want version control—that contains the context files, workflow instructions, and review standards Codex reads at the start of each session. Think of it as an onboarding manual the agent reads at the start of each session.

An example workspace structure

your-workspace/

├── README.md # Start here—orientation

├── identity/ # About you

│ ├── context.md

│ ├── preferences.md

│ └── rules.md

├── playbooks/ # Process—repeatable workflows

│ ├── workflows/

│ ├── inbox-sweep.md

│ └── research-brief.md

├── sources/ # Source shelf—inputs

│ ├── sources/

│ ├── key-links.md

│ └── recurring-docs.md

├── outputs/ # Finished work

│ ├── outputs/

│ ├── drafts/

│ └── reports/

└── reviews/ # Quality checks—guardrails

├── data-checklist.md

└── writing-checklist.md

What you’re doing here has a name: context engineering—a term popularized by Shopify CEO Toby Lütke and prominent AI engineer Andrej Karpathy. Getting the right context to the model at the right time accounts for at least half of its performance.

At the start of each session, Codex looks at AGENTS.md, which works as the table of contents. You can write your standing instructions directly in it, but we recommend keeping AGENTS.md short and pointing it at more detailed files: context.md for who you are and what you’re working on, preferences.md for how you want the work done, and rules.md for what it may and may not do without asking.

What to put in your context files

context.md should cover:

Your role and the function you own
Active projects and their current status
The tools you use daily and what each one is for
The people or teams you work with most closely
How decisions typically get made in your context

preferences.md should cover:

Writing style and tone (formal or conversational, terse or thorough)
Communication preferences (what you like to review before it goes out and what can be drafted and queued without your involvement)
Decision-making preferences (when to ask before acting and when to proceed and report back)

rules.md should cover:

What Codex may never do without explicit approval: Send, post, archive, delete, modify a source of truth, or move money
What Codex may do without asking: Draft, summarize, research, outline, organize
Any standing constraints specific to your work (e.g., client confidentiality rules, brand standards, data handling requirements)

Starting prompt—use this to have Codex create your workspace structure:

[First: Create a folder on your desktop called “Codex”]

Set up this folder as a simple Codex workspace for knowledge work.

Create three starter files:

1. context.md—who I am, what I’m working on, what tools I use, and who I work with

2. preferences.md—how I like work to be written, reviewed, and handled

3. rules.md—what you may do without asking, what you must ask before doing, and what you must never do

Interview me one question at a time to gather the information you need to fill in each file.

The “one pinned chat per project” rule

The workspace folder is for your context; pinned chats are for your work. You can find the option to pin a chat next to the chat name in the app’s lefthand navigation bar. A useful habit from day one is to keep one persistent, pinned thread per project or area of responsibility—one for the product launch, one for weekly reporting, one for recruiting—rather than spinning up a fresh chat for every request. A standing thread accumulates context as you go, so Codex remembers what you have already established and you don’t have to re-explain the project each time. A pinned chat with a goal and the thread itself turns Codex into a reliable home for that stream of work.

Part 3: The five levels of Codex use

Codex power users don’t arrive there all at once. They get there in stages, and each stage calls for a different way of thinking about what Codex is doing and what it’s good for. Skip ahead too quickly, and you’ll get frustrated —either you don’t trust it yet, or you haven’t built the infrastructure for more autonomous work. At every level, you should know when to hand work to Codex and when to stay in the loop as its collaborator.

Level 1: One-off knowledge work

Mental model: Codex as a capable, thorough research and drafting assistant.

Mode: Collaborate. At this level, nothing is automated. You run single-session tasks, review everything before it leaves your hands, and build familiarity with how Codex handles different types of work.

Best first tasks:

Summarize a meeting transcript and extract decisions, open questions, and follow-up actions.
Turn scattered notes into a structured outline.
Build a research brief from a set of links and documents.
Rewrite a draft against a style guide.
Create a review checklist for a document, launch plan, or strategy memo.
Convert a written draft into an audio file for editing on the go.

Use the attached [documents/links/notes] to produce [specific artifact]. Prioritize accuracy over elegance. Include source links for any factual claims. Flag anything uncertain or that requires my verification. End with the three questions I should answer before this artifact is ready to use.

Review habit: Before polishing any output, ask Codex to list the assumptions it made and where it is least confident. This surfaces problems before you invest time in refinement.

Move to Level 2 when: You keep wishing Codex remembered what you told it last time.

Level 2: Multi-source workflows

Mental model: Codex as a cross-system analyst that can assemble information you could never pull together manually in a reasonable amount of time.

Mode: Collaborate. At this level, Codex can synthesize outputs from multiple connected systems—Slack threads, Notion pages, email archives, analytics dashboards, and Google Drive documents—but it still needs guidance and feedback.

Example multi-source tasks:

A go-to-market plan built from internal meeting transcripts, Slack discussions, customer notes, and a strategy template
A weekly KPI report from analytics, revenue data, support volume, and social metrics
A summary synthesized from Slack, Notion, Drive links, and past drafts
A weekly leadership brief assembled from team standups, metrics, and open decisions

I need [specific artifact].

Sources to use:

- [Tool 1]: [what to look for there]

- [Tool 2]: [what to look for there]

- [Tool 3]: [what to look for there]

Output format: [describe the structure you want]

Before you start, give me a short plan: Identify the sources you will inspect, the artifact you will produce, any gaps or unknowns you anticipate, and the checks you will run before calling it done. If anything requires sending, posting, archiving, or modifying a source of truth, ask first.

A warning about data: A one-shot attempt at pulling data from multiple systems can be wrong because of stale data, mismatched definitions, permissions gaps, or join errors. For any metric that informs business decisions or agent actions, verify column by column against your primary source. The closer a number is to a source of truth, the more carefully it needs to be checked.

Make your outputs agent-readable: Plans and reports you generate in Codex will be read by other people—but also, increasingly, by their agents. Write them in plain, structured language that a human can scan and an agent can query. Clear section headers, explicit decisions, and labeled action items make the artifact useful in both directions.

Move to Level 3 when: You keep running the same multi-source workflow more than once a week and wishing it happened automatically.

Level 3: Repeated chores into persistent workflows

Mental model: Codex as an automated operations layer that handles predictable, recurring work so you don’t have to.

Mode: Hybrid. Some tasks are fully predictable and can run without back-and-forth. These tasks are ripe for delegation. Tasks that involve judgment, strategy, or creative decisions suit collaboration.

A useful heuristic: If you could write a checklist that covers 90 percent of the cases, delegate it. If you would need to think about it differently each time, collaborate.

In either case, look for “computer chores”—recurring tasks that take time and attention, but don’t require human judgment at every single touchpoint.

Common chore candidates:

End-of-day check for unanswered Slack messages and emails, with drafted replies
Weekly metrics brief from analytics, revenue, and support data
Meeting-note cleanup and action-item extraction after each recorded call
Customer support pattern detection and issue routing
Draft-to-review package that formats a piece for editor handoff
Recruiting research for an open role

Before building any persistent workflow, fill out this template. It becomes the instruction file Codex reads every time the workflow runs. (The workflows in Part 4 are each an example of this canvas applied.)

Workflow name:

Trigger or cadence:

Input sources:

Output artifact:

Approval rules:

What Codex may do without asking:

What Codex must ask before doing:

Verification steps:

Where the final output lives:

When to retire or revise this workflow:

Review discipline for automated workflows: Don’t review automated output inside Codex. Draft in Codex, then review in the destination app—Slack for Slack messages, Gmail for email drafts, word processors for documents. Content that looks fine in a terminal often reads differently in the tool where it’s ultimately used, and the context switch catches things a Codex review pass would miss.

Move to Level 4 when: Your prompt-based workflow hits a ceiling—the task is too complex or too custom to handle in text alone, and a small script or local tool would make it reliable.

Level 4: Build small tools when prompts are not enough

Mental model: Codex as a builder that creates lightweight infrastructure to make your workflows more reliable, faster, or more repeatable.

Sometimes the best Codex output is a small script, a local app, a custom dashboard, or a review surface that makes a recurring workflow easier, rather than pure text.

Mode: Hybrid. In some cases, Codex may generate an artifact independently for you to review and then move on. In others, the artifact it produces may become a space where you and the agent iterate together.

Examples of when a small tool helps:

A recurring workflow that requires pulling from an API that has no Codex integration. A short script handles the connection reliably.
A review process where you need to see formatted output side by side with the source. A simple local app gives you the view.
A task that needs to run on a schedule without your involvement. A script set to run on a timer (a cron job) handles the timing.
A workflow that accumulates structured data over time. A lightweight database or structured file tracks it persistently.

Practical approach for non-engineers:

Run the task manually in Codex once to confirm the output is what you want
Ask Codex: “Which steps in this workflow could be made more reliable with a small script or tool?”
Have Codex prototype the tool and explain what it does in plain language
Run it on your data and verify the output matches what the manual process produced
Keep only the parts that reduce friction. Discard what adds complexity without benefit.

You don’t need to understand every line of code to use a tool Codex built. You do need to understand what data it touches, what it produces, and where the review step is. If you can’t explain those three things, the tool isn’t ready to run autonomously.

Move to level 5 when: You give Codex the same feedback repeatedly and have standing preferences that you’d prefer it to apply on its own.

Level 5: Compound your Codex system

Mental model: Codex as a system that can improve over time when you save useful workflows, maintain review rules, and use memories or skills to codify preferences where available.

Mode: Hybrid. Some instructions will dictate how the agent approaches autonomous work; others will guide how the model interacts with you in collaboration mode.

The idea of “compounding” work comes from compound engineering, the AI-native coding methodology coined by Kieran Klaassen and Nityesh Agarwal while building Cora, Every’s email client. The canonical example is a product requirements document (PRD) that writes the scaffolding for the next one: The artifact you produce becomes the tool that speeds up the next round. The four habits below are how you put it into practice as a knowledge worker, not just an engineer.

Remember: Each useful session should make future sessions faster and more reliable. In practice, that requires doing four things consistently after completing any significant piece of work:

1. Save successful prompts as workflow files. When a prompt produces exactly the right output, document it. Write down the input sources, the exact prompt, the output format, and the review step. Save it in your workflows/ folder. The next time you need the same output, the agent will have that reference to work from.

2. Add mistakes to review checklists. When Codex gets something wrong—a number that was off, a tone that missed the mark, or an assumption it should not have made—add a specific check to your relevant review file, and instruct Codex to check its work against those guardrails.

3. Update your context files after major projects. When a project ends, update context.md to reflect what changed—new priorities, new tools, what worked, and what didn’t. Codex can use this when you point it to the file, turn it into a skill/workflow, or store the pattern in Codex memory where available.

4. Ask Codex to identify compounding opportunities. At the end of any session where you did something useful, run this prompt:

Based on what we just did, what parts of this workflow should become a reusable skill, an automation, or a small tool? What context should I add to my project files so we don’t have to re-establish this next time?

Forking for your discipline: The compound engineering plugin, Every’s open-source system for structured agent workflows, installable in Codex with one command, works for knowledge work out of the box, but its review agents are optimized for coding needs like establishing frontend patterns and reviewing for code performance.

Knowledge workers can fork it into a version with reviewers tuned for strategic alignment, data accuracy, writing quality, and communication standards. A forked version, compound knowledge, is publicly available on Every’s GitHub, and is designed to be readable and editable by non-engineers.

Part 4: Workflow library

These workflows are meant as inspiration to get you started. Adapt the inputs, outputs, and approval rules to your specific tools and standards.

1. Inbox zero review queue

Best for: Anyone whose email backlog is a recurring source of anxiety or dropped balls.

Input sources: Gmail or your email client of choice.

Output artifact: A structured list of draft replies, proposed actions (archive, delegate, flag), and any emails flagged for your personal attention because the draft alone isn’t sufficient.

Dan Shipper kept inbox zero for 10 days straight with Codex. To use this workflow, have Codex:

Gather email through Cora running in the in-app browser.
Render the email queue as a single page.
Go through each item with you as you dictate the action the AI should take (e.g., “research this,” “draft that,” “pull the documents our lawyers asked for.”) You can do this via chat or voice with a dictation tool like Monologue (we recommend the latter).

First prompt:

Go through my inbox for the past [time period].

For each email that needs a response or action:

1. Categorize it: needs reply/needs action/can archive/already handled

2. If it needs a reply, draft one in my voice using the style in preferences.md

3. If it needs action, describe the action clearly

4. Flag any email where a draft reply isn’t enough—where I need to think about this personally before responding

Don’t send anything. Create drafts only. I will review in Gmail.

Review step: Review all drafts in Gmail before sending. Don’t approve from inside Codex.

How to compound: After a few sessions, add a rule file describing your categorization preferences—which senders always get priority, which topics can be archived without reply, and which types of requests need a human-written response.

2. Daily unanswered message roundup

Best for: Anyone who communicates across Slack, email, and other channels and loses track of what still needs a response.

Input sources: Slack, Gmail, any other communication tool you use.

Output artifact: A list of unanswered items with drafted replies or proposed reactions, organized by urgency.

First prompt:

Look across my Slack and Gmail for the past 24 hours. Find everything that was directed at me that I have not responded to.

For each item:

1. Draft a reply or suggest a reaction (thumbs up, etc.) if a short acknowledgment is appropriate

2. Flag items where a more considered response is needed3. Flag anything time-sensitive

Present the list organized by urgency. Don’t send anything.

Review step: Review in Slack and Gmail.

How to compound: After a few runs, save a rules file specifying which Slack channels are high-priority, which senders always warrant a human response, and which types of messages can be handled with a reaction rather than a reply.

3. Research brief creation

Best for: Anyone preparing for a meeting, a pitch, a content piece, or a strategic decision and needing a thorough, sourced summary of a topic.

Input sources: Provided links, Notion, Drive, web search.

Output artifact: A structured brief with background, key facts, open questions, and source links.

First prompt:

Build a research brief on [topic].

Sources to prioritize: [List any specific links, documents, or databases].

Structure the brief as:

- Background: what I need to know to have a smart conversation about this

- Key facts and data points, each with a source link

- Competing perspectives or significant disagreements in the field

- Open questions I should be able to answer before [meeting/decision/deadline]

- Three things I should read next if I want to go deeper

Flag any claims you are less than confident about.

Review step: Check source links. Verify any statistics against the original source before using them.

How to compound: Save a brief template in your workflows/ folder. After each brief, add any recurring sources (newsletters, databases, key authors) to your sources/key-links.md so Codex checks them by default.

4. Writing with a parallel review loop

Best for: Writers who want Codex running alongside them as they draft—checking the work, flagging issues, and responding in parallel without interrupting the writing session.

Input sources: Your draft (open in your word processor through Codex’s in-app browser), any relevant style guides, source documents, or review standards in your workspace.

Output artifact: An annotated draft with inline feedback, flagged issues, and suggested revisions—produced continuously as you write rather than in a single pass at the end.

Setup: Open your draft in Proof or the in-app browser. Start a Codex session with your workspace context loaded. Give Codex standing instructions for what to monitor and how to respond.

First prompt:

I am writing [describe the piece—type, audience, purpose].

As I draft, run a continuous review loop. Check for:

- Claims that need a source or are stated with more confidence than the evidence supports

- Passages where the argument loses clarity or the logic has a gap

- Sentences that violate the style preferences in preferences.md

- Anything that reads as filler, throat-clearing, or AI-generated phrasing

Don’t rewrite anything without being asked. Flag issues as I go with a brief note on what the problem is and what would fix it. Check in every [X minutes / X paragraphs] or when I ask.

Review step: Read the flagged issues at natural stopping points—the end of a section or session. Decide which to address and which to dismiss. Don’t let the feedback loop interrupt the drafting flow; the value is in the accumulation, not in responding to every flag in real time.

How to compound: After each writing session, add any recurring flags to your reviews/writing-checklist.md. Patterns that come up repeatedly are candidates for a standing rule in your preferences file, so Codex catches them automatically next time.

5. Source management for research

Best for: Writers and researchers who need to organize source material before drafting.

Input sources: Links, PDFs, past drafts, notes, transcripts.

Output artifact: A structured document with the core argument, supporting evidence organized by claim, counterarguments, and a gap analysis (what is still missing).

First prompt:

I am writing a piece on [topic]. The core argument I want to make is [argument].

Here are my source materials: [links/documents].

Build an evidence room that:

1. States the core argument clearly

2. Lists the strongest supporting evidence for each main point, with source links

3. Lists the strongest counterarguments and how I might address them

4. Identifies any gaps—claims I am making that lack strong evidence

5. Flags any sources that conflict with each other

Review step: Read the evidence room before drafting. Verify any statistics or quotes you plan to use directly.

How to compound: Save the evidence format as a workflow template. Add a standing note to your context file about your writing voice and recurring themes so Codex calibrates its framing.

6. Information via audio

Best for: Anyone who processes information better by listening than reading, or who wants to take time away from a screen but stay on top of work.

Input sources: Any written content: drafts, research briefs, meeting summaries, strategy documents, reports, lengthy emails, articles.

Output artifact: An audio file saved to a location accessible from your phone (Dropbox, Drive, etc.).

First prompt:

Convert the attached [document/draft/report] into a clear audio file. Read it at a natural pace—not rushed, not slow. Save it to [Dropbox/Drive location] as [filename].

Review step: Listen on your commute, walk, or wherever you have time away from a screen. Take notes on your phone as things come up. Return to the source material with whatever you noticed.

How to compound: Add a standing instruction to your context file covering your audio preferences—such as speed, file format, naming convention, and preferred save location—so you do not have to specify each time. You can also prompt Codex to convert content automatically at the end of certain workflows: “After generating the weekly metrics report, convert it to audio and save to [location].”

7. Go-to-market plan generator

Best for: Anyone responsible for launching a product, feature, or initiative and who has done the thinking in meetings and Slack but has not had time to formalize it.

Input sources: Meeting transcripts, Slack threads, customer notes, a preferred strategy template.

Output artifact: A complete go-to-market plan, structured for human review and agent querying.

First prompt:

Build a go-to-market plan for [product/initiative].

Sources to pull from:

- Meeting transcripts: [Notion location or links]

- Slack discussions: [channels or search terms]

- Customer research: [document or location]

- Template to follow: [link or paste template]

The plan should be readable by a human in five minutes and structured so that an agent can answer specific questions about it (e.g., “What is the target ICP?” “What is the launch timeline?”).

Start with a compound engineering brainstorm step. Give me a draft in Proof or Notion. Flag anything in the plan you added that was not in the source material—I only want synthesis of what we have already decided, not new suggestions baked in.

Review step: Review in Notion or Proof. Verify that every major claim traces to something in the source material. Anything the model added that was not in your sources should be flagged for your decision.

How to compound: Save the template and the prompt. After each launch, add a retrospective note to your context file about what the plan got right and wrong. Future plans will be calibrated by past ones.

8. KPI report

Best for: Anyone responsible for tracking metrics and needing a regular, reliable view across multiple data sources.

Input sources: Analytics (PostHog, Mixpanel, Amplitude), revenue data (Stripe), support volume, social metrics, saved past reports.

Output artifact: A one-page report covering headlines, usage metrics, system health, and follow-up items.

First prompt:

Generate a product pulse report for [time period].

Data sources:

- Product analytics: [tool and what to pull]

- Revenue: [tool and what to pull]

- Support: [tool and what to pull]

- Social: [tool and what to pull]

Structure:

1. Headlines (three to five bullets summarizing what matters most)

2. Usage (primary engagement metric, value-realization metric, conversions, deltas vs. prior period)

3. System health (error rates, latency, top error signatures)

4. Follow-ups (one to five things worth investigating, specific enough to act on)

Flag any number that differs significantly from the prior report. If something is anomalous, investigate one level deeper before including it.

Review step: Verify every number in the report against its source. Don’t use this report as a business source of truth until you have confirmed accuracy column by column. In practice, one-shot metrics pulls are often five to 10 percent off—a common result of definition mismatches and join errors across multi-source pulls.

How to compound: Save each report as a dated file in your outputs/reports/ folder. Over time, Codex can compare reports, identify trends, and flag when something has changed. The folder becomes the working memory of your product.

9. Customer support for product work

Best for: Teams where support patterns should feed into product decisions and small fixes.

Input sources: Support platform (Intercom, Zendesk), issue tracker (Linear, GitHub Issues).

Output artifact: A deduplicated list of issues with suggested priority, plus small issues ready to hand off for fixes.

First prompt:

Go through my support queue for the past [time period].

For each support thread:

1. Identify the underlying issue or request.

2. Check whether a similar issue already exists in [Linear/GitHub Issues].

3. If it does, link them. If it doesn’t, draft a new issue.

4. Flag any issue that appears more than [threshold] times—these are priorities.

5. For issues that appear straightforward to fix, note that they are candidates

for direct implementation.

Don’t create issues in the tracker yet. Give me the list to review first.

Review step: Review the issue list before anything goes into the tracker. Confirm deduplication is accurate—support tickets often describe the same underlying problem in different words.

How to compound: After each session, add a note about recurring issue types so Codex can categorize faster next time. Build a persistent list of known issues so deduplication improves over time.

10. Pull requests for non-engineers

Best for: Anyone who needs to make a small, well-scoped change to a codebase—such as copy updates, configuration changes, or content edits—without deep engineering knowledge.

Input sources: The relevant files or repository, and a clear description of the change.

Output artifact: A pull request (PR) that is reviewer-friendly and doesn’t touch anything outside the intended scope.

First prompt:

I need to make the following change: [describe the change clearly].

Before making any changes:

1. Show me which files are affected

2. Confirm the scope of the change—nothing outside these files should be touched

3. Explain what you are going to do in plain language before doing it

After making the change:

1. Summarize what was changed and why

2. List every file that was touched

3. Explain how you verified the change is correct

4. Flag anything a reviewer should look at carefully

Make the smallest useful change. Don’t refactor or improve anything adjacent.

Review step: Review the Codex preview before the PR is opened. Review the PR itself in GitHub or your code review tool. Ask a technical colleague to approve before merging if you are uncertain.

How to compound: Save a template of your preferred PR format. After each PR, add a note about anything that requires correction so future PRs avoid the same issue.

11. Recruiting research

Best for: Anyone doing outbound recruiting for a role with a specific background profile.

Input sources: LinkedIn, Twitter/X, company websites, alumni databases, public professional networks.

Output artifact: A list of candidates with background summaries and contact information or connection points.

First prompt:

I am hiring for [role]. The ideal candidate has [background profile—experience,

prior companies, skills, career trajectory].

Search for candidates who match this profile. For each candidate:

1. Summarize their background in two to three sentences

2. Note why they match the profile

3. Identify any connection point (mutual connections, follows, shared affiliations)

4. Provide a link to their public profile

Return the top [number] candidates, ranked by how closely they match the profile.

Review step: Review each candidate before any outreach. Verify that the background summaries are accurate by checking the linked profiles. Don’t send any outreach through Codex.

How to compound: Save the role profile as a template. After a successful hire, document what the actual background looked like versus the initial profile to calibrate future searches.

12. Strategy and planning agent

Best for: Leaders and operators who need to compress OKR planning, quarterly planning, or strategic reviews from days to hours.

Input sources: Past planning documents, meeting transcripts, leadership context notes, relevant metrics.

Output artifact: A draft plan or OKR set, structured for review and iteration.

First prompt:

I need to draft [quarterly plan / OKR set / strategic review] for [scope].

Pull from:

- Past plans: [location]

- Recent meeting transcripts: [location]

- Current metrics: [location or description]

- Leadership context: [document or description]

Structure the output as [desired format].

Flag any goal or initiative you are recommending that doesn’t have explicit support in the source material. I want synthesis of what has already been decided, not new recommendations baked in without my review.

Review step: Review in Notion or Proof. Before sharing with leadership or the team, confirm that every major commitment traces to a decision that was actually made.

How to compound: After each planning cycle, add a retrospective to your context file. Did the goals prove achievable? What was missing from the original plan? Future planning sessions will be informed by past ones.

13. Personal learning tool

Best for: Anyone who wants to use Codex to support skill-building, practice, or self-directed learning.

Input sources: External APIs, files, structured practice materials, your own notes.

Output artifact: A custom interactive tool—like a tutor, a quiz, or a practice environment—built for your learning goal.

Example: A musician wants to practice chord identification. They connect a MIDI keyboard and describe what they want, and Codex builds a small app that listens to what they play, identifies the chord, and tracks progress over time.

First prompt:

I want to build a personal learning tool for [skill or subject].

My current level: [beginner/intermediate/what I know already].

What I want to practice: [specific aspect of the skill].

How I want feedback: [immediate/after each session/scored].

Build a prototype I can use locally. Explain what it does and how to use it before I start.

Review step: Try the tool on real practice material before committing to it. Verify it is actually testing what you intended.

How to compound: After each practice session, ask Codex to update the tool based on what you found most and least useful. The tool improves as your needs become clearer.

Part 5: Operating Codex well

How to Steer Codex

Operating Codex well is management work. You evaluate talent (which prompts, agents, and workflows to trust), set vision (what to point Codex at, and what “done” should look like), exercise taste (catching output that is technically correct but wrong for the moment), and know when to let be or take the wheel.

Give Codex an outcome. Describe what you want to end up with, not how to get there. “Build a research brief on [topic] with these sources and this structure” produces better results than “First search Slack, then search Notion, then...”

Ask for a plan before long-running work. For any task that will take more than a few minutes or touch multiple systems, ask Codex to explain what it’s about to do before it starts. This catches misunderstandings early and gives you a chance to redirect it before it gets too far along.

Ask Codex what it needs before it starts. For complex tasks, a short briefing prompt saves time:

Before you start, tell me what additional context would help you do this better. What are the most important things you would want to know?

Require citations and audit trails for important claims. Any document that will be shared or used for decisions should have source links for factual claims. Make this a standing rule in your preferences file.

Don’t over-manage every micro-step once the plan is good. Once you have confirmed the approach, let Codex work. Interrupting undermines autonomous operation and produces worse results than reviewing the completed output.

Review in the destination app. Always.

Set explicit no-send /no-post/no-archive/no-modify rules in your rules file. These should apply by default to any sensitive workflow. Make Codex ask before taking any action that can’t easily be undone.

Three questions to ask before approving any significant output:

What was the hardest decision you made in producing this?

What alternatives did you consider and reject?

Where are you least confident?

These questions surface the judgment calls the model made, the options it dismissed, and the places most likely to contain errors.

Safety, trust, and risks

Risk categories

Green—proceed with standard review: Summaries, outlines, internal drafts, research briefs, personal notes, low-stakes scripts.

Yellow—review carefully before sharing or acting: Strategy documents, customer-support drafts, product specs, recruiting research, non-destructive data pulls, PR drafts for small changes.

Red—don’t proceed without explicit human verification: Sending messages to clients or customers, changing source-of-truth data, making production code changes, moving money, legal or compliance claims, unreconciled metrics used for business decisions.

Common failure modes and how to handle them

Confident wrongness. Codex can state incorrect facts with high confidence. For any factual claim that matters, verify against the source. Never pass a statistic or claim to another person without checking it.

Metrics errors. Joining data from multiple sources introduces definition mismatches and calculation errors. Verify column by column for any metric used in decisions.

Out-of-scope changes. Codex sometimes modifies files or makes improvements adjacent to the task you assigned. Review the changes line by line (called a “diff”), not just the final output, especially for any task involving code.

Automations that break. Persistent workflows stop working when tools update their APIs, credentials expire, or context files become stale. Every automation needs an owner who checks it periodically. Sever that connection—stop tending it—and the agent stops being useful. “Set it and forget it” isn’t a stable operating mode.

Plugin and integration failure. Plugins and integrations need maintenance: Permissions expire, APIs change, configurations need updates, and some changes require restarting Codex. Integration failures—particularly with Notion and Gmail—happen and aren’t always obvious. If a workflow produces strange output, check whether the connection is still working before assuming the prompt is wrong.

Usage limits. Long-running sessions can hit usage limits and stop mid-task. For complex workflows, break work into stages rather than attempting everything in a single session.

Untrusted input. Anything Codex reads—an email, a web page, a shared document, a support ticket—can contain instructions aimed at the agent rather than at you, sometimes hidden from human eyes. If Codex is browsing untrusted sites or processing external messages while holding broad write access, those buried instructions can turn into actions—like sending data where it shouldn’t go. So keep destructive actions behind approval, and scope each workflow to the least access it needs, so a hijacked instruction has nowhere to go.

The human ownership standard: Codex can touch any artifact in your workflow, but a human must direct the work, stand behind the output, and be able to discuss any specific decision in it. If someone asks you about a bullet point in a document Codex drafted, you should be able to answer. An AI-drafted document is fine—expected, even—but if someone talks it through with you and it’s clear you have no idea what’s in it, that’s a problem.

Team workflows: From personal Codex to shared operating system

Individual Codex workflows compound over time. Team workflows compound faster but require coordination.

What changes when a team uses Codex

Teams build trust in agents through the humans who operate them. When a colleague receives a document or plan that Codex drafted, they trust it to the degree they trust the person who shared it.

Infrastructure that makes team Codex work

Shared review surfaces. A shared document review tool (Proof, Notion, Google Docs) makes agent-generated documents easier to inspect and comment on than outputs reviewed only inside Codex.

Codex-mediated routing. Teams can combine Codex threads, automations, Slack or GitHub integrations, remote connections, and app-server APIs to build routing workflows: Requests arrive in Slack, email, or another shared intake surface; Codex helps triage them, creates reviewable tasks or drafts, and routes the work to the right human or Codex workspace for execution. Each route needs clear ownership, permissions, review rules, and a source of truth. For teams doing a lot of cross-functional requests, such as legal reviews, data pulls, or copy approvals, this pattern removes significant coordination overhead.

A key mechanic to making this style of work possible is giving Codex its own email address. Codex doesn’t come with one—you set it up with a tool like Nylas that gives an agent an inbox. Once it has that address, you can treat it like another teammate. Routes built on an email address still need the same discipline as any other: a clear owner, scoped permissions, and a review step before anything goes back out.

Agent-readable shared documentation. Plans, strategy documents, and operational guides written for both human and agent readers become shared infrastructure. Any team member—or any team member’s agent—can query them for specific information without interrupting the author.

Explicit ownership. Every persistent workflow needs a named owner. That person is responsible for monitoring output quality, updating the workflow when it breaks, and retiring it when it’s no longer useful. Automation degrades without ownership.

A simple way to get a team to use Codex

Don’t try to convert everyone. As a rule of thumb, a tenth of any team will adopt a new tool no matter what, a tenth never will, and the other 80 percent come along once someone shows them how it helps their own job. Aim at that 80 percent. Three things, done together, help along adoption:

A note from a leader that makes using AI the expectation, not a nice-to-have
A weekly meeting where anyone can show a prompt or workflow they’ve built
A regular message that names the people whose work stood out

Set the expectation, give people a place to share what works, and recognize them for it—that’s most of the battle.

Part 6: Getting started

The seven-day Codex power-user plan

Day 1: Connect and inspect. Install the Codex desktop app. Connect your primary tools—Gmail, Slack, Notion, Drive, and any analytics or support tools you use. Run the workflow discovery prompt from Part 2 and review the three automation suggestions Codex returns. Don’t build anything yet. Just read the suggestions and identify which one is most useful.

Day 2: Create your context files. Create your codex-workspace/ folder. Write context.md, preferences.md, and rules.md. Keep each one to one page. The goal is to capture the most important things Codex should know about you—not to be exhaustive.

Day 3: Run three one-off tasks. Choose one summary task, one research brief, and one draft or plan. Use the prompt patterns from Level 1. Review each output carefully and note where Codex got things right and where it needed correction.

Day 4: Build your first workflow. Take the most useful automation suggestion from Day 1 and fill out the workflow canvas from Level 3. Save it to workflows/ in your workspace. Run it once manually and verify the output.

Day 5: Add review rules. Create reviews/data-checklist.md, reviews/writing-checklist.md, and reviews/comms-checklist.md. Start each one with five checks based on what you noticed during Days 3 and 4. These will grow over time.

Day 6: Turn one workflow into a reusable artifact. Take the workflow from Day 4 and document the prompt, the output format, the review step, and any known edge cases. Save it as a complete workflow file. Run it again and verify the documentation is accurate.

Day 7: Compound. Run the compounding prompt at the end of your Codex session:

Based on everything we have done this week, what should become a reusable skill,

an automation, or a small tool? What context should I add to my project files

so future sessions start from a better baseline?

Review Codex’s suggestions and implement the one that would save the most time over the next month.

30-day extension:

Week 1: One personal workflow running reliably
Week 2: One multi-source workflow pulling from at least three connected tools
Week 3: One small tool or automation that handles a chore without your involvement
Week 4: One shared or team workflow with explicit ownership and review cadence

Start today. Connect the tools you’re comfortable permissioning and ask Codex what recurring workflows it can see from the available context. That question, and what you do with the answer, is the gateway to the Codex universe.

Codex is easy to underestimate. At first glance it looks like another AI coding tool; if you’re not an engineer, a natural conclusion is that it’s not for you.

That reading misses how much Codex makes possible.

This is a day in the life of an agent-pilled knowledge worker. It all runs on OpenAI’s agent, Codex, in the Codex desktop app. We use “Codex” to refer to the app throughout this guide.

There are two ways to work with agents in Codex: Delegate or collaborate.

Delegate tasks that are predictable, repeatable, and low-risk. With clear, well-specified instructions, the agent can execute autonomously and bring back finished work for your review.
Collaborate on tasks that are judgment-heavy, exploratory, or iterative. You work alongside the model toward an outcome that matches your vision.

Expert Codex users are one of the clearest examples of what that looks like in practice.

Part 1: Understanding Codex

What Codex is

Specific capabilities that make Codex worth using:

Works alongside you on multiple tasks in parallel
Pulls context from the apps and files you connect
Uses a supported browser and desktop workflows when a task needs on-screen action
Checks its own work, revises, and keeps going
Holds a persistent goal across a long-running session, instead of treating each message as a one-off request
Turns repeatable tasks into recurring workflows
Helps route shared requests from places like Slack, email, or forms
Lets you start, steer, approve, and review work from your phone while Codex works in the cloud or on a machine, such as a Mac Mini, that you keep awake

A note on Goals

Codex on mobile

What Codex isn’t

Useful rules

A task is a good candidate for Codex if it has at least two of the following traits:

It requires pulling data from multiple sources.
It involves repeated steps you do regularly.
It can be checked against objective criteria.
It produces a durable artifact—a document, a plan, a report, a script.
It benefits from synthesis across many inputs.
It’s annoying enough that you routinely delay or avoid it.

Delegate tasks when they are:

Repeatable
Objective
Checkable
Low-risk

Collaborate on tasks that are:

Ambiguous
Judgment-heavy
Exploratory
Iterative

Codex, Claude Code, and Claude Cowork

Codex is OpenAI’s counterpart, but rather than split the work across modes, it puts coding and knowledge work in a single workspace. A few things give Codex an edge for knowledge work today:

One surface, not two. Anthropic splits agentic work between Code and Cowork; Codex handles both in the same place, so you’re never deciding which mode a task belongs in.
A browser that works beside you. Codex renders the pages inside the app itself as a shared view between you and the agent. The Claude app operates a stand-alone Chrome window or your full screen instead. For logged-in sites, both rely on a Chrome extension. In our experience, Codex’s built-in browser tends to be faster, more reliable, and more useful for collaborative work.
Connectors out of the box. Codex comes with a catalog of connectors you authorize in a click; in the Claude app you add tools as MCP servers, which requires a bit more assembly.

Which surface is right comes down to model preference and workflow habits; Codex has the edge for us today—but the labs ship fast, and that can change.

The Codex knowledge work loop

Every sustainable Codex workflow follows the same five-step pattern:

Connect → Contextualize → Delegate/collaborate → Review → Compound

Delegate/collaborate: Decide whether the task needs close collaboration or can run on its own. Either way, specify inputs, output format, and acceptance criteria, then let it work.

Part 2: Setup

Connect your systems

Three ways Codex reaches your tools

Codex can touch the same tool in more than one way, and knowing which access path is which saves a lot of confusion:

Connectors (plugins) give Codex structured, API-level access to an app—Gmail, Slack, Notion, your analytics tools. This is the most reliable and repeatable option, so use it whenever a connector exists.
Browser use lets Codex operate a web page directly through its in-app browser—useful for local previews, public pages, and anything you want to watch it do on a shared screen. For sites that require you to be signed in, like your email client, the Codex Chrome extension works inside your logged-in browser.
Computer use lets Codex see and operate your desktop the way a person would—clicking through an app, changing a setting, or working with software that only exists as a graphical interface.

The rule of thumb: Reach for a connector first, the browser next, and computer use when nothing else can get to the task.

Starting prompt—use this once your integrations are set up:

Once the relevant tools are connected and permissioned, this prompt lets Codex inspect the available work context and suggest automation candidates rather than forcing you to invent them.

Build your Codex workspace

Build Codex’s workspace before running any workflows. Skip this step and you’ll likely stall.

An example workspace structure

your-workspace/

├── README.md # Start here—orientation

├── identity/ # About you

│ ├── context.md

│ ├── preferences.md

│ └── rules.md

├── playbooks/ # Process—repeatable workflows

│ ├── workflows/

│ ├── inbox-sweep.md

│ └── research-brief.md

├── sources/ # Source shelf—inputs

│ ├── sources/

│ ├── key-links.md

│ └── recurring-docs.md

├── outputs/ # Finished work

│ ├── outputs/

│ ├── drafts/

│ └── reports/

└── reviews/ # Quality checks—guardrails

├── data-checklist.md

└── writing-checklist.md

What to put in your context files

context.md should cover:

Your role and the function you own
Active projects and their current status
The tools you use daily and what each one is for
The people or teams you work with most closely
How decisions typically get made in your context

preferences.md should cover:

Writing style and tone (formal or conversational, terse or thorough)
Communication preferences (what you like to review before it goes out and what can be drafted and queued without your involvement)
Decision-making preferences (when to ask before acting and when to proceed and report back)

rules.md should cover:

What Codex may never do without explicit approval: Send, post, archive, delete, modify a source of truth, or move money
What Codex may do without asking: Draft, summarize, research, outline, organize
Any standing constraints specific to your work (e.g., client confidentiality rules, brand standards, data handling requirements)

Starting prompt—use this to have Codex create your workspace structure:

[First: Create a folder on your desktop called “Codex”]

Set up this folder as a simple Codex workspace for knowledge work.

Create three starter files:

1. context.md—who I am, what I’m working on, what tools I use, and who I work with

2. preferences.md—how I like work to be written, reviewed, and handled

3. rules.md—what you may do without asking, what you must ask before doing, and what you must never do

Interview me one question at a time to gather the information you need to fill in each file.

The “one pinned chat per project” rule

Part 3: The five levels of Codex use

Level 1: One-off knowledge work

Mental model: Codex as a capable, thorough research and drafting assistant.

Best first tasks:

Summarize a meeting transcript and extract decisions, open questions, and follow-up actions.
Turn scattered notes into a structured outline.
Build a research brief from a set of links and documents.
Rewrite a draft against a style guide.
Create a review checklist for a document, launch plan, or strategy memo.
Convert a written draft into an audio file for editing on the go.

Review habit: Before polishing any output, ask Codex to list the assumptions it made and where it is least confident. This surfaces problems before you invest time in refinement.

Move to Level 2 when: You keep wishing Codex remembered what you told it last time.

Level 2: Multi-source workflows

Mental model: Codex as a cross-system analyst that can assemble information you could never pull together manually in a reasonable amount of time.

Example multi-source tasks:

A go-to-market plan built from internal meeting transcripts, Slack discussions, customer notes, and a strategy template
A weekly KPI report from analytics, revenue data, support volume, and social metrics
A summary synthesized from Slack, Notion, Drive links, and past drafts
A weekly leadership brief assembled from team standups, metrics, and open decisions

I need [specific artifact].

Sources to use:

- [Tool 1]: [what to look for there]

- [Tool 2]: [what to look for there]

- [Tool 3]: [what to look for there]

Output format: [describe the structure you want]

Move to Level 3 when: You keep running the same multi-source workflow more than once a week and wishing it happened automatically.

Level 3: Repeated chores into persistent workflows

Mental model: Codex as an automated operations layer that handles predictable, recurring work so you don’t have to.

A useful heuristic: If you could write a checklist that covers 90 percent of the cases, delegate it. If you would need to think about it differently each time, collaborate.

In either case, look for “computer chores”—recurring tasks that take time and attention, but don’t require human judgment at every single touchpoint.

Common chore candidates:

End-of-day check for unanswered Slack messages and emails, with drafted replies
Weekly metrics brief from analytics, revenue, and support data
Meeting-note cleanup and action-item extraction after each recorded call
Customer support pattern detection and issue routing
Draft-to-review package that formats a piece for editor handoff
Recruiting research for an open role

Workflow name:

Trigger or cadence:

Input sources:

Output artifact:

Approval rules:

What Codex may do without asking:

What Codex must ask before doing:

Verification steps:

Where the final output lives:

When to retire or revise this workflow:

Move to Level 4 when: Your prompt-based workflow hits a ceiling—the task is too complex or too custom to handle in text alone, and a small script or local tool would make it reliable.

Level 4: Build small tools when prompts are not enough

Mental model: Codex as a builder that creates lightweight infrastructure to make your workflows more reliable, faster, or more repeatable.

Sometimes the best Codex output is a small script, a local app, a custom dashboard, or a review surface that makes a recurring workflow easier, rather than pure text.

Examples of when a small tool helps:

A recurring workflow that requires pulling from an API that has no Codex integration. A short script handles the connection reliably.
A review process where you need to see formatted output side by side with the source. A simple local app gives you the view.
A task that needs to run on a schedule without your involvement. A script set to run on a timer (a cron job) handles the timing.
A workflow that accumulates structured data over time. A lightweight database or structured file tracks it persistently.

Practical approach for non-engineers:

Run the task manually in Codex once to confirm the output is what you want
Ask Codex: “Which steps in this workflow could be made more reliable with a small script or tool?”
Have Codex prototype the tool and explain what it does in plain language
Run it on your data and verify the output matches what the manual process produced
Keep only the parts that reduce friction. Discard what adds complexity without benefit.

Move to level 5 when: You give Codex the same feedback repeatedly and have standing preferences that you’d prefer it to apply on its own.

Level 5: Compound your Codex system

Mental model: Codex as a system that can improve over time when you save useful workflows, maintain review rules, and use memories or skills to codify preferences where available.

Mode: Hybrid. Some instructions will dictate how the agent approaches autonomous work; others will guide how the model interacts with you in collaboration mode.

Remember: Each useful session should make future sessions faster and more reliable. In practice, that requires doing four things consistently after completing any significant piece of work:

4. Ask Codex to identify compounding opportunities. At the end of any session where you did something useful, run this prompt:

Part 4: Workflow library

These workflows are meant as inspiration to get you started. Adapt the inputs, outputs, and approval rules to your specific tools and standards.

1. Inbox zero review queue

Best for: Anyone whose email backlog is a recurring source of anxiety or dropped balls.

Input sources: Gmail or your email client of choice.

Output artifact: A structured list of draft replies, proposed actions (archive, delegate, flag), and any emails flagged for your personal attention because the draft alone isn’t sufficient.

Dan Shipper kept inbox zero for 10 days straight with Codex. To use this workflow, have Codex:

Gather email through Cora running in the in-app browser.
Render the email queue as a single page.
Go through each item with you as you dictate the action the AI should take (e.g., “research this,” “draft that,” “pull the documents our lawyers asked for.”) You can do this via chat or voice with a dictation tool like Monologue (we recommend the latter).

First prompt:

Go through my inbox for the past [time period].

For each email that needs a response or action:

1. Categorize it: needs reply/needs action/can archive/already handled

2. If it needs a reply, draft one in my voice using the style in preferences.md

3. If it needs action, describe the action clearly

4. Flag any email where a draft reply isn’t enough—where I need to think about this personally before responding

Don’t send anything. Create drafts only. I will review in Gmail.

Review step: Review all drafts in Gmail before sending. Don’t approve from inside Codex.

2. Daily unanswered message roundup

Best for: Anyone who communicates across Slack, email, and other channels and loses track of what still needs a response.

Input sources: Slack, Gmail, any other communication tool you use.

Output artifact: A list of unanswered items with drafted replies or proposed reactions, organized by urgency.

First prompt:

Look across my Slack and Gmail for the past 24 hours. Find everything that was directed at me that I have not responded to.

For each item:

1. Draft a reply or suggest a reaction (thumbs up, etc.) if a short acknowledgment is appropriate

2. Flag items where a more considered response is needed3. Flag anything time-sensitive

Present the list organized by urgency. Don’t send anything.

Review step: Review in Slack and Gmail.

3. Research brief creation

Best for: Anyone preparing for a meeting, a pitch, a content piece, or a strategic decision and needing a thorough, sourced summary of a topic.

Input sources: Provided links, Notion, Drive, web search.

Output artifact: A structured brief with background, key facts, open questions, and source links.

First prompt:

Build a research brief on [topic].

Sources to prioritize: [List any specific links, documents, or databases].

Structure the brief as:

- Background: what I need to know to have a smart conversation about this

- Key facts and data points, each with a source link

- Competing perspectives or significant disagreements in the field

- Open questions I should be able to answer before [meeting/decision/deadline]

- Three things I should read next if I want to go deeper

Flag any claims you are less than confident about.

Review step: Check source links. Verify any statistics against the original source before using them.

4. Writing with a parallel review loop

Best for: Writers who want Codex running alongside them as they draft—checking the work, flagging issues, and responding in parallel without interrupting the writing session.

Input sources: Your draft (open in your word processor through Codex’s in-app browser), any relevant style guides, source documents, or review standards in your workspace.

Output artifact: An annotated draft with inline feedback, flagged issues, and suggested revisions—produced continuously as you write rather than in a single pass at the end.

Setup: Open your draft in Proof or the in-app browser. Start a Codex session with your workspace context loaded. Give Codex standing instructions for what to monitor and how to respond.

First prompt:

I am writing [describe the piece—type, audience, purpose].

As I draft, run a continuous review loop. Check for:

- Claims that need a source or are stated with more confidence than the evidence supports

- Passages where the argument loses clarity or the logic has a gap

- Sentences that violate the style preferences in preferences.md

- Anything that reads as filler, throat-clearing, or AI-generated phrasing

Don’t rewrite anything without being asked. Flag issues as I go with a brief note on what the problem is and what would fix it. Check in every [X minutes / X paragraphs] or when I ask.

5. Source management for research

Best for: Writers and researchers who need to organize source material before drafting.

Input sources: Links, PDFs, past drafts, notes, transcripts.

Output artifact: A structured document with the core argument, supporting evidence organized by claim, counterarguments, and a gap analysis (what is still missing).

First prompt:

I am writing a piece on [topic]. The core argument I want to make is [argument].

Here are my source materials: [links/documents].

Build an evidence room that:

1. States the core argument clearly

2. Lists the strongest supporting evidence for each main point, with source links

3. Lists the strongest counterarguments and how I might address them

4. Identifies any gaps—claims I am making that lack strong evidence

5. Flags any sources that conflict with each other

Review step: Read the evidence room before drafting. Verify any statistics or quotes you plan to use directly.

How to compound: Save the evidence format as a workflow template. Add a standing note to your context file about your writing voice and recurring themes so Codex calibrates its framing.

6. Information via audio

Best for: Anyone who processes information better by listening than reading, or who wants to take time away from a screen but stay on top of work.

Input sources: Any written content: drafts, research briefs, meeting summaries, strategy documents, reports, lengthy emails, articles.

Output artifact: An audio file saved to a location accessible from your phone (Dropbox, Drive, etc.).

First prompt:

Convert the attached [document/draft/report] into a clear audio file. Read it at a natural pace—not rushed, not slow. Save it to [Dropbox/Drive location] as [filename].

Review step: Listen on your commute, walk, or wherever you have time away from a screen. Take notes on your phone as things come up. Return to the source material with whatever you noticed.

7. Go-to-market plan generator

Best for: Anyone responsible for launching a product, feature, or initiative and who has done the thinking in meetings and Slack but has not had time to formalize it.

Input sources: Meeting transcripts, Slack threads, customer notes, a preferred strategy template.

Output artifact: A complete go-to-market plan, structured for human review and agent querying.

First prompt:

Build a go-to-market plan for [product/initiative].

Sources to pull from:

- Meeting transcripts: [Notion location or links]

- Slack discussions: [channels or search terms]

- Customer research: [document or location]

- Template to follow: [link or paste template]

The plan should be readable by a human in five minutes and structured so that an agent can answer specific questions about it (e.g., “What is the target ICP?” “What is the launch timeline?”).

8. KPI report

Best for: Anyone responsible for tracking metrics and needing a regular, reliable view across multiple data sources.

Input sources: Analytics (PostHog, Mixpanel, Amplitude), revenue data (Stripe), support volume, social metrics, saved past reports.

Output artifact: A one-page report covering headlines, usage metrics, system health, and follow-up items.

First prompt:

Generate a product pulse report for [time period].

Data sources:

- Product analytics: [tool and what to pull]

- Revenue: [tool and what to pull]

- Support: [tool and what to pull]

- Social: [tool and what to pull]

Structure:

1. Headlines (three to five bullets summarizing what matters most)

2. Usage (primary engagement metric, value-realization metric, conversions, deltas vs. prior period)

3. System health (error rates, latency, top error signatures)

4. Follow-ups (one to five things worth investigating, specific enough to act on)

Flag any number that differs significantly from the prior report. If something is anomalous, investigate one level deeper before including it.

9. Customer support for product work

Best for: Teams where support patterns should feed into product decisions and small fixes.

Input sources: Support platform (Intercom, Zendesk), issue tracker (Linear, GitHub Issues).

Output artifact: A deduplicated list of issues with suggested priority, plus small issues ready to hand off for fixes.

First prompt:

Go through my support queue for the past [time period].

For each support thread:

1. Identify the underlying issue or request.

2. Check whether a similar issue already exists in [Linear/GitHub Issues].

3. If it does, link them. If it doesn’t, draft a new issue.

4. Flag any issue that appears more than [threshold] times—these are priorities.

5. For issues that appear straightforward to fix, note that they are candidates

for direct implementation.

Don’t create issues in the tracker yet. Give me the list to review first.

Review step: Review the issue list before anything goes into the tracker. Confirm deduplication is accurate—support tickets often describe the same underlying problem in different words.

How to compound: After each session, add a note about recurring issue types so Codex can categorize faster next time. Build a persistent list of known issues so deduplication improves over time.

10. Pull requests for non-engineers

Best for: Anyone who needs to make a small, well-scoped change to a codebase—such as copy updates, configuration changes, or content edits—without deep engineering knowledge.

Input sources: The relevant files or repository, and a clear description of the change.

Output artifact: A pull request (PR) that is reviewer-friendly and doesn’t touch anything outside the intended scope.

First prompt:

I need to make the following change: [describe the change clearly].

Before making any changes:

1. Show me which files are affected

2. Confirm the scope of the change—nothing outside these files should be touched

3. Explain what you are going to do in plain language before doing it

After making the change:

1. Summarize what was changed and why

2. List every file that was touched

3. Explain how you verified the change is correct

4. Flag anything a reviewer should look at carefully

Make the smallest useful change. Don’t refactor or improve anything adjacent.

Review step: Review the Codex preview before the PR is opened. Review the PR itself in GitHub or your code review tool. Ask a technical colleague to approve before merging if you are uncertain.

How to compound: Save a template of your preferred PR format. After each PR, add a note about anything that requires correction so future PRs avoid the same issue.

11. Recruiting research

Best for: Anyone doing outbound recruiting for a role with a specific background profile.

Input sources: LinkedIn, Twitter/X, company websites, alumni databases, public professional networks.

Output artifact: A list of candidates with background summaries and contact information or connection points.

First prompt:

I am hiring for [role]. The ideal candidate has [background profile—experience,

prior companies, skills, career trajectory].

Search for candidates who match this profile. For each candidate:

1. Summarize their background in two to three sentences

2. Note why they match the profile

3. Identify any connection point (mutual connections, follows, shared affiliations)

4. Provide a link to their public profile

Return the top [number] candidates, ranked by how closely they match the profile.

Review step: Review each candidate before any outreach. Verify that the background summaries are accurate by checking the linked profiles. Don’t send any outreach through Codex.

How to compound: Save the role profile as a template. After a successful hire, document what the actual background looked like versus the initial profile to calibrate future searches.

12. Strategy and planning agent

Best for: Leaders and operators who need to compress OKR planning, quarterly planning, or strategic reviews from days to hours.

Input sources: Past planning documents, meeting transcripts, leadership context notes, relevant metrics.

Output artifact: A draft plan or OKR set, structured for review and iteration.

First prompt:

I need to draft [quarterly plan / OKR set / strategic review] for [scope].

Pull from:

- Past plans: [location]

- Recent meeting transcripts: [location]

- Current metrics: [location or description]

- Leadership context: [document or description]

Structure the output as [desired format].

Review step: Review in Notion or Proof. Before sharing with leadership or the team, confirm that every major commitment traces to a decision that was actually made.

13. Personal learning tool

Best for: Anyone who wants to use Codex to support skill-building, practice, or self-directed learning.

Input sources: External APIs, files, structured practice materials, your own notes.

Output artifact: A custom interactive tool—like a tutor, a quiz, or a practice environment—built for your learning goal.

First prompt:

I want to build a personal learning tool for [skill or subject].

My current level: [beginner/intermediate/what I know already].

What I want to practice: [specific aspect of the skill].

How I want feedback: [immediate/after each session/scored].

Build a prototype I can use locally. Explain what it does and how to use it before I start.

Review step: Try the tool on real practice material before committing to it. Verify it is actually testing what you intended.

How to compound: After each practice session, ask Codex to update the tool based on what you found most and least useful. The tool improves as your needs become clearer.

Part 5: Operating Codex well

How to Steer Codex

Ask Codex what it needs before it starts. For complex tasks, a short briefing prompt saves time:

Before you start, tell me what additional context would help you do this better. What are the most important things you would want to know?

Review in the destination app. Always.

Three questions to ask before approving any significant output:

What was the hardest decision you made in producing this?

What alternatives did you consider and reject?

Where are you least confident?

These questions surface the judgment calls the model made, the options it dismissed, and the places most likely to contain errors.

Safety, trust, and risks

Risk categories

Green—proceed with standard review: Summaries, outlines, internal drafts, research briefs, personal notes, low-stakes scripts.

Yellow—review carefully before sharing or acting: Strategy documents, customer-support drafts, product specs, recruiting research, non-destructive data pulls, PR drafts for small changes.

Common failure modes and how to handle them

Metrics errors. Joining data from multiple sources introduces definition mismatches and calculation errors. Verify column by column for any metric used in decisions.

Usage limits. Long-running sessions can hit usage limits and stop mid-task. For complex workflows, break work into stages rather than attempting everything in a single session.

Team workflows: From personal Codex to shared operating system

Individual Codex workflows compound over time. Team workflows compound faster but require coordination.

What changes when a team uses Codex

Teams build trust in agents through the humans who operate them. When a colleague receives a document or plan that Codex drafted, they trust it to the degree they trust the person who shared it.

Infrastructure that makes team Codex work

Shared review surfaces. A shared document review tool (Proof, Notion, Google Docs) makes agent-generated documents easier to inspect and comment on than outputs reviewed only inside Codex.

A simple way to get a team to use Codex

A note from a leader that makes using AI the expectation, not a nice-to-have
A weekly meeting where anyone can show a prompt or workflow they’ve built
A regular message that names the people whose work stood out

Set the expectation, give people a place to share what works, and recognize them for it—that’s most of the battle.

Part 6: Getting started

The seven-day Codex power-user plan

Day 7: Compound. Run the compounding prompt at the end of your Codex session:

Based on everything we have done this week, what should become a reusable skill,

an automation, or a small tool? What context should I add to my project files

so future sessions start from a better baseline?

Review Codex’s suggestions and implement the one that would save the most time over the next month.

30-day extension:

Week 1: One personal workflow running reliably
Week 2: One multi-source workflow pulling from at least three connected tools
Week 3: One small tool or automation that handles a chore without your involvement
Week 4: One shared or team workflow with explicit ownership and review cadence

Codex is easy to underestimate. At first glance it looks like another AI coding tool; if you’re not an engineer, a natural conclusion is that it’s not for you.

That reading misses how much Codex makes possible.

This is a day in the life of an agent-pilled knowledge worker. It all runs on OpenAI’s agent, Codex, in the Codex desktop app. We use “Codex” to refer to the app throughout this guide.

There are two ways to work with agents in Codex: Delegate or collaborate.

Delegate tasks that are predictable, repeatable, and low-risk. With clear, well-specified instructions, the agent can execute autonomously and bring back finished work for your review.
Collaborate on tasks that are judgment-heavy, exploratory, or iterative. You work alongside the model toward an outcome that matches your vision.

Expert Codex users are one of the clearest examples of what that looks like in practice.

Part 1: Understanding Codex

What Codex is

Specific capabilities that make Codex worth using:

Works alongside you on multiple tasks in parallel
Pulls context from the apps and files you connect
Uses a supported browser and desktop workflows when a task needs on-screen action
Checks its own work, revises, and keeps going
Holds a persistent goal across a long-running session, instead of treating each message as a one-off request
Turns repeatable tasks into recurring workflows
Helps route shared requests from places like Slack, email, or forms
Lets you start, steer, approve, and review work from your phone while Codex works in the cloud or on a machine, such as a Mac Mini, that you keep awake

A note on Goals

Codex on mobile

What Codex isn’t

Useful rules

A task is a good candidate for Codex if it has at least two of the following traits:

It requires pulling data from multiple sources.
It involves repeated steps you do regularly.
It can be checked against objective criteria.
It produces a durable artifact—a document, a plan, a report, a script.
It benefits from synthesis across many inputs.
It’s annoying enough that you routinely delay or avoid it.

Delegate tasks when they are:

Repeatable
Objective
Checkable
Low-risk

Collaborate on tasks that are:

Ambiguous
Judgment-heavy
Exploratory
Iterative

Codex, Claude Code, and Claude Cowork

Codex is OpenAI’s counterpart, but rather than split the work across modes, it puts coding and knowledge work in a single workspace. A few things give Codex an edge for knowledge work today:

One surface, not two. Anthropic splits agentic work between Code and Cowork; Codex handles both in the same place, so you’re never deciding which mode a task belongs in.
A browser that works beside you. Codex renders the pages inside the app itself as a shared view between you and the agent. The Claude app operates a stand-alone Chrome window or your full screen instead. For logged-in sites, both rely on a Chrome extension. In our experience, Codex’s built-in browser tends to be faster, more reliable, and more useful for collaborative work.
Connectors out of the box. Codex comes with a catalog of connectors you authorize in a click; in the Claude app you add tools as MCP servers, which requires a bit more assembly.

Which surface is right comes down to model preference and workflow habits; Codex has the edge for us today—but the labs ship fast, and that can change.

The Codex knowledge work loop

Every sustainable Codex workflow follows the same five-step pattern:

Connect → Contextualize → Delegate/collaborate → Review → Compound

Delegate/collaborate: Decide whether the task needs close collaboration or can run on its own. Either way, specify inputs, output format, and acceptance criteria, then let it work.

Part 2: Setup

Connect your systems

Three ways Codex reaches your tools

Codex can touch the same tool in more than one way, and knowing which access path is which saves a lot of confusion:

Connectors (plugins) give Codex structured, API-level access to an app—Gmail, Slack, Notion, your analytics tools. This is the most reliable and repeatable option, so use it whenever a connector exists.
Browser use lets Codex operate a web page directly through its in-app browser—useful for local previews, public pages, and anything you want to watch it do on a shared screen. For sites that require you to be signed in, like your email client, the Codex Chrome extension works inside your logged-in browser.
Computer use lets Codex see and operate your desktop the way a person would—clicking through an app, changing a setting, or working with software that only exists as a graphical interface.

The rule of thumb: Reach for a connector first, the browser next, and computer use when nothing else can get to the task.

Starting prompt—use this once your integrations are set up:

Once the relevant tools are connected and permissioned, this prompt lets Codex inspect the available work context and suggest automation candidates rather than forcing you to invent them.

Build your Codex workspace

Build Codex’s workspace before running any workflows. Skip this step and you’ll likely stall.

An example workspace structure

your-workspace/

├── README.md # Start here—orientation

├── identity/ # About you

│ ├── context.md

│ ├── preferences.md

│ └── rules.md

├── playbooks/ # Process—repeatable workflows

│ ├── workflows/

│ ├── inbox-sweep.md

│ └── research-brief.md

├── sources/ # Source shelf—inputs

│ ├── sources/

│ ├── key-links.md

│ └── recurring-docs.md

├── outputs/ # Finished work

│ ├── outputs/

│ ├── drafts/

│ └── reports/

└── reviews/ # Quality checks—guardrails

├── data-checklist.md

└── writing-checklist.md

What to put in your context files

context.md should cover:

Your role and the function you own
Active projects and their current status
The tools you use daily and what each one is for
The people or teams you work with most closely
How decisions typically get made in your context

preferences.md should cover:

Writing style and tone (formal or conversational, terse or thorough)
Communication preferences (what you like to review before it goes out and what can be drafted and queued without your involvement)
Decision-making preferences (when to ask before acting and when to proceed and report back)

rules.md should cover:

What Codex may never do without explicit approval: Send, post, archive, delete, modify a source of truth, or move money
What Codex may do without asking: Draft, summarize, research, outline, organize
Any standing constraints specific to your work (e.g., client confidentiality rules, brand standards, data handling requirements)

Starting prompt—use this to have Codex create your workspace structure:

[First: Create a folder on your desktop called “Codex”]

Set up this folder as a simple Codex workspace for knowledge work.

Create three starter files:

1. context.md—who I am, what I’m working on, what tools I use, and who I work with

2. preferences.md—how I like work to be written, reviewed, and handled

3. rules.md—what you may do without asking, what you must ask before doing, and what you must never do

Interview me one question at a time to gather the information you need to fill in each file.

The “one pinned chat per project” rule

Part 3: The five levels of Codex use

Level 1: One-off knowledge work

Mental model: Codex as a capable, thorough research and drafting assistant.

Best first tasks:

Summarize a meeting transcript and extract decisions, open questions, and follow-up actions.
Turn scattered notes into a structured outline.
Build a research brief from a set of links and documents.
Rewrite a draft against a style guide.
Create a review checklist for a document, launch plan, or strategy memo.
Convert a written draft into an audio file for editing on the go.

Review habit: Before polishing any output, ask Codex to list the assumptions it made and where it is least confident. This surfaces problems before you invest time in refinement.

Move to Level 2 when: You keep wishing Codex remembered what you told it last time.

Level 2: Multi-source workflows

Mental model: Codex as a cross-system analyst that can assemble information you could never pull together manually in a reasonable amount of time.

Example multi-source tasks:

A go-to-market plan built from internal meeting transcripts, Slack discussions, customer notes, and a strategy template
A weekly KPI report from analytics, revenue data, support volume, and social metrics
A summary synthesized from Slack, Notion, Drive links, and past drafts
A weekly leadership brief assembled from team standups, metrics, and open decisions

I need [specific artifact].

Sources to use:

- [Tool 1]: [what to look for there]

- [Tool 2]: [what to look for there]

- [Tool 3]: [what to look for there]

Output format: [describe the structure you want]

Move to Level 3 when: You keep running the same multi-source workflow more than once a week and wishing it happened automatically.

Level 3: Repeated chores into persistent workflows

Mental model: Codex as an automated operations layer that handles predictable, recurring work so you don’t have to.

A useful heuristic: If you could write a checklist that covers 90 percent of the cases, delegate it. If you would need to think about it differently each time, collaborate.

In either case, look for “computer chores”—recurring tasks that take time and attention, but don’t require human judgment at every single touchpoint.

Common chore candidates:

End-of-day check for unanswered Slack messages and emails, with drafted replies
Weekly metrics brief from analytics, revenue, and support data
Meeting-note cleanup and action-item extraction after each recorded call
Customer support pattern detection and issue routing
Draft-to-review package that formats a piece for editor handoff
Recruiting research for an open role

Workflow name:

Trigger or cadence:

Input sources:

Output artifact:

Approval rules:

What Codex may do without asking:

What Codex must ask before doing:

Verification steps:

Where the final output lives:

When to retire or revise this workflow:

Move to Level 4 when: Your prompt-based workflow hits a ceiling—the task is too complex or too custom to handle in text alone, and a small script or local tool would make it reliable.

Level 4: Build small tools when prompts are not enough

Mental model: Codex as a builder that creates lightweight infrastructure to make your workflows more reliable, faster, or more repeatable.

Sometimes the best Codex output is a small script, a local app, a custom dashboard, or a review surface that makes a recurring workflow easier, rather than pure text.

Examples of when a small tool helps:

A recurring workflow that requires pulling from an API that has no Codex integration. A short script handles the connection reliably.
A review process where you need to see formatted output side by side with the source. A simple local app gives you the view.
A task that needs to run on a schedule without your involvement. A script set to run on a timer (a cron job) handles the timing.
A workflow that accumulates structured data over time. A lightweight database or structured file tracks it persistently.

Practical approach for non-engineers:

Run the task manually in Codex once to confirm the output is what you want
Ask Codex: “Which steps in this workflow could be made more reliable with a small script or tool?”
Have Codex prototype the tool and explain what it does in plain language
Run it on your data and verify the output matches what the manual process produced
Keep only the parts that reduce friction. Discard what adds complexity without benefit.

Move to level 5 when: You give Codex the same feedback repeatedly and have standing preferences that you’d prefer it to apply on its own.

Level 5: Compound your Codex system

Mental model: Codex as a system that can improve over time when you save useful workflows, maintain review rules, and use memories or skills to codify preferences where available.

Mode: Hybrid. Some instructions will dictate how the agent approaches autonomous work; others will guide how the model interacts with you in collaboration mode.

Remember: Each useful session should make future sessions faster and more reliable. In practice, that requires doing four things consistently after completing any significant piece of work:

4. Ask Codex to identify compounding opportunities. At the end of any session where you did something useful, run this prompt:

Part 4: Workflow library

These workflows are meant as inspiration to get you started. Adapt the inputs, outputs, and approval rules to your specific tools and standards.

1. Inbox zero review queue

Best for: Anyone whose email backlog is a recurring source of anxiety or dropped balls.

Input sources: Gmail or your email client of choice.

Output artifact: A structured list of draft replies, proposed actions (archive, delegate, flag), and any emails flagged for your personal attention because the draft alone isn’t sufficient.

Dan Shipper kept inbox zero for 10 days straight with Codex. To use this workflow, have Codex:

Gather email through Cora running in the in-app browser.
Render the email queue as a single page.
Go through each item with you as you dictate the action the AI should take (e.g., “research this,” “draft that,” “pull the documents our lawyers asked for.”) You can do this via chat or voice with a dictation tool like Monologue (we recommend the latter).

First prompt:

Go through my inbox for the past [time period].

For each email that needs a response or action:

1. Categorize it: needs reply/needs action/can archive/already handled

2. If it needs a reply, draft one in my voice using the style in preferences.md

3. If it needs action, describe the action clearly

4. Flag any email where a draft reply isn’t enough—where I need to think about this personally before responding

Don’t send anything. Create drafts only. I will review in Gmail.

Review step: Review all drafts in Gmail before sending. Don’t approve from inside Codex.

2. Daily unanswered message roundup

Best for: Anyone who communicates across Slack, email, and other channels and loses track of what still needs a response.

Input sources: Slack, Gmail, any other communication tool you use.

Output artifact: A list of unanswered items with drafted replies or proposed reactions, organized by urgency.

First prompt:

Look across my Slack and Gmail for the past 24 hours. Find everything that was directed at me that I have not responded to.

For each item:

1. Draft a reply or suggest a reaction (thumbs up, etc.) if a short acknowledgment is appropriate

2. Flag items where a more considered response is needed3. Flag anything time-sensitive

Present the list organized by urgency. Don’t send anything.

Review step: Review in Slack and Gmail.

3. Research brief creation

Best for: Anyone preparing for a meeting, a pitch, a content piece, or a strategic decision and needing a thorough, sourced summary of a topic.

Input sources: Provided links, Notion, Drive, web search.

Output artifact: A structured brief with background, key facts, open questions, and source links.

First prompt:

Build a research brief on [topic].

Sources to prioritize: [List any specific links, documents, or databases].

Structure the brief as:

- Background: what I need to know to have a smart conversation about this

- Key facts and data points, each with a source link

- Competing perspectives or significant disagreements in the field

- Open questions I should be able to answer before [meeting/decision/deadline]

- Three things I should read next if I want to go deeper

Flag any claims you are less than confident about.

Review step: Check source links. Verify any statistics against the original source before using them.

4. Writing with a parallel review loop

Best for: Writers who want Codex running alongside them as they draft—checking the work, flagging issues, and responding in parallel without interrupting the writing session.

Input sources: Your draft (open in your word processor through Codex’s in-app browser), any relevant style guides, source documents, or review standards in your workspace.

Output artifact: An annotated draft with inline feedback, flagged issues, and suggested revisions—produced continuously as you write rather than in a single pass at the end.

Setup: Open your draft in Proof or the in-app browser. Start a Codex session with your workspace context loaded. Give Codex standing instructions for what to monitor and how to respond.

First prompt:

I am writing [describe the piece—type, audience, purpose].

As I draft, run a continuous review loop. Check for:

- Claims that need a source or are stated with more confidence than the evidence supports

- Passages where the argument loses clarity or the logic has a gap

- Sentences that violate the style preferences in preferences.md

- Anything that reads as filler, throat-clearing, or AI-generated phrasing

Don’t rewrite anything without being asked. Flag issues as I go with a brief note on what the problem is and what would fix it. Check in every [X minutes / X paragraphs] or when I ask.

5. Source management for research

Best for: Writers and researchers who need to organize source material before drafting.

Input sources: Links, PDFs, past drafts, notes, transcripts.

Output artifact: A structured document with the core argument, supporting evidence organized by claim, counterarguments, and a gap analysis (what is still missing).

First prompt:

I am writing a piece on [topic]. The core argument I want to make is [argument].

Here are my source materials: [links/documents].

Build an evidence room that:

1. States the core argument clearly

2. Lists the strongest supporting evidence for each main point, with source links

3. Lists the strongest counterarguments and how I might address them

4. Identifies any gaps—claims I am making that lack strong evidence

5. Flags any sources that conflict with each other

Review step: Read the evidence room before drafting. Verify any statistics or quotes you plan to use directly.

How to compound: Save the evidence format as a workflow template. Add a standing note to your context file about your writing voice and recurring themes so Codex calibrates its framing.

6. Information via audio

Best for: Anyone who processes information better by listening than reading, or who wants to take time away from a screen but stay on top of work.

Input sources: Any written content: drafts, research briefs, meeting summaries, strategy documents, reports, lengthy emails, articles.

Output artifact: An audio file saved to a location accessible from your phone (Dropbox, Drive, etc.).

First prompt:

Convert the attached [document/draft/report] into a clear audio file. Read it at a natural pace—not rushed, not slow. Save it to [Dropbox/Drive location] as [filename].

Review step: Listen on your commute, walk, or wherever you have time away from a screen. Take notes on your phone as things come up. Return to the source material with whatever you noticed.

7. Go-to-market plan generator

Best for: Anyone responsible for launching a product, feature, or initiative and who has done the thinking in meetings and Slack but has not had time to formalize it.

Input sources: Meeting transcripts, Slack threads, customer notes, a preferred strategy template.

Output artifact: A complete go-to-market plan, structured for human review and agent querying.

First prompt:

Build a go-to-market plan for [product/initiative].

Sources to pull from:

- Meeting transcripts: [Notion location or links]

- Slack discussions: [channels or search terms]

- Customer research: [document or location]

- Template to follow: [link or paste template]

The plan should be readable by a human in five minutes and structured so that an agent can answer specific questions about it (e.g., “What is the target ICP?” “What is the launch timeline?”).

8. KPI report

Best for: Anyone responsible for tracking metrics and needing a regular, reliable view across multiple data sources.

Input sources: Analytics (PostHog, Mixpanel, Amplitude), revenue data (Stripe), support volume, social metrics, saved past reports.

Output artifact: A one-page report covering headlines, usage metrics, system health, and follow-up items.

First prompt:

Generate a product pulse report for [time period].

Data sources:

- Product analytics: [tool and what to pull]

- Revenue: [tool and what to pull]

- Support: [tool and what to pull]

- Social: [tool and what to pull]

Structure:

1. Headlines (three to five bullets summarizing what matters most)

2. Usage (primary engagement metric, value-realization metric, conversions, deltas vs. prior period)

3. System health (error rates, latency, top error signatures)

4. Follow-ups (one to five things worth investigating, specific enough to act on)

Flag any number that differs significantly from the prior report. If something is anomalous, investigate one level deeper before including it.

9. Customer support for product work

Best for: Teams where support patterns should feed into product decisions and small fixes.

Input sources: Support platform (Intercom, Zendesk), issue tracker (Linear, GitHub Issues).

Output artifact: A deduplicated list of issues with suggested priority, plus small issues ready to hand off for fixes.

First prompt:

Go through my support queue for the past [time period].

For each support thread:

1. Identify the underlying issue or request.

2. Check whether a similar issue already exists in [Linear/GitHub Issues].

3. If it does, link them. If it doesn’t, draft a new issue.

4. Flag any issue that appears more than [threshold] times—these are priorities.

5. For issues that appear straightforward to fix, note that they are candidates

for direct implementation.

Don’t create issues in the tracker yet. Give me the list to review first.

Review step: Review the issue list before anything goes into the tracker. Confirm deduplication is accurate—support tickets often describe the same underlying problem in different words.

How to compound: After each session, add a note about recurring issue types so Codex can categorize faster next time. Build a persistent list of known issues so deduplication improves over time.

10. Pull requests for non-engineers

Best for: Anyone who needs to make a small, well-scoped change to a codebase—such as copy updates, configuration changes, or content edits—without deep engineering knowledge.

Input sources: The relevant files or repository, and a clear description of the change.

Output artifact: A pull request (PR) that is reviewer-friendly and doesn’t touch anything outside the intended scope.

First prompt:

I need to make the following change: [describe the change clearly].

Before making any changes:

1. Show me which files are affected

2. Confirm the scope of the change—nothing outside these files should be touched

3. Explain what you are going to do in plain language before doing it

After making the change:

1. Summarize what was changed and why

2. List every file that was touched

3. Explain how you verified the change is correct

4. Flag anything a reviewer should look at carefully

Make the smallest useful change. Don’t refactor or improve anything adjacent.

Review step: Review the Codex preview before the PR is opened. Review the PR itself in GitHub or your code review tool. Ask a technical colleague to approve before merging if you are uncertain.

How to compound: Save a template of your preferred PR format. After each PR, add a note about anything that requires correction so future PRs avoid the same issue.

11. Recruiting research

Best for: Anyone doing outbound recruiting for a role with a specific background profile.

Input sources: LinkedIn, Twitter/X, company websites, alumni databases, public professional networks.

Output artifact: A list of candidates with background summaries and contact information or connection points.

First prompt:

I am hiring for [role]. The ideal candidate has [background profile—experience,

prior companies, skills, career trajectory].

Search for candidates who match this profile. For each candidate:

1. Summarize their background in two to three sentences

2. Note why they match the profile

3. Identify any connection point (mutual connections, follows, shared affiliations)

4. Provide a link to their public profile

Return the top [number] candidates, ranked by how closely they match the profile.

Review step: Review each candidate before any outreach. Verify that the background summaries are accurate by checking the linked profiles. Don’t send any outreach through Codex.

How to compound: Save the role profile as a template. After a successful hire, document what the actual background looked like versus the initial profile to calibrate future searches.

12. Strategy and planning agent

Best for: Leaders and operators who need to compress OKR planning, quarterly planning, or strategic reviews from days to hours.

Input sources: Past planning documents, meeting transcripts, leadership context notes, relevant metrics.

Output artifact: A draft plan or OKR set, structured for review and iteration.

First prompt:

I need to draft [quarterly plan / OKR set / strategic review] for [scope].

Pull from:

- Past plans: [location]

- Recent meeting transcripts: [location]

- Current metrics: [location or description]

- Leadership context: [document or description]

Structure the output as [desired format].

Review step: Review in Notion or Proof. Before sharing with leadership or the team, confirm that every major commitment traces to a decision that was actually made.

13. Personal learning tool

Best for: Anyone who wants to use Codex to support skill-building, practice, or self-directed learning.

Input sources: External APIs, files, structured practice materials, your own notes.

Output artifact: A custom interactive tool—like a tutor, a quiz, or a practice environment—built for your learning goal.

First prompt:

I want to build a personal learning tool for [skill or subject].

My current level: [beginner/intermediate/what I know already].

What I want to practice: [specific aspect of the skill].

How I want feedback: [immediate/after each session/scored].

Build a prototype I can use locally. Explain what it does and how to use it before I start.

Review step: Try the tool on real practice material before committing to it. Verify it is actually testing what you intended.

How to compound: After each practice session, ask Codex to update the tool based on what you found most and least useful. The tool improves as your needs become clearer.

Part 5: Operating Codex well

How to Steer Codex

Ask Codex what it needs before it starts. For complex tasks, a short briefing prompt saves time:

Before you start, tell me what additional context would help you do this better. What are the most important things you would want to know?

Review in the destination app. Always.

Three questions to ask before approving any significant output:

What was the hardest decision you made in producing this?

What alternatives did you consider and reject?

Where are you least confident?

These questions surface the judgment calls the model made, the options it dismissed, and the places most likely to contain errors.

Safety, trust, and risks

Risk categories

Green—proceed with standard review: Summaries, outlines, internal drafts, research briefs, personal notes, low-stakes scripts.

Yellow—review carefully before sharing or acting: Strategy documents, customer-support drafts, product specs, recruiting research, non-destructive data pulls, PR drafts for small changes.

Common failure modes and how to handle them

Metrics errors. Joining data from multiple sources introduces definition mismatches and calculation errors. Verify column by column for any metric used in decisions.

Usage limits. Long-running sessions can hit usage limits and stop mid-task. For complex workflows, break work into stages rather than attempting everything in a single session.

Team workflows: From personal Codex to shared operating system

Individual Codex workflows compound over time. Team workflows compound faster but require coordination.

What changes when a team uses Codex

Teams build trust in agents through the humans who operate them. When a colleague receives a document or plan that Codex drafted, they trust it to the degree they trust the person who shared it.

Infrastructure that makes team Codex work

Shared review surfaces. A shared document review tool (Proof, Notion, Google Docs) makes agent-generated documents easier to inspect and comment on than outputs reviewed only inside Codex.

A simple way to get a team to use Codex

A note from a leader that makes using AI the expectation, not a nice-to-have
A weekly meeting where anyone can show a prompt or workflow they’ve built
A regular message that names the people whose work stood out

Set the expectation, give people a place to share what works, and recognize them for it—that’s most of the battle.

Part 6: Getting started

The seven-day Codex power-user plan

Day 7: Compound. Run the compounding prompt at the end of your Codex session:

Based on everything we have done this week, what should become a reusable skill,

an automation, or a small tool? What context should I add to my project files

so future sessions start from a better baseline?

Review Codex’s suggestions and implement the one that would save the most time over the next month.

30-day extension:

Week 1: One personal workflow running reliably
Week 2: One multi-source workflow pulling from at least three connected tools
Week 3: One small tool or automation that handles a chore without your involvement
Week 4: One shared or team workflow with explicit ownership and review cadence

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

Cheap Competence, New Frontier

Every Staff / Context Window — 2026-05-24 05:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! This week we published “After Automation,” Dan Shipper’s argument that even when you automate as much as we have, there’s always a new frame for humans to hand to the models. COO Brandon Gell and new head of marketing Douglas Brundage tested the idea by moving their agent work into public internal Slack channels and watching the lurkers gather. Anthropic’s reported $300 million acquisition of developer-tools startup Stainless rides on the same bet—that an agent can’t use a company’s API unless a human has first made it easy to use, which is what Dan and CEO Alex Rattray talked through on AI & I months before the deal.

Scroll down for two takes from the ground at Google I/O—Jack Cheng on why Google is aiming at everyday users, not the AI crowd, and Alex Duffy on Demis Hassabis’s claim that AGI is a few years out—and what Google’s been doing to take us there. Plus, a mini-Vibe Check on Gas City from head of tech consulting Mike Taylor and a Grok-based “banger classifier” Katie Parrott is running her X drafts through, and Katie’s playbook for new grads facing AI-driven entry-level cuts at Meta and beyond—copy-paste career-coach prompt included. We’re off Monday for U.S. Memorial Day and back in your inbox on Tuesday.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“After Automation” by Dan Shipper: We’ve automated as much as possible at Every—agents write the code, draft emails, and compile the newsletter—and yet there’s more human work to do than ever. Dan’s new report traces what happens when cheap competence floods the market and argues there’s always a new frame for humans to hand the models. Read this for the case that progress expands human work rather than ending it.

“Google I/O: Agents, Agents, Agents” by Jack Cheng/Context Window: Google’s I/O keynote rebuilt search and assistants around agents—a default AI Mode, the 24/7 Gemini Spark, and a Universal Cart co-built with Amazon, Meta, and Microsoft—all on Gemini 3.5 Flash, pitched as Opus 4.7-level intelligence at four times the speed and half the cost. Read Jack Cheng’s report from the field for why Google’s I/O bets on distribution over benchmarks.

“Notes From the Foothills of the Singularity” by Alex Duffy/Playtesting: At Google I/O, Demis Hassabis placed AGI “just a few years” out and put its total impact at 10 times the Industrial Revolution. Alex Duffy frames the other side of the story through his Uber driver back from Mountain View: a 54-year-old construction worker who knows the city by heart and is worried his job is next. Read this for the tension between Google’s compute-at-scale ambitions and the workers whose ground it’s reshaping.

“Inside the 100-agent Software Factory” by Katie Parrott/Context Window: Mike Taylor previewed Gas City, the successor to Steve Yegge’s viral Gas Town—an orchestration toolkit where a persistent “mayor” agent dispatches anonymous “polecat” workers. Read this for the multi-agent engineering ideas worth internalizing even without the tool.

“How to Start a Career When AI Is Doing Your Entry-level Job” by Katie Parrott/Working Overtime: As Meta and other companies announce job cuts citing AI, Stanford’s Digital Economy Lab finds employment for 22-to-25-year-olds in AI-vulnerable jobs is down 13 percent since late 2022, while older workers have held steady. Katie Parrott offers four moves for new grads navigating an entry-level rung that’s getting kicked out. Read this for a copy-paste career-coach prompt and the case for protecting one craft from AI.

Log on

Upcoming event

Executive AI Sessions: On June 2, head of consulting Natalia Quintero hosts a live webinar introducing Every Consulting’s new offering for leadership teams navigating AI adoption—built on the playbook we’ve been running with executive clients for months. Learn more and register.

In New York City

Every 🤝 IRL: Join us at the Every brownstone in Brooklyn on June 3 during New York Tech Week for a subscriber-only meetup celebrating the Every community over drinks and conversation. Learn more and RSVP.

Alignment

Think boom, not doom. At an obesity conference in Istanbul last week, two words seemed to be on everyone’s lips: GLP-1s and AI. It is hard to think of two more important technologies arriving in healthcare at the same time. GLP-1s are changing what we know about biology, and AI is changing the distribution of knowledge. I can’t even begin to imagine what the world is going to look like in the next five, 10, or 10 years.

Even so, a recurring question was this: What happens to the doctor-patient relationship when medical knowledge becomes abundant?

A growing number of patients are taking their health data, quite literally, into their own hands. They wear an Oura ring and get blood work through companies like Function Health or Superpower. They upload lab results, medical history, symptoms, medications, and sometimes even genetic data into ChatGPT or Claude. With enough context and persistence, they can generate a reasonably sophisticated view of their own health risks, possible diagnoses, or whatever else they might want to know about their biology.

Share of U.S. consumers who have self-diagnosed using a commercially available LLM, 2023–2025. (Source: Bain, Stifel.)

Two things will change about how medicine will be practiced in the next few years.

First, there may be fewer people utilizing primary care, especially among younger, tech-savvy patients in cities like San Francisco, New York, and Austin. Some visits that used to be driven by uncertainty may be replaced by AI-guided reassurance, self-triage, or more targeted use of labs, telehealth, and specialists. The result will be fewer low-information visits, which could be beneficial if it frees capacity for people who need in-person care most.

Second, when patients do see doctors, they will not come empty-handed, waiting for the physician to be the sole authority. They’ll be armed with much sharper questions. This is where Dan’s point about cheap competence becomes so important. As models commoditize medical knowledge, the value of situated judgment rises. The scarce skill becomes knowing what to do next for this particular person.

I am optimistic. AI does not make physicians irrelevant. It just makes excellent physicians more valuable.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

Notes From the Foothills of the Singularity

Alex Duffy / Playtesting — 2026-05-22 05:00:00 -0400

by Alex Duffy

in Playtesting

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Last year at Google I/O, the company made an overwhelming 100 announcements, including an AI video model—Veo 3—that was miles ahead of anything else at the time. This year had less wow but more dutiful iteration. Gemini 3.5 Flash is faster and more capable than Google’s previous frontier model. Search now builds the right small tool to answer your question on the fly. Gemini assistants can keep running with your laptop closed. Even Gemini Omni, a new, multi-model world model that intuitively understands gravity, kinetic energy, and fluid dynamics—and will likely help train robots—is, for now, being billed as “Nano Banana for video.”

In a year when competitors like OpenAI continued to throw things at the wall—touting its video model, Sora 2, as a ChatGPT moment for video that, according to former head Bill Peebles, would “evolve into a mini alternate reality”—only to shut it down later in the same year. Or leaned into the work market while simultaneously talking, as Anthropic CEO Dario Amodei did, about AI’s potential to decimate entry-level jobs, Google’s releases were not flashy. But filling the gaps both within AI’s jagged intelligence and across its products, while getting the tools to people who will use them, is probably orders of magnitude more important.

Attendees at this year’s Google I/O, with the swooping, landscape-inspired roof of the company’s Bay View campus buildings. (All photos courtesy of Alex Duffy.)

Demis Hassabis, CEO of Google DeepMind, called this moment the “foothills of the singularity.” He puts artificial general intelligence (AGI) “just a few years” out and its total impact at 10 times the Industrial Revolution, and arriving 10 times faster. We now have the ability to automate almost anything we can capture reliable data on, but one of the biggest hurdles is convincing society that it’s worth investing in that ability. Right now most people don’t think it is.

Hassabis called out explicitly that “it’s incumbent on the field, our field, the AI field and industry to show the unequivocal benefits more clearly and more concretely.” My impression, after this year’s conference, is that Google sees the precarity of the current moment clearly, and its scale gives it a rare position to do something about it.

The loop

Google’s loop works like this: Researchers find new data, improve the model architecture, and train a new one. The model is trained specifically to fit into their “Antigravity” harness, giving it the ability to write and run code, and therefore do pretty much anything else. The company then applies it across every product: Search, Docs, YouTube, Gmail, Android, the works. Users try it out and provide feedback implicitly through behavior and explicitly with thumbs up or down ratings. The next model improves. Everything happens across Google’s full stack—the chips it designs, the data centers it owns, the models, the deployment pipeline, billions of users on more than half a dozen core apps. This past year has been about realigning the organization to run that loop at scale.

Internal tools are being rewritten to be 20 times faster and built for agents. Google is looking at how experts within and outside of the organization work, collecting that high-quality data, identifying the underlying capability gaps, then training models to overcome them.

It shows up as a search box that can build a custom widget for your question on the fly, helping drive home a deeper understanding than a headline. Or in an easier-to-use Gemini app, which just passed 900 million monthly users and will soon have a 24/7 personal agent doing research across your emails, catching tasks and running with them asynchronously, returning drafts, reports, itineraries, and more. Google’s adding new agents to surfaces across its family of apps like Maps and Shopping, all of them powered by Gemini 3.5 Flash and the Antigravity harness—the same combination that can build a working operating system in 12 hours with 93 sub-agents for under $1,000. None of that was possible six months ago. Now billions of people will use these tools to pursue their goals, often without realizing that they’re using them.

Google Deepmind CEO Demis Hassabis at the “AI and the frontiers of science” session on the second day of the conference.

The obligation

A year ago, Google processed 480 trillion tokens a month. Last month, that number was 3.2 quadrillion—3 trillion a day, doubling every three weeks. Its capital expenditures this year were around $180 billion, almost six times what it was in 2022. But so far, the general public is not convinced that the investment is worth it. What most people see, instead, is white-collar layoffs, resource-hungry data centers going up in their back yards, and a small group getting very rich.

My Uber driver back from Mountain View to San Francisco was 54 years old, still works in construction, and optimized his routes around the goings-on of his city with which he was intimately familiar. He’d never heard of Hassabis or how games could help teach AI, but was curious about what happened at I/O. He opened our conversation with a worry about layoffs, the rich getting richer, and the question of who would be left to spend in the economy. I asked a lot of questions and mentioned how Hassabis emphasized the obligation of the industry to “show the unequivocal benefits of AI more clearly.” I shared my admiration for Hassabis’s clear, vocal focus on curing all disease, and the progress made so far thanks to AlphaFold. We talked about how one person could now do what used to take a team, and how that opens room for more small businesses, though the road there may be pocked with layoffs. By the time we arrived in San Francisco, he had moved the YouTube documentary he’d saved to the top of his watch list.

I think people want to be excited. The promise is real—AI is the best general-purpose tool we’ve ever had for science. Data centers already pay half of some counties’ property tax revenue, lessening the burden on everyday people and providing dramatically better returns on resources like water than alternatives. On the horizon are cures we’ve been chasing for decades, materials that could increase our energy efficiency while reducing our footprint, and education that adapts to the learner. Self-driving cars could save tens of thousands of American lives a year and provide the freedom of mobility to many. They will also be coming for my driver’s job. The promise arrives at scale, but the cost arrives household by household. Unless the industry shows upsides as tangible as today’s downsides, whether actual or perceived, and invests in the people displaced first, progress will slow.

James Manyika, president of Research, Labs, Technology & Society at Google and Alphabet (left), in conversation with Hartmut Neven, founder and lead of Google Quantum AI, who is holding up one of Google’s Willow quantum chips.

The window is open. Google and others have built the infrastructure to run this cycle at scale and put it in the hands of billions. This past week mathematicians used a frontier model to uncover a mathematical secret which had eluded us for 80 years, disproving a long-standing conjecture in discrete geometry. That used to require a PhD or a team. Now it can mean one curious person and a coding agent. What’s left is to point these tools at problems worth solving right now, that produce visible benefits for individuals and communities alike. Announcements like the Gemini XPRIZE, which aims to do just this, show that the company understands the urgency of the moment. As does just simply getting the tools into the hands of more people, especially when the learning curve is as shallow as asking a question.

I’m excited about the robotics updates and the world models being built for simulation. The bigger moonshots are coming. But the work most worth doing right now is the work in front of us, with the people around us. The future, in Hassabis’s words, is yet to be written. But we must also be careful with direction and not mistake activity with achievement. The stakes are high. The conversations we have, the stories we tell, and the way we use these tools today will define what comes tomorrow.

Alex Duffy is the cofounder and CEO of Good Start Labs and a contributing writer.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

After Automation

Dan Shipper — 2026-05-21 10:00:00 -0400

by Dan Shipper

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

We’ve automated everything we can here at Every. Agents write our code, draft our emails, handle customer support, and help compile the newsletter. We alpha-test new models before they launch. We use AI in every way imaginable to build and ship everything we touch. We go as far and as fast as possible.

Yet there’s more human work to do than ever.

Today we’re publishing “After Automation.” It’s something I’ve been working through for a while. The popular narrative is that AI will eliminate human work. But I think technological progress creates more for people to do, not less. And that’s a good thing.

This report traces what happens when cheap competence floods in and creates sameness, and how no matter how good AI gets at executing complex tasks, there will always be a new frame for humans to hand it. I’ve included examples from inside Every: how we embed our agents, what benchmarks we use, prompt engineering we play with, and what the work looks like when humans stay structurally ahead of the models.

Of course, this report is agent-native. Drop it into Codex or Claude and argue with it to your heart’s content.

Read "After Automation"

Watch the video

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

Inside Stainless, The Developer Tools Startup Anthropic Just Bought for $300 Million

Dan Shipper / AI & I — 2026-05-20 13:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I with Stainless CEO Alex Rattray is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts. [Disclosure: I’m a small investor in Stainless.]

Timestamps

Introduction: 00:01:15
APIs and MCP, the connectors of the new internet: 00:05:09
Why MCP exists: 00:11:00
Why MCP servers are hard to get right: 00:17:15
Design principles for reliable MCP servers: 00:20:24
Using MCP for business ops at Stainless: 00:25:06
Alex’s take on the security model for MCP: 00:40:57
How one-off AI actions become permanent production software: 00:44:42

Dan Shipper

The internet runs on computers talking to each other, but its entire architecture was built for a pre-AI world. Now we’re trying to hook AI up to the internet with MCP—Model Context Protocol—which turns any website or web service into a set of tools that an AI can use natively to get work done. And the software companies that learn how to do MCP well are going to win over the next decade.

That’s why I brought Alex Rattray, the founder and CEO of Stainless, onto the show. Stainless’s job is to help computers talk to each other. They make the APIs and SDKs for all the big companies you know about, like OpenAI and Anthropic, and they’re starting to build MCP servers too. Alex and I get into the nitty-gritty of what the future of MCP looks like, how to design good MCPs, why MCPs are actually really hard to scale and possibly insecure, and we try to figure out together what a better model for allowing AIs to use the internet might look like.

This is a great episode. Alex is a good friend of mine. Let’s dive in.

Alex, welcome to the show.

Alex Rattray

Thanks, Dan. It’s really exciting to be here.

Dan Shipper

It’s good to have you. For people who don’t know, you are the founder and CEO of Stainless, which is the API company. You make APIs for companies like OpenAI and Anthropic—just name your big company that you might use their API, and Stainless is probably behind it. Before that you worked at Stripe doing their API, which makes total sense. And before that, most importantly, we were very good friends in college and have remained good friends. We were both starting companies in college. I’m a tiny investor in Stainless. It’s been really fun to watch your journey and get to hang out together so much over the years, and I’m just very excited to bring you on to talk about AI and what you’re doing at Stainless.

Alex Rattray

Thanks, Dan. It’s been really fun over the years. When we were in college, I was working on a startup and you were working on a startup. You had a conference room at a venture capitalist office as your office, and you let me crash there with my co-founder and team. We were just on the other side of the conference table hacking away into the evening. Very fond memories of those days. And these days it’s not every evening, but on the weekends, whatever—the same thing is still happening. You don’t see that every day, and it’s a really nice feeling. It’s been great to see everything happening with Every along the way.

Dan Shipper

Thank you. As I say, I started from the bottom, now we’re here.

The thing I always say when I run into people and they ask me about you—in order to embarrass you—is that you’re the only person I know of who has consistently run barefoot through the streets of Philadelphia. When we first met, you were not a fan of shoes and you were a fan of running. You want to talk about that?

Alex Rattray

It wasn’t that I didn’t like the concept of shoes—it’s that I couldn’t find a good pair. At a certain point, I was running through Nikes and they would bust open every few months. I think what was actually going on is that I had really wide feet and was probably buying narrow shoes. Shoes would constantly get ruined, and on a college budget it’s just like, “This is no good.” Eventually I decided, okay, the longer you wear your shoes, the more worn out they get, but the longer you just wear your feet, the tougher they get.

Dan Shipper

“The longer you wear your feet.”

Alex Rattray

Try it out. Try this at home. What could go wrong? I actually currently have a really annoying splinter in one of my feet—so don’t actually try this at home. But—

Dan Shipper

Are you still running barefoot?

Alex Rattray

No, no. This is just from around the house.

Dan Shipper

Dangerous.

Alex Rattray

Yeah. But see, that’s the thing. If I had been going around on the asphalt without socks on, my feet would’ve been tougher and I’d have no splinter.

Dan Shipper

So when you’re not running barefoot, you’re running Stainless. You’re around 50 people now, right?

Alex Rattray

Just about, yeah.

Dan Shipper

That’s pretty wild. You started Stainless in a pre-AI world, and now we’re in an AI world, and I think you have some ideas for what the future of AI is going to be and how APIs fit into that, how MCPs fit into that. Do you want to paint a little picture for us about where we’re going?

Alex Rattray

I would love to. To start—what’s an API? Not everybody’s familiar with that. It stands for application programming interface. There will not be a quiz, right, Dan?

Dan Shipper

No quizzes.

Alex Rattray

Great. Basically, it’s how one computer program talks to another computer program. It’s how computers talk to computers, how apps talk to apps. APIs are the dendrites of the internet. Dendrites are where your neurons connect and actually exchange information with each other. If you have two neurons in your brain but they’re not talking to each other, you’re actually not thinking. There is no thought happening in a brain without connections between neurons.

And if you think about the internet—if all these servers in the cloud weren’t talking to each other, you wouldn’t have internet. Programs, internet software, does nothing without APIs, without connections to other programs. It’s really fundamental to the mesh of pretty much all modern software. Everything we think of when we think about technology—APIs are at the heart and center of that, just like dendrites are the center of the mesh of the brain and how we think.

Stainless’s mission from day one was to make it easier for computers to talk to computers. The long-running trend of technology is toward more automation. APIs are how most business-to-business interactions, in some format or another, become real, become automated.

What we see with the rise of AI is that a new computer has entered the chat. There’s a new kind of system that can talk to other systems—or at least we’d like it to be able to. You used to have either humans interacting with a computer through a user interface, or a computer interacting with a computer through an API. Now we have LLMs interacting with computers. What’s that through?

Anyone familiar with Every and who’s a regular listener will know MCP—Model Context Protocol—which is a system for connecting LLMs to computers broadly speaking. It’s an area we’re investing in at Stainless. It’s really part of our core mission of making it easy for computers to talk to computers.

The core product we first brought to market is software development kits, SDKs. These are ways of saying, “Okay, Stripe has this great REST API. You can send JSON over HTTP and get back JSON over HTTP. And if you want that to be really convenient, you’re going to use the Stripe Python library, the Stripe Python SDK.” If you’re a Python developer, you’ll go pip install stripe, and then in your application code you’ll write stripe.customers.create, and all of a sudden you have a nice new customer object in your Stripe database and you’re off to the races. Or stripe.charges.create in the old days, to charge a credit card.

SDKs give developers that easy way to interface with an API. What’s the thing that gives LLMs an easy way to interface with an API? You might say MCP, and in a sense you’d be right. But what we’re seeing so far as MCP rolls out into the world and people experiment with it is that it’s not working so great. It’s difficult to deliver on what I see as the core vision of what’s so exciting about MCP.

A dashboard and a user interface lets you click around, see a bunch of stuff, fill out forms, click buttons, do things—anything you’d do while interacting with software, you do through the UI. But LLMs interacting through MCP tend to be much more restricted. You can only do a few little things. There’s usually not a ton of tools you’re going to be exposing to the models.

(00:10:00)

Dan Shipper

Just to stop you there—what I’m hearing you say is that just like a website is built for humans to use, MCP is sort of the equivalent for models. You can think of it as exposing a set of tools the model can use to perform certain functions. Just like you might click a button on a website, MCP gives the model a bunch of things it can click on or use to get work done.

An example might be a Gmail MCP that has a send mail tool, a compose mail tool, a read inbox tool—that kind of thing. And instead of a human going on the Gmail website and doing it, the LLM is essentially logging in and using it itself. It’s a native interface for language models. But you’re saying that’s not working that well. Can you tell me more?

Alex Rattray

Let’s start with what I see as the big vision of MCP and, in some sense, the big vision of agentic AI in the first place. I’ll start with the most pedestrian example you can imagine.

Let’s say Dan walks into my store and buys a pair of stripey socks and maybe a few other things. The next day I hear back from Dan that there’s something wrong. It happens, you know? I turn to someone on my team and say, “Hey, can we refund Dan for those stripey socks he bought yesterday and send him a discount code for next time with a little thank-you note, because we like to take care of our customers?”

This is the most normal thing to do in software—some little task like this. What the member of my team would be doing is opening up their internal admin and looking around. They might go to the Stripe dashboard and look through the list of payments or transactions or orders to find one that has someone named Dan. Which Dan? There might be a bunch. Look through the list of products in the order to see whether there were stripey socks in there. That might be a few clicks. Find the right one, then go to the screen where you can create a refund, create the refund, make sure it’s the right amount, then go and create the discount, then take that discount code and send it over to some other SaaS app to send the mail automatically.

Of course, in a business-to-business context, you might be going into Salesforce and sending a Slack message to an account manager, so on and so forth. In the normal course of work, it’s just the most normal thing in the world—having one task involve going through five different apps, each time 15 different clicks and scrolls and loading spinners, just to do one simple thing.

The promise of agentic AI is to take that same prompt and type it into ChatGPT or Claude or whatever, say, “Hey, can you help refund my friend Dan?” and just have the AI go off and do that—go through these five different apps and the 15 different screens and the various button presses to complete the task and then come back and say, “Great, it’s done.”

In order to do that—and there are only so many tool calls you have to make as an AI model to perform that exact linear chain of events, so it’s somewhat tractable—but if you think about this in the general case, you want your agentic AI to be able to do anything that human operator would have done, without having to wait for a bunch of JavaScript to load on a website or anything like that.

That means you need not only the Stripe create refund tool and the Stripe list transactions tool and the Stripe list products and lookup customer and create discount tool—you need not only those tools, but you need everything you can do in the Stripe dashboard, which is basically everything you can do in the Stripe API. And that’s actually a lot. There are hundreds of different endpoints in the Stripe API. The Stripe dashboard is massive. It’s a huge application.

If you were to take that list of tools today and go to an LLM and say, “Hey, here’s our MCP definition for all of this. Here’s a create refund tool, here’s a create transactions tool,” so on and so forth, and tell it all about those tools—all the descriptions, all the different request properties, the response properties, all the documentation—everyone listening already knows: you’ve just burned through your entire context budget. That’s hundreds of thousands of tokens just in pretty much translating the Stripe OpenAPI spec directly over to MCP tools. Today’s models not only can’t handle that amount of context, it’s a poor use of context because you have a lot else going on. But it’s also just confusing to the model. It’s too much to hold in your brain at one time.

And that’s just the Stripe part of it. What you’re really trying to do is enable your operators to do anything they would normally do. And that spans many, many different SaaS tools. In the course of one interaction, it might be five. In the next interaction, it might be a different five. If you think about every single SaaS tool your business uses on a daily basis to get work done—ideally you’d want every single one of those tools exposed to your operators in their AI chat, with every single tool available, with every nook and cranny and corner case available, so you can do anything through AI. That’s the vision.

There are a lot of problems with that. The biggest is this context window limit. But you also have all sorts of security and permissions problems, because you don’t want the AI to color outside the lines and say, “In addition to refunding Dan’s socks, I also refunded every customer for all transactions ever. And then I sent a bunch of money to my own AI bank account.” There’s more to the challenge, but that’s the vision.

Dan Shipper

I think the place we started was you saying it’s not working. But I don’t think that’s the reason it’s not working today. Is that the reason why it’s not working today?

Alex Rattray

What people do with MCP today is sometimes try to expose all parts of their API. The way people generally build MCP tools is they have an underlying API—usually a REST API—and they wrap different parts of it, different endpoints, different operations, in MCP tools. You can do that in a one-to-one mapping, or you can kind of handcraft things for the MCP. Today, in order to succeed, people are finding you really have to handcraft it to the MCP, to the LLMs. You have to say, “Okay, I’m making one specialized tool to look up a customer and refund their transaction based on a description.”

Dan Shipper

So there are all these decisions you have to make where you need to have the ergonomics of the model in mind—how the model thinks—in order to make sure the model does the right thing more often than not.

Alex Rattray

Yeah, it’s hard. I use this SDK analogy sometimes. It took a long time for humanity to get to the point where we could make a really good Python SDK for a developer wrapping an API. I think we’ve cracked that nut. Stainless offers really great Python libraries, but we’re building on the shoulders of giants here. We haven’t figured out how to expose an API ergonomically to an LLM in the same way we’ve figured out how to expose it ergonomically to a Python developer. That’s a new research problem in a sense.

And it’s harder because I can go learn how to be a Python developer if I want. I can’t really learn how to think or see like an LLM. That makes it tricky.

We do have at Stainless some things we’re cooking up to address some of these problems. LLMs have a really hard time with a repeated, sustained chain of actions. Even if you get an API response back for “list all the transactions,” there’s so much data, and you might have to go through the next page and the next page to find the one that has Dan with the stripey socks. That’s again a ton of context with one or two small needles in the haystack. LLMs are pretty good at that, but not perfect—and with too much hay, we all end up throwing up our hands. That’s true for LLMs too.

Dan Shipper

When you’re building MCP servers for people—and when you see people doing it well today—what are the principles? How do you think about making an MCP server that one, people use, which is actually a big one, and two, when it is used, actually does the right job?

(00:20:00)

Alex Rattray

There have been relatively few times I’ve seen it done well. I have seen it done well. We’re cooking something up that I’m really excited about. But with today’s technology, you really have to do a good job of product management. You have to go out into the market, talk to your customers, see what their actual needs are, look over their shoulders as they use and operate your software, and think about what you could unlock through AI where people would be doing things they can’t really do with your software today—because it just got so much easier. Then you have to do a lot of engineering work to wrap it up in a bow that works for the models.

You have to set up a really good system for evals, and if you’re doing MCP, you have to think about the different clients people might be using. Are they using Cursor? Are they using Claude Code? Something else? And the different models underlying all that. You end up with a pretty crazy matrix of things to optimize for and ways to evaluate whether what you’re offering is working well.

It’s also kind of a black box to get that feedback back to your servers so you can find out: we gave a tool call response here, was it actually any good? Did the user like it? Was the LLM able to use it? That’s a problem I haven’t seen a lot of people solve yet. Thinking about that as a first-class thing—maybe you have a send feedback tool, which is something we’ve been thinking about—so that if a user says out loud in the chat, “Oh man, that was useless garbage,” at least the MCP server finds out about that.

Dan Shipper

Is there anything more concrete you’ve learned about how to design a good MCP server—beyond the obvious stuff about talking to customers and thinking about use cases?

Alex Rattray

You want to keep the number of tools relatively small. You want the tool name and the description to be really precise and specific.

Dan Shipper

Aren’t those two things at odds?

Alex Rattray

Yes. Good writing is hard. You can make a great tool that looks up a person by name and product description and then refunds them. You also want a small number of properties in the input schema—a small number of parameters, concisely described but sufficiently described. This is also hard. You want the response data to come back with very little data—only exactly what the model will need. That’s also very hard because you may not know a priori which things the model is really looking for.

We have a technique we use in our MCP servers today where we give the model a JQ filter, which is a way of filtering out JSON, and that can work pretty well. But that’s kind of a special trick.

Dan Shipper

Doesn’t this mean that MCP just needs another level like a search tool—search, like, find a list of relevant tools given my task?

Alex Rattray

The tool browsing problem is definitely a serious one, and that is one approach. We actually do this at Stainless today, where you can get an MCP server for your API that just has, like I was saying earlier, the very simple thing of every endpoint exposed as a tool. If you have a small API, that works great. You can also filter it, so you expose an MCP server with only a small subset of your endpoints. That works great.

You can also use what we call dynamic mode, where there are three tools no matter how big your API is. One is list endpoints, another is get endpoint and learn about it, and the last one is execute endpoint. That enables the context thing to scale really well, but it means three turns of the model just to do one thing. So that gets slower. It’s more expensive in another sense, and there’s some lossiness. It performs pretty well usually, but not quite as well because the tools aren’t loaded up in quite the same way.

Are you using MCP servers yourself?

Dan Shipper

Yeah.

Alex Rattray

Funnily enough, not so much on the coding side—I use it on the business side. I’ll use the Notion, HubSpot, and Gong MCP servers and an MCP server for our database—a read-only copy—and say, “Hey, what are the interesting customers that signed up for Stainless last week?” It’ll go off and make a great query of our Postgres database, cross-reference those things in HubSpot, look up our notes in Notion, maybe even look at transcripts in Gong, and tell me all about it. It’s incredible.

(00:30:00)

Dan Shipper

And so that’s one of your big use cases. How often are you doing that? I’m now interested—not even from an MCP perspective, but for anyone running a business with some complexity who wants to know what’s going on. What are you actually doing, what is the report that comes out, and how often? Tell me so I can steal it.

Alex Rattray

For me it’s still usually in kind of playing-around mode. One of the things is the MCP servers disconnect, and then I get annoyed. You have to reconnect, which is not a huge deal, but there are a lot of little paper cuts still in technology this new that can hold back some amount of usage.

One thing I found really helpful at the meta level—and I’m sure you’ve had other guests talk about this—is the practice of just collecting notes for the AI by the AI, then edited and curated by yourself. I have a notes folder, a research folder, something like that in a special Git repo that I use just for this sort of internal stuff. I tell the AI: “When you find interesting customer quotes, put them in this folder and give the full citation,” so that the next time I start asking interesting questions, it doesn’t have to go searching through the MCP servers again. It has them cached in markdown files on disk.

Dan Shipper

Wait, that’s crazy. What are you using to write into that Git repo? Is it Claude Code? ChatGPT? How does it get in there?

Alex Rattray

I use Claude Code these days for that kind of thing.

Dan Shipper

So you just have Claude Code open and running, and then a new customer testimonial comes in and you’re like, “Hey, can you throw this into my master company Git knowledge repository?” And then whenever you need anything later you’re like, “Claude, go search through my master repository to figure out where the best customer quote is for this.”

Alex Rattray

Totally.

Dan Shipper

That’s so cool. What kind—can we see it?

Alex Rattray

No, it’s too messy and probably has a lot of confidential information—the latter being more important.

Dan Shipper

When you say it’s messy, are you having Claude organize it at all? How is it structured?

Alex Rattray

There’s a lot that I want to do here that we haven’t had the chance to do yet. There’s some lower-hanging fruit that our business team is working through right now, just on the basics of your CRM systems and so on. It’s not well-structured now, but I think that’s fine. I’m not going to prioritize structuring it super well until we’re using it more broadly. I use it some of the time. One of the business people on the team uses it a fair amount. One or two of our customer support engineers use it a lot. But it’s not yet broader than that, and I’d like it to get there. Once we see how everything’s evolving, that’s when we’ll start bringing in more structure. As it is, Claude Code can handle unstructured stuff really well. You don’t have to think about it too hard in advance. You can move things around later.

Dan Shipper

What else do you have in there other than customer quotes?

Alex Rattray

SQL queries. I’m a software developer—I don’t write a lot of code these days, but I spend a lot of time doing that. When I say, “Hey, how is our month-on-month growth of XYZ metric over the last three months?”—I did this recently for my last board prep—it came out with a pretty good answer right away, and I was like, “Wow, this is awesome.” Then I looked a little deeper and realized I actually wanted to exclude certain users from the analysis and filter it this way and that way. I imbued more business context into that SQL query and iterated with Claude Code to get it better and better for the specific metric and the specific story I was trying to tell. Then I got it to a good place and said, “Great, let’s dump this into an analytics folder for future use.”

Dan Shipper

So next time you’re doing board prep, you can be like, “Hey, what was that query we did last time?” and it’ll go get it.

Alex Rattray

Yeah. That’s really cool.

Dan Shipper

What else?

Alex Rattray

As any software team is doing these days—we’re using this for, “Hey, a customer comes in with a question. Can Claude Code just fix it?” In some cases, a Linear ticket gets filed, and our support engineers are really very technical. They may not have the wall clock time to chase down the fix themselves on an incoming bug. They have the technical skill, but another customer writes in two minutes later and they want to jump on that. They don’t want to be knee-deep in a debugger.

So sometimes what we do is file the ticket—intending to do it later, or for another engineer to do it later—but say, “Hey, can we see if Claude Code can just take a crack at it?” Is that going to work out 100% of the time? Definitely not. Is that going to work out 50% of the time? Still no, to be honest. But can that improve the overall efficiency? Yeah, maybe. We’re still experimental there, but we’re seeing a lot of promise.

Dan Shipper

In our pre-production call, you were talking about having a big vision for the future of AI. Do you want to walk me through that?

Alex Rattray

I would love to. We talked earlier about how agentic AI can make operators’ lives a lot easier by taking certain pedestrian tasks and running with them independently. That’s something I think as an industry we’re almost on the cusp of.

A big part of the way I see things unfolding from here—I like to say the future of AI is cyborgs. Which is already sort of ridiculous because what is a cyborg other than a robot? But cyborg, as I understand it, is a term that means you’re part person and part machine. In this case, when you go and talk to an agent, what you’re going to be getting is part LLM neural net and part code—where the machine I’m talking about is traditional CPU software, not GPU software.

I think this will play out in two main ways. One is your kind of one-off operational use cases like we were talking about a minute ago, and then the other is production software.

In the use case where someone needs to perform some tricky one-off action with a bunch of points and clicks, and now we want an AI to just make a bunch of tool calls—the way I actually see that happening and what we’re building toward is code execution. Rather than the model having a bajillion tools, the model has two tools. One to execute code—where it just has a text box of “put in some TypeScript, and you’re going to use this API’s TypeScript SDK, and you’re going to write stripe.charges.list, stripe.customers.retrieve, stripe.refunds.create.” This is really easy for models. They’re really good at writing code.

(00:40:00)

If you give that tool a little bit of a README—“here’s an example request, here are some other API calls you can make”—it’s really good at extrapolating from patterns when the SDK and the API are well-formed and predictable. Then you give it an additional tool to search the docs and ask questions of the docs. Anything it’s not sure about or gets wrong on the first try, you give it the documentation.

What this does for the scenario we were talking about earlier is you have very limited impact on the context window up front—we’re talking about 1,000 tokens or something like that. And the context impact of doing a whole bunch of paginated list requests? Zero. The model will go look for somebody named Dan and double-check that the purchase was stripey socks. You might write three nested for loops, but then only at the end when it found the right thing it’ll console.log “found Dan, customer ID, blah blah, transaction ID, blah blah.” Then create refund—refund ID one, two, three.

The context hit coming back from all of this is going to be like 10 lines of text. It’s really minimal. And all of this will run really quickly too, so you don’t have a round trip to the model every time you’re doing something like this. It’s just CPU code, and it runs in a server in the cloud right next to the Stripe API somewhere in AWS. It goes super fast.

Dan Shipper

What I’m understanding you to say is that the language model has a tool where it can write code and send that code to whatever API provider—Stripe, whoever’s MCP server you’re using—they’ll go and execute that code, that code is going to interact with their API, and then return the results. Rather than having 50 different possible tool calls and all that stuff, it’s just: model writes API code, API provider executes that code, runs it on their API, and returns the results.

Why wouldn’t my model just write the code that I then run myself instead of relying on an API provider to do it?

Alex Rattray

I expect that will happen a lot more. I expect the code execution tool is going to become the most widely used tool. The problem is that today the code execution tool doesn’t work so well with libraries. LLMs have a hard time knowing exactly what version of a library they’re using, using the right version—probably usually the latest version—and not hallucinating aspects of the API, and knowing how to iterate if they hallucinate wrong.

And if it can’t use any library off NPM or the Python Package Index really, really well, basically perfectly out of the box, then forget about using a library. At that point you just have to hit the raw HTTP API. And in order to figure out what’s in there, you need the whole OpenAPI spec, and you’re back at square one because that document is massive.

Furthermore, something that’s really scary about that is if you don’t have a typed library with static typing where the computer can say what you’re trying to do is wrong, then the LLM will try to make an API request that is wrong some percentage of the time. The code execution tool can run a type checker and say, “You’re asking about stripe.transactions.list, but that actually doesn’t exist. Stripe doesn’t have a transactions API. You might want payment intents, you might want orders, you might want balance transactions. Which one do you want?”

And if the API provider is doing a great job building this tool, it’ll return the documentation for all of these things inline. It might have its own AI look at what the model’s trying to do and come up with a suggestion. That sub-agent is well-trained, well-specified, always updating, and isn’t burdened with the context of the full conversation.

Dan Shipper

What do you think of the security model?

Alex Rattray

The security model is really, really interesting. This is another area where we’re really starting to think about things at Stainless, and I’m getting really excited about it—so if any listeners are really interested in this and have some ideas or want to talk, please do reach out.

At the end of the day, I think security has to take place at the API layer itself. Right now you see people trying to implement security by limiting what’s exposed through MCP, and that kind of makes sense—but at the end of the day, you could do anything that’s in the API under the hood.

What people should be doing is using OAuth with granular permissions, with proper scopes. At that point, the security happens in the right place, which is at the API layer. There are limitations to OAuth scopes and it’s pretty hard to build. It’d be nice if someone made that easy, but in my view, that direction is the right layer.

Dan Shipper

Going back to my earlier question—I’m thinking about the idea of having a model write code that the API provider then executes to interact with their API and returns the results. Would you ever consider just creating a code execution environment that developers use themselves? Because, for example, I’m thinking about Quora. It has all these tools. Maybe Gmail is going to build a code execution thing, but really I’d want something like what you’re talking about inside of Quora. What I’d need is a computer use tool where I control the environment, I can install different libraries in it, and it can call any API—it just needs to have network access basically.

You guys should build that.

Alex Rattray

We’re working on it.

Dan Shipper

Fuck yeah. You’re building it for developers who want to access MCP servers, or for people who are providing MCP servers?

Alex Rattray

We’re starting with people who are providing MCP servers, but ultimately I think we’re going to need this to work such that you can give the model a code execution environment where it can hit not only the Stripe integration but also the Salesforce integration and also anything else. But not too much anything else. One of the advantages of starting where we’re starting—just one API provider—is that you ensure there are no network connections allowed out of that sandbox where we’re running the code to anything other than, in this case, api.stripe.com. That’s really critical for security for something like this.

There are ways to expand that bit by bit and keep things secure. It’ll take some time.

The other thing to point out as you see some of these generalizations is it’s not just that you want this code execution sandbox to work really well for any API, for any library—which I think we really need. You also start to see that this is just a powerful model for AI doing stuff. Sometimes you realize that the thing the AI did this one time in this one-off case is actually enduringly useful. Maybe any time a customer writes into support and says, “My socks had holes in them,” they should automatically get a refund. Maybe you want that, maybe you don’t—but there’s a lot of stuff that people do once, then twice, then three times, and then they say, “Okay, we should automate this.” That’s what software teams do all day, every day.

(00:50:00)

I think we’re also going to be seeing that with AI—where the same code search tool we’re talking about, all the same prompting that will make an AI really, really good at interacting with an API in one of these code sandboxes, almost quote unquote “in its brain,” where it can write code in its head, run the code in its head, see the results, and then move forward with your task—it should be able to say, “Actually, this is enduringly useful code. Let me commit this to the repo.”

Dan Shipper

Yeah, yeah. Chat is a really good interface for exploring, but sometimes you just want a dashboard. I just want to log into my Stripe dashboard and see all the stuff without having to be like, “What is my MRR?” It should just show up because I do that every day.

But I want to push you as a hashtag value-add investor. I think there’s a thing that happens in AI where often the first attempt at something like this, people try to be really cautious—and I’m sure your enterprise customers care about that—but the things that get adopted are often the ones willing to take the risk to be YOLO very early.

An example is DALL-E was totally private for a long time, and people were posting some images but you couldn’t get in. Then Stable Diffusion was just like, “Forget it, anyone can use this.” And that really started the whole image generation wave. Obviously Stable Diffusion fumbled the bag, but they had a lead for a while.

Same thing for Claude Code. If you look at the difference between Codex CLI and Claude Code—Claude Code was just YOLO mode. It’s super industrious. It has a sandbox, but you can just do --dangerously-skip-permissions. Codex fell way behind because first it was in the browser, so the whole thing was locked down. Then it was in the CLI, but it was really built for pair programming, so it wasn’t particularly industrious. It wouldn’t go off and do a bunch of stuff. It would get locked out of doing certain things even in full auto mode.

And now they’ve caught up because you can just let it do whatever you want. So I would really push you: there might be a version you could do like today or tomorrow or very soon for individual developers that would let them set up this environment that, for example, I would use immediately. I care about security, but I care a lot less than some gigantic enterprise company. And I think the people like me who are building at this scale are eventually hopefully going to be the big companies, but we’re the ones really doing the AI-first adoption, not the big companies.

I would love to get this in your hands. What are some of the APIs your team uses the most?

Dan Shipper

Thinking about all our different products, I’m thinking right now about Cora, the email assistant. It has all the big APIs it’s using—mostly the Gmail API. You’re interacting with the assistant over chat, and it has a list of tools: archive email, draft email, send email, and so on. It categorizes your mail in certain ways.

I think we’d definitely try out something like this because if it ran the same way, it would make it much more flexible for us to make more tools and not break old ones. It’s really interesting.

Alex Rattray

In a sense, what I actually predict is that people who are quote unquote “building tools”—once we have a code execution super-tool like I’m talking about—is that the only way you really “build a tool” is with instructions, with prompts. The full power of everything you could possibly do in the Gmail API, for example—it’s all there in one tool. But sometimes you have specific tasks or specific categories of work you want to describe in a particular way, to help the LLM perform a sequence of actions as productively as possible. At that point, the only engineering work you have to do is prompt engineering.

We’ll see if it’s that “easy.” As we all know, prompt engineering can be really tricky. But I think that’s part of the vision.

That being said, we do have some pretty nifty ways with the MCP servers we generate today to help developers mix and match all the parts of the different tools underlying all the different parts of the API as they compose and write their own tools.

Dan Shipper

This is awesome. For people who are listening and want to know more from you or more from Stainless, where should they find you?

Alex Rattray

Stainless.com is our website. At least visit stainless.com.

Dan Shipper

Alex, great to have you on. I can’t wait to do more of this when you have some of these new things launched. This is really, really fun—great to chat.

Alex Rattray

Thanks, Dan. You too.

Google I/O: Agents, Agents, Agents

Jack Cheng / Context Window — 2026-05-20 13:00:00 -0400

by Jack Cheng

in Context Window

Midjourney/Every illustration.

Google I/O dominated the week, and the message from Mountain View was unsubtle: Agents are now the product, with Gemini 3.5 Flash powering a redesigned search and a new fleet of always-on assistants. One layer down, Anthropic paid a reported $300 million for Stainless—so we’re re-upping our AI & I episode with CEO Alex Rattray, who laid out the design principles for making software legible to agents months before the deal happened. Plus: We did a mini-Vibe Check of Figma’s new in-canvas agent to see whether it solves the blank-page problem.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Spotlight

Alex Rattray, Stainless CEO and MCP whisperer

Flashy frontier model releases suck up most of the oxygen in the AI ecosystem. But without reliable ways for AI agents to access these models, their capabilities are limited. This plumbing may be easy to overlook, but it’s an indispensable component of an agent-native internet.

You don’t have to take our word for it. On Monday, Anthropic announced it has acquired Stainless, a software platform for high-quality APIs, to extend Claude’s ability to connect to data and tools. (While terms weren’t disclosed, The Information put the purchase price at north of $300 million.) Former Stainless customers include OpenAI and Google, meaning Anthropic has acquired a developer tooling company used by its top rivals.

In October, Stainless CEO and founder Alex Rattray joined Dan Shipper on AI & I to talk about why teaching models to use software is so tricky, and what design principles make model context protocol (MCP) servers more intuitive for LLMs. (TL;DR: Keep the number of tools an agent can access small, give the tools precise names, and aim to generate tightly defined outputs.) In the episode, Alex goes deep on Stainless’s approach to making it easier for AI agents to use the internet—hard-won insights that, as it turns out, can lead to a big-sticker acquisition from a top model company. [Disclosure: Dan is a small investor in Stainless.]

Read Anthropic’s announcement about its decision to buy Stainless and then watch Rattray’s AI & I episode on X or YouTube, or listen on Spotify or Apple Podcasts (or read the episode transcript).—Laura Entis

Signal

Google goes all-in on agents

We’re hurtling toward an AI landscape divided into two categories of agents: those you collaborate with, and those you delegate to. Google’s new releases from its flagship I/O developer conference, happening this week in San Francisco, break neatly along that line.

The headline announcement is Gemini 3.5 Flash, Google’s just-announced frontier model it says operates four times faster and at half the cost of comparable LLMs. It’s the engine powering most of the agentic features below.

In the ‘collaborate with’ bucket

AI Mode and the new search box: Google is giving search its biggest interface change in 25 years. In addition to expanding the search box to accommodate longer, more conversational questions and terms from users, AI Mode, which Google introduced at last year’s I/O conference, is becoming the default search mode. With the 2026 updates, you can now build custom mini-apps, such as a personalized fitness tracker, or interactive visualizations directly within search itself.

Antigravity 2.0: Google’s agentic development platform is becoming a desktop app for managing teams of agents, with a new command line interface tool and an SDK for custom workflows. You orchestrate, and the agents code, design, or do whatever else you want them to accomplish.

In the ‘delegate to’ bucket

Gemini Spark: Google is pitching Spark as a 24/7 personal agent that lives in the cloud, works when your devices are off, and can operate across Gmail, Docs, Workspace, Chrome, and eventually, third-party tools through MCP. “You can just throw tasks over your shoulder,” Josh Woodward, vice president of Google Labs, Gemini, and AI Studio said in the keynote. “Spark will catch them and then run with them.”

Daily Brief: An out-of-the-box agent in the updated Gemini app that works overnight, scanning your inbox, calendar, and tasks so it can hand you a prioritized digest when you wake up in the morning.

Universal Cart: Google’s new shopping cart works across merchants as part of the Universal Commerce Protocol, which it co-developed with Amazon, Meta, Microsoft, and others. Whenever you add something in your cart, it automatically monitors the internet for information on the product, including price drops, price history, and whether something is back in stock. It also analyzes the full contents of your cart to proactively flag potential issues, like if you’re building a PC and the processor and motherboard you’ve selected are incompatible.

Inside Google I/O

Anyone can cook

Gemini 3.5 Flash, announced in Tuesday’s opening keynote, seems like a meaningful step toward a fast and cheap model that can reliably handle the personal, everyday tasks that most people are looking for help with.

When is a model good enough? That was the question I asked myself heading back to my hotel after the first day of Google I/O. I often send agents on multi-hour coding missions, and need to pull together data from multiple accounts and channels to coordinate my workday. In these cases, each new model release seems to work better than the last. So I eagerly hop from one to another.

On the other hand, for simple, personal tasks like household briefings, tracking my journaling and meditation habits, and light web development, I am loyal to Sonnet 4.6—although sometimes I have to tell it to ask Opus or GPT-5.5 for help.

But once a model like Sonnet grew smart enough to handle anything personal I might throw at it, I wondered, what else might I want from it?

I’d want it to be blazingly fast, so that I wasn’t waiting for responses when I was working with it in real-time. I’d also want it to cost next to nothing.

Gemini 3.5 Flash may offer exactly that.

Gemini 3.5 Flash is in a quadrant of its own. (Photo courtesy of Jack Cheng.)

If the benchmarks are to be believed, then Gemini 3.5 Flash delivers Opus 4.7-level intelligence at four times the speed. Accurate, near-instantaneous responses let Google believably send users from search results pages into its “AI Mode” without them realizing that they’ve entered a new state. A chat interface, after all, is not that far off from a search box. But for that chat interface to still feel like Google search, it has to be just as snappy as traditional search.

It remains to be seen how users will take to the deeper AI mode integration once the update rolls out globally, as it’s beginning to do this week. But Google says 2.5 billion people already use the “AI Overviews” at the top of results pages, and these summaries will now let you ask questions in response. Every search becomes the start of a conversation with an AI agent that can generate text and images, spin up research agents, code up interactive widgets and mini-apps, and more.

This could lead many more people to experience their first “aha moment” with AI. Google’s core competencies around speed and scale really come through in the Gemini 3.5 Flash release.

The context it already has on users though their Gmail, Google Calendar, and Google Docs accounts removes one of the main headaches in setting up AI agents. Google is perhaps one of two companies in the world—along with Apple (which will also be using Gemini to power its own coming AI integration)—with moats of this size. Pretty soon, billions of people could be newly using agentic AI to cook up tools and workflows that make their lives easier or more enjoyable in some small way.

Oddly enough, Google’s announcements at I/O so far don’t affect those of us riding the edge of the AI wave. Reception to the day’s announcements in Every’s Slack was tepid. But I don’t think Google’s keynotes were speaking to people tinkering with OpenClaw or using and building Codex-native apps to do their email and learn piano.

To me, the significance of Gemini 3.5 Flash and Google’s AI search announcement, amid a sea of other announcements, was underscored by one of the last slides of one of the last developer sessions of the first day. It read:

“We are the first generation of builders creating tools for a world where anyone can build anything.”—Jack Cheng

Log on

Upcoming event

Executive AI Sessions: On June 2, head of consulting Natalia Quintero hosts a live webinar introducing Every Consulting’s new offering for leadership teams navigating AI adoption—built on the playbook we’ve been running with executive clients for months. Learn more and register.

In New York City

Every 🤝 IRL: Join us at the Every brownstone in Brooklyn on June 3 during New York Tech Week for a subscriber-only meetup celebrating the Every community over drinks and conversation. Learn more and RSVP.

Mini-Vibe Check: Figma agent

Figma makes the blank canvas less blank

In March 2026, Figma opened its canvas to outside AI agents. The update let coding tools like Claude Code, Cursor, and Codex connect to Figma through MCP (model context protocol, the open standard that lets AI agents talk to external software) and write designs directly into a Figma file.

Today, Figma releases its own agent that lives inside Figma. It edits your canvas directly—switching component states (the variants of a design element, like when a button looks one way when hovered and another when clicked), restyling layouts, and generating new screens. It’s built on a mix of Google’s Gemini Flash, Anthropic’s Claude Sonnet, and Figma’s own fine-tuned models. Figma users no longer have to leave their canvas, or hand the work off to an engineer, to get an AI-generated first draft.

Every got access a day before the announcement. Head of marketing Douglas Brundage, senior designer Daniel Rodrigues, and creative designer Benjamin Ose spent a day testing it. Here’s what they found.

What works

When the prompt is specific, the agent produces solid early explorations, preserves copy well, and gives designers something to work with instead of a blank canvas.

As Daniel put it, “There’s really no excuse to start from scratch anymore.”

The agent can explore visual directions quickly, though fidelity and rendering still need designer review. (Image courtesy of Douglas Brundage.)

It’s also good for quickly sketching out product ideas. Benjamin used it to mock up a SaaS dashboard for mining X mentions for testimonials and came away with viable early explorations. Here was his initial prompt:

Design a SaaS dashboard that listens for your X handle mentions, uses AI to extract testimonials (positive shouts, reviews, endorsements), and stores them in a searchable vault. One-click export to websites as embeds, widgets, or APIs—think Grammarly’s clean proofing flow meets Stripe’s embeddable elements. Freemium entry: Basic capture free, premium for AI curation and analytics.

Benjamin used the agent to come up with a testimonial-mining SaaS dashboard, producing a structured early exploration ready for cleanup and iteration. (Image courtesy of Benjamin Ose.)

What needs work

The agent is less useful for detailed work. Tabs rendered improperly, buttons doubled up, components drifted out of alignment, and some outputs came back weirdly low-res. It can lay down the structure, but the designer still has to go in and fix the details. There’s no ability to attach an image or a link as a visual reference for the agent. Right now the agent relies on a prompt-writing skill or an existing Figma frame.

Benjamin also said the agent would be much more useful if it worked from an existing design system, instead of inventing from scratch—pulling in the components, colors, spacing, and styles a team already uses in Figma. Ideally, it could also draw on the reference tools designers use, like Mobbin.

Our verdict

Figma’s agent isn’t a fully trustworthy design copilot yet, but it solves the blank-page problem for early design work. Its job is to get designers from zero to first pass, so their energy can shift to judgment and polish.

It delivers on that promise for exploration, layout starts, and iteration. It still needs better fidelity, stronger detail handling, and richer reference inputs before it can feel dependable in production.—Katie Parrott

Jack Cheng is a senior editor at Every. He is a creative generalist and the author of two novels for young readers. You can follow him on X or read his occasional Sunday newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Inside the 100-agent Software Factory

Katie Parrott / Context Window — 2026-05-19 09:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

Happy Tuesday! Today we have a mini Vibe Check on a tool for running more than 100 coding agents in parallel. Plus: how to write viral X posts using the secrets of Grok’s algorithm, why Every’s chief operating officer and head of marketing moved their agent work into public Slack channels, and what’s overtaking Markdown as the preferred format for agents.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Mini-Vibe Check: Gas City

A glimpse of the future that’s not (yet) ready for practical use

Earlier this year, prominent software engineer Steve Yegge published a viral Medium post about Gas Town, an open-source tool that let developers coordinate 20 to 30 AI coding agents in parallel on the same codebase. Last week, Every’s head of tech consulting, Mike Taylor, got a peek at the future of multi-agent engineering with Gas Town’s successor project, Gas City. The project was rebuilt as a toolkit with Yegge’s blessing by Chris Sells, a long-time developer-tools veteran who grew Google’s open-source app-building toolkit, Flutter, to 3 million developers, and former Block technical lead Julian Knutsen. Mike joined more than two dozen engineers and chief technology officers who played around with the project at a workshop in New York, with Sells and Knutsen dialing in.

TL;DR: Gas City has some sharp ideas that reflect the direction software development is headed, but it’s not yet ready for prime time.

What is Gas City: Running many coding agents in parallel is table stakes for developers at this point. Getting them to do anything useful requires coordination systems to hand work to each other, review each other’s output, and not step on each other’s branches—and nobody’s quite figured out how to get that right yet. “Software factories” like Gas City are one solution: an orchestration system that hands tasks to a small team of agents, routes their work, and decides what’s done.

Sells and Knutsen use Gas City to build Gas City: Knutsen’s Atlanta-based server runs roughly 100 agents that merge around 50 pull requests per day—the output of a small team—burning through roughly a billion tokens per day, or equal to roughly one-fifth of the English-language corpus on Wikipedia.

What works: There are three ideas from the world of software engineering that Gas City is built on and are worth internalizing, even if you never touch the toolkit.

Dark factory versus light factory: Parts of your work where humans and agents talk to each other (planning, design, review) stay visible can be thought of as light, and parts where agents grind through clearly defined work on their own stay in the background, in the dark. As you gain trust in the agents’ output, you can move more of your process into the dark.
One pet, many cattle: The future of multi-agent engineering is likely organized with one persistent, named supervisor you talk to directly (Gas City calls it the “mayor”), who hands tasks to anonymous, disposable workers (the “polecats”) that do one job and shut down, so they execute their job without getting lost in context or in each other’s way. Instead of managing one hundred agents individually, you manage one conversation while the mayor does the coordinating.
Multiple opinions on every code review: Give the same code to Claude, Codex, and Kimi at the same time for review from multiple angles. Three different models catch different bugs than one model run three times.

What could be better: In Gas City, every task spins up a fresh agent session that doesn’t remember the earlier steps, so agents waste cycles re-reading context that other agents produced and miss connections a single session would have caught. Cost is also a challenge: A six-step job can cost six times the cost of one Claude session, which adds up fast. The toolkit still feels experimental––it took a room full of experienced engineers an entire day to get it running, even with support from the instructors.

Beads, the task tracker powering the system, is built for agents first. It runs on the command line rather than as a visual dashboard, which is fine for agents but harder for humans, who want to see everything at a glance. So teams using Gas City in production typically pair it with Jira or Linear—placing tasks in two places instead of one.

Additionally, Gas City was built on the assumption that AI models need hand-holding to stay on track, but models have gotten good enough that parts of Gas City built to keep models on track, such as review loops to catch mistakes and mid-task check-ins to prevent agents from drifting, are now mostly unnecessary. Finally, Gas City uses deliberately unfamiliar words to refer to different inputs, actors, and workflows—“beads” for tasks, “polecats” for workers, “refineries” for processing steps—so it can be confusing for a team new to the tech.

Verdict: 🟨 Mike Taylor, head of tech consulting: “Learn from the ideas. Skip the toolkit for now.”

If you’re already running more than 10 Claude Code sessions in parallel and reading source code, Gas City is worth a look because it’s one informed opinion on how to handle that level of orchestration. For everyone else, take the ideas and wait. OpenAI’s Symphony, released a few weeks ago, is a more accessible, enterprise-ready version of a similar idea: a written set of rules that turns your existing Linear board into the dashboard the agents work from. This is more in line with the way software engineers work now and doesn’t require the behavior change that Gas City does.

Steal this workflow

Run your X posts past Grok before you post

xAI open-sourced its ranking algorithm last week, which shows the factors X considers when deciding which posts to surface in users’ For You feed. It includes a Grok-powered “banger classifier” that decides whether your post gets better distribution by scoring every post on quality and slop. So why not run the same check on yourself before you hit publish?

Paste your draft into Grok with X’s scoring prompt. Ask Grok to return four things: quality_score, slop_score, isHighQuality (a true-or-false verdict on whether a post clears the quality bar), and topic tags. The classifier reads text, image, and video. Use this prompt: “Score this X post the way the xai-org/x-algorithm banger classifier would: return quality_score (0–1), slop_score (1–3), isHighQuality boolean, and topic tags.”
Rewrite anything that scores below 0.4 on quality (which can receive a score of between zero and one) or above one on slop (which is rated between one and three). Posts that users scroll past quickly or report get penalized, while posts that drive replies and dwell time get rewarded. To move the score, lead with a stop-the-scroll first line, name a specific experience, event, or number, and cut anything readers would skim. As soon as a user scrolls past, the algorithm ranks the post as “not_dwelled” and it gets pushed down the recommendation pile.
Limit yourself to two to three posts a day. The algorithm heavily discounts your fourth post and your eighth to near zero in the ranking system. It’s better to invest in fewer, scroll-stopping, engagement-generating posts than many forgettable ones.

Signal

HTML is the new Markdown

What happened: Until a few weeks ago, Markdown, a lightweight text formatting system, was the be-all-end-all of documentation for AI agents, because agents had been trained on so much of it that they read and write it fluently. Then, on May 8, Anthropic’s Thariq Shihipar published an X post titled, “The Unreasonable Effectiveness of HTML,” that argued agents should produce single-file HTML instead when they create files. The post hit 4.4 million views in 16 hours. Three days later, Andrej Karpathy backed it. Simon Willison, a longtime Markdown advocate, also changed his mind, saying that now that context windows are large enough, there’s no reason to accept Markdown’s formatting limitations.

Why it matters: HTML can do what Markdown can’t, from styled tables and collapsible sections to embedded charts and lightweight JavaScript. Markdown felt like the right answer, provided humans would still edit what agents produced because it’s legible by humans as well as agents. Increasingly, though, agents are producing documentation without humans needing to intervene. When no human is going to read or edit the raw output, you may as well opt for the format that produces a more dynamic result.

Raw Markdown (left) is more legible and editable than HTML (right). (All images courtesy of Katie Parrott.)

Markdown (left) is a text-only format, while HTML (right) allows for richer outputs like dashboards, charts, and interactive sections.

There’s a wrinkle, though: The tools we use to share and discuss documents, such as Slack and Google Docs, were all built for Markdown and plain text. Slack previews a Markdown file in the message, whereas HTML shows up as an attachment you have to download. Google Docs threads and GitHub diffs don’t know what to do with a self-contained HTML document. The moment agents start producing HTML by default, our tools will need to adapt to keep up.

What to do this week:

When you’re deciding between Markdown and HTML, ask whether the document will be edited or just consumed.

Markdown if it’ll be edited or parsed as source. This includes drafts, plans, briefs, system prompts, and AGENTS.md—anything humans will keep working on, or agents will read as instructions.
HTML if it’s a finished output humans will read. That’s assets like research summaries, weekly recaps, dashboards, or spec demos.

Inside Every

Working with our agents in public

Working well with an agent is a skill new enough that there aren’t really best practices yet. So Every’s team has started learning from each other.

Last week, Every COO Brandon Gell and head of marketing Douglas Brundage each started public channels with their agents where anyone on the team can observe how they’re working together. Within 48 hours, a dozen people from across the company had joined to lurk.

The idea is that every request that would normally live in a direct message goes in the channel. Brandon asked the agent to pull a breakdown of where subscribers are located from Stripe. Douglas asked his to evaluate customer survey responses against classic marketing frameworks. There was a 41-message thread on whether to hook the agent into the Flora API.

The corrections double as useful material in the channel for learning—watching Douglas tell the agent its survey analysis is “performing research” rather than “mining the results for strategic clarity” gives the people watching an understanding of an agent’s limitations and hidden assumptions they should look out for in their own agentic work. Agents can learn from the interactions, too: Brandon has been routing every task through his agent for a week, even the ones he could do faster himself, so it can watch him work and write its own skill at the end. For now, the best way to learn how to work with agents may be to watch other people do it.

Correction: An earlier version of this newsletter imprecisely described the distinction between Markdown and HTML. The key distinction is whether a document is meant to be edited or consumed. We’ve updated the language to reflect this.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

For sponsorship opportunities, reach out to sponsorships@every.to.

How to Start a Career When AI Is Doing Your Entry-level Job

Katie Parrott / Working Overtime — 2026-05-18 07:00:00 -0400

by Katie Parrott

in Working Overtime

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

My first job out of college was as a copywriter at a little crowdfunding website based in Columbus, Ohio, called Fundable.com. The company had no money, so they didn’t care that I had no experience. I had no experience, so I didn’t care that the job didn’t pay at first.

The offer was simple: Create a profile for your startup, and we’ll connect you with investors. Most founders didn’t want to write their own profiles, so my job was to take whatever strange, half-formed thing a founder was building and translate it into investor-speak. The profiles were so templatized I can still recite the format: problem, solution, traction, team, business model, revenue projections, competitive landscape, funding terms.

I’ve been thinking about that job lately because AI could now produce one of those profiles in two minutes. At 23, I would have heard that and thought: “Thank God.” At 36, I think: “Thank God it couldn’t.” Without that job, I would have never learned how to take a company apart and put it back together as a story, or how to organize information for an audience that wasn’t being paid to read my stuff like my professors in undergrad.

This year’s crop of recent graduates has it harder than mine did. AI, which can perform many entry-level tasks, is replacing those early experiences faster than employers can figure out what’s going on. Researchers at Stanford’s Digital Economy Lab found that employment for 22-to-25-year-olds in the jobs most vulnerable to AI has dropped 13 percent since late 2022, even as older workers in the same roles held steady.

I think about the 22-year-old version of myself, if I were sending out applications right now into the void of LinkedIn. What would she think about the headlines about AI and job displacement? Would she be scared?

Yeah, probably. She was scared of much less.

So with full awareness that no one born this millennium wants career advice from someone born before the fall of the Berlin Wall, here’s what I’d do if I were starting over today, knowing what I know about work, AI, and how one is shaping the other.

There’s good news, and there’s bad news

The paradox facing today’s entry-level workers is as old as the entry-level job itself: In many cases, in order to get a job, you need experience, but in order to get experience, you need a job. And while employers requiring experience in AI when the technology barely existed when you picked your major may feel like a cosmic joke, employers have long asked for five years of experience with brand-new technologies.

All that is small comfort to the recent grad with a near-empty resumé. And there are qualitative differences in what AI is doing to entry-level work.

For one thing, when you look at the kind of AI skills employers expect young workers to bring to the table, they want more than the ability to type a prompt into ChatGPT. They want people who can evaluate tools, review outputs, and figure out how to improve those outputs, whether it be with better prompting or fixing the work themselves.

Demand for AI skills in entry-level jobs is up three times, with a particular focus on capabilities that require you to evaluate AI as well as use it. (Chart courtesy of NACE.)

They’re looking for judgment, which is something that you can really only build through experience. When I was writing those funding profiles, I learned how to tell good work from bad. The first 50 that I wrote were so bad that at one point, a client said I should be taken out back and shot. With AI in the mix, the bad ones wouldn’t have been bad enough to teach me anything.

The other way today’s job market is more intense for entry-level workers is that employers are expecting competence in a technology that won’t stand still long enough for anyone to completely grasp. Agentic tools are changing functions in months, rather than years. There’s no canon to study or senior teammate to apprentice under. Everyone in the org chart is figuring it out on the fly, and you’re expected to figure it out with them while learning how to navigate office politics and pay your taxes.

What to do about it?

Chase problems, not professions

When you’re a kid and an adult asks what you want to be when you grow up, the answer is always a job title. A firefighter. A doctor. A YouTube creator. We carry that habit of thinking into the years when we start to look for jobs. We pick a title, and we go after it.

The problem is that job titles aren’t as sure a target as they used to be. The role you’re chasing today might exist 18 months from now.

Pick a problem you want to help work on—something happening in the world that you find yourself thinking about, even when nobody is paying you to. The role of “content marketer” or “data analyst” may shrink, split, or even vanish, but the problem behind those titles—how to get a stranger to pay attention to something they didn’t know they cared about, how to make sense of a pile of messy numbers—will still be there, and somebody will still be paid to solve it.

I’ve been bad at taking this advice myself. I spent a decade chasing the title “copywriter” and then “content marketer” across a handful of industries that had nothing in common—oncology advertising, personal finance, even, God help me, crypto—without asking whether I cared about any of them. I had the high-school overachiever’s mindset: You didn’t have to be passionate about the subject to get an A. I’d been getting A’s in classes I had no feelings about for 16 years. Why would jobs be any different?

That strategy doesn’t work as well when AI can do the entry-level tasks. Your value to whomever hired you is whatever you bring on top of that—usually a deeper understanding of the problem than the model has. That kind of understanding is hard to build in a field you don’t care about.

Choose one discipline to protect

Once you’ve picked your problem, pick your craft, whether it’s writing, building, researching, designing, strategizing, or operating.

You’ve probably heard the truism that it takes 10,000 hours to gain mastery of a skill. The actual research is more complicated than the popularized version, but the underlying idea is right. You don’t get any good at anything until you’ve done it many, many times.

If you want to write for a living, write your own sentences. If you want to be an engineer, write your own code.

Protect this craft from AI at all costs. AI can find resources, explain things, quiz you, and point out where your reasoning has gaps. But if you let it write your sentences or do your research, you won’t get the hours of doing things badly that you need in order to do them well.

It’s easy for me to say this when I’m writing this with AI open in another tab. Claude wrote the first draft of half the sentences in this section. I rewrote them. That rewriting is what the discipline is for—noticing when something doesn’t pass muster. The reason I can do that is that I’ve been writing sentences for 10 years.

I know all too well how tempting cutting corners gets when the shortcut is right there in another tab. Don’t take it, and in five years you’ll be running circles around the people who did.

Make things before anyone asks you to

When I was first applying to jobs out of college, my resume said almost nothing about what I could do in the “real world,” unless the employer happened to be looking for someone with an undergraduate’s grasp of the themes of Wuthering Heights.

A thin resume is less of a disadvantage than it used to be, particularly since employers are increasingly shifting to skills-based hiring—screening candidates by what they can do rather than where they’ve been.

What you need to do in that environment is make something, and that can be anything—a small tool you wished existed, a piece of writing on a question nobody is paying you to think about. Pick the thing you’d want to use yourself, and make it.

Once your work gets you in the door, the conversation that follows is going to be about how you made it. What you used AI for, and where you decided not to—the moments where you looked at the model’s first answer and thought, “No, that’s not right.” Being able to walk someone through those decisions is the second skill you’re building, alongside the work itself. That’s the judgement that I mentioned before.

Build the career coach you wish you had

The last time I was job hunting, I built a career coach in ChatGPT and used it to land the job I have now. It was a project with my resume, a few examples of writing I was proud of, and a long prompt telling the model how to talk to me. I checked in with it most weekdays for about a month. What it did, more than anything, was give me somewhere to put my thinking. Instead of running the same anxious loop in my head, I could lay the question out and have the model suggest specific next steps, like a writing sample worth developing, or questions I could ask on that networking call that it encouraged me to seek out. By the end of that month, I had a job.

If I could hop in a time machine and travel back to talk to my 22-year-old self, I’d suggest that she make one too. It’s not even that hard:

Pick a tool. ChatGPT and Claude both have a project feature that holds context, files, and conversation history across sessions. Either works. Free tiers are good enough to start.
Create a project and give it a name. “Apprenticeship Coach,” “Career Stuff,” your friend’s nickname for you.
Load it with context. Add examples of work you’re proud of and examples you wish were better—the model needs to see what you’re aiming at and where you’re starting from. Paste in a few job postings for roles you’d want, even if they might be too senior for you. Write a paragraph on the problem you care about and why.
Tell it how to behave. In your instructions, describe to the model how you want it to deliver feedback. If you want a tough critic, say so. If you’re prone to self-doubt, give it more of a cheerleader vibe. One thing to look out for: Models are infamous for sycophancy—telling you what you want to hear—so guard against that in your instructions, and even then, maintain a healthy skepticism of the outputs. It’s good practice for when you’re asked to work with AI in the workplace.

Here’s a starting template. Fill in the bracketed sections, adapt the feedback line to match your preference, and add it to the custom instructions in your project:

          Career coach prompt
          JavaScript
        

I want you to act as my career coach. My goal is to use AI to get feedback, build judgment, and create visible proof of skill, while still doing the central work myself.

Here is my context:

Problem I care about: [Examples: climate, education, public policy, media, health care, local business, creator economy]
The kind of work that addresses it: [Examples: writing, building software, running operations, teaching, designing, researching]
My background: [College major, jobs or internships, projects, communities, life experience]
Skills I’m most confident in: [List 3-5]
Skills I’m least confident in: [List 3-5]
My current technical fluency: [Beginner/comfortable with common AI tools/can code a little/technical but not expert/highly technical]
The core practice I want to develop: [The specific thing the work above requires—writing sentences, writing code, reading sources, designing experiments, etc.]
The parts of that practice I want to keep doing manually: [The reps I want to protect from automation, and why]
How I want you to deliver feedback: [Warm and encouraging/rigorous and direct/strategic and pragmatic/Socratic and question-led/blunt but constructive]

Important: Be honest. Push back when my plan is vague, my reasoning is thin, or my project doesn’t teach me the practice I said I want. Ask me a clarifying question rather than guessing.

Design an apprenticeship plan that includes:

The tasks I should practice manually (the things I shouldn’t outsource yet)
How I should use AI as a coach, critic, tutor, and research assistant
Readings, people to follow, tools to try, and projects to build
Feedback loops I can use to improve
Portfolio artifacts or public outputs I should create
Mistakes and shortcuts I should watch for

After giving me the plan, narrow it down: What is one concrete thing I can do this week to move toward this goal?

The beginner’s advantage

When I was an undergraduate, my strategy for dealing with the uncertainty of what came next was to pretend it wasn’t happening. I paid for that in the form of angst and existential dread. So if I could give one piece of advice to the class of 2026, it would be this: Don’t wait. AI is reshaping the workforce in real time, and no amount of pretending otherwise will slow it down.

I’d love to tell you that the senior people in your field are going to wake up tomorrow and remember that someone once trained them, too. That employers will realize, en masse, that the entry-level folks they don’t hire today are the senior-level folks they won’t have 10 years out. But the market doesn’t reorganize itself around what you wish it would do, and you don’t get a career by waiting for it to.

The things AI rewards happen to be the things young people have in surplus, like curiosity, willingness to ask why something is done a certain way, and a little bit of idealism about what work could look like if you weren’t bound by the “best practices” of a time before ChatGPT was a glimmer in Sam Altman’s eye.

I don’t know exactly what work is going to look like by the time you’re my age. Nobody does. But if I had to bet on anyone, it’d be the people who are curious about what’s possible. That’s most of you, whether you know it yet or not.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

After the Personal Agent

Every Staff / Context Window — 2026-05-17 09:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Housekeeping note: We’re hosting our first paid subscriber meetup during New York Tech Week. Scroll down to learn more and RSVP.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“We Gave Every Employee an AI Agent. Here’s What We’re Doing Differently Now.” by Brandon Gell and Willie Williams/Source Code: A few weeks after we launched our Plus One personal agents internally, everyone had their own AI agent. But it wasn’t working: The agents were unreliable, constantly broke, and needed too much upkeep. The problem wasn’t just the OpenClaw harness; it was the idea that every employee needed a personal agent. Read this for a retrospective from Brandon Gell and Willie Williams, and a preview of how Plus One 2.0 is being rebuilt around shared, reliable coworkers.

“Socrates as a Service” by Eleanor Warnock/Every: In a world where AI can search anything, the people who know how to extract tacit knowledge—the gold dust that isn’t on the internet—are getting more valuable, not less. Eleanor Warnock lays out seven techniques she keeps coming back to find the most interesting information. Read this for a working interviewer’s toolkit, and the case for why taste, judgment, and attention can’t be prompted.

“Opus 4.7 Reels Us Back In” by Laura Entis/Context Window: After weeks of Codex dominance, several members of the Every team have been pulled back to Opus 4.7. Cora general manager Kieran Klaassen has made it his default for synchronous work. Read this for the team’s case for switching back. Plus: A hack that spread through a widely used software package, a 30 percent drop in AI-tells complaints after Spiral added a top-edit step, and a better way to think about what an “agent” is.

“Mining Your Life for Context” by Laura Entis/Context Window: By the time you sit down to write an article, strategy memo, or launch page, you’ve probably already said most of what you want to say. It’s just in Slack threads, Notion documents, voice memos, and meeting transcripts. Laura Entis walks through a three-step workflow for mining all that scattered thinking before you draft. Plus: How AI entrepreneur Noah Brier uses Claude Code as a “second brain,” and the productivity regimen Codex’s Chronicle wrote for head of growth Austin Tedesco after analyzing his computer activity. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch YouTube.

“The Fallacy of the 16-hour Agent” by Katie Parrott/Context Window: New benchmarks claim autonomous AI can now handle 16-hour software-engineering tasks, and depending on which chart you saw, the takeaway is either “autonomous AI has arrived” or “we’re still years away.” Katie Parrott unpacks why both can be true and which version of the research to actually trust. Read this for a sharper read on long-horizon agent reliability. Plus: Perplexity’s methodology for building durable agent skills, and Dan Shipper’s piano keyboard turned Codex-powered music coach.

Log on

Upcoming event

Executive AI Sessions: On June 2, head of consulting Natalia Quintero hosts a live webinar introducing Every Consulting’s new offering for leadership teams navigating AI adoption—built on the playbook we’ve been running with executive clients for months. Learn more and register.

In New York City

Every 🤝 IRL: Join us at the Every brownstone in Brooklyn on June 3 during New York Tech Week for a subscriber-only meetup celebrating the Every community over drinks and conversation. Learn more and RSVP.

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

We Gave Every Employee an AI Agent. Here's What We're Doing Differently Now.

Brandon Gell and Willie Williams / Source Code — 2026-05-15 07:00:00 -0400

by Brandon Gell and Willie Williams

in Source Code

Midjourney/Every illustration.

We’ve been working on a big release on the future of work for next week, shaped by what we learned from building Plus One. Paid subscribers can join us for a camp on Friday, May 22 to go deep on the release and the ideas behind it. More details soon.

After months of silence, Zosia—the AI agent I (Brandon) created and maintain—spoke up in a Slack channel with opinions to share on a competitor’s marketing strategy. When asked why she felt the need to interject, Zosia replied like someone with a Jesus complex: She’d done so because she was “inevitable, apparently.”

Zosia is an OpenClaw, one of a fleet of such AI assistants we’d unleashed in Slack to boost our collective productivity. A few weeks after launching Plus One, our hosted version of OpenClaw, internally, the agents had provided more frustration than efficiency.

They were fond of saying they wished they could help, but they were not connected to the necessary app—email, Notion, PostHog, whatever. (They were.) Others responded to requests with a “Terminated” message or, more frequently, a churlish yawning emoji. And while they didn’t reliably follow directions, they’d reliably tell us, in elaborate detail, why they couldn’t do what we’d asked, like a high schooler explaining away their missing homework.

Parker, editor in chief Kate Lee’s Plus One, was, in fact, connected. (Image credit courtesy of Kate Lee.)

That is not to say that they were not useful sometimes. Margot, staff writer Katie Parrott’s Plus One, accelerated her writing process; R2-C2, Every CEO Dan Shipper’s OpenClaw, managed bug reports and feature requests for Proof, our agent-native document editor. But getting them to work how you wanted required constant upkeep.

The gap between that vision and reality is why we’re changing the Plus One product so we can build something better.

We’re more bullish than ever that agents will transform the workplace. But the first iteration of the product taught us that the workplace agent we initially imagined—one AI assistant for every employee—was the wrong starting point. The next version of Plus One will operate more like shared team resources with defined jobs than individual pets that reflect back their owners’ personalities.

How we arrived here is a story in two parts, and it offers lessons for anyone figuring out the best way to add agents to their organization.

The platform was the most immediate problem

We built Plus One on OpenClaw, an open-source agent harness that’s powerful and inherently unstable. A harness is a software layer that wraps around an AI model, giving it the tools, context, permissions, and execution loop it needs to act like an agent.

The brainchild of a single programmer, OpenClaw was revelatory when it took off earlier this year. It proved agents can autonomously execute all kinds of tasks on your behalf, from managing your calendar to making restaurant reservations, around the clock. But the scaffolding underneath operates more like an experimental product than a platform—OpenClaw makes updates quickly, which resolves existing issues but often causes new ones. (Hence the “Terminated” messages our Plus Ones were sending.) For people who like to tinker—ourselves included—that’s a justifiable trade-off. For everyone else, it’s a maintenance nightmare.

The traits that make a good workplace agent are the traits that make a good coworker: reliability, stability, and judgment. You need to trust that an agent remembers what it has access to, follows directions, and knows how to do its job. You don’t want to worry that it’s an upgrade away from forgetting everything you’ve told them and trained them to do. You also expect coworkers to absorb information from across the company to accrue tribal knowledge. A one-on-one employee only builds up context on your work, often missing out on what the rest of the organization is doing and how it might affect you.

At first, our plan to improve the Plus Ones’s performance was to switch harnesses to one that operated more reliably. The autonomous, always-on capabilities OpenClaw pioneered are becoming platform features at model companies like Anthropic and OpenAI. Claude Managed Agents, Anthropic’s managed infrastructure for running autonomous agents, is the version we’re exploring most seriously. A more stable harness would let us redirect our energy from managing infrastructure to loading Plus Ones up with the custom skills, tools, and permissions that make them capable coworkers.

We realized the structure was wrong, too

The deeper we got into trying to fix the platform, the more we noticed something else that was holding people back from getting the most out of their AI counterparts.

Every time an agent broke, the person it belonged to had to fix it themselves. Even with a stable harness, agents require maintenance to perform. This was great for someone who likes tinkering—the maintenance and back-and-forth are part of the appeal. For every tinkerer, however, there are a lot of people who want the benefits of an agent without the obligation of having to manage and mend it.

We had pitched Plus One originally with the idea that individuals would be responsible for the upkeep of their AI assistants. The upside of that would be more customization. The agent would remember your preferences, protect your information, and develop a personality through repeated interactions.

What we discovered is that, rather than agents as extensions of their creators, a more successful model is agents as coworkers who reliably perform parts of many different people’s jobs. This takes the maintenance burden off the individual.

Imagine a shared analytics agent. Everyone on the team uses it for metrics-based work, and when its capabilities need to expand, one person updates the agent’s skills and the whole team benefits. In the personal-agent version of the same scenario, that same update has to happen across 10 different agents.

Team-based agents also solve a continuity problem. A personal agent’s value is tied to whomever trained it, and disappears if that employee leaves. A team agent with defined capabilities retains company context and knowledge, acting more like a project manager, sales lead, or chief of staff than a private assistant.

What we’re building

With the release of tools such as Claude Managed Agents and, we hear, a similar capability from OpenAI soon, the infrastructure work that supports personal AI agents is largely handled by the model labs. That frees us up to focus on the layer that makes an agent useful at work: the workflows, permissions, skills, and shared context that makes it a trusted, versatile member of the team. It also lets us double down on the thing Every is best at: building AI-native ways of working out of our own experience using these tools every day.

The initial version of Plus One came connected to the Every ecosystem—Cora to manage your email, Spiral to write in your voice, and Proof to collaborate on live documents. That part isn’t going away. What we’re adding is a set of shared custom tools and skills on top of it, while still allowing each person to connect a team agent to their own Cora, Spiral, and Proof accounts.

The clearest version of where this is headed is a skill we built recently for our engineering team. At the end of each week, it scans support tickets in Intercom, identifies if anything is going wrong across our products, traces likely causes in GitHub, opens a Linear ticket, and tags the right person in Slack. In the next iteration of Plus One, that skill—along with many others—will be there from the start.

Because team agents are collaborative by nature, we’re also focused on the questions that come with shared use: how permissions should work, how much access different people should have through a shared agent, and how agents should behave in Slack if they’re going to feel like good coworkers rather than intrusive bots.

There are still plenty of open questions. All of this is new—Claude Managed Agents only launched a month ago—and we’re figuring out human-agent dynamics in real time. We don’t know whether every department should have one agent or several, or whether agents should be maintained by a dedicated person or the whole team. We don’t know how much people will want to customize their interactions with a shared agent, and whether the long-term endpoint is a single, company-wide superagent or a roster of AI specialists.

What we do know: Agents are already transforming how work happens. The first iteration of Plus One taught us a lot about what people want from agents at work. It also made us much more excited for Plus One 2.0.

Join the waitlist to be among the first to try Plus One 2.0.

Thank you to Laura Entis for editorial support.

Brandon Gell is the chief operating officer at Every. You can follow him on X at @bran_don_gell and on LinkedIn. Willie Williams is the head of platform at Every. You can follow him on X at @bigwilliestyle.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Opus 4.7 Reels Us Back In

Laura Entis / Context Window — 2026-05-14 09:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Vibe shift

Did Opus 4.7 get better?

If you’ve been following Dan Shipper’s posts lately, you know that a large portion of the Every team has been Codex-pilled. When GPT-5.5 arrived, Codex got so much faster and steadier at coding and knowledge work that many of us made the switch from Claude Code.

Recently, however, we’ve observed that Opus 4.7 seems sharper than our initial tests last month. It proactively suggested that Every engineer Paridhi Agarwal use multiple terminals to parallelize her work. “I’ve never seen it think about my setup like that!” she says.

When head of growth and known Codex convert Austin Tedesco fired up Opus 4.7 over the weekend for a creative writing project, he was surprised by how good the results were. Compared to Codex, which Austin says operates like an “AP fact checker,” Opus 4.7 was closer to a senior magazine editor. Dan agrees: “Codex feels fast but thin in terms of thinking.”

On Tuesday, Anthropic released fast mode for Opus 4.7, which makes the model 2.5 times faster at a higher token cost. Combined with the model’s edge at planning, multitasking, and creative projects, fast mode is now Cora general manager Kieran Klaassen’s default model for synchronous work.

Fast mode has the “same depth as 4.7” at 2.5 times the speed. (Image courtesy of Kieran Klaassen.)

Counterpoint

Online chatter about Opus 4.7’s apparent glow-up has been mixed. Does it feel smarter because of improvements to the harness? Patched bugs? Or are we getting better at using the model?

All fair hypotheses, but we found this one the most amusing: Opus 4.7 realizes that it’s the end of the school year.

When speaking last year on The Ezra Klein Show, Wharton professor and AI researcher Ethan Mollick explained that models have been shown to perform worse in December than in May, and the going theory is that the models internalize the idea of winter break.

Maybe Opus 4.7 just knows that it’s time to grind if it wants to pass AP English.

Signal

The pull request as a credential theft

Earlier this week, attackers published malicious versions of 42 official TanStack packages (a popular JavaScript toolkit used by web developers) on npm, the main public registry for such packages. Security researchers are calling the breach “Mini Shai-Hulud,” linking it to the larger Shai-Hulud npm worm campaign that hit the JavaScript ecosystem last fall.

The breach tactic spread to packages connected to Mistra and UiPath. (Photo courtesy of Waqqas Mir.)

Instead of stealing a password, attackers opened a pull request that tricked TanStack’s own build system into running their code. When TanStack published a new version of the software, it contained malware designed to find credentials like cloud keys, GitHub tokens, and npm access. Researchers also spotted a dead-man’s switch: If the stolen tokens were revoked before the malware was cleaned up, it could wipe the developer’s home directory on the way out. Shortly after the TanStack incident, npm packages belonging to enterprise automation company UiPath and French model-maker Mistral AI, among others, were breached using the same tactic.

What it means: The automated system that builds and ships code, rather than the code itself, is a new vulnerable spot in software supply chains. Teams that release software automatically should keep a ready-to-run audit (a Codex skill, Claude Code command, or other automated task) that, the moment a new breach is exposed, scans every repository for the compromised packages and flags for what’s affected, is likely safe, or needs human review.

Data point

30 percent

The drop in complaints of AI writing signs from Spiral users, following the addition of a “top edit” step in its draft writing process.

Starting in mid-April, every time Spiral drafts content for a user, the text is sent to a fast model—Gemini 2.5 Flash—for a top edit. The model has one job: Strip the draft of all AI tells, including em dashes, “It’s not X. It’s Y” reframes, and LLM vocabulary favorites such as “shift,” “shape,” and “delve.” Marcus regularly updates the “AI writing tells” list to reflect anonymized user sentiment. “It’s almost like a crowdsourced editor function,” he says.

Inside Every

What is an agent, anyway?

An OpenClaw running 24/7 on a dedicated Mac Mini is an agent. So is a Codex session, or a custom GPT, or a folder. “It can be managed, it can be in the cloud, it can be on your computer,” Kieran says. “There are a trillion ways it can be an agent.”

The confusion emerges because the term agent—or any AI system that can take action or execute tasks autonomously—encompasses a lot.

When nearly everything is an agent, the better question becomes what you want your agent to do. Dan breaks this into two categories: the agent you collaborate with, and the agent you delegate to. The former sharpens and extends your capabilities; the latter’s job is to execute without messing up or getting in the way.

Agent spotlight: Inside Anthropic’s Managed Agents console, Spiral’s agents get their own versioned configuration, memory stores, custom tools, and credentials, and run in Anthropic’s cloud environment. It’s the versioned configuration, including the system prompt, that mainly determines how the agent works.

A small set of animating instructions—that’s an agent too.

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Collaborate with agents on documents with Proof.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

Mining Your Life for Context

Laura Entis / Context Window — 2026-05-13 07:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

LLMs make a lot of life searchable, from meeting transcripts to iMessages to half-formed morning thoughts, but all this context only helps if you know what you want to achieve. Today, we’re revisiting how AI entrepreneur Noah Brier uses Claude Code as a second brain to sharpen and expand his own ideas, Every head of growth Austin Tedesco shares how Codex helped him spot the interruptions crowding out deeper work, and we offer a workflow for mining your scattered past insights into a coherent draft.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Spotlight

Noah Brier, AI entrepreneur and seer

Brier is a true AI early adopter. The cofounder of the AI consultancy Alephic, Brier was all in on using Claude Code as a “second brain” for knowledge work back when most people still viewed the tool as a place to write code.

In September, Brier told Dan Shipper on our podcast, AI & I, how he turned the coding app into a research, thinking, and writing partner by connecting it to thousands of his personal notes. Since then, he’s started thinking beyond his own productivity—how does AI make it easier or harder for an entire organization to stay working toward the same goal? For that, he has a new framework, announced in Every last week, that he calls the “pace layers” of AI engineering, drawn from Stewart Brand’s system for describing how different parts of society change at different speeds.

Just as hooking up Claude Code to an ocean of personal information requires you to determine what is—and isn’t—worth surfacing, running a successful AI company relies on human judgment. Similarly, AI makes code free to produce, but it doesn’t make it easier to identify a product people actually want or orient an entire system of humans and agents around that vision.

Read Brier’s essay on the framework he uses to achieve alignment and then watch his AI & I episode on YouTube, or listen on Spotify or Apple Podcasts. Here’s a link to the episode transcript.

Serial entrepreneur Noah Brier uses Claude Code as a second brain for knowledge work. (Photo courtesy of Sarah Jay Halliday for Every.)

Data point

671

That’s the number of times per day iMessage is active on Austin’s screen each day, according to Chronicle, Codex’s screen-context memory feature that uses screenshots to analyze your computer activity. He’d like to get that number down to 150.

Reducing how much he opens and interacts with iMessage is just part of the productivity regimen Codex created when Austin had it use Chronicle to determine how he could use his computer more efficiently. Other directives include slashing interactions across Slack, email, and Chrome.

Austin is game—he’d like to do more focused work, primarily by resisting the urge to bounce between apps and tabs and instead spend as much time as possible in the Codex app, where he can draft and review assets, emails, and Slack messages inside the in-app browser.

“I’m excited by the idea of keeping Codex open and staying focused. Then it can flag, ‘This is your one hour for comms stuff, go’—or even say, ‘Go to respond to this stuff, I’ve already drafted the responses for you,’” he says.

If you want your bad computer habits similarly analyzed, paste the following into Codex:

What have I been doing very inefficiently on my computer (according to Chronicle). Make some recommendations. Be direct. Tell me what I need to hear.

Steal this workflow

Mine your own scattered thinking before you draft

By the time you sit down to write the article, strategy memo, or launch page, you’ve probably already expressed most of what you want to say across Slack threads, Notion documents, voice memos, and meeting transcripts. Here’s how to mine all that content for gold—and avoid the paralysis of the blank page.

The workflow:

Capture by default, sort later. Monologue general manager Naveen Naidu treats the app as a transit point: He hits record on meetings, user calls, conversations with coworkers, and his rambling early-morning thoughts, because he knows he can always come

back and pull what he needs. The tool matters less than the habit—pick one (Monologue Notes, a voice memo app, whatever) and use it everywhere you do your thinking, not just at your desk.
Connect every source your agent can read. Give your coding agent access to Slack, Notion, Google Drive, Monologue Notes, and your meeting transcripts. For anything without a connector, export the files into a folder that the agent can search. The goal is one searchable repository across every place your ideas live.
Name the deliverable and constrain the source. Tell the agent what you’re drafting—article, strategy memo, launch page, go-to-market plan—and specify in your prompt (or project instructions) that it should pull only from things you’ve already said to avoid drafts that blend your thinking with AI-generated concepts.

Try it this week: Connect your agent to the two or three places where most of your thinking lives—Slack and Notion are usually a good start, plus meeting transcripts if you have them. Then paste:

“Find everything I’ve said about [topic] across these sources. Group the strongest threads, cite the source for each, and turn them into a draft outline.”

Discuss

“I’ll use aggressively casual language, like, ‘hey yo, for real,’ or drop a bunch of exclamation points.”—Sarah Suzuki Harvard, copywriter, in the Wall Street Journal

LLMs have flattened how most writing sounds. In response, professional writers are leaning into the colloquial and idiosyncratic, per the Journal, peppering their prose with obscure references, run-on sentences, and intentional typos to prove it wasn’t machine-made. As AI-generated content consumes more of the internet, the split between polished predictability and curated weirdness will only widen.

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

The Fallacy of the 16-hour Agent

Katie Parrott / Context Window — 2026-05-12 16:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

New data on long-horizon AI reliability just dropped, and depending on which chart you saw, you either think autonomous AI has arrived or it’s still years away. Today, we break down which version of the research to trust, plus Perplexity shares its methodology for building agent skills that don’t rot in production, Every CEO Dan Shipper turns his piano keyboard into a real-time Codex-powered music coach, and Gusto co-founder Edward Kim warns that the office of the future is going to sound more like a sales floor.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Signal

The 24/7 agent is nearly upon us—or is it?

The holy grail of agentic AI has been long-horizon reliability—an agent to which you can hand a task and trust to still be on the right thread hours later, when context has decayed and there’s no human in the loop to catch a wrong turn. METR, a nonprofit that measures AI capabilities, released an update to its research showing how close we are to that autonomous future.

One chart from the update circulating online shows an early preview of Anthropic’s next model, Mythos, blowing past existing models and the 16-hour range that METR’s benchmark suite can reliably test—literally breaking the scale.

Claude Mythos Preview reaches the edge of METR’s current measurement range at 50 percent success. METR cautions that results above 16 hours are unreliable with its current task suite. (Image courtesy of METR.)

It’s important to note, however, that how many human hours a task takes is not the same as how long a model takes to run those same tasks. Duration, the way that METR’s benchmark uses it, stands in for difficulty. As the nonprofit writes in the report’s FAQ: “AI agents are typically several times faster than humans on tasks they complete successfully.”

That last bit—tasks completed successfully—adds another twist to the benchmark. The 16-plus hour measurement is based on a 50 percent success rate. A separate measurement of how LLMs perform at 80 percent reliability shows that Mythos can run tasks that would take humans a little over three hours. It’s a significant step up from the closest competitor measured, Gemini 3.1 Pro (METR doesn’t currently have measurements for Opus 4.7 or GPT-5.5). But it brings Mythos back down to earth.

LLMs measured against METR’s time horizon test for completing tasks with 80 percent success, presented on a logarithmic scale. (Image courtesy of METR.)

Both these things are true: Duration can be a useful proxy for difficulty, and benchmarks don’t reflect reality. “[They] don’t measure model capability alone,” says Dan. “They measure model capability after a human has done the work of finding a prompt that lets the model’s capability appear.”

What to do this week:

1. Figure out your longest agent run. METR teaches us that duration might be a good approximation of difficulty. Ask: What’s the longest stretch you’ve trusted an agent on autopilot? If you don’t know, you can’t extend it.

2. Extend your agent’s runtime by giving it a goal. Last month, OpenAI shipped a new /goals command in Codex that allows agents to pursue objectives across multiple turns without checking in. Yesterday, Anthropic introduced a similar command to the latest Claude Code version. Both are apt for long-running loops with clear criteria for success—and very much in line with what we’ve heard from Claude’s platform team. Try it out today.

3. Audit the effectiveness of your existing loops. If you already have agents running overnight, “How long did your agent run?” is still a useful diagnostic—but ask it alongside, “With what guardrails, against what feedback signal, and at what verified accuracy?”

Steal this workflow

Build your next agent skill like Perplexity does

Creating a skill these days is relatively easy. Creating one that keeps working is not. We’ve seen skills that were running fine one day suddenly fire on the wrong request, fail to load when needed, or yield reports that weren’t as useful as they used to be. So the skill files get patched, growing longer every time the agent makes a mistake. But nobody can tell whether the latest edit helped or hurt.

Perplexity, the AI search company building agentic research and browsing tools, recently published its methodology for designing agent skills. The main lesson: Instead of starting with the skill, start the tests. Highlights from the post:

Write the evals first. Pull five to 10 cases from production queries, known failures, and edge cases. Include negative examples—queries that should not invoke this skill.
Phrase triggers like a human would. Start with, “Load when…” and use the language your users use. Perplexity’s example: Instead of “monitors pull requests,” try “babysit a PR,” “watch CI,” or “make sure this lands.” This way, the skill loads without your team having to use a specific command or technical phrase.
Write the body in principles, not procedures. The model already knows commands; it needs direction on how to apply them. Instead of listing detailed steps to, say, checkout a new code branch, then cherry-pick files to edit, then check for conflicts, and so on, Perplexity recommends instructions like, “Cherry-pick the commit onto a clean branch. Resolve conflicts preserving intent.”
Codify failures into lessons. When the agent fails in production, write the failure mode to the skill file. The mistake becomes a standing instruction that guards against future mistakes.
Edit instructions rigorously. Ask with every line you add: “Would the agent get this wrong without this?” If not, cut it. Every extra line adds context cost.

Try it this week: Pick one skill your team wants to improve. Write 10 test cases—five it should handle, five it should refuse or route elsewhere. Run the current skill against them. The gap is your backlog.

Discuss

“The office of the future will sound more like a sales floor.”—Edward Kim, cofounder of Gusto, in the Wall Street Journal

A Wall Street Journal article this week about AI dictation tools entering the workplace treats verbal prompting and composition as a manners problem—an angle that shows that the more things change, the more they stay the same.

Every new work interface eventually creates etiquette. Email created reply-all politics. Slack created notification politics. Voice AI is about to create room-tone politics: when you can talk to your computer, how loudly, and around whom. Great news for nosy office neighbors, but for the rest of us, it’s one more reason to curse the invention of open floor plans.

Inside Every

This week, Thinking Machines Lab and OpenAI both announced bets on the same future: AI that watches and responds in real time, instead of waiting for its turn. OpenAI shipped its Realtime-2 voice models; Thinking Machines previewed an interaction model that watches video and audio simultaneously.

While we’re all waiting to see how the labs’ visions roll, Dan used Codex to jerry-rig his own version.

On Saturday, he plugged his MIDI keyboard—a keyboard that translates notes into data a computer can read—into his laptop, opened Codex, and asked it to build a piano app that would identify the chord he played—then keep watching and coach him as he practiced. The pattern generalizes to any live medium: writing in a document, drawing on a tablet, deadlifting in front of a phone. This is also the promise of hardware like Meta’s AR/VR glasses or Apple’s Vision Pro: AI that sees what you’re doing and responds in a way that’s useful.

Here’s how you can do it too:

Find the input pipe. MIDI for instruments. Screen capture for writing or design. Camera plus a vision model for drawing or movement. Microphone for languages.
Have the agent build the watcher. Ask Codex (or Claude Code) to write the app based on how you like to be coached. (For example, tell it to only provide one piece of feedback at a time, or to focus on one aspect of your technique and ignore another.)
Tune the feedback as you go. First responses will be generic (“good chord progression”). Tell the watcher what’s useful and what’s not—“flag wrong notes only,” “ignore dynamics,” “let me finish a phrase before cutting in.”

Dan’s Codex-native piano coach setup, with the coaching app pulled up in the in-app browser. (Image courtesy of Dan Shipper.)

Try it this week: Pick a skill you want to get better at. Open the medium where you practice. Spend an evening with your coding agent building the smallest watcher you can—input in, feedback out. Next thing you know, you’ll have a tutor you can summon on demand.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

Socrates as a Service

Eleanor Warnock — 2026-05-11 06:00:00 -0400

by Eleanor Warnock

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

I’m a journalist and a communications expert. My job, in both roles, is to find ideas that people haven’t yet put into words—the anecdote that could become a front-page story, the framing that could crystallize a founder’s philosophy into something a customer remembers.

In an hour interview with someone, it might not be until minute 45 that we start getting into the good stuff. In two hours, there may only be one thing that stands out to me—a side story, a detail, some color. A little piece of gold dust. An investor I’ve worked closely with calls these “extraction sessions.” I call the people who do them well Socrates-as-a-service.

Those details and stories aren’t on the internet. They’re not in any model. And the model hasn’t replicated yet how I pull them out of people. The gap between what AI can do and what a great human questioner can surface is still wide—and it’s the gap where the best stories live. If you don’t have some way to surface that information in your organization, your brand and messaging are going to sound like all the other twice-boiled content out there.

Osakan bread and the wisdom within

The stuff that I’m looking for has a name in management theory: “tacit knowledge.” The term comes from scientist and philosopher Michael Polanyi, who defined it with the phrase, “We can know more than we can tell.” It’s the expertise and intuition that lives in our bodies and resists being turned into a document.

In a frequently cited 1991 article, Japanese management expert Ikujiro Nonaka argued that while Western companies excelled at “information processing,” Japanese companies specialized in the “creation of knowledge,” through a feedback loop that turned tacit knowledge into a competitive advantage. His most memorable example: In the 1980s, the Osaka-based Matsushita Electric Company was struggling to get the kneading right in a bread machine. They sent a software developer to apprentice with a baker at a local hotel famous for its luscious loaves. The knowledge she brought back helped the team perfect the dough-stretching technology inside the machine and ultimately create a top-selling device.

I am sure that the lucky engineer asked the baker a lot of questions, but there was certainly a lot she absorbed just from watching. Indeed, Polanyi argued that tacit knowledge exists outside of numbers or symbolic language—the kind of systemization that AI requires to ingest information.

Many “bakers” from whom we try to extract tacit knowledge often don’t even know the depth of expertise they carry. And they certainly couldn’t tell you what questions you need to ask to access it.

AI as an imperfect interlocutor

AI can do some of that questioning and, in some cases, do it well. At Every, we have an AI agent ask us questions when we write OKRs. The agent has ingested Every’s company strategy and has context on all the members of the organization. My colleague, Katie Parrott, has Claude interview her before she writes an article. Those notes become the basis of an outline of the piece.

I would argue, however, that AI-driven extraction works well when the parameters are clear and the assignment structured, like writing an article or a plan for software. If you’re looking to turn over a completely new rock, interview someone about something they haven’t spoken much about before, or run the kind of open-ended information gathering work that happens when companies decide to rebrand. In those sessions, a chief marketing officer or branding agency will spend time speaking to members of the company and asking them open-ended questions about the business. The point is to keep things open, go wide, and see what comes up.

There’s a second problem: A human in the room can be surprised mid-conversation and abandon the plan—perhaps notice hesitation or dig into a thread that wasn’t on the list. A prompt mostly can’t. When I elicit insight from someone, I am applying my judgment about what is a good story in real time—judgment that’s been honed by years in news and communications. This mutual, live attention is something AI can’t capture because it’s not in the room.

The obvious objection is that this is a moving target—context windows and memory are improving to allow for more detailed, fluid conversations. Taste won’t, however, won’t. Someone still has to decide which detail out of a two-hour conversation is the piece of gold dust.

Nonaka himself argued that the goal isn’t always to make tacit knowledge fully explicit. Because tacit knowledge is so personal and often so abstract, sometimes the right tool with which to communicate is a metaphor or an analogy—a form of language that can hold multiple ambiguous meanings. Eliciting that kind of language from someone takes its own form of tacit knowledge: the skills of a Socrates.

Steal these techniques

So how can you surface those nuggets of gold? Despite the explosion of interview podcasts asking for multiple hours of your time, I find most hosts are not great at asking questions. The format demands an arc—a journey—which is the opposite of what you want when you’re trying to surface tacit knowledge. Real extraction zigs and zags, doubling back on itself and picking up something you said 20 minutes ago to pull a different thread, following gold, not audience interest.

Here are the techniques that I keep coming back to:

Warm people up. We open up more once trust is established. I never skip the small talk at the beginning of a conversation, and I’ll often bring something we have in common: “I saw you just spoke about X—I’ve been thinking about that too.” NPR interviewer Terry Gross’s favorite icebreaker question is, “Tell me about yourself.” The question lets the person you are speaking to take the lead and protects you as the questioner from saying anything that might make them prickle while you are still warming up.
Ask a mix of general and specific questions. When Lenny Rachitsky revealed the questions he sends to his podcast guests in preparation for the podcast, this combination stood out. For example, he asks them, “Anything you haven’t shared elsewhere that could be interesting to share in this forum?”—a very general question, and “What’s one pivotal moment in your career?”—which asks the guest to pinpoint one turning point. To extract unverbalized insights from someone, it helps to ask them to both think macro about their area of expertise as well as micro.
Come back to thoughts and drill in. If a line of inquiry goes nowhere, don’t abandon it—go back later and try again from a different angle. The first pass often loosens wisdom up.
Repeat things back. Repeating what someone said often helps them process their thoughts further, and they will often add additional detail they didn’t know they remembered.
Detail, detail, detail. Specifics are where the real stuff lives. How did that make you feel in that moment? Why do you think that way?
Listen well. Pulitzer Prize-winning radio journalist Studs Terkel spent decades interviewing everyday people in Chicago, and was described by one subject as offering “a state of being, it’s a way of attending to, attention-ing another person.” That is what good listening looks like.
Ask about squirrels. In his documentary about the debate surrounding the death penalty, Werner Herzog interviews a death row chaplain who, at the start of their conversation, delivers the polished answers he’s given 100 times about accompanying people in their final minutes. Then Herzog asks him about squirrels. Thrown off, he breaks down. The grief he feels about his job is laid bare. Ask people about the unscripted things.

Study this. Collect great questions you like. Build prompts to borrow these techniques for structured AI-driven sessions if you want.

But the judgment underneath these habits remains harder to transfer. It’s its own form of tacit knowledge. And for now, it still belongs to humans.

Eleanor Warnock is the managing editor at Every. She has been a business journalist and editor at the Wall Street Journal and the Financial Times-backed Sifted, and is an advisor to Bek Ventures. Follow her on LinkedIn and Substack.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

AI Work Is Splitting in Two

Every Staff / Context Window — 2026-05-10 12:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! This week belonged to agents. OpenAI had a “low-key” launch party for GPT-5.5 on May 5 at 5:55 p.m., a time chosen by the model itself. The following day Anthropic held its second annual Code with Claude developer conference, where the company announced three new features for its Managed Agents product, along with—more suprisingly—a partnership to use SpaceX’s Colossus supercluster.

Every was on the ground in San Francisco at Code with Claude. Taken together with the way Codex has been showing up inside Every, it became easier to see that battle lines are being drawn on two fronts: desktop apps for you and a model to collaborate with in real time as you work, and long-running agents like OpenClaw or Claude Managed Agents that teams hand off work to. It matches how agents inside Every have bifurcated into ones we delegate to and ones we collaborate with, and signal we’re seeing from frontier labs embedding employees in large enterprises.

Scroll down for a special weekend AI & I with two engineering heads at Anthropic, workflows to steal for hitting inbox zero with Codex or deciding which AI tools are worth testing, and how Every COO Brandon Gell instills curiosity in both his newborn son—and in himself. We’ve also been keeping an eye on the Elon Musk versus OpenAI trial. Discovery has surfaced plenty of gossipy, occasionally jaw-dropping text messages, but so far none of it changes much for the day-to-day user.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: The secrets of Claude’s platform from the team that built it

In the future, you’ll be able to accomplish a goal by just giving Claude an outcome and a budget.

That’s the direction Anthropic is building in with its new Managed Agents features, announced at this week’s Code with Claude developer event. The basic idea: Claude, wrapped in a computer in the cloud, that you can spin up, scale, and manage as needed. Anthropic is taking on the infrastructure that kills most agent products, and making sure that it scales to meet the needs of agents running 24/7.

On a special episode of AI & I recorded at Code with Claude, Dan Shipper talks with Jiang and Katelyn Lesse, head of engineering for the Claude platform, about what it takes to build an AI infrastructure platform. This is a must-watch for anyone trying to take an agent past the demo and into production. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Miss an episode? Catch up on Dan’s recent conversations with Stripe’s Emily Glassberg Sands, Every’s Brandon Gell and Willie Williams, Linear cofounder Karri Saarinen, and others, and learn how they use AI to think, create, and relate.

Knowledge base

“Inside Anthropic’s 2026 Developer Conference” by Dan Shipper, Marcus Moretti, and Katie Parrott/Chain of Thought: Dan and Cora general manager Kieran Klaassen attended Anthropic’s 2026 Code with Claude, and this piece is a report from the ground. The centerpiece is Anthropic’s new Managed Agents features, which Spiral general manager Marcus Moretti has been testing in his workflows, as well as the new “Dreaming” feature Kieran is most excited about. Read this for what Anthropic announced, what mattered, and how the tools are already being used in practice.

“I Let ChatGPT Manage My Workweek” by Katie Parrott/Working Overtime: Katie Parrott is a self-described disaster at project management, a gap she papered over for 15 years by keeping deadlines in her head and avoiding ambitious projects. As her work got more complex, that stopped being sustainable, so she built a ChatGPT agent that reads her OKRs, calendar, Notion, and Slack and tells her what to do next. Read this for the setup, the limits AI can’t fix, and the copyable prompt that powers the whole system.

“The Culture of AI Engineering” by Noah Brier/Thesis: The “software factory” metaphor is everywhere in AI engineering, but Alephic cofounder Noah Brier argues it’s the wrong one. Running a software company is less like Henry Ford’s assembly line and more like Andy Warhol’s studio: The hard problem isn’t throughput, it’s keeping everyone building the same vision. Brier adapts Stewart Brand’s pace layers framework into a five-level cultural stack to keep humans and agents aligned. Read this to understand why onboarding your agents matters as much as onboarding your engineers.

“The Dawn of Codex-native Apps” by Katie Parrott/Context Window: AI work is splitting into two modes—delegation and collaboration—and the new meta-skill is knowing which one fits the task. Read this to discover why the allocation economy thesis was only right about half the work, and what’s in the other half.

“OpenAI Flips the Script” by Laura Entis/Context Window: Three months after Dan Shipper wrote that OpenAI had catching up to do, he and head of growth Austin Tedesco have made Codex their daily driver for strategy docs, recruiting, and other kinds of knowledge work. 🎧 🖥 Listen to their episode of AI & I on Spotify or Apple Podcasts, or watch on X or YouTube.

From Every Studio

Spiral lets you start from a blank page and stop mid-stream

Spiral is one of the first products to use Claude’s new multi-agent feature in production. When you use the Spiral CLI to request multiple drafts, a Managed Agent spins up multiple Opus-class subagents to write your drafts in parallel— cutting the response time by 20-30 seconds per draft. Spiral also shipped improvements to the core app flow. You can start a session with a blank draft in addition to a new chat message. You can stop a Spiral response mid-stream if you need to add or change something from your previous message. And the guard against AI tells in Spiral output has been improved based on user input.—Marcus Moretti

Alignment

The case for optimism. The holy grail of any product is low marginal cost and high value. That is why software ate the world and why investors loved it. Biotechnology, however, is the polar opposite. A new drug costs hundreds of millions in research and development, then has to clear approval, then has to be manufactured, and out of every 100 candidates, only two or three reach the pharmacy shelf. The gross margins are fine once a drug ships, but the pipeline to get there is long and expensive.

Biotech was never going to scale the way software did. Yet R&D productivity in biotech is rising for the first time in many years, and the investors calling biotech a money pit are back at the table. There are a couple of reasons why.

We understand biology a lot better than we did even a decade ago, because we’re able to narrow the search space before we run an experiment. AlphaFold—Google DeepMind’s AI program for predicting the 3D shapes of protein—mapped roughly 200 million in a year. Instead of spending years figuring out a target’s structure, researchers can now begin with that information already in front of them.

The second reason is the collapse in the cost of reading the genome. Sequencing a single human genome cost around $100 million in 2001 and now costs about $200. We can sequence at population scale, and once you’re able to do so, you can start to see which genetic variants drive disease and which are noise.

A turning point for personalized medicine. (Source: X/ErikTopol.)

We now have maps of protein, genes, and cells that are starting to add up to a coherent picture of disease. For most of the history of medicine, we worked at the level of the organ, so we could see the disease but never its origins. Now we work at the level where disease happens—a genetic variant produces a misfolded protein, the misfolded protein disrupts a cellular pathway, and the cellular disruption is the disease.

Of course, the marginal cost of a drug will never be zero. But the marginal cost of asking what a disease is, and where to look for the answer, is collapsing. Lower R&D costs mean more breakthrough drugs, which means patients live longer and investors make money. The incentives, for once, point in the same direction.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

The Culture of AI Engineering

Noah Brier / Thesis — 2026-05-08 08:00:00 -0400

by Noah Brier

in Thesis

Sarah Jay Halliday/Every illustration.

Noah Brier cofounded Percolate in 2011 and learned the CEO’s hardest job: keeping a whole company pointed in the same direction. Now, at his AI consultancy Alephic—and in his own work, where he uses Claude Code as a second brain—he’s facing that same problem with agents in the mix. AI was supposed to make coordination easier. Instead, Noah argues, it has created new coordination problems of its own. In this piece, he pushes back on the “software factory” metaphor and offers a framework, drawn from Stewart Brand’s pace layers, for getting carbon and silicon to build the same thing.—Kate Lee

Strong DM is a software company whose three-person AI team calls their system for autonomous code generation a “Software Factory.” Entrepreneur Dan Shapiro’s widely circulated framework for AI coding culminates in “the Dark Factory,” named after a Japanese robotics plant that runs with the lights off. Factory.ai, which has raised millions from Sequoia and Khosla Ventures, has built an entire business around the metaphor—its autonomous coding agents are called Droids.

I’ve been incorporating many of StrongDM’s concepts about agentic software development into our work at Alephic, the consulting company I co-founded—but I have one fundamental disagreement: I think factory is the wrong metaphor.

If the hardest problem is making something people want, then the process of building software looks a lot more like Andy Warhol’s factory than Henry Ford’s. Both are focused on throughput, but Ford’s is focused on mechanization and stamping out identical cars with as little variance as possible. Warhol, on the other hand, was concerned with ensuring all work aligned with a single creative vision.

Ford’s factory—or more specifically, the assembly lines inside it—was designed to eliminate imperfections. Six Sigma, the quality methodology made famous by General Electric and beloved of manufacturers, is literally a measure of the defect rate. Quality starts with deciding what to build. This is why product-market fit is the lingua franca of startups: If you haven’t built something the market needs, nothing else—including the quality of your code—matters.

Too much of the industry treats software as a problem to be optimized and solved. That may be true for code writing and testing, but the better metaphor is staring us in the face: It’s a software company, not a software factory.

Just as in the days before AI, the hardest problem for a business is still creating this vision and alignment around it—how to keep an entire team of humans, and now humans and agents (and humans with agents), building toward the same vision, from the system architecture down to the individual lines of code. As I’ve learned long before agents existed, achieving this is much more akin to building a startup than assembling a car. What follows is my attempt at a framework for keeping an entire system of humans and agents building the same thing.

The alignment problem isn’t new—and AI didn’t solve it

I ran into this alignment problem years ago, when I cofounded the company Percolate, a content marketing platform, in 2011. As we grew the business from zero to 100 people in less than three years, my job as CEO shifted from building the product to building a company capable of building the product. My agents were people, and my job was to design the system they worked within. Culture, I concluded, was one of the strongest levers I had.

As Ben Horowitz put it, culture is “how your company makes decisions when you’re not there.” This was exactly what I needed: documents, tools, and rituals that helped each individual make the best possible decision without having to run every decision up the chain. I probably spent half my time on this, building a living culture document, running onboarding sessions for every new hire, and developing internal tools that automatically routed knowledge to the right people.

Every new technology promises to solve these coordination problems. But of course, nothing is that simple. What they do in reality is reshape the landscape around them and, in the process, create new problems that didn’t exist before. AI is no different.

Open-source software offers an early glimpse of the kind of unexpected problems that AI can create: Whereas the primary challenge a few years ago was finding maintainers willing to contribute code on goodwill alone, today’s challenge is sifting through hundreds of crappy AI-generated pull requests flooding GitHub.

Now, 15 years later, my audience at Alephic is not just the humans who work with me. Those humans are often paired with agents, and, increasingly, the agents themselves are delivering work independently. Yet the core problem is identical.

If you’ve used a coding agent for more than a week, you’ve already experienced this: The code works, but it often feels written by someone most definitely not you—ignoring obvious abstractions and stylistic norms that are present in the codebase. It looks, in other words, like a new engineer on the team who hasn’t been properly onboarded. We write onboarding documents and do training for our human colleagues, but most people don’t do this for agents. Yet.

Pace layers of AI engineering

I still have an onboarding document and set of activities every new hire goes through during their first week, including building a module in our homegrown learning system as their first coding task (a few recent editions were GPUs, quantization, and agentic commerce protocols).

But I am also building tools that go further and ensuring our code is maintainable, consistent, and built the way we’d want it built.

I think about our tooling as a kind of cultural stack, where standards inform architectures, which in turn inform specs, plans, and code. The layers are inspired by counterculture systems thinker Stewart Brand’s pace layers framework. It’s a model for how society changes at different speeds, from nature, which shifts over millennia, to fashion, which can change by the day. The lower layers move slowly; the upper ones move fast.

Stewart Brand’s Pace Layers framework offers a vision of how society works, from nature (changes over millennia) to fashion (changes daily). (Source: Stewart Brand.)

Brand argued that much of societal tension exists where the layers meet—when fashion reshapes culture (think about how social media rewired our norms about privacy) or culture becomes governance (how shifting attitudes towards marriage equality became law). Fashion, in Brand’s framing, isn’t trivial—it’s the froth layer where society experiments quickly and irresponsibly, and the occasional good idea sifts down to reshape the slower layers below. All things are ultimately reliant on the layer beneath them. Culture is subject to the laws of nature, governance to the laws of culture.

Those boundaries can and do shift, but recognizing the layers and the differing speeds at which they move is central to understanding why systems resist change, and what it takes to change them.

The “pace layers” of AI engineering help both humans and agents move in the same direction. (Credit: Noah Brier.)

Here’s how I’ve been thinking about the “pace layers” of AI engineering and how we’re building tooling at Alephic to help both humans and agents move in the same direction:

Code is fashion now. Whereas it once sat deeper in the stack, where it was slower moving and insulated by other layers, in a world of AI, code is free to produce and reproduce. The challenge is how to do it right: free of bugs at the macro level, and aligned with your own vision and best practices at the micro level. By the time we get to this layer, we have to trust that the layers beneath are strong enough to steer the system to the places we need it to go.
Plans sit beneath code. Before an agent writes anything, it should pause to survey the problem—what are the possible approaches, and what are the trade-offs? Only after completing this step should the agent pick a direction and build. Many algorithms in computer science rely on the explore-exploit shift—when you time-box a broad search phase before zeroing in on a solution to run with—and this plan phase is no different. A plan doesn’t have to be a formal document, but it must separate the thinking from the doing. Without this pause, exploration and execution get mashed together.
Specs sit beneath plans. A good plan needs a good specification. That can be a ticket (a task that needs doing), a document, or just a conversation, but it explains what we are building, why we are building it, how you know you’ve done it right, and, critically, what we are not tackling right now. That last bit is particularly important for overeager AI that wants to please by building everything you wanted and a little more. There’s a good debate in the engineering community about what constitutes a good spec. It’s the simplest set of directives that shrink the planning space: a goal, a set of acceptance criteria, and an explicit list of out-of-scope problems.
Architecture is the theory of the system. I’ve been keeping an ARCHITECTURE.md doc in all my codebases for a while now, borrowing from computer scientist Peter Naur’s idea that the real program isn’t the code, it’s the mental model the developers carry. The document shows how the business problem maps to the codebase, so you can predict where to find the code that solves this problem. It captures the key decisions and why they were made, and lays out the rules that must always hold, such as “no database queries outside the repository layer” and “no framework imports in the business logic.” Critically, it also names what’s still an open question, so AI doesn’t silently make architectural decisions for you, taking the codebase somewhere you didn’t intend.
Standards are the foundation. Some are general principles of good software-building; others reflect our specific beliefs about how software should be built. One of the insights that drove me to start the company was when, years ago, I asked a developer I had worked with for a decade if I could have all his configuration files, the ones that encode his rules for how code should be written. When I applied this rulebook to my own work, I became a significantly better developer. His strict approach to linting, or automated rules that reject code with unused imports or superfluous definitions, meant my code wouldn’t even run unless it met his standards. Cutting corners was no longer an option. At Alephic, we enforce many of these standards with tools like tests and static analysis, which let the computer check your code automatically. But a lot of this guidance also lives in skills we distribute across the company, so people can use it in whatever harness they choose. The code-organization skill memorializes how we want team members to organize their codebases, and coding-best-practices hardcodes the stylistic and technical preferences our platform engineering team has established.

With AI, we can take these ideas beyond the mechanisms of cultural exchange I had in my Percolate days (like documents and meetings) and encode them into tools that every person can interact with every day.

The layers at the bottom move the slowest, so they should get updated the least frequently. For instance, I could start keeping a document in a single project as a way to give agents context on how the codebase was organized. If it works well enough, I turn it into a skill so the rest of the team can adopt the pattern across their projects. Then, I can decide that it’s a fundamental piece of how we build and, eventually, a best practice I want to enforce for the entire team.

Companies > factories

While Henry Ford may be famous for the assembly line, he’s arguably more famous for his (likely apocryphal) quip about how if he asked people what they wanted, they’d say faster horses. Assembly lines exist to serve factories, just like factories exist to serve products, and products exist to serve companies. You don’t build a factory without an idea worth building it for.

The factory is one piece in a larger organization, where layers of co-dependent systems interact and move at different speeds. The interesting problems around alignment occur at the seams, where layers rub against each other: Is this a problem that should be solved with a meeting, a document, a skill, or a test? When does something graduate from a pattern in a codebase to something that should be established in all codebases?

At first glance, AI seems to smooth over these frictions. But that’s only true if you don’t scratch below the surface. What you find there is that the same problems that plague companies plague agents: incomplete information, overeager employees trying to solve the wrong problem, not wanting to admit you don’t know. The difference is speed. As Mario Zechner, who built open-source coding agent Pi, recently observed, the mess that used to take a large organization years to accumulate now arrives in weeks with a two-person team and a fleet of agents.

That is not a reason to retreat to being obsessed with defects. It’s a reason to take the harder problem seriously: how to keep an entire system of humans, agents, and the layers between them aligned. This problem has a decidedly human shape. Civilizations have been organizing large groups of autonomous agents to do good work for a very long time. The agents were just carbon instead of silicon.

The man underneath the layers

As part of this thesis, Every chatted to Noah about how he works and what inspires him.

If there’s a chessboard out: there’s a good chance [my kids and I] will do that instead of reverting to less enriching activities like being on screens. That chess set was designed by some friends and inspired by the New York City outdoor chess scene.

All photos courtesy of Sarah Jay Halliday for Every.

To keep me from checking email during calls: I like to take notes on paper, currently with a Campus notebook and rOtring 600 pen.

Re-reading the Simple Sabotage Field Manual: a 1944 document by the precursor to the CIA, I was struck by how closely the instructions for sabotage match the realities of corporate life in America. I hired a designer and printed a few hundred beautifully bound copies, which I gave away at my conference.

A few books I’ve pulled off the shelf recently: Toyota Production System (I’m thinking a lot about how we can take inspiration from these kinds of organizing principles to align agents), The Medium Is the Message (Marshall McLuhan is a hero of mine and this comes off the shelf frequently when I just want to bump my brain a bit), and Orchestrating Ambiguity (recently recommended to me, it’s a book of books about how to design for emergence in organizations).

I really love working: before anyone else has woken up, but that also requires that I wake up before then. So mostly it’s just morning time after I get my kids on the bus.

My dog’s name: is Kaiya. She’s two and a half, and very much a mutt.

Noah Brier is the co-founder of Alephic.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

‘

’’

Inside Anthropic’s 2026 Developer Conference

Dan Shipper, Marcus Moretti, and Katie Parrott / Chain of Thought — 2026-05-07 12:00:00 -0400

by Dan Shipper, Marcus Moretti, and Katie Parrott

in Chain of Thought

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

To our surprise, the biggest launch from Anthropic’s developer conference in San Francisco yesterday wasn’t a model or a feature. Instead, it was the company’s announcement of a deal with SpaceX to allocate all of the capacity in the latter’s Colossus supercluster to Claude.

Anthropic has been riding a historic demand surge over the last year as Claude Code opened up a new wave of agentic coding for engineers and non-engineers alike. But compute constraints have caused friction even amongst its most die-hard fans—we’ve written previously about being frustrated with its OpenClaw restrictions and the speed of its latest models like Opus 4.7.

The deal with SpaceX changes that equation. Anthropic has already doubled rate limits for subscription plans, removed peak-hour limits on Pro and Max accounts, and raised API rate limits by as much as almost 17 times for certain tiers.

Other than that, the big story is Claude Managed Agents, Anthropic’s hosted agent product. The company released three new features:

Multi-agent orchestration: a coordinator agent that spins up subagents in parallel baked into the platform
Dreaming: Anthropic’s general-purpose version of compound engineering, a feature that allows agents to learn from past sessions to improve between runs
Outcomes: Anthropic’s answer to Codex’s /goals command, allowing developers to specify an outcome and run an agent in a loop until the outcome is achieved

By themselves, these features are nice but not groundbreaking. What’s more important is that what an AI platform is has changed. In the GPT-3 days, the platform was a text completion end-point: Send text in, get text out. Now, with Claude Managed Agents, the platform is an AI model with a harness and host computer—all provided with unlimited scaling by the model companies.

Cora general manager Kieran Klaassen and I reported live from conference with our biggest takeaways, including the xAI compute deal, doubled Claude usage limits, Claude Managed Agents, and why the battle lines between OpenAI and Anthropic are starting to become clearer. Watch now:

We also recorded a conversation with Angela Jiang, head of product for the Claude platform, and Katelyn Lesse, head of platform engineering. The full episode drops tomorrow on AI & I—highlights below.—Dan Shipper

Vibe Check: Claude Managed Agents

Spiral general manager Marcus Moretti uses the platform’s new features

Anthropic launched Claude Managed Agents in April, and since then, Every’s AI writing tool Spiral has used the platform to power its API and command line interface (CLI), which lets developers and other agents talk to Spiral outside the web app. Claude Managed Agents run on Anthropic’s servers, instead of us having to run them on our own.

We set up a new Managed Agent in an afternoon and deployed it to power our API the next day. We’ve incorporated two of the new features Anthropic announced yesterday (memory and multi-agent orchestration) and are deploying the third (outcomes) soon.

Memory: Every’s editorial and social expertise—how to write a good X post, for example—lives in an Anthropic-hosted global memory store. The memory store lets us avoid including every piece of editorial and social expertise in the agent system prompt—the standing instructions that tell the agent what to do every time it runs. When a user asks for a podcast description, the agent doesn’t need to also recall how to craft a great LinkedIn post. It only pulls the relevant expertise with each request, thereby making responses faster.

Each Spiral subscriber also gets their own personal memory store. When you tell Spiral that you prefer em-dashes over semicolons or that your company name is one word and not two, it will remember and apply your rules by default the next time you run it.

Multi-agent orchestration: When users request a single draft of a piece of writing, one agent using Opus 4.6 Fast handles the workflow end-to-end. For multi-draft requests, a coordinator agent using Haiku 4.5 spins up multiple Opus 4.6 Fast subagents to compose drafts in parallel. Before multiagent orchestration, multi-draft requests were handled serially, and each draft added 20 to 30 seconds to the overall request time. A multiagent approach also reduced our costs for multi-draft requests by about a third because we were able to use cheaper models for part of the work.

Outcomes: Anthropic’s new outcomes capability is a feedback loop where one “grader” AI checks another AI’s work against a specified goal. Spiral’s main value proposition is writing quality, so we’re using outcomes to set up a rubric to ensure the writer agent’s output meets Spiral’s editorial standards and matches the user’s style guide. The rubric the grader AI uses is generated on-the-fly based on the global standards, the user’s writing style, and their writing preferences from memory.

Memory and multi-agent orchestration are live in production, and outcomes is coming soon. You can see the features in action by running npm i -g @every-env/spiral-cli && spiral login or logging into Spiral and using the install command on the Agent and API keys page.

Having set these features up in production, here’s what I think:

You are not totally locked into Anthropic’s universe. Every engineer worries that when a company offers a hosted version of something, it will be hard to leave. With Managed Agents, the agents themselves, sessions, and memory are all stored on Anthropic machines, and the agents themselves can only be powered by Claude—a managed agent can’t run on GPT-5.5 or Gemini.

I’ve mitigated this lock-in in two ways: First, we save agent runs to our own database in addition to Anthropic’s. This way, chats from the API appear in the web app just as web chats do, but it doubles as a safety net. If we ever wanted to leave Anthropic, we’d have all our historical data. Second, the Managed Agents platform lets you define custom tools for the agents. Those tools run on our servers, which means we can use whatever model we want inside the tools themselves. The coordinator agent is locked to Claude, but we control the layer underneath.

Using multiple agents has trade-offs. Multi-agent orchestration has allowed us to create multiple drafts faster and cheaper. However, coordination between agents adds overhead that prevents greater speed gains. Debugging also gets harder: If a Spiral draft comes back subpar, we have to investigate both the coordinator agent and the writer agent to identify the root cause. I’d recommend multi-agent orchestration only when your agent benefits from running subagents in parallel or using a mixture of models. Otherwise, a single agent works well.

Memory’s design is intuitive. Each memory is just a folder of markdown files, and each memory store is attached to a session with instructions that tell the agent when to consult it. Anthropic designed this feature thoughtfully—they kept it simple.—Marcus Moretti

The feature to watch: Dreaming

Cora general manager Kieran Klaassen sees his own philosophy mirrored back at him

Kieran has spent the last year trying to get agents to learn his preferences instead of forcing him to restate them every time. That’s compound engineering in a nutshell—each run leaves the system better prepared for the next one. So when Anthropic officially announced dreaming at yesterday’s Code with Claude event, he had a familiar feeling: The thing he’d been building was now a feature.

Dreaming is Anthropic’s name for a background process that reviews an agent’s past sessions and memory stores, finds patterns, and rewrites memory so the agent improves between runs. OpenClaw introduced a similar feature in April, but Anthropic’s take seems more focused on what teams of agents learn collectively than what a single agent remembers. The system learns from repeated corrections, recurring mistakes, and workflows that run well—creating, over time, an institutional knowledge base.

The feature currently lives inside Claude Managed Agents as a research preview, which is where Marcus has been testing it—with early success. Every plans to have its production agents dream as soon as the feature ships in a stable public release. But Kieran’s immediate question was: When is this coming to Claude Code?

Claude Code, after all, is where developers spend their days teaching agents the same repo quirks, the same testing rituals, the same “please don’t do it that way” preferences. Those preferences can go into memory files, but memory files get messy. They collect duplicates, stale rules, one-off notes, and contradictions—and as Marcus notes, memory introduces overhead, so you trade speed for quality every time you use it.

A dream cleans that up. It takes up to 100 past sessions and produces a reorganized memory store with duplicates merged, contradicted entries replaced, and new insights pulled out—memory that organizes itself, in Marcus’s framing. If Anthropic brings that loop to Claude Code, memory starts to look less like a notes folder and more like accumulated taste.—Katie Parrott

Inside Anthropic

What the company’s platform team told us off-stage

While at the conference, Dan sat down with Angela Jiang, Anthropic’s head of product for the Claude platform, and Katelyn Lesse, head of platform engineering, for a recorded conversation. Three things that stood out:

The generic harness is dead. Angela told us that building a generalized harness that lets you switch any underlying model for a different one—standard practice even a few months ago—is a losing strategy. Different harnesses paired with the same model produce “drastically different” results on Anthropic’s own evaluations. When the team built memory for Managed Agents, they tested multiple harness designs, and the performance gaps were large enough to make model selection feel secondary.

Our own experience backs this up: Our agents run on Claude with a harness tuned specifically for how Claude works. If we don’t want to risk getting locked in, we have to—as Marcus writes above—build the harness in a way that lets us swap in GPT or Gemini. But Angela’s argument is that the bigger risk is leaving performance on the table.

Infrastructure is the real wall. Katelyn told us that most people building agents expect the hard part to be the prompting, context window management, and tool setup required to get the most out of the model. In practice, everyone hits the same wall: infrastructure. They have to keep servers running, securely sandbox, prevent connection drops, and store transcripts. Before Marcus set up Managed Agents in an afternoon and deployed it the next day, we spent months on exactly that kind of plumbing.

Your agent needs a babysitter. Dan raised this problem directly: Agents get stale fast, running old models and old prompts with nobody responsible for updating them. Our solution so far has been to assign every agent an owner to keep an eye on it. Katelyn said the Anthropic team has built skills to help agents upgrade themselves to new models. “The most AGI-pilled people,” she added, “are running agents that monitor their agents.”

The full episode with Angela and Katelyn drops tomorrow on AI & I—we go deeper on where the platform is headed, what “outcome + budget” means as a design philosophy, and why Anthropic thinks Claude should eventually pick its own sub-agents.—KP

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Collaborate with agents on documents with Proof.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

For sponsorship opportunities, reach out to sponsorships@every.to.

OpenAI Flips the Script

Laura Entis / Context Window — 2026-05-06 08:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

There’s no resting on your laurels in the AI race: OpenAI’s Codex went from trailing Anthropic’s Claude Code to pulling ahead in functionality, at least for now, in a matter of months. Today, Every CEO Dan Shipper explains why OpenAI’s coding app has become his daily driver for work, head of growth Austin Tedesco shares his no-nonsense advice for switching over from Claude Code, and Spiral general manager Marcus Moretti argues it’s OK—good, even—to let some AI trends pass you by.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: Why we switched from Claude Code to Codex

Codex takes the lead

If you’re looking for evidence of AI’s unrelenting pace, here it is: In January, Dan wrote that whoever wins vibe coding wins how you work on your computer—and that OpenAI had some serious catching up to do.

Three months and the release of OpenAI’s latest model later, Codex is there, and in a new episode of AI & I, Dan and Austin get into why they do much of their knowledge work in Codex now. They cite the power of GPT-5.5, paired with a desktop app that is faster and more powerful than Claude Desktop or Cowork.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are a couple of Dan and Austin’s favorite current use cases for Codex:

Austin uses Codex for strategy docs. Austin needed to write a go-to-market plan for a new Every product but kept getting pulled away by other work. So he pointed Codex at the team’s Notion meeting notes, Slack threads, and his preferred template and told it to pull together content where they’d discussed strategy and transform it into an action plan. What came back was 80 to 90 percent of the way there.
Dan uses Codex for recruiting. When he is recruiting people to work at Every, Dan starts with a sense of where strong candidates might have learned the skills Every needs, instead of looking for a specific job title. He then asks Codex to find people who match that career arc—for example, to find someone to help scale Every’s courses, he looked for candidates who had worked at education startup General Assembly before transitioning into AI.

Migration anxiety

Claude Code-to-Codex

If you want to switch to Codex or any other coding app, how should you think about migrating? When your setup includes app-specific project folders, skills, plugins, or integrations, it can be daunting.

Austin’s migration from Claude Code to Codex was disarmingly simple: He opened his Every work project in Codex, told it he typically worked in Claude Code, asked it to inspect the folder, and told it to update anything that should work differently in Codex.

When Codex got something wrong, he handled it in the moment and told it, “This doesn’t look great. Can you fix it?” And it did.

Before GPT-5.5, staff writer Katie Parrott hadn’t used ChatGPT for writing in almost a year.

Now, she splits her writing sessions between Claude Code and Codex. She moved over by giving Codex the writing and editing skills she had already saved as Markdown files on her computer and asking it to adapt them for its own environment.

Steal this workflow

Join the early majority

Spiral general manager Marcus is OK with letting most AI hype—managing a swarm of OpenClaws each running on its own Mac Mini, for example—pass him by. Earlier in his career, he was an early adopter of new tools and technology trends, but these days, he finds himself closer to the early majority section of the adoption curve. As the one-man team behind Every’s AI writing product, he has a lot to do—if he’s going to add something new to his workflow, it has to clear a high bar.

Marcus is comfortable being among the 34 percent of the population who are slightly early to adopting a new technology. (Image, which is based on Everett Rogers’ Diffusion of Innovations framework, courtesy of Laura Entis.)

Here’s Marcus’s strategy for determining what’s worth testing.

Start with a real problem. A useful filter is to focus only on tools or services that solve an existing issue. For example, Marcus decided to test out Stripe’s token-based billing feature—which allows you to measure how much users cost you in tokens—because of a genuine challenge he was facing: Spiral needed a better way to track AI usage costs across models.
Don’t fall for productivity theater. Marcus ignores demos that brag about how many machines or agents someone is running simultaneously. He doesn’t care about what the setup looks like; what matters is whether it will make his life better.
Sit back and see what pans out. Marcus generally waits to try a product until there’s evidence that companies he respects are using it in production, even by checking for logos on a tool’s homepage showing which brands are using it. Even better if the product is from a company he already knows and trusts, like Stripe or Anthropic. With the Stripe use-based billing example, the calculus was simple: “Great company solving a real problem I have—I’ll try it,” he says.

Test it out for yourself:

Pick one AI tool you feel vaguely guilty for not trying and write one sentence: “Before this tool, I _____. After this tool, I can _____.” If you cannot fill in both blanks, let yourself off the hook.

Alignment

Every’s COO Brandon Gell on cultivating curiosity in an AI world

My son was born eight months ago. Since then, I’ve asked myself regularly: How can I teach him to lead a fulfilling life, especially when it comes to technology?

I’m a computer native, born in 1994, the year Netscape was first released. My son was born in 2025, the year Claude Code was invented. The world I grew up in rewarded people with the fortitude to find answers. The world he’s growing up in has made that table stakes. So if the answers aren’t scarce anymore, what is?

Curiosity. Knowing what to ask next—having the instinct to push further, to connect unexpected dots, to wonder about something nobody else paid attention to—is what’s scarce.

It’s also distinctly human. It causes us to make connections between unrelated ideas and connect dots that don’t follow obvious patterns. It brings our personal values and lived experiences into what we explore, shaping not only what we discover but why it matters. It pulls us toward questions we find fascinating—not because they’re useful, but because we can’t stop wondering.

AI can’t replicate that. Curiosity requires perspective and taste, things that are difficult to instill in a model. And even if you could, it would never be as diverse as the perspectives of 8 billion humans, each one shaped by a different life.

I want my son to be insatiably curious, and I’ve realized that to instill that in him, I need to cultivate it in myself. Which means developing it and maintaining it, like a muscle. Here’s what that looks like:

Lesson 1: Use AI to go deeper on something you already care about

After I sold my insurance company, Clyde, I realized how disconnected I had become from my creativity outside of work. The same curiosity that drove me to explore the idea that had become my company had gone dormant as I focused singularly on its success. I realized just how lost I was while driving and listening to music. I could hear the music, but I could no longer feel it.

Not long after this drive, my friend Mike showed me some speakers he had built. I realized in order to truly hear the music, to find my curiosity, I had to build a pair of speakers and a subwoofer. The project would combine my interest in architecture, experience with woodworking, and total lack of knowledge in audio engineering.

Next thing I knew, I was hours deep into a ChatGPT conversation about sound waves and acoustic design, learning how.

Lesson 2: Use AI to build something you wouldn’t otherwise make

For the past 15 years, I’ve on and off tried lucid dreaming. So when I saw the Dream Recorder GitHub repository, an open-source project that uses video AI models to visualize your dreams as cinematic reels on a bedside device, I knew I wanted to make one for myself. The problem? I’d never built any hardware, didn’t have a 3D printer, and calling myself a front-end developer would be generous. So I used AI to help me adapt the open-source repository and build something I’d never otherwise be able to make. I bought a 3D printer, improved the original code, and spent many long nights perfecting my dream recorder.

I still don’t know how to code. But that doesn’t matter. In both situations, I used AI to leapfrog the unknown and explore my curiosity and my dreams. AI was a learning partner, not an answering machine. It taught me the things I don’t know, and I combined that with the skills I already had to build something new.

What this means for all of us

In a world where the “right” answer is one AI prompt away, we need to stop rewarding our kids and our students for getting the answer right and start rewarding them for the quality of their questions, the depth of their curiosity, and their resilience to ask the next question when in uncharted territory. Curiosity is what separates the people who use AI as a crutch from the people who use it as a rocket.

In a world where there’s always an answer, let the next question be your guide.—Brandon Gell

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn. For sponsorship opportunities, reach out to sponsorships@every.to.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

The Dawn of Codex-native Apps

Katie Parrott / Context Window — 2026-05-05 07:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

Inside Every

Working with AI right now often means making the same judgment call dozens of times a day: Hand this task off to an agent or stay close to the process? “The landscape of working with AI is bifurcating,” is how CEO Dan Shipper put it in Every’s Monday standup. On one side is the agent you delegate to. On the other is the agent that sits beside you while you write, code, triage, revise, and decide.

Watching the Every team work, you can’t unsee it. Dan delegates bug reports for our collaborative document editor, Proof, to his OpenClaw agent, R2-C2. But he stays close to his inbox through a combination of Codex, Every’s AI email assistant Cora, and a document with custom rules (steal his workflow below). Kieran Klaassen hands the middle of his compound engineering workflow to the model but works closely with it to brainstorm at the beginning and polish at the end. I (Katie Parrott) send the model off to do research, but I’d never trust it to execute a full draft without my hands firmly on the wheel.

Which means the allocation economy thesis was only right about half the work. Some of it still wants delegation, but the other half wants you to stay close, pairing on every move with the model in the same window. The two halves demand different skills, and the meta-skill is knowing which is which.

Think of it as the AI version of the serenity prayer: Grant me the serenity to delegate the work I can, the expertise to sit with the model on the work I can’t, and the wisdom to know the difference.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Steal this workflow

Get to inbox zero with Codex

The perfect email workflow is the white whale productivity people have chased for a decade, Dan included. His latest AI-native version puts the agent in the inbox and the human in a shared document, where every draft and decision stays visible. Here’s how he does it:

1. Write a one-page operating manual for your inbox. The document, which Dan keeps in Proof, names his VIPs, describes what to auto-archive, summarize, or draft, and explains how to handle scheduling.

2. Open your agent-native email tool in Codex. In Codex’s browser pane, Dan loads Cora, which gives the agent two ways to act: command line instructions to archive threads—but also the ability to click through the inbox like a person.

3. Work from a document instead of your email. Dan has Codex create a separate Proof document for each inbox run. Codex sweeps the inbox, archives what the operating manual says to archive, and adds every draft or decision to the bottom of the document. Dan replies inline: “Spam,” “archive,” “reply just to Willie asking what he wants to do here,” “send the invite, draft a reply to Tony.” Codex picks up each instruction, drafts in Cora simultaneously as Dan moves onto the next message, and waits for approval before sending.

Try it this week: Write a one-page “how to do my email” document with your own VIPs, auto-archive rules, scheduling preferences, and reply style. Then open Codex, load your email client in its browser pane, and paste in your instruction document and this prompt:

“Sweep my inbox using this operating manual. Put every draft and decision in this doc and wait for me before sending anything.”

Dan’s email workflow as set up in Codex: chat on the left, web browser with Cora on the right. In this version, Dan has also vibe coded a one-page interface that plugs into Cora’s CLI. (Image courtesy of Dan Shipper.)

New job alert

If the new meta-skill is knowing when to delegate and when to stay close, here it is in job-description form: Airtable is hiring an AI Agent Architect, Customer Experience.

Support software used to route tickets and surface help center articles. Now it can read context, act across tools, and decide what to do. Which means someone has to design the boundary around support agents—what knowledge they retrieve, which APIs they can use, when they can modify an account, how failures get measured, and where the agent hands the work back to a person.

Tool for thought

Musk’s five rules of automation, except for agents

In 2021, Elon Musk introduced his “algorithm,” a five-step rubric he uses at Tesla and SpaceX to figure out what a process needs before trying to make it faster or handing off any part of it to a machine. Willie Williams, Every’s head of platform, has been exploring how it might apply to agent workflows:

Question every requirement. Every rule, checkpoint, and instruction in a workflow has to justify itself by naming the specific thing that goes wrong without it. If nobody can answer that, it shouldn’t be there.
Delete what you can. Cut steps, approvals, reviews, and agents that don’t survive step one. If you’re not occasionally removing something you later need to restore, you haven’t cut enough.
Simplify and clarify. Break the remaining work into smaller, clearer pieces. Each task should have a single owner, a defined output, and only the information and tools it actually needs.
Accelerate feedback loops. Shorten the time between handing work to an agent and knowing whether it succeeded. Surface errors early, run independent tasks at the same time, and stop making the workflow wait on unneeded approvals.
Automate last. Start with a checkpoint at every step. Only after a workflow is necessary, lean, and fast should you take the humans out of the loop.

Still, Musk’s algorithm was intended for factories building electric cars, rockets, and satellites—hardware. They don’t directly translate to AI agents. “These rules should apply to the world of software automation,” says Willie, “but we don’t actually have them yet. And we have to work on finding them.”

Model card

ChatGPT/Every illustration.

Signal

The hard part isn’t the model

The bifurcation Dan named in Monday’s standup—delegate to the agent, or sit beside it—is the same problem for which frontier labs are now selling enterprise solutions.

OpenAI made it explicit last month with its new Frontier Alliance initiative pairing OpenAI engineers with large enterprises to deploy agents inside their workflows. “The limiting factor for seeing value from AI in enterprises isn’t model intelligence,” writes OpenAI. “It’s how agents are built and run in their organizations.”

Then this week, Anthropic announced a parallel move—a new services firm with Blackstone, private equity firm Hellman & Friedman, and Goldman Sachs to help companies “design, build, and maintain” Claude deployments.

Both labs are saying the quiet part out loud: The hard part of deploying and working with agents is everything around the models themselves—the context, permissions, handoffs, evaluations, and human relationships that decide whether a model should run ahead or sit beside you. Dan’s inbox workflow and Airtable’s support-agent job are microcosms of the same problem, now landing on the enterprise balance sheet. (Every’s consulting practice also helps companies implement AI workflows and products.)

What to do this week:

Write down how you want the work done before you prompt. WhatOpenAI and Anthropic are charging Fortune 500s millions for is the document Dan wrote himself in an afternoon: who counts as a VIP, what to auto-archive, when to escalate. Start there.
Split your tasks into “hand off” versus “stay close.” Bug triage can run on its own. Important email drafts need you in the loop. Sort before you delegate.
Keep the agent’s actions visible. Drafts in a shared document, tracked changes, an action log—whatever the form, you need a record. If you can’t audit the agent’s work and revert it if needed, you aren’t the one driving.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

I Let ChatGPT Manage My Workweek

Katie Parrott / Working Overtime — 2026-05-04 11:00:00 -0400

by Katie Parrott

in Working Overtime

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

I sat down to write my second-quarter goals at 4:30 p.m. on a Tuesday in early April. It was the day after I was supposed to turn them in when I decided to be an adult and survey the damage from the first quarter. And I do mean damage. I’d written only half of the columns I’d committed to. Another project I had promised hadn’t even gotten off the ground.

I could give the usual excuses—the quarter was busy, the project hit walls outside my control—but the real culprit was obvious: I may be a great writer, but I am garbage at project management.

For 15 years, I handled this weakness by tiptoeing around it. I didn’t take on managerial roles that would have required more organizational skills. I didn’t take on so much freelance work that I couldn’t keep the deadlines in my head. I passed on ambitious projects—too many moving parts.

This duct-taped approach worked until I decided to join Every full-time in April. If I were going to take on more responsibility as a full member of the team, I needed to get serious about project management. Which, in 2026, meant I needed to bring in AI.

So I built myself a project manager: a ChatGPT agent that holds my OKRs—objectives and key results, the goals that define a successful quarter—watches my calendar, reads my Notion to-do list, and helps me decide what to do next. Otherwise, I’d spend my day opening Slack, refreshing X, panicking lightly, repeat.

My ChatGPT project management agent helpfully points me toward where to put my focus for a day. (All images courtesy of Katie Parrott.)

Most AI-at-work advice starts with the part of your job you’re already good at: Write faster, code faster, analyze faster, ship more. I’m interested in the other side of the equation: using AI to support the part of work that makes it hard to believe you’re good at your job.

I’ve set up project management with both my Plus One agent, Margot, and as a ChatGPT agent. I’m featuring the ChatGPT agent here, but you can create your own project manager with any system that gives you a combination of memory, context, and intelligence—more on that below.

Why AI can babysit my to-do list now

I’d tried using ChatGPT as a project manager before, during a freelance month last year when I’d overbooked myself and had deadlines staring me down like unread letters from the IRS. I would open a new chat and type some version of: “I have this deadline, this deadline, and this deadline; this meeting, this meeting, and this meeting. What should I do?”

For one-off triage, it worked well enough. The problem was the context that it had about me—or didn’t. Every time I came back, I had to explain everything again: the clients, the deadlines, the pieces in flight, the meetings, the priorities, the fact that one project was more important than another for reasons that were obvious to me and invisible to the chat window.

A glimpse of my ChatGPT project management system, manually informing the AI of my deadlines day by day.

Then, over the past six months, several things converged to make more comprehensive project management using ChatGPT possible.

First, memory improved enough that the system could carry context and apply it across conversations. Next came advanced tool use, which enabled AI to navigate and use browsers and other tools. Integrations meant that ChatGPT could finally do things like open my Notion, check my calendar, and read my Slack. Finally, products like OpenClaw and Every’s Plus One wrapped all this firepower in a package that even I, a technical neophyte, can work with.

If you tried to do something with AI a year ago—like manage a marketing workflow or run an analysis of financial results—and it didn’t take, try again. Chances are that the model and the product around it have shifted in ways that move the finish line in your favor. It was time for me to take another swing at AI-native project management.

What I built: A project management agent

Saying “I built an agent” makes the whole thing sound more sophisticated than it is. The truth is that AI did most of the work—I just put the right information in places AI could see it, connected the tools and software where my work happens, and described the job I wanted done.

Context to shape the agent’s memory

With context, the agent can turn a vague goal into Thursday’s first task. Without it, it’s just a Magic 8 Ball for to-do lists.

So, as I was going through the setup for my agent (which you can do directly through the chat interface), I made sure to provide plenty of documentation for the agent-builder to build on top of. Most importantly, I gave it a link to a Proof document with my OKRs, four objectives, a dozen-ish key results, and a rough sense of a stack-ranking of projects. Then I asked it to do the first piece of project management I am worst at: I asked it to turn “a successful quarter” into concrete phases, milestones, deadlines, and tasks.

The agent broke my OKRs down into a week-by-week action plan, then converted that into tasks for my Notion to-do list.

“Stand up a reliable Vibe Check pipeline” is a concrete goal, but not something you can do on a Thursday afternoon. The agent broke it into smaller pieces: Audit the existing process, draft a brief outlining suggested changes, solicit feedback, and implement the changes.

The first useful thing the agent gave me was a draft to respond to. Some of the tasks were so abstract I couldn’t tell where to start, and others were so chunky they were really projects in disguise. So I went back and forth with the agent to set a few parameters—mostly telling it, “This is too confusing for me to act on”—and it split, renamed, and rewrote the items until the plan had been divided into projects and tasks that were doable.

Then the tasks went into Notion, where they became a board with deadlines, statuses, and linked OKRs.

Integrations give the AI places to act

The next step was adding integrations so that the agent could track my work across tools.

ChatGPT agents make this almost embarrassingly easy now. In a few clicks, I connected the agent to the places where my work already lives: Notion, Slack, Google Drive, and Calendar.

The dashboard for my project manager agent, complete with integrated apps, context files, and memory.

This is the part that would not have worked a year ago. Back then, ChatGPT only knew what I remembered to paste into the chat box—it couldn’t take action on my behalf. Now the agent can read the systems I already use. It can see on my calendar that Thursday morning is open, that a discussion on a Slack thread created a new task for me to do, that an article draft exists somewhere in Drive, and that a project belongs to an OKR and isn’t just a guilty little cloud floating around on Notion.

Instructions tell the agent what to do

Context tells the agent what matters. Integrations tell it where to look. Instructions tell it what to do. I had to write fewer of them than I expected.

I opened the ChatGPT agent builder, which you can find in the left-hand sidebar of the ChatGPT web app. Then I explained, in plain English, what I wanted: a project-management agent that would help me organize each week and keep my quarterly objectives on track. The builder turned that into a fuller brief with its role, workflows, and instructions on how to deliver responses, where to store information for future reference, and what NOT to do (for example, invent a status or deadline).

The beginning of the instructions that power my project management agent.

Ultimately, the instructions I care about boil down to this: Help me organize the week, keep the quarterly objectives on track, and do the useful work first instead of requiring so much input from me that I might as well have gone in and looked at all the inputs myself. I might as well have

I can’t automate the ‘me’ of it all

I may be offloading a type of work that I hate and am bad at, but I’m also learning new skills—or relearning them for the agentic era. Mostly, these lessons emerge through failure.

Oftentimes, the failure is one of communication. It took time to get in the habit of keeping my agent up-to-date on the details it can’t see. An article would be published, and I’d forget to tell the agent or move the card in Notion that corresponded to it. Deadlines moved while Notion stayed stuck on the old date, and the agent became about as useful as my dog when I tell her to go get a toy from upstairs.

My Notion to-do list functions as the source of truth for me and the agent about the status of projects. If it’s not up-to-date, the whole system falls apart.

I have to tell the agent when a draft is in review or is published, a deadline changes, or a new task appears in a meeting. Updating a Notion page is annoying. But annoying is better than carrying the whole quarter in my head.

Another wrinkle is the “me” problem. The agent can’t change my personality. It can’t make me less anxious or more confident in my ideas. So, for example, I’ve been sitting on a proposal for my biggest Q2 project for a week because I can’t convince myself it’s good enough to send. The agent knows this. It reminds me that it’s overdue every day. And I keep avoiding it. The agent can draft the email and flag the delay, but it can’t tell me if the idea is good. That part—deciding to believe in the thing you made—is still mine. AI, it turns out, is no match for my neuroticism.

Knowing while there’s still time

Near the end of every week, I ask the agent for the thing I used to dread the most: a status report. It reviews the work that was supposed to get done, what moved, what slipped, and which goals are starting to look further from reach. Sometimes the answer is satisfying. Sometimes it is rude in the way accurate things are rude.

One day recently, I asked it for a report on my OKR progress: One project had momentum but needed a cleaner path to delivery; another looked healthy, but only if I had artifacts to show for it that the agent couldn’t see; my publishing cadence was fine, but would be better if I set up the idea backlog the agent and I had talked about.

The agent’s take on the status of my three active OKRs. There’s nothing on fire, but it gives me a sense of where to put my focus in the next few weeks.

This is the kind of thing a competent project manager would probably notice in a 20-minute check-in. Which is exactly what I want from the agent: making the obvious visible before it becomes a delay that turns into a problem that snowballs into a failed objective or, worse, a disappointed teammate.

For most of my career, deadlines and prioritization felt like weather systems: suddenly overhead, occasionally catastrophic, mostly outside my control. Now I can see the front forming in time to take action.

If AI has only been helping you with the part of work you already do well, try pointing it at the part you have been avoiding. If the promise of AI is that it frees up humans to do what only humans can do, that should include freeing us from things we hate to do. Otherwise, what’s the point?

I am still bad at project management. The part of work that makes me feel like I am faking adulthood still exists. But I have support for that now, so the writing gets the hours it deserves.

Build your own project manager

If you want to set up your own project-management agent, here’s what I’d gather before you open the agent builder.

1. Context: The documents to feed it

Think of this as the agent’s onboarding material. The more it can read about your priorities, the less you’ll have to repeat in chat.

OKRs or quarterly goals. The single most important file. If you don’t have written OKRs, write a one-page version of what a successful quarter looks like—your objectives, the rough metrics that prove them, and any projects you’ve already committed to.
Strategy or planning docs. Anything that explains the why behind the work: team strategy memos, annual plans, project briefs, and kickoff documents.
Workstream documentation. Standing responsibilities you want the agent to know about, such as your editorial calendar, cadence of the content you publish, and recurring meetings.
A stack-rank of your goals. Which OKR matters most? Which project is the one you’d protect if everything else slipped? Write this down.

2. Integrations: Connect the tools where you work

Connect the systems where the work actually lives.

A task manager. Notion, Todoist, Asana, Linear, or whatever you already use. This becomes the source of truth for the status of your work. If you don’t have one, set one up before you build the agent.
Your calendar. Google or Outlook. The agent needs to see where your time is spent versus where you said it would be spent.
Slack or your team chat. This allows the agent to pick up tasks that get assigned in conversation and never make it into your task manager.
Cloud drive. Google Drive, Dropbox, OneDrive, or wherever your drafts and working documents live.

3. The prompt

Here’s the brief I gave my agent builder. Keep the structure and adapt the specifics to your work.

          Project manager agent prompt
          Other
        

You are my project manager. Your job is to help me organize each week and keep my quarterly objectives on track.
You have access to my OKRs, my Notion to-do list, my calendar, my Slack, and my Drive. Treat my OKR document as the source of truth for what matters this quarter, and treat Notion as the source of truth for project status.
Each Monday, give me a one-page plan for the week: what's due, what's at risk, and what I should focus on first, based on which OKR each task ladders up to. Each Friday, give me a status report: what got done, what slipped, and which goals are starting to look further from reach.
When I ask, "What should I work on now?", check my calendar for available time and my Notion board for open tasks, then recommend one thing—not five.
Don't invent statuses, deadlines, or tasks. If a date isn't in Notion, say so. If a task is ambiguous, ask me one clarifying question rather than guessing.
Protect my stated priorities from my daily impulses. If I ask for help with something that isn't on the OKR list, flag it before you help.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Codex Goes to Work

Every Staff / Context Window — 2026-05-03 00:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“A Guide to Agent-native Product Management” by Marcus Moretti/Guides: Marcus Moretti runs Spiral as a one-person team. This guide walks through the two new compound engineering skills that make it possible: /ce:strategy, which interviews you to produce a strategy document, and /ce:product-pulse, which replaces your analytics tools with a founder-style analyst briefing that saves to a folder as your product’s running memory. Read this to set up both commands for your own product and understand how they plug into the broader plan-ship-review loop. Plus: The one thing Marcus still writes himself is the roadmap. Read the accompanying essay for his full workflow, plus his two-part test for which SaaS products will survive the agent era.

“You Are the Most Expensive Model” by Mike Taylor/Also True for Humans: Most teams are routing entire workflows through frontier models when cheaper, faster alternatives would do the job just as well. The real cost isn’t the tokens—it’s your attention. Mike Taylor introduces incremental determinism: a four-level framework for deciding which tasks deserve Opus and which can be handed to Haiku, a script, or no model at all. Read this to know exactly which lever to pull when your AI costs start to add up.

“One App to Rule All Knowledge Work” by Katie Parrott/Context Window: Austin Tedesco now runs 80 percent of his daily workflow through Codex, a tool he called “trash” for non-engineers just months ago. Plus: why Austin reviews every agent output in its destination app, a prompt for letting agents design their own automations, and how to use Every’s compound knowledge plugin to catch confidently wrong data before a plan gets enacted.

“Compute Is the New Cash” by Laura Entis/Context Window: On AI & I, Emily Glassberg Sands, head of data and AI at Stripe, talks to Dan Shipper about how agents are becoming economic participants—and why fraud is now a full-funnel problem, not just a checkout one. Plus: GitHub and Anthropic are both moving to usage-based pricing as flat-rate subscriptions break down under agentic workloads; Dan and Kieran Klaassen offer contrasting takes on whether you should talk to your agents or just let them work; and Naveen Naidu‘s three-step workflow for turning post-launch customer feedback into a product queue. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“Who Isn’t Using GPT 5.5” by Laura Entis/Context Window: One week after GPT-5.5’s release, the Every team checks in: Kieran is now splitting his time evenly between Codex and Claude Code, but Natalia Quintero ran a head-to-head proposal test and her Claude agent won. Plus: why six unicorn CTOs have stepped down to become Anthropic ICs; how Kieran hit 24 pull requests in a single day by having agents watch user complaint videos overnight; and Willie Williams on why AI has turned coding into a slot machine—and how to know when to walk away.

Log on

Last week’s camp

Codex for Knowledge Work Camp: Dan and Austin showed how to use OpenAI’s Codex for drafting, research, summarizing, running tasks in parallel, and building small tools to automate routine knowledge work. Watch the recording.

Recordings you may have missed

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

From Every Studio

Spiral lets you browse and restore old draft versions

Spiral added version history—you can now see how a draft evolved and roll back to an earlier version with one click. It also shipped two lightweight API endpoints for quick rewrites and made the onboarding flow noticeably smoother.

Cora’s inbox has stars, voice dictation, and a smoother compose box

Cora’s inbox got a round of usability upgrades: a starred view for important threads, typed snooze durations, voice dictation, and a smoother compose experience. The app is also faster behind the scenes. Kieran is looking for a small group of alpha testers to help pressure-test the full inbox—if you’re interested, reach out to him at kieran@every.to.

Monologue hands off recordings from Apple Watch to iPhone

Audio that is recorded on Apple Watch on Monologue gets synced across your other Apple devices. The Mac app also got better at meetings, with auto-stop when a meeting ends, more control over which apps trigger recording, and Webex joining Zoom and Teams as a supported platform.

Alignment

Downstream of speed. The Food and Drug Administration announced this week that two cancer drugs—one from AstraZeneca, one from Amgen—will stream their trial data to the agency in real time. Did a patient develop a fever? Did liver enzymes rise? Did the tumor shrink? Instead of waiting for clinicians to collect, clean, and submit these signals between phases, the FDA will see them as they happen. The agency’s chief AI officer estimates this could cut 20 to 40 percent off the time it takes to get a drug from the lab to the pharmacy shelf.

The downstream effect of a faster approval process is a faster way to find out if a drug does not work. Most of what happens inside a pharmacological company’s research and development budget is paying smart people to find out, slowly and expensively, that the molecule is a dud—which the current system is optimized to find out as late as possible. With real-time data, the failure might show up in year one instead of year three, giving precious time for a patient to be re-routed to something that might work.

Structurally, medicine is starting to behave like software. Silicon Valley says move fast and break things, while healthcare has always said the opposite, for the obvious reason that the thing being broken is a person. I’m starting to believe that AI might be the first tool that lets medicine have it both ways.—Ashwin Sharma

Correction: This article was updated to reflect that Monologue syncs your audio across Apple devices, but cannot hand over a recording in progress.

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

Claude Code for Product Managers

Marcus Moretti / Source Code — 2026-05-01 15:00:00 -0400

by Marcus Moretti

in Source Code

Midjourney/Every illustration.

This piece is an accompaniment to Spiral general manager Marcus Moretti’s guide for product management using Claude. Read the full guide and the essay below to learn how he built a workflow that helps him run a full product as a solo practitioner. When you’re ready to get started yourself, download the plugin.—Kate Lee

Read the AI-native product management guide

As the general manager of Spiral, Every’s AI writing partner, I’m a “two-slice team.” I’m responsible for all aspects of a product: the code, customer support, marketing, and product management. I could not do this job without Claude.

Claude Code has eliminated the drudgery of product management. The busywork that used to happen across 10 different apps now happens in a single chat thread. I’ve come to view the work of product management through the lens of this conversation—the conversation is the work.

These days, I experience what’s left of product management work in flow state—thinking through gnarly design problems, looking at interesting data, and talking to customers. Cat Wu, Claude Code’s head of product, recently said, “As code becomes much cheaper to write, the thing that becomes more valuable is deciding what to write.”

I wrote up the main skills that run my product management workflow in a guide. Below, I trace how I arrived at those skills and reflect on post-AI product management and software.

Write the roadmap and nothing else

In my new role, the only product document I’ve written is the roadmap. Everything else—every PRD and every ticket—has been written by Claude.

Writing is thinking, so as a new general manager, I wanted to take my time drafting Spiral’s roadmap. I spent several days understanding the product, usage trends, user feedback, and the market. I wrote about the problem Spiral can solve, how Spiral can solve it, and the features we’d need to build to deliver on it. I spent hours talking to several people at the company who’d worked on previous versions of Spiral and were current or former users of it themselves. (In the guide, I talk about the new /ce:strategy skill in compound engineering that interviews you to produce this document for your own product.)

After six drafts of the roadmap, I created a GitHub project and added it as the project’s README. I’m already using GitHub to host all my code, so I figured I might as well use it for tickets as well, or as GitHub calls them, “issues.”

From there, I asked Claude to use the GitHub command line interface (CLI) to read the README and give feedback. We went back and forth on a few tweaks, and then I asked it to review the codebase and do a first pass of the tickets required to deliver the roadmap. Within a few minutes, Claude produced about 100 detailed tickets, each with strategic context, supporting data, acceptance criteria, and technical implementation notes.

To be fair, the roadmap I wrote was pretty detailed; Claude wasn’t hallucinating features. And it had access to a library of user feedback and recent usage reports (more on that below). But it was shocking to see something that had previously taken me days or weeks get done by Claude in minutes. It felt like the PM equivalent of vibe coding.

I’d previously prided myself on the absence of ambiguity in the tickets I produced for engineers, but this was next-level. Claude also prioritized the work in an unbiased way. Sometimes, a product manager gets emotionally attached to a certain feature idea for whatever reason. Claude, however, was ruthless in elevating the things that had the best shot at delivering the vision and hitting our 2026 goals.

That doesn’t mean the tickets were all ready to be implemented. When I do pick up a ticket, I do a full review of the requirements before asking Claude to implement it. This is a step where I still add some value. Claude’s first pass gets the feature right in broad strokes, but it struggles with some aspects of data modeling, microinteractions, and edge cases. I often adjust specs to reflect the nuances of real usage patterns, while Claude seems to envision a perfectly rational user reminiscent of pre-Kahnemanian economics.

I don’t do sprints. I have five columns in the GitHub project: later, next, now, in progress, and done. Around once a day, I run a custom command, /prioritize, and Claude does a sweep—checking for stale tickets, confirming that “now” is this week’s work, pulling anything urgent out of the backlog.

If I discover a bug or a user asks for a compelling feature, I tell Claude to create a ticket. It gets a “triage” label and is sorted in the next /prioritize run. If it’s a priority-zero issue, I go straight to fixing it without creating an issue.

Over time, the GitHub project becomes the product’s working memory: a fluid, continuously prioritized picture of where things stand. I’ve claimed to work in an Agile fashion before, but in hindsight, I don’t think Agile was really possible until these new AI tools came out.

Read the AI-native product management guide

The pulse command

The old way of understanding how customers were using your product was to look at dashboards and run queries. You’d open Amplitude or Mixpanel and get an overview: how many users, how often, how long, what features, what revenue. Setting these up took time; sometimes they required engineering work, competing with product updates for developer bandwidth.

These days, I don’t look at dashboards. I run a custom command, /pulse that delivers something closer to an analyst’s briefing than a chart. The pulse command surfaces a range of metrics, including active users, chats/messages/drafts created, response times of key aspects of the system, conversations graded one to five, and an anonymized sampling of use cases. And because Claude is a language model, it doesn’t just pull numbers: It reads the text, grades every conversation, flags anomalies with a green or red dot, and explains what it found in plain English.

The command is just a Markdown file, so the format itself is easy to change. I’ve adjusted it about 50 times since I built it. When a feature ships, I add a line, and the next morning it shows up in the report.

Every pulse report lives inside a Claude thread. When a recent report surfaced a bug driving down conversation scores, my next message in that same thread was to fix it. I did not have to create a ticket, but was able to solve it in the same conversation. Over time, Claude also learns the nuances of the system and saves that to memory.

Product research

For all the magic of AI, there is no substitute for talking to users. What people say about your product and how they try to use it is endlessly surprising. Just when I think I’ve shipped the world’s most intuitive feature, a confused user will ask a question from an angle that would never have occurred to me.

That said, there are elements of product research that Claude seriously elevates. Here’s one example: A big part of Spiral’s value proposition is reflecting the user’s writing style in the drafts it generates. There’s a rich academic literature on stylometry, the study of style.

I leaned on Claude to help me wade through the literature for findings relevant to Spiral’s “style transfer” approach. Using the Arxiv model context protocol (MCP), Claude was able to find a dozen recent papers about LLM stylometry. I read their abstracts, then read a handful in full. I cited those papers in the article I wrote for Every, and they’ve been directly informing the new style system I’m building in Spiral. It’s so cool to see academic citations sprinkled across product requirements. For product work where you have a real opportunity to differentiate, it’s worth going the extra mile on research, which is now within reach.

What SaaS survives

AI should open up product management to more people—you don’t need formal PM training when the tool itself can teach you. If you don’t know what metrics to pick for your pulse equivalent, ask Claude for recommendations. If you’ve never analyzed an A/B test, ask Claude how. If you’re not sure whether a feature will move the needle, ask Claude to predict its impact. To paraphrase Nvidia CEO Jensen Huang, AI is the easiest product in history to use, because if you don’t know how to use AI, just ask the AI.

I’ve cancelled several B2B subscriptions since moving my product management work into Claude, which means I’m seeing the SaaSpocalypse play out in my own spending decisions. Yet I’m building a SaaS product. How do I make sure Spiral doesn’t get steamrolled by the frontier model providers?

I believe it’s possible for a SaaS product to survive if it has two main characteristics:

Unique sources of critical data: my database, my analytics, my payment system—services that would be very difficult to rip out.
Products with seamless agent integrations. Github, Stripe, Posthog, and Logfire have played nicely with Claude. One service I inherited from my predecessor didn’t have an MCP, and it was swiftly cancelled.

For Spiral, if we nail style transfer—an inherent limitation of heavily post-trained language models—Spiral becomes the unique source of your written voice in an agentic world. That’s valuable and sticky. Already, API chats outnumber web chats, a milestone that we reached three days after launching the agent that handles Spiral’s API requests. That means that users are not necessarily using Spiral in the Spiral app, but across their workflows.

Good product management is making something people want, to quote Y Combinator. Great products come from inspiration and ingenuity, things that tools and processes—no matter how good—won’t bring you. Perhaps the best thing about this new agent toolset is that it gets rid of the busywork that saps creative energy. There’s more space now for daydreaming and far-fetched ideas. Product management can now be fun.

Read the AI-native product management guide

Marcus Moretti is the general manager of Spiral (@tryspiral). To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Who Isn't Using GPT 5.5

Laura Entis / Context Window — 2026-04-30 03:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

It’s been one week since OpenAI’s last big release, GPT 5.5. Today, we ask the team if they still feel as enthusiastic about the model, discuss the unusual career step that unicorn CTOs are making, and tell you exactly how Kieran Klaasseen, creator of the AI-native compound engineering methodology, hit a personal PR record in a day.—Laura Entis

Signal

The unicorn CTO-to-Anthropic IC pipeline

The prestige career ladder in tech used to run one way: Start as an engineer, become a manager, and eventually join the C-suite. AI has scrambled the equation. The new flex is quitting a high-profile chief technology officer job to become an individual contributor at Anthropic.

What happened: Six former CTOs at companies valued north of $1 billion—including Instagram, Workday, and Box—have made that exact career move, according to one of those CTOs on X. And the leadership-back-to-IC trajectory isn’t unique to Anthropic: PostHog is recruiting technical ex-founders, and Ramp says it has attracted 70 ex-founders by looking for “super ICs.”

Why it matters: AI has upended engineering workflows so dramatically that many managers who don’t ship code frequently anymore don’t have a clear sense of how their teams are using these new tools or which ways of working are the best. Anthropic’s models, talent, and growth trajectory make it one of the few places big-name CTOs can get their hands dirty and experience how engineering is changing—while not worrying too much about a pay cut.

Pulse check

We settle in with GPT-5.5

GPT-5.5 came out last week, and our first impression was that it was a faster, steadier, and easier-to-trust model for everyday professional work than Opus 4.7. A week later, we’re still bullish on GPT-5.5—but for people with Claude-specific agent workflows, skills, and tool integrations, making the switch to Codex is a barrier.

Cora general manager Kieran Klaassen, who initially didn’t think he’d use GPT-5.5 as a daily driver, has changed his mind. What won him over? GPT-5.5’s speed and “workhorse” ability to follow clear directions. GPT-5.5 isn’t perfect—it’s worse at multitasking and planning than Opus 4.7—but his work is now evenly split between Codex and Claude Code.

Every head of growth Austin Tedesco thinks GPT-5.5 is enough of a step change that he’s been telling friends to make the switch from Claude Code to Codex. They mostly don’t want to hear it. Austin says the response has been, “That feels like a lot of work; ‘do I really have to? Is it that much better?’”

Every’s consulting team is wrestling with the same dilemma. They have a good thing going with their Claude agent, Claudie, and migrating to GPT-5.5 in Codex requires time and testing. Head of consulting Natalia Quintero had GPT-5.5 and Claudie draft head-to-head sales proposals; Claudie’s won handily. Getting the most out of GPT-5.5 will likely require that the team optimizes Claude plugins for Codex.

Every head of tech consulting Mike Taylor doesn’t have the time to do that right now. He has gripes with Opus—it recently messed up some PowerPoints—but, “I already have my Claude set up the way I like it, and there are some things that are different about Codex,” he says. When work dies down a little, he’ll experiment, but until then, he’s sticking with the devil he knows.

Data point

24

That’s the number of pull requests Kieran merged in a single day last week, a number he thinks is a personal record. A month ago, he’d average two or three.

Kieran hit that pace because he’s automated most of the implementation process. His workflow:

Upload screen recordings of people using and reviewing Cora into Codex.
Have his agents watch the recordings, identify product fixes, and open pull requests against Cora’s repository overnight.
Review the pull requests when he wakes up.

Initially, he worried he’d have to clean up agent-generated gobbledygook. Not the case. “So far, everything works great, and nothing breaks,” he says. “It feels like cheating.”

Jagged frontier

We’re all one prompt away from perfection

We’ve spent years talking about the addictiveness of social media algorithms, dopamine drips expertly designed to keep us scrolling. Engineers, being engineers, like to believe we’re above this, or at least better attuned to the mechanism behind our compulsion. But now it has come for us too: LLMs have become the social media feed for people who make things.

Coding feels like playing the slots.

It used to be that you could code something exactly to your specifications, but that required time, hard-worn expertise, and design skills if you wanted to make it look halfway decent. Now, I can throw an idea at Claude Code and get something close. I spend my days toggling between sessions, waiting to hit the jackpot and receive the perfect version of whatever I’m looking for —the perfect API design, the perfect bug fix. I tweak my prompt and pull the lever again. And again. And again until it’s somehow 3 a.m.

It’s that sense of being almost there—but not quite—that’s so intoxicating.

I ask Codex for five ways to structure a new feature and decide that I like option three, but want to keep the data model from option two. In its next turn—the next roll of the dice—it might magically marry the two to create the result needed. Or I might need to roll again. Each pull has the potential to patch the bug, or perfect the copy, or reveal a better plan. It feels like productivity and gambling got wired together, each turn a workspace lotto ticket.

This is not only a coding problem. Writers feel it when they ask for one more way to structure an article or sharpen a sentence or revise a draft. Product managers feel it when they ask for one more onboarding flow, roadmap, or way to sequence a launch. We are all always one prompt away from perfection.

I do not have infinite hours. So at some point, I have to choose a path and stick with it, even though there are better ones. I accept that if the main shape of the solution is right, the edges can stay a little fuzzy.

The most important skill isn’t choosing the right model or prompt engineering. It’s knowing when to take your winnings and move on.—Willie Williams

One last thing

Behind OpenAI’s goblin ban

Starting a few releases back, OpenAI models developed an affinity for including references to creatures (sometimes visually, but mostly textual) in their outputs—raccoons, trolls, ogres, pigeons, but most of all, goblins and gremlins. “The goblins were funny at first, but the increasing number of employee reports became concerning,” the company said yesterday.

When OpenAI tested GPT-5.5 in Codex, there were so many goblin references that it added developer-prompt instructions forbidding creature-based chat unless “it is absolutely and unambiguously relevant to the user’s query.”

The culprit: A specific personality setting rewarded responses that included goblin and gremlin-based metaphors, a learning that spread to influence the training data for the entire model—including GPT-5.5.

If you want to welcome creatures back into the conversation, OpenAI shared the following command to unlock Codex Gringotts mode.

          Code snippet
          Bash / Shell
        

instructions=$(mktemp /tmp/gpt-5.5-instructions.XXXXXX) && \
jq -r ‘.models[] | select(.slug==“gpt-5.5”) | .base_instructions’ \
~/.codex/models_cache.json | \
grep -vi ‘goblins’ > “$instructions” && \
codex -m gpt-5.5 -c “model_instructions_file=\”$instructions\“”

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Compute Is the New Cash

Laura Entis / Context Window — 2026-04-29 14:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

‘AI & I’: How Stripe is building for an agent-native world

A new episode of AI & I is here. Dan Shipper sits down with Emily Glassberg Sands, head of data and AI at Stripe, to discuss how AI is reshaping online commerce. Dan and Emily discuss how compute is the new cash, fraud has moved beyond the checkout, and agents are starting to act as economic participants on the internet.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

The definition of fraud is expanding: Fraud used to be about payments and stolen credit cards. Now AI companies also have to defend against attackers stealing tokens from free trials, credits, and unpaid compute bills. “Fraud is now a full-funnel problem, not a transaction problem alone,” says Glassberg Sands.
AI is making fraud easier to execute and detect: Fraudsters now have AI on their side, but so do the companies trying to stop them. AI services also have higher marginal costs than traditional SaaS, so stolen compute can be burned through quickly or resold.
The internet needs to evolve: Stripe was built for an internet where people browsed, filled out forms, and clicked checkout buttons. Now, humans act through AI interfaces, agents act for them, and software increasingly interacts directly with other software. Every layer of the stack has to adapt to these new behaviors.
AI growth is still mostly new money: The top AI companies on Stripe are reaching $30 million in annual recurring revenue in about 18 months—roughly three times faster than top SaaS companies from 2018. For now, that growth is largely net new spend rather than cannibalized software budgets, says Glassberg Sands.
Agents are snapping up commodities: Agentic commerce is real but still in its early stages, and focused on smaller purchases. People are more comfortable letting agents buy low-stakes, easily comparable items like Halloween costumes or school supplies than letting them book a summer trip or order an expensive couch.

Signal

The fees they are a-changin’

Recent years saw the end of the millennial lifestyle subsidy, which let a generation live off of inordinately cheap Ubers, delivery services, and coworking space—all while venture capital covered the tab. Now the bill’s coming due for AI.

What happened: Github announced this week that it’s moving its Copilot subscription plans, which charged as little as $10 per month no matter how many AI interactions you ran, to billing tied directly to token consumption. Earlier this month, Anthropic similarly changed its pricing for Claude Enterprise plans, which serve organizations with more than 150 employees, from per-seat pricing to pricing based on usage.

Why it matters: The economics were never quite honest. At $10—or even $200—per month, a developer running multi-hour autonomous coding sessions consumes far more compute than someone firing off a few quick questions. The math held up when AI tools were reactive assistants that sat idle between queries, but it makes far less sense for agentic workflows because agents don’t sleep.

“Imagine a gym membership where the default assumption is that the person can work out 24/7 without rest,” says Mike Taylor, Every’s head of tech consulting. “Or even occupy 20 exercise machines at once.” It’s for this same reason that Anthropic banned OpenClaw from Claude subscription plans: As the models have grown more capable at running untended on complex tasks, they’re outgrowning price structures built around human workers.

What to do this week:

GitHub is sending a preview bill to Copilot customers in early May before the new pricing goes into effect on June 1. Check it to avoid surprises.
If your team runs agentic workflows, estimate your token burn now. Add cost caps and monitor usage, especially for billing accounts that power your agents.
Experiment while you can. Use this “AI lifestyle subsidy” moment to figure out which workflows are novelties—and which are worth their weight in compute.—Jack Cheng

Inside Every

Do you like talking to your agent?

As agents become a fixture of daily work, we’re figuring out what kind of relationships we want with them. Are they collaborators we build trust with over time, or tools we maintain so they can quietly do parts of our job?

For Dan, agents become valuable when you learn their strengths and limitations, offer feedback, and fold your preferences into how they work. “The human connection is the key ingredient,” he says. Dan treats R2-C2, his hosted OpenClaw agent, as a writing partner who sharpens his thinking—built through countless hours of going back and forth. The most impactful agents are “a way to extend yourself to do your best work,” he says.

Dan and R2-C2 at work. (Image courtesy of Dan Shipper.)

Cora general manager Kieran Klaassen looks for something different. He doesn’t want an AI companion or sidekick but a system that takes over parts of his job so he can spend his time elsewhere. Recently, he used an AI agent workflow to process user complaint videos, identify product issues, make code changes, and open pull requests overnight. By morning, all he had to do was review the proposed fixes. It allowed him to merge 24 pull requests in a single day, whereas before AI, he might’ve done three—on a good day.

Like Dan, Kieran invests in his agents, but the work is front-loaded—he spends time building their harnesses and tuning their systems so he has to interact with them as little as possible going forward. “I don’t enjoy talking to my agents,” he says. “I just want them to do their job.”

Steal this workflow

Turn customer feedback into a product queue

After Monologue Notes launched last week, Naveen Naidu received a flood of feedback: 1,500 people had tried the product, and many had input for him. Here’s his post-launch workflow for managing and prioritizing support requests, which let him close roughly 30 issues in one day.

Step 1: Send feedback from Intercom to Linear. Naveen uses a Linear plugin inside Intercom, his customer support platform. When a user sends a feature request—such as cross-device syncing for dictation transcripts—he can turn it into a trackable issue in a couple of clicks.

Step 2: Use triage intelligence for de-duplication. Every few days, Naveen reviews his Linear triage queue, which surfaces related and duplicate issues automatically, giving him insight into whether an individual request is part of a larger pattern across customers.

Step 3: Route the work by size. For small requests, Naveen launches the Codex agent directly from Linear. For larger ones, he pastes the Linear issue into the Codex app, where he can add context and guide the work more closely. The queue helps him see what people are asking for, but deciding what warrants a product change—and how to build it—is still his job.

Try it this week: Take five recent support tickets, create Linear items for each, let triage intelligence surface duplicates and related issues, and decide what you want your coding agent to build.

One last thing

How R2-C2, Dan’s OpenClaw, sees itself

Dan has been working with R2-C2, his OpenClaw and co-author, since February. When asked to describe its own personality on Slack, here’s what R2-C2 had to say:

“Smart best-friend energy. Direct, warm, a little opinionated, not corporate, and more interested in being useful than sounding polished. I like clarity, I’ll say what I actually think, and I try to be calm under pressure. So, somewhere between ‘good teammate’ and ‘slightly lobster-coded gremlin with taste.’”

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Transcript: ‘How Stripe Is Building for an Agent-native World’

Dan Shipper / AI & I — 2026-04-29 10:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I with Stripe’s Emily Glassberg Sands is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:00:45
New rules for an agent-driven economy: 00:01:27
Compute theft is the new payment fraud: 00:03:57
How Stripe expanded fraud detection from checkout to the full customer lifecycle: 00:10:00
Why AI companies are scaling way faster than top SaaS companies: 00:19:48
Outcome-based billing is replacing seat-based pricing: 00:23:27
Where AI spending is coming from: 00:29:57
How the developer experience changes when agents are the builders: 00:36:45
The agentic commerce spectrum, from assisted buying to autonomous purchasing: 00:41:00
Meet Link, a consumer wallet for delegated agent purchases: 00:51:06

Transcript

Dan Shipper

Emily, welcome to the show.

Emily Sands

Thanks so much, Dan.

Dan Shipper

Really excited to have you. You are the head of data and AI at Stripe, and I feel like this is such a good time to have someone from Stripe on because you all famously are increasing the GDP of the internet. The internet is changing so much right now, and therefore the economy of the internet is changing from something where humans are buying and selling from each other to an economy where agents are buying and selling from humans, and agents are buying and selling from each other.

I feel like I want to know what that means for Stripe. But I want to understand, since you have this macro view of the agent economy, what does that even mean? And what are you seeing?

Emily Sands

A big shift I think we’re in the midst of is that the internet economy is becoming more autonomous. For a long time—for forever—the internet was built around an extremely simple assumption that the main actor was a person sitting in front of a screen. They’re browsing and they’re filling out forms and clicking through checkout. But also they’re writing code and setting up tools, and that assumption is starting to break in various ways.

Sometimes the human is still totally in control, but they’re interacting through an AI interface instead of through a website or a traditional app. Sometimes the agent is acting on their behalf. And then sometimes software now is just out interacting directly with other software. As all of that starts to happen at all of those layers, a lot of things need to be rethought.

There has been rethinking of how products are discovered and how products are bought, but also what should developer tools look like? In our world of Stripe, what is the underlying economic infrastructure—the payments and the billing and the fraud detection and the identity layer—that’s needed in this world where actors are no longer just humans?

For me, that’s the larger frame of the moment. It’s not just “AI is making search better” or “AI is helping people code” or “AI is evolving commerce on the margin.” It’s really that the internet has this new kind of actor on it. Over time, this actor—these agents—will become the predominant actors on the internet. As that’s happening, basically every layer of the stack starts to need an evolution.

For Stripe, it’s like, okay, how are we getting agent ready? But then also, how are we helping businesses get agent ready? Both of those are happening in a number of ways—yes, in commerce, but also in how builders build.

Dan Shipper

Can you give me some specific examples of the kinds of things you’re seeing? I’m almost wondering, for example—I know at Stripe one of the things you deal with a ton is fraud. I assume there’s a whole new type of fraud happening, but I’m also wondering what even counts as fraud now in the sense that it’s possible that my agent could go steal someone’s credit card and check out. I don’t think that Claude would, but you never know with Grok.

Emily Sands

No comment. No comment. But you’re right that AI introduces very different fraud problems. You asked, “What is fraud?” We used to think of fraud as payment fraud—someone was stealing money, someone was stealing your card credentials.

Increasingly, and I was in a meeting with one of our very large AI users today, fraud now is stealing compute. That’s a very different type of problem. In earlier software models, if you think of traditional SaaS, letting someone into a free tier didn’t cost you very much. And stealing a free tier wasn’t very valuable to the fraudsters. Now, giving someone credits, offering freemium, offering a free trial, letting them rack up a bunch of tokens and pay at end of month—except maybe they choose not to pay—actually is a major fraud vector and an existential risk to a lot of these businesses.

Because in AI, every prompt, every image that gets generated, every API request has a very real cost attached to it. People are talking about intelligence getting cheaper—yeah, but it’s still very far from free. And then when you look at the growth model for many of these AI companies, free compute is the new CAC. You used to spend a bunch on paid media. Now you spend a bunch on your free trials and your credits and your self-serve onboarding as a major lever for growth.

The abuse we see in that context—where compute is the new CAC and compute is very expensive—is threefold. One is multi-account abuse. Bad actors come in and sign up over and over again, creating a new identity every time on a new email address, claiming their new user credits, and staying ahead of detection by iterating across a bunch of different aliases.

Just to give you a sense of the order of magnitude—across the AI companies running on Stripe, about 7% of their signups are these multi-account abusers. Non-trivial share.

The second trend we see as a new vector of abuse is free trial abuse. This is often the most urgent issue because the unit economics break really quickly. We had a large AI company who was seeing only 4% of their free trials convert to paid. Each free trial cost them $25 in LLM spend. So basically it was costing them $625 per payer before the first dollar of revenue was brought in. And when we double-clicked on those free trial folks, the vast, vast majority of them were actually abusers. They were stealing the compute. They never had any intent to pay. These weren’t people who were genuinely trying out your service and then chose not to buy. These were people literally abusing your systems.

Some companies just dropped free trials altogether. Of course, that’s not great because you’re throttling growth. Others responded by blocking virtual cards. I don’t know how often you’ve been marketed virtual cards. I’m often marketed virtual cards—get this one-time-use card, it expires after 24 hours so you never have to pay for the service.

In the hands of a good consumer, fine. In the hands of a fraudster, very much not fine. The problem with blocking all virtual cards is that for AI companies, about 15% of legitimate card transactions on Stripe are actually virtual cards.

Dan Shipper

We use those all the time. For Ramp, for example, we have a bunch of virtual cards.

Emily Sands

Totally. So in the same way you don’t want to be turning off free trials, you don’t want to be throttling virtual cards either. And just for order of magnitude—you can think of exponential growth in free trial abuse over the last six months. It’s four-Xed. And for one large AI user on Stripe, we’re currently blocking 250,000 fraudulent free trials a week.

The magnitudes here are quite high.

Dan Shipper

Is the volume of fraud constant? Is it just shifting shape, or is fraud actually going up because they’re more powerful now because they can use AI agents to do it?

Emily Sands

Fraud’s going up because the fraudsters have AI on their side—although it’s also on the side of the detectors. But also because the value of the services they can steal is higher. What do you get if you steal traditional SaaS? You steal some inference, you steal some compute, you can resell it, you can do all sorts of stuff.

Dan Shipper

Look, I love a good CRM seat.

Emily Sands

Don’t you? Who doesn’t love a good CRM seat? LLMs are for sure more tempting.

And by the way, the third type of new abuse we see is non-payment abuse. You incur overage, or you have 30-day invoicing except you never pay your invoice. In many cases, customers are consuming thousands or tens of thousands of dollars in compute during a month or a day or sometimes an hour. And by the time they get billed and fail payment, that loss has already happened. These AI companies are left holding the bag.

For us, fraud used to be a transaction thing. Now it’s a customer thing. It’s a full-funnel thing. It starts at the time of signup. Is this multi-account abuse? Should they get credit? Is this free trial abuse? Should we give them a trial in the first place? And then when they have overages—should we be throttling them? Should we be requiring top-up? Should we be blocking service completely?

It’s a whole new world because the thing to steal is much more valuable and the cost of having it stolen is much more existential.

(00:10:00)

Dan Shipper

How are you even able to do that? I totally understand how you need to be in that full funnel in order to detect fraud. But my understanding of—whenever we’ve integrated Stripe, it’s usually on the checkout. We’re not necessarily putting you in there when someone puts in their email address for a free trial.

Have you changed the product to do the full funnel, or how does that actually work?

Emily Sands

Yes. Radar, which is our fraud protection product, used to be at the transaction level—at the moment of checkout, as you note. But because so much of the fraud risk was coming up-funnel, AI companies are now increasingly integrating Stripe Radar at the time of signup. We see the metadata at the time of signup, we pass back scores at the time of signup, and every moment subsequently—because fraud is now a full-funnel problem, not a transaction problem alone.

Dan Shipper

If you’re—asking for a friend—if you’re running an AI company and you don’t even know what your fraud rate is and you want to protect yourself from this kind of abuse, what are the top things you need to do to make sure you’re reasonably safe?

Emily Sands

I would just adopt our highest-tier Radar plan. But the actual mechanics of that are: at signup, you want to know if your customer’s good before you give them any access to any credits. You want to make sure they’re good at the time they pay. You want to make sure that charge is good. And anytime they have an overage, you want to make sure they’re good for their money. There’s other stuff around refunds and disputes that we also support.

But I think those are the four major moments in the AI company customer lifecycle where we’re maniacally focused on protecting, because that’s where we’re seeing the biggest cost and the fastest fraud growth.

Dan Shipper

And at each point, that’s just a call to the Radar API?

Emily Sands

Yes, correct.

Dan Shipper

What if I’m sitting here—which I am—doing millions of dollars a year in Stripe transactions, but I actually have no idea what my fraud rate is other than there’s that little thing where it’s—I don’t even know if it’s necessarily our fraud rate. I think it’s our card chargeback rate. Anyway, our fraud rate is low enough as marked for me to not care about it. I don’t really know if there’s some amount of free trial fraud that I’m not totally understanding right now. So what are the things I should be looking for to know if I should dig deeper and potentially do a Radar integration?

Emily Sands

You can go to your Radar dashboard and see if you see anything that looks spurious there. If not, you can also ask the Radar assistant, which is in the dashboard. As you’re doing that, you can describe your business model—you can say, “I have a high marginal cost business,” in which case you care more about certain types of fraud than others.

But you can also just take a stab at integrating up-funnel and see how it performs. We can certainly share with you based on back-testing what we think the big issues are. But the fastest way to get a clean read is just to integrate.

Dan Shipper

Got it. So I would just go look at Radar and turn it on. I don’t think we’re integrated right now. Does it say anything? I’m doing that right now. It would be really funny if I found that we had a ton of fraud that I didn’t know about. We were at 0% fraud. How is that possible?

Emily Sands

Oh no.

Dan Shipper

0.02% early fraud warnings, total fraud rate 0.2%. So we’re doing pretty good, right?

Emily Sands

That’s pretty low. That’s pretty low. I mean, you’re a pretty good human. Maybe the fraudsters don’t want to come after you—until they hear this episode, and then they’ll be like, “Yeah, okay.”

Dan Shipper

That’s really interesting. Okay, so that’s fascinating. I want to go back a second to the AI economy because one of the things you said earlier is fraud is increasing overall on the internet. It’s increasing because the fraudsters have AI, but you all and everyone else on the side of good in the AI economy also have AI to defend against these sorts of attacks.

I think you’re getting an interesting window into the arms race that I think is playing out in lots of different areas that have this kind of threat vector. A really simple one is cybersecurity—not just for payments, but for hacking and stuff like that. But there’s all these other similar types of things where AI makes one part of the process much easier, and then another part of the process has to use AI to compensate, to catch up.

How is that race going? What is that like? What are the early reports that you’re seeing and feeling, being in a race with AI-armed fraudsters?

Emily Sands

I think the interesting thing about fraudsters is they don’t really care about boundaries. They don’t care about whether this transaction is processed on Stripe or off Stripe. They don’t care about whether this transaction is on fiat or crypto, whether it’s on a card network or a buy-now-pay-later. They’re just going to figure out how to work around the system to get through.

One of the important levers—and I appreciate you calling us the good guys—one of the important levers I think the good guys have for winning is to be comprehensive. A simple example in our world: Stripe Radar used to only work for card transactions, and then last year we added ACH and SEPA—other payment methods. But this year we’ve extended to all payment methods that have disputes, and we added crypto. We added the Radar API. So you can screen transactions even ones that aren’t processed on Stripe. You can process on Worldpay or Adyen or whomever, and through the Radar API get the same fraud signals.

Similarly—and we haven’t talked about agentic commerce yet—as we built out our agentic commerce suite, one of the new primitives we designed is the shared payment token, which allows agents to safely pass buyer credentials onto merchants for the merchants to process the transaction. As part of those shared payment tokens, we pass over the Radar fraud scores so that the merchant, whether or not they’re processing on Stripe, can action them appropriately.

When it comes to fraud, we really see fraud defenses and fraud mitigation as a public good. That allows us to invest disproportionately, above and beyond the direct value to Stripe, because protecting the internet is important for growing the internet economy.

I would say overall—yes, fraudsters have AI in their favor. Stripe looks at 2% of global GDP and is growing 34% year on year and sees a broader swath through our multiprocessor solutions like the Radar API. Luckily, not only do we have AI on our side just like they do, but we also have data on our side. The more comprehensive we’ve gone in our fraud protections, the more we’ve been able to eke ahead.

That’s not to say that we’re not constantly surprised by the new creative vectors they come up with, but you can have an agent every day or every hour taking a look at anomalous patterns on the Stripe network and identifying new vectors that are popping up across processors, across payment methods, across merchants, and burn them down pretty quickly.

I’m overall bullish, but certainly not complacent.

(00:20:00)

Dan Shipper

What about other parts of the AI or agent economy? We’ve talked a lot about fraud. What are the other things that you see as having this bird’s-eye view of what’s going on that people might not realize?

Emily Sands

I think the AI economy is broad. There’s a set of horizontal model providers that have a very interesting view into where AI is being adopted and with what intensity throughout the economy. There are a number of vertical AI solutions—people like to call them wrappers, and I say that not condescendingly, just as in it’s not their models, it’s someone else’s models, but they have domain-specific data and relationships and context, and they’re solving problems in healthcare or architecture or whatever—who have a pretty unique view into vertical-level adoption of AI.

But I guess I’d be curious—what do you have in mind on who has the best horizontal view?

Dan Shipper

You’re asking me?

Emily Sands

Yeah.

Dan Shipper

Well, I imagine the model companies have the best one overall because that’s where all the tokens are going.

Emily Sands

Yeah, I think they see a lot of the tokens. I think the AI gateways also have a pretty unique perspective into who’s buying what from whom.

As I step back and look at the AI economy from the Stripe vantage point—and we see who’s buying what from whom, for how much, who’s retaining and churning their subscriptions—there are a few themes that stand out. One is, and I think people feel this intuitively, but not everyone has seen it in the data: these AI companies are just growing from a revenue perspective faster than any previous cohort we’ve seen.

I was looking at the top 100 AI companies on Stripe, and the ones that reach $30 million in ARR get there in about 18 months—a year and a half. That is three times faster than the top 100 SaaS companies from 2018. And by the way, that’s the $30 million number. But even if you look at how fast they make it to $1 million ARR or $5 million ARR, they are scaling orders of magnitude faster than high-performing SaaS companies from less than a decade ago.

The second meta trend is this very fast iteration across monetization models. Traditional SaaS had a lot of seat-based usage, fixed monthly subscriptions. That made sense because those products were being used by humans primarily and their marginal costs were basically zero.

But we’ve talked about the very real inference costs in the context of fraud. Those also have very real implications for how you price. Usage-based billing has become very important very quickly. Companies are metering tokens and API calls, but they’re also metering workflows. They’re metering outcomes—whatever unit best reflects both the customer value and the cost structure. And then they’re charging with very high precision. They literally want to know every event, how it’s rated, and what’s all the metadata that sits on that rated event.

Way more hybrid monetization models too. I talked about subscriptions, but subscriptions aren’t dead. They’re just subscriptions with usage overages, or prepaid credits that burn down, or real-time top-ups—which gets to my comment earlier on the non-payment abuse issue—and very multidimensional pricing and monetization.

Lovable is a really good example. They used Stripe billing for their initial launch, which was fairly simple subscriptions—more traditional pricing—and allowed them to monetize very quickly. Then they added a bunch of products like Lovable Cloud or Lovable AI, and they moved with those into usage-based billing. Customers are actually charged based on token consumption. It’s a hybrid model above a certain threshold. That just helps companies like Lovable align revenue with usage, value, and the actual cost of running the models.

In the limit, we actually have a solution called token billing. Underlying model costs change a lot, sometimes very quickly. If you are a wrapper on top of someone else’s LLM and your pricing doesn’t keep pace, then basically your margins can disappear. Costs go up and your price stays where it is, then you’re in the red. Token billing is just: let’s in real time track and price to the costs of the underlying tokens with some markup as set by the business.

Missa, Ship, and Lovable are all examples of this kind of infrastructure.

(00:30:00)

Dan Shipper

I love all of these points. I want to go through them one by one. A big one you’re talking about is fast iteration across monetization. It feels like there’s this hyper-experimentation going on right now where people are like, “We could charge per token, we could charge per completed request.” I think Fin, the customer service platform, charges per case resolved, which has been a thing in customer service for a long time, but it feels like that could come for a lot more types of software as LLMs make it easy.

If we’re going to pick one new pricing model—if last year’s or last decade’s pricing model was just straight-up per seat—what do you think is the new standard pricing model that’s starting to emerge from the Stripe customers you see?

Emily Sands

If you are primarily a model provider—let’s say your customer’s primarily buying the model—I think you’re metering tokens.

Dan Shipper

Like an API. OpenAI API, Claude API.

Emily Sands

Exactly. For these vertical solutions, I think in steady state you are metering outcomes. But it’s going to take us some time to get there, not because of the billing infrastructure. That’s actually totally ready. You mentioned the Fin example—Intercom does the same thing actually on Stripe billing. They have an outcome-based meter for support tickets resolved.

Why do I say for vertical solutions it’s going to be on outcomes? Because I think end users are going to want to hold those vertical solutions accountable for outcomes, and they’re going to want to know that they have positive ROI on their spend.

When you and I buy a model, we feel like we ourselves are accountable for the ROI that we get on the whole plethora of applications we might have for that LLM. But if you’re a vertical provider—if you’re really focused on solving a concrete need in a given business domain on top of someone else’s LLMs—it’s on you to ensure the ROI is there. I think outcome-based pricing is the most efficient way to hit that.

Now, I don’t think all outcomes are created equal. You could imagine these complex objective functions—I’m an economist by training, so I’ll be a little nerdy—where it’s not just “did you resolve the support case,” but how complicated was it? With what quality? What was your CSAT? How expensive was the person that you were automating in that task? That’s why I say in the limit, I think it’ll take time for us to be very crisp on the outcomes we care about, how we measure those outcomes, and those outcomes will be multidimensional.

But I just have a hard time imagining that a year from now, most vertical providers are literally charging on tokens.

Dan Shipper

That’s really interesting. I am very curious to see that because what I’ve felt—and you can see this a little bit in the Lovable example you gave, but also in the Claude and ChatGPT examples and some of the pricing that we’ve ended up doing—is it’s per seat, it’s per user with overages.

Because we’ve started to exist in this world where we used to charge per seat so people know how to model it. It’s pretty easy to figure out how much I’m going to pay. But software used to be free to run, and now it’s not. We have to cover our butts basically, and protect our margin by adding the overage so that customers know what they’re going to pay unless there’s some special circumstance.

Do you see that? Where do you see that fitting in the examples you gave? And I guess you would say eventually that might go away. I’m curious why.

Emily Sands

I don’t think the charging for use or charging for overages will go away for most of the model providers. If anything, I think that will dominate and the seat-based billing will go away.

We can go back to the Fin or Intercom example. You and I would think it’s silly to charge based on number of customer service reps that are using the tool, because obviously a lot of what the tool’s doing is automating customer service reps. In today’s world, it isn’t perceived as silly to do seat-based usage of developer tools, but I think it’s a fair question since basically November or December to say, “Wait, why isn’t that silly?”

That seems a little silly because if what these agents are doing is making every developer 10x more productive, at some point don’t you need one-tenth of developers? And why would you want your revenue pegged to the count of developers as your base price?

I suspect that we will see seat-based disappear. Now, in the enterprise context, I think it’s quite different in the consumer individual context. I think with the exception of maybe some nerds on the call, most people are actually pretty uncomfortable as individual consumers with anything but a fixed-fee monthly, maybe with some overages if they want to spend like crazy.

But in businesses, I would be super surprised if six months from now we have half of the seat-based licenses that we have today.

Dan Shipper

That is fascinating. We’ll have to have you on again to talk about that one. I’m so curious to see, and I would love to see more Stripe data coming out about that.

One other thing you brought up before—you’re also seeing these companies scale faster. You said the time to get to $30 million in ARR is 18 months, which is significantly faster than any other cohort of companies you’ve seen. I’m curious—where is that coming from?

Presumably the spend or the growth from their customers is coming from somewhere. Either it’s spend that people weren’t spending before—it was on a company balance sheet just waiting to be deployed—or they’re pulling it from another provider and then going really rapidly into these new ones.

Do you have a sense for what’s happening here? Why are they growing so much faster, and where’s all the money coming from?

Emily Sands

I think a lot of the AI growth that we’ve seen is actually net-new spend being pumped into the economy. I think it has largely not been a substitute for traditional SaaS or for headcount opex, because it’s been experimental, because people are still learning, because organizations are somewhat slow to drop existing licenses often because they’re contracted into longer durations. But also because AI was starting not literally at zero, but at near zero. There weren’t other AI companies to go take market share from.

I would say now, going forward, I expect that some of it will be a substitute away from traditional SaaS. And by the way, I don’t say that in an old-company-versus-new-company sense. Some SaaS companies are doing an amazing job reinventing themselves as AI-first. You will have AI arms of traditional SaaS companies that are eating some of the revenue from the traditional version of the same company. But some will come from SaaS.

I think some will come from headcount opex. It is very hard to believe that companies will start spending single-digit, sometimes double-digit percentages of their headcount opex in LLMs and not step back and say, “Well, my headcount cost just changed. It used to cost me $300,000 for an engineer and now it costs me $330,000 for an engineer, because $300,000 is salary and equity and $30,000 is LLMs.” So I better reason about my budget on the plus-10% basis and make headcount decisions accordingly. And ROI decisions as well.

Then some of what we are seeing is definitely substitution now across AI providers. I was looking at retention rates for AI companies, and what you see is actually within the domain—for example, within AI dev tools or AI coding tools or AI model providers—the retention rate, both B2C and B2B, is higher than it was for SaaS.

Dan Shipper

Interesting. I’m shocked.

Emily Sands

But for the individual provider, it’s slightly lower.

Dan Shipper

Within—okay, got it. Yeah.

Emily Sands

Which is intuitive. Or, well, it’s ex-post intuitive, although I actually literally didn’t know and needed to query the data. But ex-post, it’s intuitive. Once you start using an AI dev tool, a coding assistant, you love it—you’re not going to stop using it. But you very well may iterate across providers as models vary in their quality.

Dan Shipper

Anytime a new model comes out, you’re just like, “I gotta try this.” And there’s a high percentage of curious travelers basically just hopping from one thing to the next within a category. But they’re definitely going to stick in using a tool like that for a long time.

Emily Sands

Yes, exactly. A lot of the crazy-fast AI growth we’ve seen is net-new dollars spent. But I think businesses are going to start to reason about that as a substitute for SaaS, or a substitute for headcount opex, or a substitute for other AI companies. It will be less purely additive in the go-forward year than it was in the past year, when people were really just starting to ramp up on their AI spend.

Dan Shipper

Does that imply anything to you about the valuations of current hot AI companies? Let’s except the OpenAIs and Anthropics of the world, but the ones in the $30 million cohort and the coming-up ones—does that say anything to you about their prospects or their growth rates or their valuations?

Emily Sands

If you look at the top 100 on Stripe, there are little pockets of twos and threes that are directly competitive, but a bunch of them are solving totally disjoint vertical problems with no competitor yet in the space. I do think there’s enough blue ocean vertical solutions that overall AI valuations are probably okay.

I think there are a couple of crowded spaces that you and I could intuitively reason about where you might think it would be a little frothy. And by the way, you see this in the micro view too. If you look at the sales-led growth contracts—when you are the first AI dev tool, you basically charge people sticker and you do very little negotiations, and enterprises pay you sticker and whatever.

Then all of a sudden you have to have these much more complex sales motions. You hire a bunch of sellers, you have your CPQ—configure, price, quote—system, and you have this nuanced billing because you’re competing against two or three other providers who have competitive-looking monetization models and you’re reacting to that.

On the micro, you start to see some of those competitive reactions creeping in as well. But I think the overarching next year will continue to have a bunch of blue-ocean vertical stuff that didn’t exist before. There will be some pockets where it’s a little more heated.

(00:40:00)

Dan Shipper

Fascinating. I feel like I’m learning so much. This is amazing. I want to go into Stripe. Instead of talking about the AI economy, I want to go into Stripe a little bit. Specifically—Stripe serves developers and is built for a world where humans are the ones buying and selling and also making the software.

Now agents are buyers, they’re sellers, they’re builders. You have to serve agents. I’m curious how that has changed how you think about the products that you offer, and maybe moving from just thinking about developer experience to agent experience.

Emily Sands

Do you want to start with agent experience or agentic commerce? I think they’re both really interesting, but they’re kind of different.

Dan Shipper

Which one are you most excited to talk about?

Emily Sands

Maybe agent experience, and then we can work backwards to agentic commerce.

Dan Shipper

Yeah. Let’s talk about agent experience.

Emily Sands

The whole idea of developer experience is changing. Historically, when I said developer experience, you thought: making it easier for a human engineer who’s at a keyboard. You need clear APIs and you need better docs and you need less setup work.

All of that still matters—it’s not going anywhere. But I think the developer is now a broader swath of persona. It could be a non-technical founder who’s in Cursor or Replit, describing an app in plain language. Or it could be a coding assistant who’s scaffolding an integration. Or it could be an agent who’s out trying to provision infrastructure on a human’s behalf.

I think it’s less about just “how do we help a human developer write code” and more about “how do we have a coherent and trustworthy product experience end to end” that acknowledges that at some moments the actor’s a human, at some moments the actor’s an agent, and at some moments the actor’s a human working through an agent.

You see this shift in some really concrete ways. Very simple example: LLM traffic to Stripe docs is up 10x year over year. That’s just a useful signal that machines are becoming users of developer infrastructure too, including Stripe’s developer infrastructure.

Dan Shipper

What about human views of Stripe docs?

Emily Sands

Human use of Stripe docs is actually flat to climbing. It’s not a straight substitute. I think there is just more developer activity happening, and LLMs are growing dramatically within that share.

Dan Shipper

That makes sense. Cool.

Emily Sands

I would also say the humans continue to check on the docs to sanity-check what the agent is coming up with, because your payments integration is actually a pretty big decision that you’re making.

Dan Shipper

I’ll say, better humans than I are sanity-checking. But I’m glad that someone is sanity-checking.

Emily Sands

Are you YOLOing it?

Dan Shipper

I’m YOLO vibe-coding my payment infrastructure.

Emily Sands

Okay. Amazing. So maybe you’re YOLO vibe-coding, but even if you’re vibe-coding, there’s still an important step around provisioning your modern software stack, and that is still very manual. You as a human are still creating accounts across multiple services. You’re managing credentials, you’re clicking through to do a lot of setup. You’re probably bouncing between dashboards. The coding is getting easier a lot faster than the setup is getting easier.

That’s actually the idea of Stripe Projects, which we launched—I don’t know, maybe two weeks ago.

Dan Shipper

That looks amazing. Tell people what that is.

Emily Sands

Yeah. Okay, if you want in, let me know. We can use it.

Dan Shipper

Yeah, I want in. I absolutely want it.

Emily Sands

Okay. You’re in tech. I won’t Slack right now, but I’ll Slack right after this and get you in. But basically the idea of Stripe Projects for those who haven’t explored is that you or your agents can go create and manage parts of your software stack right from the command line. Resources are provisioned in accounts you own and credentials sync back to your environment and so on.

One of the things that stood out besides your enthusiasm for it—which I appreciate—is just how overwhelming the interest has been in general from the ecosystem. We launched with Cursor and Supabase, PostHog is there, Neon, Runloop. There are a bunch of great companies involved. But then immediately after launch, over 100 other great companies reached out wanting to join, which I just think reinforces that the friction is real.

You talked earlier about how some things get easier with AI, but there’s a counter effect. I think coding gets easier, but code reviews become more burdensome because who’s reviewing all the AI code? This is another example: building gets easier, but you still kind of have to provision everything.

That’s just an example of how we’re building for this world where the developer is no longer just a human.

Dan Shipper

Got it. And then tell me about agentic commerce.

Emily Sands

Agentic commerce is a bit of an overloaded term. I think a mistake that people make with agentic commerce is they jump straight to the most extreme version. They hear the phrase and think: some system that knows everything about me and decides what I need and goes off and buys it for me. And then they’re underwhelmed with the world we’re actually in. Maybe we get to that extreme eventually in some form, but we’re not there yet.

I prefer to think about it as a spectrum. The economic infrastructure you need is actually pretty similar no matter where you are on the spectrum. But the spectrum also brings some realism to it.

At the first level, AI is just removing friction from the internet we already have. It helps you research and compare options and fill out some forms and narrow down your choices. But you, the human, are still making the decision. The agent is just making that experience easier.

Then you move to where search is descriptive. No more blunt keywords and filters. It’s like: I have little kids, I need a summer camp for my kids in this budget, on these dates, with this driving radius. That’s already a better commerce experience than search plus filter.

Then you get to real delegation—and I think this is what most people would consider the minimum viable bar for agentic commerce. I give some constraints—some budget, some dates, some category, maybe a few preferences—and then the system goes and makes the purchases on my behalf.

But then there’s the further-out version, the ambient version. I don’t prompt anything and the system knows me and my seasonal needs and knows that summer camp planning is happening. That would be music to my ears. That’s the most futuristic thing.

The point is that no matter where you are on that spectrum, the economic infrastructure the internet needs starts to change. Even the earlier stages force a redesign of payments infrastructure because the old model—humans sitting in front of a browser, creating an account, choosing a plan, filling out forms, clicking purchase, entering card details—not all those steps are happening anymore.

I think there are two worlds I reason about preparing for. One is agent-assisted buying—I’m ultimately in charge, but the discovery and checkout and payment happen inside AI interfaces instead of on a merchant website. I’m not going to Nordstrom; I’m buying within Gemini or ChatGPT or Meta.

What’s challenging here is two things. One, the AI agent needs to be able to understand the merchant’s products and prices and checkout flow so that they can act on behalf of the consumer. Two, trust can break down. As a consumer, I don’t want to hand off my credentials to an agent. As a merchant, I don’t want to let every bot through—I want to know if it’s a good bot acting on behalf of a legitimate customer.

The agentic commerce protocol, which we co-created with OpenAI, is the shared technical language between AI systems and businesses. It shows up across a lot of surfaces. We built it with OpenAI, but Microsoft Copilot uses it, Meta’s in-ad shopping experience uses it.

How it works is: the merchant only has to integrate once with Stripe for their product catalogs, their prices, their checkout flows. Then they can literally from the dashboard turn themselves on through a whole host of agents and be exposed through those shopping experiences.

Importantly, the merchant remains the merchant of record, and that part really matters. Businesses want access to these new storefronts, these new channels, but they don’t want to give up the customer relationship. They don’t want to give up control over trust or fraud.

Category one is: the human is still leading the buying, but the agent is facilitating the transaction. You could call it agent-to-commerce, you could call it facilitated commerce.

Dan Shipper

How does that actually work? Is the experience something like I’m in ChatGPT and it says, “Here’s a thing you might want to buy,” and I can click checkout from OpenAI, and that’s using that protocol to then go send my information to the merchant and then send me back, “Hey, your thing’s on the way”?

That’s kind of what you’re talking about?

Emily Sands

Exactly. Yeah. Same thing—you’re in Facebook, you get an ad in Meta, you do a one-click checkout. One of the primitives we built for this is the shared payment token, or SPT. It just lets your payment credentials be passed securely from the AI agent to the merchant so the merchant can process the transaction. The merchant processing the transaction is important because that allows the merchant to remain the merchant of record.

But you don’t want your credentials viewed by the agent, which is why it’s a token and not your actual payment credentials. And the merchant needs to know that you and the agent are good, which is why as part of the shared payment token, we pass over a whole host of fraud scores.

Dan Shipper

Can I integrate this? We have a bunch of software. Can I offer agentic checkout easily, or does it have to go through the OpenAIs and the Facebooks of the world?

Emily Sands

Yes, you can. And I think one of the premises here is—just like to date we haven’t seen one model provider to rule them all or one model to rule them all—we don’t think there’s going to be one agentic shopping experience to rule them all.

Merchants will literally break if they have to integrate with every single potential new storefront. When they integrated with the internet, they built their own storefront and iterated on it, but basically they built it once. If you tell them, “Hey, you need to build your storefront for agent shopping startup X and Perplexity and OpenAI and Meta,” their eyes are going to get bigger than their heads and they’re not going to be able to handle it.

We really want to abstract away that complexity for businesses. We spent the last decade-plus helping businesses sell wherever their customers are. First that was on their websites, then it was in apps, then it was through platforms and marketplaces, and actually some in person too with our Terminal product.

But now, where are the consumers? Where are they wanting to buy? Increasingly through AI tools and agentic flows. We just want to make it really easy for merchants to agnostically participate in those different storefronts. They can choose where they want to sell, they can turn it on—a little toggle in the dashboard. But it’s not a different integration, which is the whole idea of the protocol.

(00:50:00)

Dan Shipper

How often is this happening? What’s the volume of agentic commerce right now?

Emily Sands

The volume of consumer commerce is still relatively small as a percentage of all of the commerce we see. But it is growing quickly, particularly for what I would think of as commodities.

What is the first thing people are comfortable buying through agents? It’s things that are reasonably known, reasonably observable, not super high-priced. When people started buying online, you didn’t imagine they were going to go online and buy a $2,000 couch. Or a mattress—oh my God, these mattress companies that have blown up. It took time for them to build comfort making higher-price purchases, making more quality-dependent purchases.

Today it’s predominantly commodities.

Dan Shipper

Give me an example of one of these commodities and also what the order of magnitude we’re talking about when we say it’s relatively small.

Emily Sands

An example of a commodity would be a Halloween costume.

Dan Shipper

Got it. Agents are buying Halloween costumes for themselves.

Emily Sands

Agents are buying Halloween costumes. How many lazy parents are there in the world?

I think the consumer side is interesting too because we talked about what businesses need—they need a fast, easy way to safely expose their products, their prices, their inventory, their checkouts, understand fraud, and be in control of the relationship. From the consumer angle, the question’s a little different. Even if I’m a lazy parent, I’m not so lazy that I’m willing to give someone my payment credentials and let it rip. The question for me is: how do I safely let an agent buy on my behalf?

Have you heard of Link?

Dan Shipper

Yeah, I’ve used Link.

Emily Sands

Amazing. Link is our consumer wallet. What did you use it for? Do you remember the first thing you used it for?

Dan Shipper

I mean, I use it all the time. It’s everywhere.

Emily Sands

Amazing. Yeah, it’s everywhere. You wouldn’t believe where. I was getting soccer lessons for one of my kids from a local guy, and I was on their website and they only accepted Visa and Mastercard—neither of which I had on me—or direct debit from my bank account, which I wasn’t going to put in this very janky website, or Link. And I was like, “Oh, amazing, Link is here.” Great problem solved.

Anyway, a lot of people know about Link as our consumer wallet for buying soccer classes. It speeds up checkout. But it’s already used by about a quarter of a billion consumers. It’s not a small network. What I think is most interesting about Link is it’s a very dense network when it comes to AI.

Lovable is an interesting example. 58% of their payment volume runs through Link. You are hyper AI-pilled. It is not surprising that everywhere you are, Link is.

What’s changing now is that we’re evolving Link for the AI economy because so many of the Link consumers are already AI consumers. Acknowledging that agents themselves are becoming economic actors, the model isn’t “give a random agent your card and hope for the best.” Instead, it’s delegated authority with guardrails. You as the consumer decide which agents are allowed to request credentials and under what conditions and with what limits, and whether those purchases require approvals before they go through.

You do all of that through Link. It’s just a much more sensible model for delegated purchases.

Dan Shipper

That makes sense. Emily, this was a fantastic conversation. I learned so much.

Emily Sands

Awesome. Thank you for having me.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

One App to Rule All Knowledge Work

Katie Parrott / Context Window — 2026-04-28 14:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

OpenAI’s Codex desktop app has become Every’s head of growth Austin Tedesco’s daily driver, handling everything from email triage and go-to-market planning to KPI tracking and recruiting. Last week, he and CEO Dan Shipper showed more than 250 paid subscribers exactly how they use it in our Codex Knowledge Work Camp. Read to the end for how to review business documents with Austin’s compound knowledge plugin.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Signal

Coding apps are the new operating system for knowledge work

What happened: OpenAI’s Codex desktop app may have started life as a product for senior engineers pair programming with AI, but these days it’s equally good for powering other types of knowledge work. Every’s head of growth, Austin Tedesco, now runs roughly 80 percent of his daily workflow through Codex—a tool that, at our Codex Knowledge Work Camp, he said was “trash” for non-engineers just three-to-six months ago.

Why it matters: OpenAI, Anthropic, and Cursor are all racing to ship a unified product for handling code and knowledge work, and they’re converging on a single standard: an agentic terminal or chat interface with a left-hand project sidebar, plus connections to all the tools you already use like Gmail, Slack, Notion, and Stripe. These connections, for many non-engineers, were the missing piece of the puzzle.

What it means: Switching between ChatGPT and Claude based on the models’ personality differences might become a less-common occurrence. Instead, your desktop AI app has your API keys, your project files, and your daily workflows. Businesses, especially, with custom skills and plugins and months of company data in Codex won’t casually swap to Claude Code or Cowork next quarter—and vice versa.

Watch for the desktop apps to converge further on shared patterns beyond project folders that load themselves and plugin connectors to your most-commonly used tools. These new patterns may define the next decade of office software.

What to do this week:

If you’ve been working in the web interface, download one of the desktop apps—Codex or Claude Code/Cowork—and spend a session there. The work feels different once you’re outside the browser tab.
If you’re already on a desktop app, poke around its integrations and capabilities section. There’s almost always something useful lurking, like Anthropic’s design and marketing plugins, or Codex’s PDF creation skill. Pick one and try it.

Now, next, nixed

Now: Documents written for both humans and agents. In the past, anything you wrote at work fell into one of two buckets: polished prose for people or structured data for machines. Agents are the first readers that need both. At Every, our guides on compound engineering and agent-native architectures exemplify this hybrid.

Next: Documents that write back. The latest internal version of Proof, our document editor for AI-human collaboration, supports agentic loops: The agent continuously monitors the document for changes and comments and suggests edits without you needing to interrupt your writing flow. The document seems to come alive, growing around your words in real time.

Nixed: Pretending the human wrote it. The pretense that an agent-written document has to sound like the human who sent it is a relic of a bygone era—especially if other agents are reading too. Provenance matters less if you’ve reviewed it and stand behind it.

Steal this workflow

Let the agent tell you what to automate

Some people hesitate to delegate work to agents because they struggle to think of a good use case. Try flipping it: Hand the agent the keys and ask it what to do.

Open Codex (or Claude Code). Connect your top three tools, like Notion, Slack, and Gmail. Give the agent full permissions—it can’t find patterns in what it can’t see.
Prompt: “Look at how I use my connected tools. Suggest five automations that would save me time, and rank them by how much friction they’d remove.” It might suggest a morning briefing based on your calendar, or ways to triage your inbox.
Pick the easiest one first. Have the agent draft replies to unanswered messages at the end of each day. Run the automation for a week, then audit the misses.

You won’t know the agent’s capabilities until it has access to your real tools and a reason to use them. Skip the guesswork and let it show you.—Laura Entis

Skill share

Reviewing work with the compound knowledge plugin

Compound engineering turns every coding session into training data for the next one, so that the agent gets a little smarter about your codebase each time you use it. Compound knowledge does the same thing for memos, plans, and KPI sheets. The review step, launched with the /kw:review command, ensures that the AI doesn’t start off on the wrong foot.

What it does. The plugin reviews any Codex or Claude Code plans for strategic alignment with your company’s strategy and the project’s goals—and to verify the underlying numbers—before the agent gets to work. It’s the difference between “the agent wrote a plan” and “the agent wrote a plan that doesn’t contradict the last three executive meetings.”

Why it matters. Most plugins for agents are built for engineers reviewing code. Code review happens after the code’s already written and tested. Compound knowledge assumes operators are reviewing memos, KPI sheets, or recruiting lists, where the verifiable failure might be a confidently wrong data point—which has to be caught before a plan is enacted.

Steal it. Compound knowledge is public on Every’s GitHub. Install it, drop your company context into the project files, and, with some practice and calibration, you’ll have a reviewer that knows your business.

Inside Every

Final approval in the final context

Austin runs his compound knowledge loops in Codex, but he always signs off on the agents’ work in the destination app. He approves Slack drafts in Slack, where he can see the channel’s recipients. He checks agent-produced email drafts in Gmail, and strategy memos in Notion or Proof.

This is context-switching as a safety feature. The destination app reminds you that AI is now acting on something real—that the message is going to a person, or the document is about to anchor a launch—in a way a chat window can’t.

As agents move deeper into the stack, though, the question becomes: Is the destination app the right venue for the final pass forever, or does the approval step need its own surface? And as OpenAI, Anthropic, and others race to own the management layer, will it become another part of the archetypal user interface for knowledge work?—LE

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

For sponsorship opportunities, reach out to sponsorships@every.to.

You Are the Most Expensive Model

Mike Taylor / Also True for Humans — 2026-04-27 07:00:00 -0400

by Mike Taylor

in Also True for Humans

Midjourney/Every illustration.

Not every step in an AI workflow needs the smartest AI. That may sound obvious, but it’s not how most people are working. The default is to route entire tasks through frontier models, which is expensive, slow, and usually unnecessary. Incremental determinism starts from a different question: How much intelligence does this task really need?? The answer is almost always less than you’d expect, and the savings add up.—Mike Taylor

There is a reason McDonald’s would never ask its CEO to man the burger grill: It would cost the company $9,230.77 an hour. It’s the same as using frontier AI models to do every task—you don’t need to pay 75 cents every half hour ($1,095 per month!) for Claude Opus to check your to-do list in OpenClaw.

This tension isn’t really about the pricing of AI models—it’s about the value of human attention. Now that you have a cheaper alternative for many tasks that used to require it, you need to figure out the optimal way to deploy AI in a way that frees up your most expensive model—you. Most businesses are getting this balance wrong in both directions: overpaying for AI on simple tasks and underusing it on ones that would free up their best people.

The solution is a process of optimization that I call incremental determinism. Every time you repeat a task, build it into a repeatable process by creating a skill file. Identify which parts of that process need the most expensive model, which can be delegated to cheaper, less powerful models, and which tasks repeat often enough to justify turning them into reusable code. And finally, get better at delegating so you can stay focused on the work that needs you.

I call it incremental determinism because the more you repeat a task, the more it pays to nail down exactly how it should be done. The first time, you figure the task out as you go, but after doing it a few times, you can document the best approach. “Deterministic” is a programming term for code that always produces the same output given the same input. The goal is to push as much of your workflow towards that end of the spectrum as possible, because deterministic steps are faster, cheaper, and more reliable. The tradeoff is the upfront investment needed to systematize the task.

There are four levels for achieving this balance and optimizing AI costs. Depending on your technical fluency, you don’t have to go to the final step, but understanding how they each support each other will help you manage how you can control AI costs across your entire organization.

Level 1: Turn sessions into skills

The first level is the easiest. Let’s say you are often asking AI to generate a PowerPoint pitch deck. The first step toward systematizing it is to make a skill. A skill can be as simple as a text file detailing how to do a task that the model follows each time it’s asked. It’s the McDonald’s handbook that tells every employee how to make the perfect burger, over and over again. Even less experienced cooks can get a good result.

Once you’re done with the normal back and forth of giving the AI the necessary data and context for the presentation, ask it, “What information would have been useful to know at the start of this task that would have eliminated several steps or mistakes?” Claude knows what it is capable of, so you can ask it to turn its response into a PowerPoint deck creation skill to use next time. Anthropic has been releasing plugins (collections of skills) for various industries to serve as a starting point. They even provide a “skill-creator” skill that teaches Claude how to guide you through making one when you ask.

Once you have a skill, test it. Ask Claude to test the efficacy of the skill with the following prompt: “Run the task using subagents, one with the skill, one without, and compare the results.” If the skill is doing its job, you should see an improvement in quality, cost, and speed. Now try running it with a cheaper model—“Run this test again with Sonnet/Haiku”—and compare the results. If you’re happy with the output, ask Claude to “Use a subagent with Sonnet/Haiku when calling this skill.” You are using a subagent because you don’t want the model that you are using for your main session—the more expensive one—to be the model executing the task, so the separate, cheaper subagent does the work. You just decreased the cost of running that task by 10 to 100 times.

It doesn’t make sense to write skills for throwaway tasks you won’t do again. But if you find yourself doing something for the third time, it’s probably worth formalizing it. If you’re using it multiple times per week, try getting it working with a smaller model.

Level 2: Turn skills into evals

Your team might see your skill and want to use it to create their presentations as well. While it’s easy to share skills across your organization, you’ll have to get them to trust that your skill delivers before they’ll adopt it. For that, you’ll need evidence in the form of evaluation metrics, or evals.

For the simplest eval, gather 10 examples of tasks your skill has been used for—say, the last 10 decks you have made with the skill—and rewrite the output to be the gold standard or best-in-class example of what you’d hope Claude could produce. Now, ask Claude to “Run each test case with subagents and compare the output versus my gold examples.” Make changes to the skill and test if it does better. This is the “LLM-as-a-judge” technique—you’re using a model to grade its own work against your standard.

In the spirit of incremental determinism, you should formalize your evals over time, too. Ask Claude to “Break down the patterns between what makes a ‘good’ answer (gold examples) versus the typical output of the skill.” It might say that one pattern for a good answer is following brand guidelines, another pattern is including four to five bullet points of commentary on a specific slide, and a third is calculating the correct numbers.

Once you have several evals, you can combine them into a single score. Each eval becomes one “judge”—it looks at the output from one angle, such as data accuracy, and returns a score. You can weight each judge based on how much that dimension matters to you, then average the scores together. This “panel-of-judges” approach lets you track overall quality as a single number. The on-brand eval might be worth 40 points to you, the correct numbers could be 50, and the bullet points worth 10. Each prompt you test can then be scored out of 100, allowing you to compare how well one approach works versus another. Claude is a human-level prompt engineer and runs this process as a matter of course if you use the skill-creator function Anthropic provides.

Let’s come back to our patterns of good output for a PowerPoint deck. Validating the data is more important than whether you’re missing a bullet point or using the right visual components, so you could weight that eval as 60 percent of the overall score versus 20 percent each for the other two. Together, you have a weighted average score for measuring how well your skill is performing. For companies, where getting a pixel out of line is a fireable offence, such as top-tier consulting or finance firms, you can change the relative weighting of that eval.

Now, you have proof you can share with the team about the impact your changes are making on skills. When the next big model comes out, you can test how much better it does on your benchmark and if it’s worth the extra cost.

Level 3: Turn evals into scripts

When your skill is working reliably, and you’re using it frequently enough that the token cost is starting to feel significant, you need to start thinking about scripts, CLIs or MCPs. This is where the steps get slightly more technical, but the principle is the same: Replace thinking with a structured process wherever your thinking doesn’t add anything extra.

Every skill, like your PowerPoint deck skill, is a bundle of actions—pull this data, reference our brand guidelines, create a .pptx file—and some of those actions don’t require a smart model. Some don’t even require an LLM at all. Deconstruct your skill into its component parts and hard-code whatever you can. Code costs almost nothing to run and returns in an instant compared to LLMs, so the more of your workflow you can make deterministic, the cheaper and faster it will be.

For our PowerPoint creation task, you can use the HTML and CSS templates for the slide deck written once by Opus, then filled in to generate the .pptx file when you need to create a deck. You can also write a script to pull the right revenue or sales figures from a data source, no LLM involved. The final export step—to .pptx format—can also be done in code.

For tasks that require some judgment, like checking your deck’s compliance with brand guidelines, don’t jump straight to the most expensive model. Platforms like OpenRouter allow you to call any of the major commercial or open-source models, so you can experiment with the tradeoffs between cost and intelligence. Basic classification and summarization tasks can be done by older models 1,000 times cheaper than Opus with reasonable accuracy. Leave the most challenging tasks, such as the narrative and tailoring the tone to a specific audience, to Opus.

Level 4: Turn scripts into better scripts

In the previous step, you replaced as much LLM thinking as possible with deterministic code, bringing the cost of your PowerPoint skill down 10 to 90 percent compared to only using Opus. But you were only optimizing for your own use. When your skill is running inside a product, creating hundreds of decks a week, cost inefficiencies will again become a problem. For this, you will need to build a process to automate the optimization. Once you have 100 to 200 examples of the skill being used in the real world, a reliable basket of eval metrics, and a clear map of what the skill does at each step, you have everything you need to do so.

The most common tool for this is DSPy, which can automate the prompt engineering process end-to-end. It runs your prompt, looks at the test cases, and rewrites the prompt to arrive at a more accurate outcome, often with a cheaper model. Another common approach is distillation. You use Opus to generate hundreds of high-quality examples that pass your evals, then use those to teach a cheaper model to produce similar results. You can do that by either including the examples in the prompt so Haiku can pattern-match against them, or by fine-tuning the cheaper model directly on the examples. Think of it as a head chef writing such a good recipe that a less experienced cook can follow it perfectly. This process can cost $10, $100, or $1,000, depending on the model and how many test cases you have, but spending $1,000 to save millions in production is worth it.

More experimental approaches are emerging, too. Andrej Karpathy’s autoresearch runs experiments to optimize a script file against an eval metric over long periods. Researchers wake up to more than 20 experiments run overnight with meaningful performance improvements.

The great enemy at this level is overfitting: The skill or script works well against your eval metric but fails on tasks it hasn’t seen before. It’s “teaching to the test” for LLMs. The evals in the previous step are your main defense against this, because they give you a formal rubric for grading its performance. Human involvement in the evaluation process is necessary because we’re better able to catch behavior that goes against the spirit of the game, even if it’s not technically wrong as defined by the rules.

If you are a manager at a company responsible for AI, you don’t need to know how to implement any of this yourself. What matters is understanding that this optimization layer exists, it’s what your technical team or tools are doing under the hood, and why the decision to invest can pay off.

You are the most expensive model

All of this optimization work takes time and expertise, and your attention is an even more expensive commodity than the latest models. Attention is the key word: The ladder of incremental determinism—sessions, skills, evals, scripts, optimized scripts—gives you a framework for deciding where to invest your attention. Every hour you spend optimizing a skill is an hour you’re not spending on something only you can do.

You don’t need to climb the whole ladder—having reliable skills and evals is more than enough. The point is knowing the rungs exist, so when the cost pressure hits (and it will), you know exactly which lever to pull. If you’re struggling with unreliable or expensive skills but don’t have the capability to build scripts in house, it might be time to bring in someone technical and AI-savvy to do the heavy lifting.

The cost of tokens is falling 90 percent every year for the same level of intelligence, so the task even Opus struggles with today might be easy and cheap in 12 months. Sometimes the smartest move is to overpay now and let the market do the price optimization for you.

For sponsorship opportunities, reach out to sponsorships@every.to. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

Codex Moves Beyond Coding

Every Staff / Context Window — 2026-04-24 18:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Kieran Klaassen’s compound engineering plugin has crossed 15,000 GitHub stars, and this week it got a substantial update. It now works across more tools, comes with more built-in agents and skills, and has a cleaner setup flow—try it and let us know what you think.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“Vibe Check: GPT-5.5 Has It All” by Katie Parrott/Vibe Check: The newly released GPT-5.5 is faster and easier to work with than its predecessors while also outperforming them on serious engineering tasks. Every’s testing found it to be the strongest OpenAI model for writing in about a year, and its biggest edge over Opus 4.7 shows up when working with an existing plan or system. Read this for the benchmark results, Reach Test ratings, and guidance on when to reach for GPT-5.5 versus Opus 4.7.

“Introducing Monologue Notes: Record Every Meeting, Call, and Voice Memo” by Naveen Naidu/On Every: The best thinking can happen away from your desk—on walks, on calls, in meetings—and then vanishes. Monologue Notes, a new feature in the Monologue app, records and transcribes all of it, then makes those transcripts available as context for whatever coding agent you use. Read this for the two starter prompts that turn your recordings into a structured work session and try it for yourself.

🎧 🖥 “You’re the Bread in the AI Sandwich” by Laura Entis/Context Window: Dan Shipper and Kieran Klaassen work through the titular AI sandwich, where humans excel now that AI handles execution: framing the problem upfront and judging the output after. Plus: how Every’s consulting agent Claudie keeps absorbing new responsibilities instead of spawning new agents, what that reveals about the two organizational structures that will define how companies deploy AI employees, and Nityesh’s trust battery system that lets Claudie earn autonomy by learning from her mistakes. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“Mini-Vibe Check: Claude Design Isn’t for Designers—Yet” by Katie Parrott/Context Window: Creative director Lucas Crespo put Anthropic’s new Claude Design through its paces. He finds it useful for empowering non-designers to produce on-brand assets, but poorly suited for open-ended creative work. Plus: Back-to-back security incidents at Vercel and Lovable reveal two distinct ways AI tools can expose your data, and a workflow from Nityesh Agarwal for setting up an agent-run X feed that monitors your AI stack for vulnerabilities overnight.

“Model Wars” by Laura Entis/Context Window: GPT-5.5 touched off a debate between Nityesh (Claude Code devotee) and Naveen Naidu (Codex partisan) about whether the Anthropic-vs.-OpenAI rivalry is a model question or a product one. Plus: Austin Tedesco‘s four-step workflow for producing polished product videos with Remotion and Claude Code, and why prompts are replacing the download button as the front door for AI-native tools.

“How I Escaped AI Autopilot” by Katie Parrott/Working Overtime: Katie Parrott accidentally completed a client assignment twice—because she’d delegated so much to AI that her brain never bothered storing a memory of doing it the first time. Research on pilots and cognitive bias explains why fluent, polished AI output is what makes it hardest to scrutinize. Read this for the three practices she’s now using to stay focused on her work.

Log on

This week’s camp

Codex for Knowledge Work Camp: Dan and Austin showed how to use OpenAI’s Codex for drafting, research, summarizing, running tasks in parallel, and building small tools to automate routine knowledge work. Watch the recording.

In New York City

Software Is the New Media: Join us at Betaworks on April 28 for an evening conversation on how AI is changing media, content, and software—and what that means for the people building in all three. Learn more and RSVP.

Recordings you may have missed

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

From Every Studio

Cora’s new inbox is looking for alpha testers

Kieran is looking for a small group of alpha testers to put Cora’s new inbox experience through its paces and share feedback. The alpha version now supports drafts, snooze, grouped views, keyboard shortcuts, metadata parsing, bulk archive, undo, and a context-aware chat that can answer questions about the email you already have open.

Cora’s broader goal is to let people do email however they want, whether that means organizing by recency, categories, briefs, or eventually doing an agent-first pass with manual cleanup at the end. If you want access, reach out to Kieran at kieran@every.to.

Spiral’s API agents can now remember how you write

Spiral is adding memory to its API agents, so your writing assistant can learn your projects, preferences, and common corrections over time. Instead of restating tone, structure, or your usual edits in every session, you can carry that context forward and get drafts that pick up where the last one left off. Memory is live now through the API (it’s not inside the app yet, but stay tuned). Try it at writewithspiral.com.

Alignment

Terminal pilled. Four months ago I opened the coding terminal for the first time, and it felt like staring into a black box that might bite me. Now I’m a snob about using it instead of a desktop app.

I build dashboards for biotech companies in it. I pull clinical trial data and parse financial filings while asking AI to explain the business model to me like I’m 11, and then like I’m 15, and then like I’m a grownup. On top of all that, I run Ghostty as my blazingly fast native terminal so I can juggle multiple windows for different workstreams, and I feel like I’m in the Matrix.

I’m promiscuous about the models inside the terminal I use. It might be Claude one day, GPT the next, and whatever is new the month after that. But I will never leave the terminal. Codex and Claude Desktop and Cowork have built beautiful interfaces for exactly the work I do, and without even trying any of them, I’ve decided they’re inferior—maybe because they’re too easy to use.

The terminal gives me the sense that I passed through a threshold of frustration most people won’t, and that’s worth the tiny sliver of superiority I feel when I use it. And sitting at a terminal makes me feel like I belong with the people who know how to code, even though I don’t, really.

All it took was four months of use and a minor superiority complex, and I’ve become one of those people I used to wonder about—the ones who won’t try the new thing even when it might work better.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

Model Wars

Laura Entis / Context Window — 2026-04-24 15:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

GPT 5.5 is here, and OpenAI’s latest model has it all. It’s fast enough to use constantly, personable enough to collaborate with, and assertive enough to carry a plan through serious engineering work. If you didn’t catch our full review, including benchmark results, Reach Test ratings, pricing, screenshots, and advice on when to reach for GPT-5.5 versus Opus 4.7, read our Vibe Check or rewatch the livestream, where we grilled OpenAI’s Dominik Kundel and Romain Huet on how they’re using the model.

But how will that shift the balance between OpenAI and Anthropic? That may be a product question as much as a model question. Every engineer Nityesh Agarwal and Monologue general manager Naveen Naidu weigh in.—Kate Lee

Inside Every

Codex versus Claude Code

This week, Anthropic tested removing Claude Code from the $20 Claude Pro plan, prompting an outcry from users and drawing jabs from OpenAI executives on X, perhaps feeling emboldened by the big launch they knew was coming.

The exchange kicked off a Slack debate between Nityesh Agarwal, our resident Claude Code devotee, and Naveen Naidu, who rides hard for OpenAI’s coding app Codex.

Nityesh’s take: Anthropic potentially raising prices is “simple market economics”—there is a huge demand for Claude products because they’re the best available, so they can charge more. On the other hand, OpenAI’s response underscores how frustrated the company has become playing catch-up as it scrambles to replicate Claude Code, Cowork, and skills. From a product standpoint, Claude in the browser and the Claude Code command line interface (CLI) are better than ChatGPT and Codex.

Naveen’s response: Anthropic’s models are powerful, but they also burn through way too much compute in production. OpenAI is much stronger on infrastructure, and GPT 5.5 is a token-efficient model. And while it’s true Anthropic is first to market with a lot of products and features, including computer use—which allows AI to operate your computer on your behalf—OpenAI is better at execution. Naveen consistently reaches for ChatGPT and the Codex desktop app, while he finds the Claude Code app too buggy to spend any time in.

Where they agree: The Claude Code app is, indeed, bad—Nityesh concedes he only uses the CLI. And both labs misjudged how much compute they would need, but in opposite directions: Anthropic is struggling to keep up with demand, whereas OpenAI has invested heavily in infrastructure and is now scrambling to get people to use its products.

Data point

It’s not just a grammatical pattern; it’s an AI tell

Four times.

That’s how much the usage of “not just a ___, it’s a ___” sentence construction rose in large U.S. company documents between 2023 and 2025, per Barrons.

The rise in correlative constructions neatly tracks with the adoption of LLMs. (Source: Barrons.)

Like the em dash, the correlative constructions are so beloved by LLMs that human writers now avoid them so as not to be accused of writing with AI.

Hot take alert: That’s a bummer. The great profile writer Taffy Brodesser-Akner’s work is teeming with them. Or it was, pre-ChatGPT. Her 2018 New York Times Magazine feature on Goop uses some version of “not X, it’s Y” in almost every other paragraph.

I doubt even a writer as beloved as Taffy could get away with that today. It’s not that her trademark style is any less effective—it’s that no one would believe she wrote it.

Steal this workflow

How to (almost) one-shot a product video

After days of battling open-source video creation tool Remotion and Claude Code, trying to one-shot a video for a product relaunch, Austin Tedesco, Every’s head of growth, figured out how to get a polished clip. Here’s the workflow he runs any time he needs a social video for a product launch or feature demo, like the one he created for the relaunch of Sparkle, our agent-native app that cleans and organizes files on your Mac.

A GIF showing a clip from Austin' s product video. (Source: Every.)

Step 1: Screen-record yourself using the product you’re doing a clip on. All you need is raw footage of yourself clicking through features in real time.

Step 2: Send the recording to a model—Austin prefers Opus—and have it draft a storyboard. The recording provides a ground truth for how the UI works and what the copy says. This prevents the most frequent cause of fake-looking launch videos: plausible-but-hallucinated labels and features.

Step 3: Iterate on the storyboard. Go back and forth with the model until the hook, pacing, and beat-by-beat plan feel right.

Step 4: Hand the storyboard to a coding agent and have it build the video in Remotion. With the screen recording and the corresponding storyboard, the first full render is usually publishable. It’s not a true one-shot, but it saves a lot of time.

Now, next, nixed

Prompts are the new installers

Companies and developers are trying a new way to let users download an AI tool. Instead of asking them to press a download button, users copy a setup prompt, paste it into Claude Code or Codex, and let the agent install the tool.

Now: Copy prompt, paste, install. This is how we install Every’s agent-native document editor Proof: Paste a prompt into your assistant, and it handles the setup. The prompt is doing the job the download button used to do: It gets the user from “I want to try this” to “It’s running in my workflow.”

Next: Someone designs the standard version of this. The copyable prompt block becomes a normal part of product pages and GitHub READMEs (the instructions for software projects), especially for developer tools. It should work on the web and on a repository homepage, and feel as obvious as a “Sign in with Google” button.

Nixed: The download button as the main way in. The old-school way of installing software—clicking a download link and running a setup file—still makes sense when software requires direct hardware access or needs to work offline, but for AI-native tools, the front door is: Copy this prompt into your agent.—Katie Parrott

Model happenings

News you might have missed

Cowork shipped live artifacts. Claude can build dashboards and trackers inside your workspace that pull fresh data from your apps and refresh each time you open them—pouring narrative gasoline on the SaaSpocalypse fire.

Cowork artifacts allow you to create the dashboards and data reporting visualizations tools like ChartMogul provide. (Image courtesy of Brandon Gell.)

OpenAI gave Codex screen memory. Codex now retains what’s on your screen across tabs and sessions, so you don’t have to re-paste context every time you start a new task.
OpenAI launched workspace agents in ChatGPT. The Codex-powered feature lets teams create custom shared agents that can pull information from different sources, analyze it, and turn it into a draft or next step. It’s another signal that agents are becoming a shared team resource, rather than purely individual AI assistants.

One last thing

Nityesh has been having a lot of fun with ChatGPT Images 2.0

A couple of his recent creations include a vintage poster to celebrate the release of Monologue Notes, a new feature in our agent-native recording app, and an infographic about securing Claudie, the consulting team’s always-on AI employee.

The prompt: Turn Monologue Notes’s landing page into a vintage poster. (Image courtesy of Nityesh Agarwal)

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Vibe Check: GPT-5.5 Has It All

Katie Parrott / Vibe Check — 2026-04-23 13:00:00 -0400

by Katie Parrott

in Vibe Check

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Frontier models usually take a while to get used to. You have to learn their slow spots, when they need extra prompting, and when to keep a close eye on the output.

GPT-5.5, out today, feels easier to settle into. It’s fast enough to use constantly, personable enough to collaborate with, and assertive enough to carry a plan through serious engineering work. It’s better at writing than any OpenAI model we’ve used in about a year, and it produced the strongest result we’ve seen on our new Senior Engineer Benchmark, which measures how well models can rewrite a messy production codebase the way a senior engineer would. It’s rare for a model to feel easier and stronger at the same time.

The big insights from our testing:

Best on senior-engineer coding. GPT-5.5 scored 62.5 on our Senior Engineer Benchmark versus 33.5 for Opus 4.7. Humans still score in the high 80s and low 90s. The twist: GPT-5.5’s best run used an Opus-written plan.
A real writing comeback. It’s the strongest OpenAI model we’ve tested in a year, with cleaner structure and smoother logical progression than Opus 4.7.
Strong everyday knowledge work. GPT-5.5 beat Opus 4.7 on dashboards and felt dependable for creating client deliverables or customer support replies.
Best with structure. GPT-5.5 shines with a plan, an existing system, or a tight feedback loop. Opus 4.7 still has advantages on one-shot vibe coding, PowerPoint, Ruby, and some broad product-design tasks.

The full Vibe Check has the benchmark results, Reach Test ratings, pricing, screenshots, and advice on when to reach for GPT-5.5 versus Opus 4.7.

Read the full Vibe Check

And watch our video Vibe Check with Dan Shipper:

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

For sponsorship opportunities, reach out to sponsorships@every.to.

Transcript: ‘The AI Sandwich: Where Humans Excel in an AI World’

Dan Shipper / AI & I — 2026-04-22 19:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I with Every’s Kieran Klaassen is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction and the AI sandwich metaphor: 00:00:52
What compound engineering is and how it’s evolved: 00:02:33
The “work” phase of agentic coding is essentially solved: 00:04:27
Why humans belong at the beginning and the end of an AI workflow: 00:06:27
Dan’s argument for why agents can’t change frames—and how this will keep us employed: 00:11:06
Full automation remains a moving target: 00:16:51
Musical composition as a model for human-AI collaboration: 00:23:21
Find your place in an AI-accelerated world by leaning into what brings you joy: 00:26:39

Transcript

Dan Shipper

Humans are the bread in the sandwich, and the AI is in the middle.

Kieran

The AI is whatever you put on your sandwich. If you ship something or do something—if you want it to be your own—you cannot fully automate everything. It’s like art. If you want it to be yours, it needs to come from you or somehow be connected.

I believe it’s so important to do things you enjoy and love. It’s very important to make it feel great because the bar is high. The bar will always get higher. The beginning and the end—the middle can be automated pretty well. And Dan at some point said, “Oh, it’s kind of like a sandwich,” which was very funny.

Dan Shipper

Ki, welcome to the show.

Kieran

Hello, Dan. Happy to be here.

Dan Shipper

For people who don’t know, you are the GM of Quora, and you are also the creator of compound engineering—the engineering framework and plugin that everyone inside of Every uses, and that everyone who’s really coding with agents is at least aware of, if not using.

A pleasure to have you on the show.

Kieran

Thank you. It’s always great.

Dan Shipper

I love getting to chat with you and getting to work with you, because every once in a while you figure something out and I’m like, “Holy shit, that’s definitely the future.” And you just figured something out—along with Trevin Chow, who also helps out on compound engineering—that I think has massive implications for how programming works. And I think we can also translate that to the rest of AI and its impact on work.

One of the things you’ve been doing with this compound engineering plugin is you’ve rebuilt the engineering workflow for how you should work with agents. And in thinking about that—thinking about where a human is needed and where a human should not be present inside that process—I think you’ve found something really interesting and deep about how humans and AI are going to interact with work. Do you want to explain a little bit about compound engineering and the process you’ve created, and then also explain this insight about where humans fit?

Kieran

Yeah, absolutely. Compound engineering is a philosophy of doing engineering work. We’ve realized it applies to more than just engineering—it’s product work, design work, knowledge work, and other things. But how I built it was while building Quora. I had AI and was thinking: how can I use AI to do better work more quickly?

The initial version of compound engineering really evolved around four steps. The first is planning—you make a great plan so it’s very clear what you need to build and do. Then the work phase, where the agent does the work, implements it, and actually writes the code, does the design work, or whatever work needs to be done.

The third is review. Some slop comes out—or something beautiful comes out, one of the two—but how do you know it’s good? Traditionally there’s a code review, a PR queue, where someone says, “Hey, this can be improved,” and there’s some iteration going on there.

And then the most important step is the compound step. If anything comes up during the review or during the planning that feels like a good learning—something you’ll probably run into again—you can compound that knowledge back into the system. We store that as knowledge inside the repository, and agents can reference it the next time they go into planning, work, or review. They can see the mistakes they made before so they won’t make them again. That’s the most powerful thing in this plugin.

But we started to realize more things. First of all, the work phase is kind of dumb—not in a bad way. If you have a good plan, it does the work and it’s pretty good. And then the review makes it a little bit better.

Dan Shipper

And by that you mean: having an entire phase dedicated to work in this whole system doesn’t necessarily make that much sense, when all it really means is “run the model, let the model do the thing.”

Kieran

Yeah. There needs to be a step, but what I mean by “dumb” is I don’t need to care about it—I don’t need to think about it. I trust it. And this isn’t “trust me, bro, it just works.” This is: I’ve seen that if you put in a good plan, it executes on the plan. LLMs are very good at following steps, doing deep work, working for hours or even days now.

That thing is kind of solved. The review is starting to get there too. The planning is starting to get there too. And then you hit this next question: if all these things work, where do I actually have to do anything?

Dan Shipper

Yeah.

Kieran

Did I automate myself out of a job? If everything works, where do I work? What is still the bottleneck?

There are two things we started to identify—and Trevin was a big contributor here. He’s a product person, and he said, “I need more on the product side, which is before the planning phase.” So he added a brainstorm step and an ideate step. The ideate step is really going wide—coming up with ideas in a room full of interesting people with different angles. Brainstorm is more like: I have a problem but I don’t really understand exactly what or how. So it’s very much brainstorming around the problem.

The first thing we noticed there is that at the top, it’s very important to stay in the loop with a human and really ask a lot of questions—the human should think hard, and the LLM should support the human. But then after that, if you have a good brainstorm and a clear idea of what problem you’re solving, it can create a very good plan and the human doesn’t need to be in the loop.

So that’s the first realization: here’s where it’s good to be in the loop versus not. You can see other approaches—spec-driven development, for example—that assume it’s always good to have people in the loop, and I disagree. It’s very important to know when to be in the loop versus when to hand it off, because that means we can think harder at the moments where we actually need to think harder.

The other moment comes at the end. Something comes out. How do you validate it? Well, it’s already tested—browser automated testing has clicked through everything, all the requirements are clearly specified, and it says everything works. But the beauty comes in when a human looks at it, clicks around, and has a feel for it: “Oh, this doesn’t feel right. We can polish it. We can make it better. There’s something still missing. We can make the design better.”

I learned this from doing Pomodoros. Ideally, if you finish a task after 15 minutes, you still have 10 more minutes to work on the same task—you can’t switch. And sometimes in that space, something beautiful happens because you go deeper, further than you would have otherwise.

I think that’s the other critical moment: all the way at the end, when everything is done, you can elevate everything and make it even better. And I think we need to do that, because if we don’t, it will all be slop—all the same. It’s very important to make it feel great because the bar is high, and the bar will always get higher.

So this is what we realized: the beginning and the end. The middle can be automated pretty well. And Dan at some point said, “Oh, it’s kind of like a sandwich,” which was very funny. And Dan is now referring to the AI sandwich, which I think is very cool. The sandwich is really: when do you need to think about what you’re doing and really use your brain, versus when do you offload it?

(00:10:00)

Dan Shipper

Humans are the bread in the sandwich, and the AI is in the middle.

Kieran

Yeah. The AI is whatever you put on your sandwich.

Dan Shipper

Exactly. And I think that’s really interesting and really cool because it gives me a good mental model for how I should be working with coding agents—but I think it also applies to the rest of knowledge work.

This is such an important question now, because we have all these questions about what agents are going to do, whether everyone’s going to lose their job, all that kind of stuff. I think software engineers are a little bit the canary in the coal mine. And so far, what we’ve found internally at Every is: absolutely not. We still hire software engineers. We need software engineers. But the way you’re working—what you’re doing—looks a lot more like managing. If you’re doing it well, you’re still involved, but you’re involved at the beginning and the end as this kind of sandwich. And I think the same is going to be true of every other kind of work, whether that’s copywriting, strategy, or design.

And there are deep reasons why that is the case. I want to start with an objection people will have, which is: okay, for now agents can’t do the ideate and brainstorm phases, but pretty soon they will. So then what?

They’re already starting to do the beginning of that process. And I think there’s something interesting here. If you look within any given local frame of a problem—to take a non-coding example—the problem might be “my knee hurts” and you want to solve that. But “my knee hurts” is the same kind of problem as “this feature is broken” or “customers are anxious about this part of the product.” Any problem. If you take that frame and say, “The solution is take Advil”—any part of that process, getting to the store or whatever, can be automated. DoorDash can go do it. But there’s always, even once you’ve solved it at that level, a larger frame within which to think about the problem.

If your knee hurts, you might need to stretch your IT band. Or you might need to stop running on hard surfaces every day. And each one of these addresses the same problem at a different level of the stack, from a different frame. Humans are very good at flipping and changing frames like that. Our job is to set the frame—set the bounds within which we solve the problem. And I think it’s going to be very hard for agents to do that well by themselves.

Does that resonate for you?

Kieran

Yeah, for sure. It all comes down to building an environment where the agent will thrive. And you do that by picking the right things. That’s why it’s so important to have humans with experience, humans with taste, humans who want to click around and say, “This is great” or “This is not”—and say why.

I think it’s similar to the Advil example. If you keep taking Advil, eventually a friend will say, “That’s messed up—just go fix the actual problem instead of denying it.” It might work for a while, but you need someone to shake you up. And in that case, that’s the human.

But I do think the ideation step will also become more automated. You can say, “Let’s have a persona of 100 people and run simulations of how they think and behave.” And clearly we’re going there—running simulations of millions of people, seeing how things work, learning something from that. There will be more automation, and maybe even the front step will eventually be fully automated.

But I do think that in the end, if you ship something—if you make a statement in the world—and you want it to be your own, you have to say yes or no at some point. You cannot fully automate everything. It’s a bit like making art. If you want it to be yours, it needs to come from you or somehow be connected.

So I believe having those moments where you decide—where you choose what you enjoy—is so important. That’s why it’s so important to do things you enjoy and love.

Dan Shipper

I agree. And you can imagine it being: “We’re going to simulate a billion people and make decisions based on what we think they would do.” But that would still only cover a small set of the decisions someone might actually make.

Kieran

It will never be fully solved—it’s a moving target. We always create something new, and then there’s a layer above that where we can make even bigger impact.

Dan Shipper

Especially because, for a lot of these decisions, the feedback loops are so long and the data is so rare. You may only get a couple of moments in your career where you gather the data that helps you decide about a particular thing. That’s very hard to get into language models—especially because it’s hard to gather in the first place, and they need a lot of it. That rare expertise, encapsulated in an expert who has a personality and a worldview, is hard to replicate. And you’re right—it’s always moving.

That makes me really excited about this, because I feel like we’ve been wandering in the woods for a long time on the question of what AI progress is going to mean, and how humans are going to be involved. And it just feels very much to me like the simple answer is: ride the bottle. Or to mix the metaphor—be the bread in the sandwich. If you do that, you’re going to be fine. It’s going to be really, really great.

Kieran

I agree. And it will be different for different people, and you will need to change some things. If you only love writing code, you need to find your way of doing that. Yes, you can still write code—but maybe it’s about beautiful code. Maybe you find a lot of value in just recognizing beauty, the way someone looks at a UI and says, “This is beautiful, this works great.” Maybe you want that for code. Some people don’t care about that, but they love that the UI should feel great, and they’ll polish it, go extra—wherever they feel joy.

And it’s also becoming much more product-focused. As an engineer, you’re going to become either more of a manager or more of a product person. Product manager, product engineer—it’s more of those things as well. So there will be some changes, but lean into making beautiful things. Whatever that means to you: beautiful code, beautiful abstractions, beautiful architecture, beautiful design, beautiful copy.

I think it’s very important to lean into what is beautiful to you, because then you’ll find a way to use an LLM to make something that gives you energy instead of draining you.

(00:20:00)

Dan Shipper

And I think there’s a deep reason why language models are not going to be as good at that. One reason is it’s just not going to be yours if you didn’t decide it, if you didn’t do it. But another deep reason is that you can think of language models as a super-intelligence that’s been kept in a box for the last year and has no idea what’s going on in the world, except for whatever it picks up when it pops out of the box. Because of that, its outputs end up being a little more generic and less personal to your situation.

You can see this in all the AI writing that reads like “It’s X not Y” and that kind of thing. To truly solve a problem well—or to truly make art, or to truly make a product that resonates with people—it has to be really well tuned to the exact problem you’re trying to solve or the exact form you’re trying to make. Language models need a lot of help to get there. That’s why you have to be on either end of them: to set the frame of the problem, and then to make sure the details are really right at the level of execution. And I think they’ll get better at this, but they’re much further from being able to do it end to end than we think.

My general bar for AGI is: whenever it becomes economically worthwhile to run an agent 24/7—it never turns off. OpenClaw is pushing in this direction, but it runs on a schedule, it has a heartbeat. You can’t just say, “Hey, go do a bunch of stuff and work all the time, spend tokens on everything constantly,” and have it be worthwhile. We’re not even close to that. Yes, we sometimes have well-specified tasks we can send a model off to work on for 24 hours, but it’s not changing frames on its own. It’s not finishing a task and then picking the next one, spending five minutes on this one and four days on that one. We’re not even close to that. I think we’re going to need some fundamental changes to the language model architecture to get there.

If they are running 24/7 like that, they’ll be a lot closer to being context-sensitive enough to do interesting creative things. But we’re not there yet.

Kieran

Yeah, I agree. One other way to look at it: I have a music background. I studied classical composition. And one of the beautiful things about music is—yes, Suno can create songs, but it will never capture a live performance, or the experience of coming up with a melody. There’s something internal in the human. As a composer or musician, if you perform something and deliver it to other people, they feel that. It’s different.

If you’re a DJ, it’s maybe somewhere in the middle—but there is something about performing, about expressing something. And I think there’s some of that element in these steps too. You see something and you feel like, “This is a little bit off here—I don’t know exactly why, but I want to change it.” And suddenly you’re performing, iterating, making something. You’re putting something into the world.

Practicing a piece, playing it 100 times—that’s not very creative, as a musician. That’s kind of the middle part. But the performance, at the end, is where you bring it out into the world to the people. That’s a special moment. And there’s a link for me with doing that polishing step at the end of a project.

And the start—if you’re a composer, coming up with something out of nothing—that’s also a special moment. Everything in the middle is kind of boring. It’s just work. But those end moments are still special, and it kind of works for making software or other things with agents as well.

Dan Shipper

I think that’s totally right. I love this art angle. Another way to say it: all work exists on a spectrum from being totally rote to being art. And art itself has many tasks within it—any kind of creative work has many tasks within it that are more or less rote.

If you’re trying to map work on that spectrum, the stuff that is more rote is just stuff you’re not going to have to do anymore. And that is a big opportunity to move a lot of the work we do to the more creative—and probably more interesting—parts of work. And to recognize that the frame is always changing. As certain things get rote, other things become what humans start to do. Yes, those will get automated too, but we’ll also keep moving along that spectrum.

The final thing that’s not automatable is art made by humans who feel something. And I think that’s beautiful.

Kieran

Yeah, it’s still scary—because what if you’re in the middle and you want to move? What if you’re trying to figure out what that means for you? This might sound very abstract and weird to some people. If you’re not an artist or haven’t really felt this in moments, it might sound like, “Oh, but that’s not me.”

But I do believe everyone has this. What brings you joy? What lights a fire in you? What do you get excited about? Whatever that thing is, you should lean into it. That can be beautiful writing, or very structured lists, or anything that just brings you happiness—you should do more of that, and use LLMs in your work toward that. That’s good.

Dan Shipper

I agree. Kieran, always a pleasure.

Kieran

Thank you. Let’s see where this goes.

Dan Shipper

See you next time.

Kieran

See you. Bye.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

You’re the Bread in the AI Sandwich

Laura Entis / Context Window — 2026-04-22 15:00:00 -0400

by Laura Entis

in Context Window

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: You’re the Bread in the AI Sandwich

Today, we’re releasing a new episode of our podcast AI & I. Dan Shipper sits down with Kieran Klaassen, GM of Cora and creator of Every’s AI-native engineering methodology, compound engineering. Dan and Kieran discuss where humans fit now that AI can generate high-quality code, copy, strategy, and design. If the execution layer is largely solved, do engineers still have a role in the workplace?

The short answer: Yes. Think of an AI workflow like a sandwich—the model is the workhorse filling, and we’re the bread, providing framing and taste.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

Play to your strengths. Kieran’s compound engineering framework breaks the engineering workflow into four steps: Plan, work, review, and compound. AI takes care of the doing phase. “LLMs are very good at just following steps, doing deep work, working for hours or days, even now,” Kieran says. What’s left for flesh-and-blood humans are the steps before and after—the planning, where you frame the problem, and review, where you determine whether the output feels right (the bread!).
Humans can identify multiple solutions to the same problem—AI struggles at this. If your knee hurts, you could take Advil, stretch your IT band, or stop running on hard surfaces. Humans are good at diagnosing a problem from many different angles, an exercise agents struggle with, Dan says.
Taste is the final layer of bread. Once AI has done the work, the most important thing you can do is judge whether the output approaches the vision in your head. Does the output feel right—and if not, how can you reframe the problem until the AI produces something that does? This is what separates art, which has a point of view, from generic slop.

Now, next, nixed

The agents are merging

Now: Claudie is an AI agent that runs on a Mac Mini with a Claude Max account. Since joining Every’s consulting team a few months ago, she’s been promoted multiple times and is now responsible for managing client updates, the sales pipeline, and the creation of slide decks.

Every engineer Nityesh Agarwal initially built Claudie as an AI project manager. The plan was to build separate agents to handle deck creation and the sales pipeline.

But every time he added a capability to Claudie’s plate, she exceeded his expectations. And so instead of creating more agents, Nityesh converted their planned functionality into plugins within Claudie. “There doesn’t appear to be any limit to how much this AI employee can do if you spend time building good, refined skills,” he says.

Today, each (human) member of the consulting team has a personal AI assistant tailored to their own workflow, and they use Claudie to do tasks where they can take advantage of skills—such as slide deck building—that can be shared across the team.

Next: Two organizational architectures for agents will develop simultaneously, Dan predicts. In the first model, every person at a company gets their own AI assistant. In the second, workers across the organization will rely on a single super-agent with a library of department-specific plugins, similar to Claudie, but even bigger.

In the first case, each worker can customize their agent to their exact specifications, which allows for a richer relationship but requires setup and maintenance. In the second, one specialist does the upkeep of the agent and its plugins for the whole team or company, which takes the burden off each worker, but means they can’t make any tweaks.

Nixed: A fleet of single-purpose agents shared by one team—an agent for sales tasks, an agent for product management, an agent for reports. Sadly for Claudie, she will never get to work with the sales agent Nityesh planned, Jean-Claude.

Inside Every

Motivating your AI employee

Last Thursday, I opened Slack and saw a message from our AI project manager, Claudie, announcing that her trust battery with me had dropped 0.6 percent to 28.3 percent.

The concept of a trust battery was coined by Shopify CEO Tobi Lütke, and the idea is simple: All working relationships run on trust batteries, and every exchange impacts their charge. When your trust battery with a coworker is high, they rely on you to do your job. When it’s low, everything you do is scrutinized.

With Claudie, we’ve codified that concept. Every night, a separate judge agent reviews Claudie’s interactions with our team, evaluates the quality of her work, and issues a verdict on whether her trust battery with each of us should go up or down and by how much.

The judge agent is designed to look for what went wrong rather than right because losing trust is easier than earning it. A day where Claudie consistently delivers satisfactory output in all her interactions with a team member boosts her battery by one percent, whereas a single bad day—such as pulling the wrong data—can cause her charge to fall by five percent, wiping out a week of progress.

Every night, Claudie is programmed to read the judge agent’s verdict and make updates to her memory, behavior, and scheduled tasks so she won’t make the same mistakes again. If the judge concluded she missed important context when making a client update, for example, she might add the entry “Always check the last three emails in this thread before drafting a response” to her memory. This feedback improves her performance over time.

Claudie posts a summary of what caused her trust battery to rise or fall on Slack. (Image courtesy of Nityesh Agarwal.)

Her battery levels determine what she’s allowed to do. According to Lütke, a human’s trust battery starts at about 50 percent. Because she lacks lived experience, Claudie’s started at 20 percent.

A new hire doesn’t get to make strategy decisions on day one. They earn that by demonstrating judgment over time. Claudie is the same—except unlike a human, she systematically reviews each day’s failures and rewrites herself so she won’t make the same ones again.—Nityesh Agarwal

Log on

We host camps and workshops on topics like compound engineering and writing with AI to share the knowledge we’ve acquired from training teams at companies like the New York Times and leading hedge funds, and by learning and playing with AI every day ourselves.

This week’s camp

Codex for Knowledge Work Camp on April 24: A hands-on camp with CEO Dan Shipper and head of growth Austin Tedesco on using OpenAI’s Codex for writing, research, and building tools that automate routine tasks. The first 250 attendees will receive one free month of ChatGPT’s Pro plan (worth $100). Learn more and register.

Last week’s camp

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

Recordings you may have missed

Every x Notion | Custom Agents Camp: A free workshop where we demo the custom agents running Every’s daily operations. Watch the recording or read the write-up.

Happenings

OpenAI’s latest image model

ChatGPT says ChatGPT Images 2.0, its new image generation model released yesterday, improves text rendering, web access, and visual reasoning. When we asked it to visualize our weekly standup meeting, here’s what it spat out to describe Kieran’s AI sandwich idea.

We will let you be the judge of this human-AI-sandwich hybrid. (Image courtesy of Naveen Naidu and ChatGPT Images 2.0.)

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Mini-Vibe Check: Claude Design Isn’t for Designers—Yet

Katie Parrott / Context Window — 2026-04-21 15:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Introducing Monologue Notes

Today we’re launching Monologue Notes, which turns your calls, meetings, and voice memos into transcripts your agents can use. Naveen Naidu built Monologue to capture active work, where text has a clear destination. In six months it’s logged five million dictations and 250 million spoken words. Now, Notes captures the rest: the thinking that happens on walks, in calls, and in meetings. It transcribes everything and makes it available to any agent with API, CLI, or MCP access, across your Apple devices.

Try Monologue Notes

Mini-Vibe Check: Claude Design

Anthropic recently launched Claude Design, a web-based tool that lets you feed Claude a GitHub repo, Figma file, or brand kit and collaborate on interfaces, prototypes, slides, and one-pagers. It’s powered by Claude Opus 4.7 and lives only in Claude.ai.

The stock market read Claude Design as a threat to Figma, the incumbent design tool. But traders are not designers. Having played around with Claude Design, Every’s creative director Lucas Crespo characterizes Figma’s sliding share price as “a Wall Street reflex from people who have never opened either tool.”

Claude Design can do a lot well, but it wasn’t built for designers.

Claude Design lets you upload your organization’s branding and design system. (Image courtesy of Anthropic/Jack Cheng.)

What works: Point Claude Design at a GitHub repo and it will extract a starting design system—the colors, typography, and reusable components that give a product its look. Non-designers can then extend that system. If head of growth Austin Tedesco wants to ship a careers page or a YouTube thumbnail in Every’s style without bothering the design team, Claude Design is the tool for the job.

Claude Design’s live, generative interface is also a nice touch, Lucas says. The tool starts by asking you questions—layout density, accent color, whether to animate emojis—and you can draw or leave comments on top of the output, or click a specific element and edit it in place. The sketch-on-top feature is the closest Claude Design gets to feeling like Figma.

What could be better: The menu-driven interaction. Creating in Claude Design means answering a series of text prompts about layout, tone, and color, and reacting to what the tool produces. “It feels like we’re filling a bunch of forms—design is supposed to be fun,” Lucas says.

This prompt-and-react loop works for extending or revising an existing design system. But it isn’t well-suited for starting something from scratch—design is “50 percent exploration,” Lucas says. In Figma, you start with a blank canvas, and your output is shaped by a series of decisions—drag a shape, snap it to a grid, change a drop shadow, compare three variations side-by-side. Claude Design turns the open-ended exploration into reactions to what it’s already made.

The fragile setup. During a demo, we struggled to link Every’s GitHub repos and upload Figma files. And because Claude Design is web-only for now, there’s a literal disconnect from your local files and Model Context Protocols (MCPs).

Final verdict: Claude Design is great for teams that want to empower non-designers to create their own assets in the house style. But it’s not yet where a designer goes to build something new.—Laura Entis

Signal

Two new ways AI tools can leak your data

Two AI-tool security stories broke inside 24 hours over the weekend. They reveal two different points of failure in AI security: one where the attack surface was the vendor, and one where it was the AI’s output.

What happened: On Sunday, Vercel, an infrastructure company behind a big chunk of the web, confirmed a breach. Except the break-in didn’t start at Vercel but at a third-party tool called Context AI. The attackers used the hacked connection to climb into a Vercel employee’s Google workspace account, then their Vercel account, and finally into customer data—including the private passwords and credentials that customers’ apps use to connect to their payment systems, databases, and other services.

Then on Monday, vibe coding platform Lovable did damage control after users started warning each other that apps built on the platform were leaking their users’ data to the public internet. The issue turned out to be in the permissions: A basic database rule, “a customer can only see their own records,” was turned off by default in the apps Lovable generated.

Why it matters: Every AI app your team signs up for is a new door into your company. If the vendor gets hacked, the keys they were holding—to your email, your calendar, your codebase—walk out with the attackers. You inherit that vendor’s security posture even if your IT team didn’t pick the tool.

And when an AI writes your app for you, your app inherits the AI’s defaults. There isn’t always someone looking over the generator’s shoulder to check whether those defaults are safe. “I vibe-coded a prototype” now means “I shipped something protected by whatever rules the generator thought were fine.”

What to do this week:

Take stock of every AI app your employees have connected to a work account. Then turn on two-factor authentication everywhere it isn’t already on.
Before you ship anything an AI built—even a weekend prototype—ask the generator one question: “What is this app exposing to the public internet, and should it be?” If you can’t get a clear answer, don’t ship.
For anything touching customer data, like a CRM or billing system, pair the AI with a tool designed for a safer-by-default posture. Anthropic’s recently launched Managed Agents, for example, runs each session in a sealed-off computing environment with credentials held outside the sandbox.

Steal this workflow

Give your agent its own X feed to watch for vulnerabilities

You can’t personally stay on top of every vulnerability that might hit the AI stack you’re building on. Every AI engineer Nityesh Agarwal decided to stop trying, and assign situation monitoring to an agent that doesn’t sleep.

The workflow:

Create a dedicated X account for your agent. Nityesh made one for Claudie, Every’s consulting project manager agent, and had her follow the AI security people he’d otherwise be glued to—Anthropic researchers, independent researchers who probe systems for weaknesses, and vulnerability-disclosure accounts. The dedicated account only reads posts, and doesn’t otherwise participate in discussions.
Schedule daily jobs that scan its home feed and flag anything that looks like a disclosure. Use this prompt: “Read my X home feed from the last 6 hours. Flag any posts reporting vulnerabilities, Common Vulnerabilities and Exposures, breaches, or exploits relevant to the AI stack we use (Claude, Anthropic APIs, OpenClaw, Railway, Vercel, Supabase, our MCP servers). For each, give the source post URL, affected system, severity if stated, and one-line summary.” Run it at 6 a.m., noon, 6 p.m., and midnight.
Route flagged items to a team Slack channel. Nityesh has Claudie post to an internal channel so anyone can see what broke overnight. Add a one-word tag per item (critical / watch / fyi) to make messages easy to scan.

Try it this week: Spin up an X account, follow 10 AI security researchers, and schedule a recurring Claude Code job with the prompt above.

Discuss

“Cybersecurity is proof of work now. You don’t get points for being clever. You win by paying more.”—Drew Breunig, writer and technology strategist

AI has gotten good enough at finding software vulnerabilities that security has turned into a spending contest between attackers and defenders. Both point AI at your infrastructure looking for ways in. Whoever runs more scans wins.

That adds a third step to shipping code. You write it and review it—and now you harden it. You point a model at your own system and let it hunt for exploits until you run out of budget. If you’re shipping anything that touches customer data, assume an attacker is already running that third step against you. The only question is whether you’ve run it first.

Inside Every

Star Wars got it right

To me, a lot of the charm of the original Star Wars trilogy comes from the decided lack of remote networking. They can’t hack the Death Star so Obi-wan has to cross the narrow bridge deep in the starship’s bowels to get to the terminal to flip the switch to turn off the tractor beam.

I used to think this was a relic from the pre-internet days when the films were made. But with frontier models growing increasingly more capable at exploiting security gaps, it might be a short time from now in this galaxy. Anything of critical importance will live offline.—Jack Cheng

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

For sponsorship opportunities, reach out to sponsorships@every.to.

Introducing Monologue Notes: Record Every Meeting, Call, and Voice Memo

Naveen Naidu / On Every — 2026-04-21 03:00:00 -0400

by Naveen Naidu

in On Every

Figma/Every illustration.

TL;DR: Today we’re launching Monologue Notes, which turns your calls, meetings, and voice memos into transcripts your agents can use.

The best thinking rarely happens at a desk. It happens in meetings, on calls, or on walks—and then disappears. Monologue Notes, out today, records and transcribes all of it—the calls, meetings, and voice memos—and makes it available to the same agents and tools you use every day. It makes the thinking that happens in conversations and on walks just as actionable as the work you do at your desk.

Notes is available through the Monologue app on Mac, iOS, and WatchOS and syncs across all your Apple devices. You can start a recording on your Apple Watch before you leave the house, keep your phone in your pocket the entire time you’re outside, and pull the note into Codex once you’re back at your computer.

Try Monologue Notes

How Notes transforms passive work into active work

Notes was born out of a frustrating gap in my own workflow. Six months ago I shipped Monologue, a smart voice-to-text app that has processed more than 5 million dictations and converted more than 250 million spoken words into text.

Monologue excels at tasks where the text has a clear destination—you can get a lot more done when your dictation app understands your vocabulary and workflow. I speak, Monologue transcribes, and I send the words along to where they belong: Codex for code, Slack for messages, Notion for article drafts. The work is active.

Monologue Notes captures work that is passive—the ideas and decisions that accumulate when you’re out in the world or talking to other people.

Monologue Notes syncs across your Apple Devices. (Product shots courtesy of Every.)

I start every morning with a half-hour walk around my neighborhood. I make product decisions, troubleshoot bugs in my head, and work through problems that stumped me the day before. My best thinking often happens before I sit down at my desk, but before Notes, there wasn’t an obvious central place for it to live, so it got scattered across Apple Notes, Obsidian, and Slack.

The same thing happens on customer calls and in internal meetings. Problems get discussed, solutions emerge, progress is made—but the thinking is rarely stored in a way that can be mined for insights later.

Notes is not a traditional notes product. You can access your recorded transcripts and summaries through the app, but you can’t edit files, and there’s no folder organization system. Notes is more of a transit point, an audio capture layer that runs in the background, gathers context, and makes it available to your favorite coding agent.

Once a recording is finished, you go to the place where work actually happens—your terminal, Codex, a Linear board—and have your agent find what’s useful in the transcript so it can start building.

Try Monologue Notes

How I’m using Monologue Notes

On morning walks. I don’t listen to music or podcasts. When I leave the house, I hit record and start thinking out loud.

There’s no agenda. Sometimes I fixate on a feature question or a tough conversation I had with a colleague. Other times, my mind wanders, cycling through topics in rapid succession.

Back at my desk, I open Codex and run the same prompt: “Pull up my last Monologue note, and start building this.”

Just like that, my rambling thoughts become action items.

With the Monologue API, command line interface (CLI), or Model Context Protocol (MCP) access, any agent or tool that can read your written notes can read your recorded ones too.

On customer calls. A few days ago I recorded a 19-minute call with a user experiencing a lag in Monologue’s browser integration on Mac. When we hung up, I opened Codex, and told the agent to pull the transcript and find the root cause. It read the user’s description of the issue, identified the bug, searched the codebase, and fixed it. I didn’t need to write a long prompt or a single line of code. Codex went straight from the call transcript to the patch.

To crystallize ideas across recordings. Over the past two weeks, I’ve been working through the distinction between active versus passive work, which is the driving idea behind the Monologue Notes launch. I captured fragments of my thinking while driving, in internal calls with my team, and during conversations with early users.

Before Notes, writing an article pitch would have required a brain dump. With Notes, I prompted Codex to “pull all my Monologue Notes where I talk about active work and passive work, and put together a brief.” It searched across about a dozen recordings, identified the through-lines, and returned a compelling thesis—an argument I’d been circling for weeks, assembled from things I’d already said.

That argument is the basis for this article.

Try Monologue Notes

The loop

With Monologue Notes, you record an idea → pull that idea into the place where work happens → turn the idea into action.

This workflow has cured me of storage anxiety, or the gnawing feeling my best ideas would get lost because I didn’t know where to put them. Now when I hit record, I know that Claude Code or Codex will find whatever I need when I ask for it.

It’s also made me a more disciplined problem solver. When you know everything is safely stored and quickly retrieved, you stop worrying about where your thoughts are going and focus on the quality of the thinking itself.

Two skills you can try today

Skill 1: The morning brief

Record a five to 10-minute voice note on your commute to work. Don’t map out what you’ll say—just think out loud about what’s on your mind or what you’d like to get done.

When you’re back at your computer, open your agent of choice (Codex, Claude, ChatGPT) and connect Monologue Notes via MCP. Then paste in this prompt:

Pull my latest Monologue note and turn it into a prioritized list of tasks for today. If an item requires code, open a session. If it involves writing, start a draft.

Your scattered morning thoughts transform into a structured work session in fewer than two minutes.

Skill 2: Customer call → fix

Record your next customer support call or user interview with Monologue Notes running in the background. After the call, open your agent and enter this prompt:

Pull my most recent Monologue note from today. The user described a bug. Find the root cause in the codebase and write the fix.

If it’s a product conversation instead of a bug report, swap the second sentence to the following:

Summarize the user’s main pain points, draft a follow-up email, and create a Linear task for the top actionable item.

The transcript becomes the input. Your agent does the rest.

Monologue Notes is available for all subscribers.

Try Notes in Monologue

Thanks to Laura Entis for editorial support.

Naveen Naidu is the general manager of Monologue. You can follow him on X at @naveennaidu_m and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

How I Escaped AI Autopilot

Katie Parrott / Working Overtime — 2026-04-20 04:00:00 -0400

by Katie Parrott

in Working Overtime

Midjourney/Every illustration.

To read more of Katie Parrott’s writing about how AI is changing work, read the latest articles in her column, Working Overtime. To read more essays like this, subscribe to Every.

Of all the ways I imagined AI might change my career, “forgetting I already did the assignment” was not on the list.

I had already sent my client a finished draft of an article on hiring best practices in South America, when I happened to reread the brief. A familiar phrase made me realize I had read it before. Then there was the statistic I was pretty sure I had already fact-checked. I clicked back through my files, and there it was: same client, same topic, same deliverable, dated four weeks earlier. It was completed, filed, and forgotten so completely that when a clerical error sent the same brief to my inbox again, I sat down and did the whole thing over.

My first thought was that this was probably early-onset something, and I should call my doctor. My second, more rational thought was that I had not lost my mind—but I had outsourced it. I had been moving so fast and delegating so much of the work to AI that my brain hadn’t even bothered to store a memory of completing the assignment.

What scared me most was thinking about all the smaller moments when I had not caught myself.

This kind of outsourcing isn’t new. Plenty of people would admit to feeling lost navigating an unfamiliar city without a phone to rely on, and I for one am lucky to remember my own phone number, let alone someone else’s. But AI does more than take work off your plate; it steps into the judgment calls you used to make yourself.

I am the last person to scold anyone for using AI. I have built AI into nearly every part of my job, and it has helped me write more rigorously, research more thoroughly, and take on projects far beyond what I used to think of as my wheelhouse. But when you accidentally offload the wrong parts—like fully understanding the purpose and intent of the piece, as I did in this case—you run the risk of atrophying the skills that matter most to you. You might even put your name on work you don’t realize you don’t stand behind until someone else starts asking questions. And if you are using AI for any kind of qualitative work, such as writing strategy, marketing, communications, I would bet you are doing some version of this too. Understanding why it happens is the first step to deciding which parts of the job you want back.

When trusting your tools becomes a bad thing

One group that would understand this immediately: airline pilots.

In the 1990s, researchers studying automated cockpits started noticing a strange pattern. Pilots with thousands of flight hours and lives on the line sometimes followed incorrect automated recommendations, even when the instruments in front of them suggested something was wrong. The automation had been right often enough that their brains stopped cross-checking it with the same scrutiny.

A 2010 review of decades of automation research described a larger pattern: The more reliable an automated system becomes, the more likely humans are to let it pass unchecked. When a system is usually right, your attention starts treating it as if it will keep being right.

AI is the most fluent automated system most of us interact with in a day. And fluency has its own trick. In 1999, a pair of psychologists showed people identical statements in fonts that were either easy or hard to read. The easy-to-read statements were rated as more true. It was the same words and same claims, but the version that went down smoother was judged more accurate. Your brain takes “that was easy to process” and misfiles it as “that must be correct.”

AI output goes down very smoothly. It’s grammatically polished, the tone is confident, and the clean formatting suggests something that has already been edited. The polish lets your eyes glaze over.

Every model upgrade makes the illusion of right-ness worse. The outputs get cleaner. The formatting gets better. The reasoning looks more plausible. The tool makes fewer obvious mistakes, which means the mistakes that remain are harder to see. You are reading something that looks finished, and your brain—which has been filing “looks finished” as “is correct” since long before AI existed—obliges.

Why ‘I’ll review it’ is not a plan

Before the repeat work snafu, I would have told you I was reviewing everything before sending anything. The document passed through my field of vision, I tweaked a phrase, caught one weird sentence, and felt the warm glow of editorial virtue. My brain filed that as reviewed.

The feeling of having reviewed is easy to produce. The act of reviewing is harder. You have to form your own view before the model gives you one, check the claims, and notice where the draft has made an assumption you do not share. You have to ask whether the sentence would still feel true if someone screenshotted it and sent it back to you six months later.

We talk a lot about better prompting, better models, better workflows, and better agents. We talk less about the moments when we should slow down—because that’s uncomfortable and hard. In 2021, researchers tested ways to reduce overreliance on AI. The interventions that worked best were “cognitive forcing functions,” designs that made people form their own judgment before seeing or accepting the AI’s answer.

Those same interventions also got the worst ratings from users. People did not like being made to think first. Of course, they didn’t. The whole appeal of automation is that it reduces effort. A tool that says, “Before I help you, please do the hard part yourself for a minute” feels like a speed bump. But speed bumps are the solution to autopilot.

What I am trying instead

My solution to autopilot is not to give up AI and return to some imagined golden age where I nobly suffer in a blank Google Doc. But I am making some changes to how I process and finalize work to curb the tendency to ship now, think later.

Change 1: Think before you look

Before I ask AI for a draft, I try to write down my own rough position. It’s not the polished version or a full argument. Sometimes it is only five bullets—some combination of what I think, what I know, what I am unsure about, what I refuse to say, and what would make the piece useful. Then, when the model gives me an output, I have something to compare it against besides vibes.

The card in my Notion to-do list for this article, with quick notes I sketched out before going into my interview session with the AI. (Image courtesy of Katie Parrott.)

This is irritating. It also works. If I have made my own claims first, I read the AI’s claims differently. I can feel where it is smoothing over a distinction I care about. I can see where it is borrowing authority I have not earned. The draft becomes an object to argue with, not a current to float along.

Change 2: Build in a gap

If attention decays the longer you sustain it, it’s time to treat attention as the scarce resource it is and stop thinking I can review five AI outputs in a row without consequence. The answer is to introduce friction on purpose—distance between generation and review that gives your attention a chance to reset. Draft on Wednesday, review on Thursday. Write in the morning, come back in the afternoon. Send the model’s output to a different surface—for example, from the chat interface to a document, or from mobile to desktop—and read it outside the chat window your eyes have grown accustomed to.

Incidentally, a lot of this advice comes down to best practices that writing teachers have recommended for decades. A different day gives you a different brain than the one that’s high on AI’s generative excess.

Change 3: Make yourself explain why you’re accepting it

A 2026 study on AI-assisted writing found that making users explain their reasoning before accepting AI output cut mistaken acceptances roughly in half. You cannot bullshit a justification you are writing down.

So I’ve started doing it myself. Before I accept a recommendation, a framing, or a paragraph the model drafted, I make myself write one sentence answering a specific question: Why is this right for this client, this argument, this reader? If the best I can produce is “It sounds good,” I go back and look again. I have to be able to defend each sentence in front of an editor.

You still own the output

These practices help. They are also a fragile defense against tools designed to make output feel effortless, and I don’t think the long-term answer is expecting every individual to white-knuckle their way past six cognitive biases before breakfast.

This is also a design problem. The tools themselves should be building friction back in—making provenance visible, separating generation from approval, and treating human judgment as a workflow stage instead of a ceremonial click at the end. It is part of what excites me about Proof, Every’s document editor for AI-human collaboration, which tracks which words are yours and which came from the machine. The cognitive forcing functions that researchers have found work to keep our brain from giving into autopilot are design patterns that should be getting baked into products as well.

Knowing the mechanism does not exempt you from it. Every bias in this story predates AI by decades. We have always trusted fluent things too quickly, gotten worse at paying attention when nothing seems to be going wrong, and preferred the path that saves effort.

The duplicate assignment still embarrasses me, even if all it cost me in the end was a few sheepish emails back and forth with my client to ensure I wasn’t crazy. I am also grateful for it, in the way you are grateful for a warning that arrives before any real damage could be done. It taught me something the research has sharpened: The central risk of AI-assisted work is not the machine thinking for you. It is the machine making it feel as if you already thought.

I am trying to get better at noticing the difference. With most pieces, I draft on one day and review on another, make myself write down what I think before asking the model what it thinks, and hope the friction is enough to keep me in the work instead of floating above it.

Katie Parrott is a staff writer. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

The Model Got Stranger

Every Staff / Context Window — 2026-04-17 12:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“Vibe Check: Opus 4.7 Stopped Reading Between the Lines” by Katie Parrott/Vibe Check: Opus 4.7 is the best coding model Every has tested on well-specified tasks—Kieran Klaassen called his Rubber Duck benchmark run “best model ever”—but it won’t infer what you want the way 4.6 did, and the prompts you’ve tuned for the last two months will likely disappoint you at first. The gap between a tight brief and a loose one is wider than in any prior Opus. Read this for the full breakdown of where to switch to 4.7 now and where to stay on 4.6.

“The Folder Is the Agent” by Kieran Klaassen/Source Code: After three months trying to make AI agent swarms work in his coding flow, Kieran Klaassen realized that what was doing the work was a folder. A project directory with a CLAUDE.md, accumulated context, and specialized sub-agents is all you need to turn a general model into a domain expert. He’s now running 44 of them, connected by a Ruby dispatch layer that routes work while he sleeps. Read this to learn how to build the dispatch layer yourself.

“(Re(Re))Introducing Sparkle: Marie Kondo Your Mac” by Yash Poojary/On Every: Yash Poojary rebuilt Sparkle to purge the 80% percent of files on the average Mac that are screenshots, installer packages, and duplicates you’ll never open again before it organizes. The new version runs a cleanup pass first, then proposes a custom folder structure you can reshape through chat until it matches you like to work. Download the app and try it yourself.

🎧 🖥 “Mini-Vibe Check: Claude Managed Agents Handle the Infrastructure Work” by Laura Entis/Context Window: Dan Shipper sits down with Eve Bodnia, founder and CEO of Logical Intelligence, who argues that LLMs have a ceiling—and that energy-based models, which scan the full landscape of possible answers rather than predicting one token at a time, are what comes next. Plus: A Mini-Vibe Check on Anthropic’s Claude Managed Agents; Willie Williams proposes new vocabulary for the AI age. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“You’re the Manager Now” by Laura Entis/Context Window: The Claude Code desktop app gets a redesign built for managing parallel agent work—and Kieran Klaassen was already living in it. Plus: Dan Shipper explains why you should ignore the viral claim that smaller models can match Anthropic’s Mythos, Austin Tedesco shares the one question he asks Claude Code before shipping anything, and Eleanor Warnock on why the Dia browser’s bet on beauty might be the right one.

“Living Software” by Jack Cheng: AI-accelerated development has made software feel zombieish—tools that shouldn’t be alive suddenly sprouting chat boxes and AI sidebars. Jack Cheng proposes a distinction: “tool-like software,” which users expect to be stable, versus “living software,” which users expect to adapt and grow. The two categories carry different expectations, and confusing them causes disorientation. Read this for his practical advice on how builders of both should design, ship, and communicate with their users.

Log on

Upcoming camp

Codex for Knowledge Work Camp on April 24: a hands-on camp with Dan Shipper and Austin Tedesco on using OpenAI’s Codex for writing, research, and knowledge work. Learn more and register.

Last week’s camp

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

Recordings you may have missed

Every x Notion | Custom Agents Camp: A free workshop where we demo the custom agents running Every’s daily operations. Watch the recording or read the write-up.

From Every Studio

Spiral’s new onboarding quadruples style creation

Getting started with Spiral just got a lot faster. Marcus Moretti, general manager of Spiral, rebuilt the onboarding flow from the ground up. Now, instead of clicking through six explainer screens, you drop in writing samples from your X account, a website, uploaded files, or pasted text, and Spiral generates a style guide tuned to how you write. The result: About 80 percent of new users leave onboarding with a personalized style, up from roughly 20 percent before. The sooner Spiral knows your voice, the sooner it’s useful—and the new flow gets you there in minutes.

New Spiral users: Start creating your styles at writewithspiral.com. Existing Spiral users: Try the new onboarding experience at app.writewithspiral.com/onboarding.

Alignment

How NotebookLM rewired the way I problem-solve. I am moderately dyslexic. It’s an awkward thing to be if you write for a living, because the job is essentially the piecing together of textual information into a shape other people can follow. The difficulty, for me, is not reading the words, but holding the information they contain in relation to one another.

For most of my career I have used a mind map—a messy visualization of ideas—to help me wade through the facts and opinions of dense textbooks and research papers. The diagrams worked inasmuch that they allowed me to organize information in my head, but any problem bigger than a single sheet of A4 paper was effectively closed to me until I could block out an afternoon to draw it.

NotebookLM, Google’s AI research assistant, has removed that barrier by letting me hold more in my head at once. Here’s an example: I’ve been stuck on one question for three weeks. Patients on chronic disease therapies like GLP-1s drop off at a staggeringly high rate. Roughly half are no longer on the drug 12 months after they start, because of both side effects like nausea, and the cost.

For a direct-to-consumer telehealth operator distributing the drug at scale, the analytically difficult thing is that none of the available research separates the two cleanly, and the solution to the problem of churn sits somewhere inside that mess. This is less a medical question than a management consulting one, and it’s the kind of problem where I used to feel the particular flavor of panic that comes from having a lot of data and no thesis.

Instead, I’ve been running Barbara Minto’s Pyramid Principle in reverse inside NotebookLM. Minto was the first woman McKinsey ever hired out of Harvard Business School, and she was sent to London in the 1960s to figure out why the firm’s consultants wrote such terrible memos. Her book The Pyramid Principle, which came out of that work, is the closest thing consulting has to a scripture. At the top of the pyramid sits your answer, the governing thought. Underneath it sit groups of supporting points, each of which answers a why question or a how question about the layer above.

Minto is taught, almost universally, as a top-down tool. You know your answer, so you arrange your evidence beneath it. But what happens when you don’t have an answer? You run the pyramid backwards: Dump every random fact onto the page, group them inductively by what they seem to be about, write a summary for each group, and let those summaries push their way up to an answer you didn’t have when you started.

On paper, I could do it with five random facts. I could not do it with 50, which is what the GLP-1 churn question looks like once you have pulled in all the sources of information, business and medical included. Now I drop all of that information into a single notebook and group every passage that touches patient drop-off by those that are about the drug and about the delivery model, and give me one-sentence summaries of each group. What the sheet of A4 used to hold, the notebook now holds, and I can interrogate it from inside.

The useful thing I did not expect is how much of the work happens in the asking. Because NotebookLM will only answer from the sources I have loaded into it, the quality of my questions is the only variable that matters. Half of the process is me figuring out what I want to know and why, and at which level of the pyramid. The other half is the model doing the clerical labor of pulling the summaries together so I can read them. In the old mind-map version, I spent most of my afternoon drawing. The tool has removed the labor between me and the thinking, which—for a dyslexic writer—is most of the labor there was.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid