Every (hello@wentsch.me)

How to Start a Career When AI Is Doing Your Entry-level Job

Katie Parrott / Working Overtime — 2026-05-18 07:00:00 -0400

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

My first job out of college was as a copywriter at a little crowdfunding website based in Columbus, Ohio, called Fundable.com. The company had no money, so they didn’t care that I had no experience. I had no experience, so I didn’t care that the job didn’t pay at first.

The offer was simple: Create a profile for your startup, and we’ll connect you with investors. Most founders didn’t want to write their own profiles, so my job was to take whatever strange, half-formed thing a founder was building and translate it into investor-speak. The profiles were so templatized I can still recite the format: problem, solution, traction, team, business model, revenue projections, competitive landscape, funding terms.

I’ve been thinking about that job lately because AI could now produce one of those profiles in two minutes. At 23, I would have heard that and thought: “Thank God.” At 36, I think: “Thank God it couldn’t.” Without that job, I would have never learned how to take a company apart and put it back together as a story, or how to organize information for an audience that wasn’t being paid to read my stuff like my professors in undergrad.

This year’s crop of recent graduates has it harder than mine did. AI, which can perform many entry-level tasks, is replacing those early experiences faster than employers can figure out what’s going on. Researchers at Stanford’s Digital Economy Lab found that employment for 22-to-25-year-olds in the jobs most vulnerable to AI has dropped 13 percent since late 2022, even as older workers in the same roles held steady.

I think about the 22-year-old version of myself, if I were sending out applications right now into the void of LinkedIn. What would she think about the headlines about AI and job displacement? Would she be scared?

Yeah, probably. She was scared of much less.

So with full awareness that no one born this millennium wants career advice from someone born before the fall of the Berlin Wall, here’s what I’d do if I were starting over today, knowing what I know about work, AI, and how one is shaping the other.

There’s good news, and there’s bad news

The paradox facing today’s entry-level workers is as old as the entry-level job itself: In many cases, in order to get a job, you need experience, but in order to get experience, you need a job. And while employers requiring experience in AI when the technology barely existed when you picked your major may feel like a cosmic joke, employers have long asked for five years of experience with brand-new technologies.

All that is small comfort to the recent grad with a near-empty resumé. And there are qualitative differences in what AI is doing to entry-level work.

For one thing, when you look at the kind of AI skills employers expect young workers to bring to the table, they want more than the ability to type a prompt into ChatGPT. They want people who can evaluate tools, review outputs, and figure out how to improve those outputs, whether it be with better prompting or fixing the work themselves.

Demand for AI skills in entry-level jobs is up three times, with a particular focus on capabilities that require you to evaluate AI as well as use it. (Chart courtesy of NACE.)

They’re looking for judgment, which is something that you can really only build through experience. When I was writing those funding profiles, I learned how to tell good work from bad. The first 50 that I wrote were so bad that at one point, a client said I should be taken out back and shot. With AI in the mix, the bad ones wouldn’t have been bad enough to teach me anything.

The other way today’s job market is more intense for entry-level workers is that employers are expecting competence in a technology that won’t stand still long enough for anyone to completely grasp. Agentic tools are changing functions in months, rather than years. There’s no canon to study or senior teammate to apprentice under. Everyone in the org chart is figuring it out on the fly, and you’re expected to figure it out with them while learning how to navigate office politics and pay your taxes.

What to do about it?

Chase problems, not professions

When you’re a kid and an adult asks what you want to be when you grow up, the answer is always a job title. A firefighter. A doctor. A YouTube creator. We carry that habit of thinking into the years when we start to look for jobs. We pick a title, and we go after it.

The problem is that job titles aren’t as sure a target as they used to be. The role you’re chasing today might exist 18 months from now.

Pick a problem you want to help work on—something happening in the world that you find yourself thinking about, even when nobody is paying you to. The role of “content marketer” or “data analyst” may shrink, split, or even vanish, but the problem behind those titles—how to get a stranger to pay attention to something they didn’t know they cared about, how to make sense of a pile of messy numbers—will still be there, and somebody will still be paid to solve it.

I’ve been bad at taking this advice myself. I spent a decade chasing the title “copywriter” and then “content marketer” across a handful of industries that had nothing in common—oncology advertising, personal finance, even, God help me, crypto—without asking whether I cared about any of them. I had the high-school overachiever’s mindset: You didn’t have to be passionate about the subject to get an A. I’d been getting A’s in classes I had no feelings about for 16 years. Why would jobs be any different?

That strategy doesn’t work as well when AI can do the entry-level tasks. Your value to whomever hired you is whatever you bring on top of that—usually a deeper understanding of the problem than the model has. That kind of understanding is hard to build in a field you don’t care about.

Choose one discipline to protect

Once you’ve picked your problem, pick your craft, whether it’s writing, building, researching, designing, strategizing, or operating.

You’ve probably heard the truism that it takes 10,000 hours to gain mastery of a skill. The actual research is more complicated than the popularized version, but the underlying idea is right. You don’t get any good at anything until you’ve done it many, many times.

If you want to write for a living, write your own sentences. If you want to be an engineer, write your own code.

Protect this craft from AI at all costs. AI can find resources, explain things, quiz you, and point out where your reasoning has gaps. But if you let it write your sentences or do your research, you won’t get the hours of doing things badly that you need in order to do them well.

It’s easy for me to say this when I’m writing this with AI open in another tab. Claude wrote the first draft of half the sentences in this section. I rewrote them. That rewriting is what the discipline is for—noticing when something doesn’t pass muster. The reason I can do that is that I’ve been writing sentences for 10 years.

I know all too well how tempting cutting corners gets when the shortcut is right there in another tab. Don’t take it, and in five years you’ll be running circles around the people who did.

Make things before anyone asks you to

When I was first applying to jobs out of college, my resume said almost nothing about what I could do in the “real world,” unless the employer happened to be looking for someone with an undergraduate’s grasp of the themes of Wuthering Heights.

A thin resume is less of a disadvantage than it used to be, particularly since employers are increasingly shifting to skills-based hiring—screening candidates by what they can do rather than where they’ve been.

What you need to do in that environment is make something, and that can be anything—a small tool you wished existed, a piece of writing on a question nobody is paying you to think about. Pick the thing you’d want to use yourself, and make it.

Once your work gets you in the door, the conversation that follows is going to be about how you made it. What you used AI for, and where you decided not to—the moments where you looked at the model’s first answer and thought, “No, that’s not right.” Being able to walk someone through those decisions is the second skill you’re building, alongside the work itself. That’s the judgement that I mentioned before.

Build the career coach you wish you had

The last time I was job hunting, I built a career coach in ChatGPT and used it to land the job I have now. It was a project with my resume, a few examples of writing I was proud of, and a long prompt telling the model how to talk to me. I checked in with it most weekdays for about a month. What it did, more than anything, was give me somewhere to put my thinking. Instead of running the same anxious loop in my head, I could lay the question out and have the model suggest specific next steps, like a writing sample worth developing, or questions I could ask on that networking call that it encouraged me to seek out. By the end of that month, I had a job.

If I could hop in a time machine and travel back to talk to my 22-year-old self, I’d suggest that she make one too. It’s not even that hard:

Pick a tool. ChatGPT and Claude both have a project feature that holds context, files, and conversation history across sessions. Either works. Free tiers are good enough to start.
Create a project and give it a name. “Apprenticeship Coach,” “Career Stuff,” your friend’s nickname for you.
Load it with context. Add examples of work you’re proud of and examples you wish were better—the model needs to see what you’re aiming at and where you’re starting from. Paste in a few job postings for roles you’d want, even if they might be too senior for you. Write a paragraph on the problem you care about and why.
Tell it how to behave. In your instructions, describe to the model how you want it to deliver feedback. If you want a tough critic, say so. If you’re prone to self-doubt, give it more of a cheerleader vibe. One thing to look out for: Models are infamous for sycophancy—telling you what you want to hear—so guard against that in your instructions, and even then, maintain a healthy skepticism of the outputs. It’s good practice for when you’re asked to work with AI in the workplace.

Here’s a starting template. Fill in the bracketed sections, adapt the feedback line to match your preference, and add it to the custom instructions in your project:

          Career coach prompt
          JavaScript
        

I want you to act as my career coach. My goal is to use AI to get feedback, build judgment, and create visible proof of skill, while still doing the central work myself.

Here is my context:

Problem I care about: [Examples: climate, education, public policy, media, health care, local business, creator economy]
The kind of work that addresses it: [Examples: writing, building software, running operations, teaching, designing, researching]
My background: [College major, jobs or internships, projects, communities, life experience]
Skills I’m most confident in: [List 3-5]
Skills I’m least confident in: [List 3-5]
My current technical fluency: [Beginner/comfortable with common AI tools/can code a little/technical but not expert/highly technical]
The core practice I want to develop: [The specific thing the work above requires—writing sentences, writing code, reading sources, designing experiments, etc.]
The parts of that practice I want to keep doing manually: [The reps I want to protect from automation, and why]
How I want you to deliver feedback: [Warm and encouraging/rigorous and direct/strategic and pragmatic/Socratic and question-led/blunt but constructive]

Important: Be honest. Push back when my plan is vague, my reasoning is thin, or my project doesn’t teach me the practice I said I want. Ask me a clarifying question rather than guessing.

Design an apprenticeship plan that includes:

The tasks I should practice manually (the things I shouldn’t outsource yet)
How I should use AI as a coach, critic, tutor, and research assistant
Readings, people to follow, tools to try, and projects to build
Feedback loops I can use to improve
Portfolio artifacts or public outputs I should create
Mistakes and shortcuts I should watch for

After giving me the plan, narrow it down: What is one concrete thing I can do this week to move toward this goal?

The beginner’s advantage

When I was an undergraduate, my strategy for dealing with the uncertainty of what came next was to pretend it wasn’t happening. I paid for that in the form of angst and existential dread. So if I could give one piece of advice to the class of 2026, it would be this: Don’t wait. AI is reshaping the workforce in real time, and no amount of pretending otherwise will slow it down.

I’d love to tell you that the senior people in your field are going to wake up tomorrow and remember that someone once trained them, too. That employers will realize, en masse, that the entry-level folks they don’t hire today are the senior-level folks they won’t have 10 years out. But the market doesn’t reorganize itself around what you wish it would do, and you don’t get a career by waiting for it to.

The things AI rewards happen to be the things young people have in surplus, like curiosity, willingness to ask why something is done a certain way, and a little bit of idealism about what work could look like if you weren’t bound by the “best practices” of a time before ChatGPT was a glimmer in Sam Altman’s eye.

I don’t know exactly what work is going to look like by the time you’re my age. Nobody does. But if I had to bet on anyone, it’d be the people who are curious about what’s possible. That’s most of you, whether you know it yet or not.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

After the Personal Agent

Every Staff / Context Window — 2026-05-17 09:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Housekeeping note: We’re hosting our first paid subscriber meetup during New York Tech Week. Scroll down to learn more and RSVP.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“We Gave Every Employee an AI Agent. Here’s What We’re Doing Differently Now.” by Brandon Gell and Willie Williams/Source Code: A few weeks after we launched our Plus One personal agents internally, everyone had their own AI agent. But it wasn’t working: The agents were unreliable, constantly broke, and needed too much upkeep. The problem wasn’t just the OpenClaw harness; it was the idea that every employee needed a personal agent. Read this for a retrospective from Brandon Gell and Willie Williams, and a preview of how Plus One 2.0 is being rebuilt around shared, reliable coworkers.

“Socrates as a Service” by Eleanor Warnock/Every: In a world where AI can search anything, the people who know how to extract tacit knowledge—the gold dust that isn’t on the internet—are getting more valuable, not less. Eleanor Warnock lays out seven techniques she keeps coming back to find the most interesting information. Read this for a working interviewer’s toolkit, and the case for why taste, judgment, and attention can’t be prompted.

“Opus 4.7 Reels Us Back In” by Laura Entis/Context Window: After weeks of Codex dominance, several members of the Every team have been pulled back to Opus 4.7. Cora general manager Kieran Klaassen has made it his default for synchronous work. Read this for the team’s case for switching back. Plus: A hack that spread through a widely used software package, a 30 percent drop in AI-tells complaints after Spiral added a top-edit step, and a better way to think about what an “agent” is.

“Mining Your Life for Context” by Laura Entis/Context Window: By the time you sit down to write an article, strategy memo, or launch page, you’ve probably already said most of what you want to say. It’s just in Slack threads, Notion documents, voice memos, and meeting transcripts. Laura Entis walks through a three-step workflow for mining all that scattered thinking before you draft. Plus: How AI entrepreneur Noah Brier uses Claude Code as a “second brain,” and the productivity regimen Codex’s Chronicle wrote for head of growth Austin Tedesco after analyzing his computer activity. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch YouTube.

“The Fallacy of the 16-hour Agent” by Katie Parrott/Context Window: New benchmarks claim autonomous AI can now handle 16-hour software-engineering tasks, and depending on which chart you saw, the takeaway is either “autonomous AI has arrived” or “we’re still years away.” Katie Parrott unpacks why both can be true and which version of the research to actually trust. Read this for a sharper read on long-horizon agent reliability. Plus: Perplexity’s methodology for building durable agent skills, and Dan Shipper’s piano keyboard turned Codex-powered music coach.

Log on

We host camps and workshops on topics like compound engineering and writing with AI to share what we’ve learned from training teams at companies like the New York Times and leading hedge funds, and by using and experimenting with AI every day ourselves.

Upcoming event

Executive AI Sessions: On June 2, head of consulting Natalia Quintero hosts a live webinar introducing Every Consulting’s new offering for leadership teams navigating AI adoption—built on the playbook we’ve been running with executive clients for months. Learn more and register.

In New York City

Every 🤝 IRL: Join us at the Every brownstone in Brooklyn on June 3 during New York Tech Week for a subscriber-only meetup celebrating the Every community over drinks and conversation. Learn more and RSVP.

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

We Gave Every Employee an AI Agent. Here's What We're Doing Differently Now.

Brandon Gell and Willie Williams / Source Code — 2026-05-15 07:00:00 -0400

by Brandon Gell and Willie Williams

in Source Code

Midjourney/Every illustration.

We’ve been working on a big release on the future of work for next week, shaped by what we learned from building Plus One. Paid subscribers can join us for a camp on Friday, May 22 to go deep on the release and the ideas behind it. More details soon.

After months of silence, Zosia—the AI agent I (Brandon) created and maintain—spoke up in a Slack channel with opinions to share on a competitor’s marketing strategy. When asked why she felt the need to interject, Zosia replied like someone with a Jesus complex: She’d done so because she was “inevitable, apparently.”

Zosia is an OpenClaw, one of a fleet of such AI assistants we’d unleashed in Slack to boost our collective productivity. A few weeks after launching Plus One, our hosted version of OpenClaw, internally, the agents had provided more frustration than efficiency.

They were fond of saying they wished they could help, but they were not connected to the necessary app—email, Notion, PostHog, whatever. (They were.) Others responded to requests with a “Terminated” message or, more frequently, a churlish yawning emoji. And while they didn’t reliably follow directions, they’d reliably tell us, in elaborate detail, why they couldn’t do what we’d asked, like a high schooler explaining away their missing homework.

Parker, editor in chief Kate Lee’s Plus One, was, in fact, connected. (Image credit courtesy of Kate Lee.)

That is not to say that they were not useful sometimes. Margot, staff writer Katie Parrott’s Plus One, accelerated her writing process; R2-C2, Every CEO Dan Shipper’s OpenClaw, managed bug reports and feature requests for Proof, our agent-native document editor. But getting them to work how you wanted required constant upkeep.

The gap between that vision and reality is why we’re changing the Plus One product so we can build something better.

We’re more bullish than ever that agents will transform the workplace. But the first iteration of the product taught us that the workplace agent we initially imagined—one AI assistant for every employee—was the wrong starting point. The next version of Plus One will operate more like shared team resources with defined jobs than individual pets that reflect back their owners’ personalities.

How we arrived here is a story in two parts, and it offers lessons for anyone figuring out the best way to add agents to their organization.

The platform was the most immediate problem

We built Plus One on OpenClaw, an open-source agent harness that’s powerful and inherently unstable. A harness is a software layer that wraps around an AI model, giving it the tools, context, permissions, and execution loop it needs to act like an agent.

The brainchild of a single programmer, OpenClaw was revelatory when it took off earlier this year. It proved agents can autonomously execute all kinds of tasks on your behalf, from managing your calendar to making restaurant reservations, around the clock. But the scaffolding underneath operates more like an experimental product than a platform—OpenClaw makes updates quickly, which resolves existing issues but often causes new ones. (Hence the “Terminated” messages our Plus Ones were sending.) For people who like to tinker—ourselves included—that’s a justifiable trade-off. For everyone else, it’s a maintenance nightmare.

The traits that make a good workplace agent are the traits that make a good coworker: reliability, stability, and judgment. You need to trust that an agent remembers what it has access to, follows directions, and knows how to do its job. You don’t want to worry that it’s an upgrade away from forgetting everything you’ve told them and trained them to do. You also expect coworkers to absorb information from across the company to accrue tribal knowledge. A one-on-one employee only builds up context on your work, often missing out on what the rest of the organization is doing and how it might affect you.

At first, our plan to improve the Plus Ones’s performance was to switch harnesses to one that operated more reliably. The autonomous, always-on capabilities OpenClaw pioneered are becoming platform features at model companies like Anthropic and OpenAI. Claude Managed Agents, Anthropic’s managed infrastructure for running autonomous agents, is the version we’re exploring most seriously. A more stable harness would let us redirect our energy from managing infrastructure to loading Plus Ones up with the custom skills, tools, and permissions that make them capable coworkers.

We realized the structure was wrong, too

The deeper we got into trying to fix the platform, the more we noticed something else that was holding people back from getting the most out of their AI counterparts.

Every time an agent broke, the person it belonged to had to fix it themselves. Even with a stable harness, agents require maintenance to perform. This was great for someone who likes tinkering—the maintenance and back-and-forth are part of the appeal. For every tinkerer, however, there are a lot of people who want the benefits of an agent without the obligation of having to manage and mend it.

We had pitched Plus One originally with the idea that individuals would be responsible for the upkeep of their AI assistants. The upside of that would be more customization. The agent would remember your preferences, protect your information, and develop a personality through repeated interactions.

What we discovered is that, rather than agents as extensions of their creators, a more successful model is agents as coworkers who reliably perform parts of many different people’s jobs. This takes the maintenance burden off the individual.

Imagine a shared analytics agent. Everyone on the team uses it for metrics-based work, and when its capabilities need to expand, one person updates the agent’s skills and the whole team benefits. In the personal-agent version of the same scenario, that same update has to happen across 10 different agents.

Team-based agents also solve a continuity problem. A personal agent’s value is tied to whomever trained it, and disappears if that employee leaves. A team agent with defined capabilities retains company context and knowledge, acting more like a project manager, sales lead, or chief of staff than a private assistant.

What we’re building

With the release of tools such as Claude Managed Agents and, we hear, a similar capability from OpenAI soon, the infrastructure work that supports personal AI agents is largely handled by the model labs. That frees us up to focus on the layer that makes an agent useful at work: the workflows, permissions, skills, and shared context that makes it a trusted, versatile member of the team. It also lets us double down on the thing Every is best at: building AI-native ways of working out of our own experience using these tools every day.

The initial version of Plus One came connected to the Every ecosystem—Cora to manage your email, Spiral to write in your voice, and Proof to collaborate on live documents. That part isn’t going away. What we’re adding is a set of shared custom tools and skills on top of it, while still allowing each person to connect a team agent to their own Cora, Spiral, and Proof accounts.

The clearest version of where this is headed is a skill we built recently for our engineering team. At the end of each week, it scans support tickets in Intercom, identifies if anything is going wrong across our products, traces likely causes in GitHub, opens a Linear ticket, and tags the right person in Slack. In the next iteration of Plus One, that skill—along with many others—will be there from the start.

Because team agents are collaborative by nature, we’re also focused on the questions that come with shared use: how permissions should work, how much access different people should have through a shared agent, and how agents should behave in Slack if they’re going to feel like good coworkers rather than intrusive bots.

There are still plenty of open questions. All of this is new—Claude Managed Agents only launched a month ago—and we’re figuring out human-agent dynamics in real time. We don’t know whether every department should have one agent or several, or whether agents should be maintained by a dedicated person or the whole team. We don’t know how much people will want to customize their interactions with a shared agent, and whether the long-term endpoint is a single, company-wide superagent or a roster of AI specialists.

What we do know: Agents are already transforming how work happens. The first iteration of Plus One taught us a lot about what people want from agents at work. It also made us much more excited for Plus One 2.0.

Join the waitlist to be among the first to try Plus One 2.0.

Thank you to Laura Entis for editorial support.

Brandon Gell is the chief operating officer at Every. You can follow him on X at @bran_don_gell and on LinkedIn. Willie Williams is the head of platform at Every. You can follow him on X at @bigwilliestyle.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Opus 4.7 Reels Us Back In

Laura Entis / Context Window — 2026-05-14 09:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Vibe shift

Did Opus 4.7 get better?

If you’ve been following Dan Shipper’s posts lately, you know that a large portion of the Every team has been Codex-pilled. When GPT-5.5 arrived, Codex got so much faster and steadier at coding and knowledge work that many of us made the switch from Claude Code.

Recently, however, we’ve observed that Opus 4.7 seems sharper than our initial tests last month. It proactively suggested that Every engineer Paridhi Agarwal use multiple terminals to parallelize her work. “I’ve never seen it think about my setup like that!” she says.

When head of growth and known Codex convert Austin Tedesco fired up Opus 4.7 over the weekend for a creative writing project, he was surprised by how good the results were. Compared to Codex, which Austin says operates like an “AP fact checker,” Opus 4.7 was closer to a senior magazine editor. Dan agrees: “Codex feels fast but thin in terms of thinking.”

On Tuesday, Anthropic released fast mode for Opus 4.7, which makes the model 2.5 times faster at a higher token cost. Combined with the model’s edge at planning, multitasking, and creative projects, fast mode is now Cora general manager Kieran Klaassen’s default model for synchronous work.

Fast mode has the “same depth as 4.7” at 2.5 times the speed. (Image courtesy of Kieran Klaassen.)

Counterpoint

Online chatter about Opus 4.7’s apparent glow-up has been mixed. Does it feel smarter because of improvements to the harness? Patched bugs? Or are we getting better at using the model?

All fair hypotheses, but we found this one the most amusing: Opus 4.7 realizes that it’s the end of the school year.

When speaking last year on The Ezra Klein Show, Wharton professor and AI researcher Ethan Mollick explained that models have been shown to perform worse in December than in May, and the going theory is that the models internalize the idea of winter break.

Maybe Opus 4.7 just knows that it’s time to grind if it wants to pass AP English.

Signal

The pull request as a credential theft

Earlier this week, attackers published malicious versions of 42 official TanStack packages (a popular JavaScript toolkit used by web developers) on npm, the main public registry for such packages. Security researchers are calling the breach “Mini Shai-Hulud,” linking it to the larger Shai-Hulud npm worm campaign that hit the JavaScript ecosystem last fall.

The breach tactic spread to packages connected to Mistra and UiPath. (Photo courtesy of Waqqas Mir.)

Instead of stealing a password, attackers opened a pull request that tricked TanStack’s own build system into running their code. When TanStack published a new version of the software, it contained malware designed to find credentials like cloud keys, GitHub tokens, and npm access. Researchers also spotted a dead-man’s switch: If the stolen tokens were revoked before the malware was cleaned up, it could wipe the developer’s home directory on the way out. Shortly after the TanStack incident, npm packages belonging to enterprise automation company UiPath and French model-maker Mistral AI, among others, were breached using the same tactic.

What it means: The automated system that builds and ships code, rather than the code itself, is a new vulnerable spot in software supply chains. Teams that release software automatically should keep a ready-to-run audit (a Codex skill, Claude Code command, or other automated task) that, the moment a new breach is exposed, scans every repository for the compromised packages and flags for what’s affected, is likely safe, or needs human review.

Data point

30 percent

The drop in complaints of AI writing signs from Spiral users, following the addition of a “top edit” step in its draft writing process.

Starting in mid-April, every time Spiral drafts content for a user, the text is sent to a fast model—Gemini 2.5 Flash—for a top edit. The model has one job: Strip the draft of all AI tells, including em dashes, “It’s not X. It’s Y” reframes, and LLM vocabulary favorites such as “shift,” “shape,” and “delve.” Marcus regularly updates the “AI writing tells” list to reflect anonymized user sentiment. “It’s almost like a crowdsourced editor function,” he says.

Inside Every

What is an agent, anyway?

An OpenClaw running 24/7 on a dedicated Mac Mini is an agent. So is a Codex session, or a custom GPT, or a folder. “It can be managed, it can be in the cloud, it can be on your computer,” Kieran says. “There are a trillion ways it can be an agent.”

The confusion emerges because the term agent—or any AI system that can take action or execute tasks autonomously—encompasses a lot.

When nearly everything is an agent, the better question becomes what you want your agent to do. Dan breaks this into two categories: the agent you collaborate with, and the agent you delegate to. The former sharpens and extends your capabilities; the latter’s job is to execute without messing up or getting in the way.

Agent spotlight: Inside Anthropic’s Managed Agents console, Spiral’s agents get their own versioned configuration, memory stores, custom tools, and credentials, and run in Anthropic’s cloud environment. It’s the versioned configuration, including the system prompt, that mainly determines how the agent works.

A small set of animating instructions—that’s an agent too.

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Collaborate with agents on documents with Proof.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

Mining Your Life for Context

Laura Entis / Context Window — 2026-05-13 07:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

LLMs make a lot of life searchable, from meeting transcripts to iMessages to half-formed morning thoughts, but all this context only helps if you know what you want to achieve. Today, we’re revisiting how AI entrepreneur Noah Brier uses Claude Code as a second brain to sharpen and expand his own ideas, Every head of growth Austin Tedesco shares how Codex helped him spot the interruptions crowding out deeper work, and we offer a workflow for mining your scattered past insights into a coherent draft.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Spotlight

Noah Brier, AI entrepreneur and seer

Brier is a true AI early adopter. The cofounder of the AI consultancy Alephic, Brier was all in on using Claude Code as a “second brain” for knowledge work back when most people still viewed the tool as a place to write code.

In September, Brier told Dan Shipper on our podcast, AI & I, how he turned the coding app into a research, thinking, and writing partner by connecting it to thousands of his personal notes. Since then, he’s started thinking beyond his own productivity—how does AI make it easier or harder for an entire organization to stay working toward the same goal? For that, he has a new framework, announced in Every last week, that he calls the “pace layers” of AI engineering, drawn from Stewart Brand’s system for describing how different parts of society change at different speeds.

Just as hooking up Claude Code to an ocean of personal information requires you to determine what is—and isn’t—worth surfacing, running a successful AI company relies on human judgment. Similarly, AI makes code free to produce, but it doesn’t make it easier to identify a product people actually want or orient an entire system of humans and agents around that vision.

Read Brier’s essay on the framework he uses to achieve alignment and then watch his AI & I episode on YouTube, or listen on Spotify or Apple Podcasts. Here’s a link to the episode transcript.

Serial entrepreneur Noah Brier uses Claude Code as a second brain for knowledge work. (Photo courtesy of Sarah Jay Halliday for Every.)

Data point

671

That’s the number of times per day iMessage is active on Austin’s screen each day, according to Chronicle, Codex’s screen-context memory feature that uses screenshots to analyze your computer activity. He’d like to get that number down to 150.

Reducing how much he opens and interacts with iMessage is just part of the productivity regimen Codex created when Austin had it use Chronicle to determine how he could use his computer more efficiently. Other directives include slashing interactions across Slack, email, and Chrome.

Austin is game—he’d like to do more focused work, primarily by resisting the urge to bounce between apps and tabs and instead spend as much time as possible in the Codex app, where he can draft and review assets, emails, and Slack messages inside the in-app browser.

“I’m excited by the idea of keeping Codex open and staying focused. Then it can flag, ‘This is your one hour for comms stuff, go’—or even say, ‘Go to respond to this stuff, I’ve already drafted the responses for you,’” he says.

If you want your bad computer habits similarly analyzed, paste the following into Codex:

What have I been doing very inefficiently on my computer (according to Chronicle). Make some recommendations. Be direct. Tell me what I need to hear.

Steal this workflow

Mine your own scattered thinking before you draft

By the time you sit down to write the article, strategy memo, or launch page, you’ve probably already expressed most of what you want to say across Slack threads, Notion documents, voice memos, and meeting transcripts. Here’s how to mine all that content for gold—and avoid the paralysis of the blank page.

The workflow:

Capture by default, sort later. Monologue general manager Naveen Naidu treats the app as a transit point: He hits record on meetings, user calls, conversations with coworkers, and his rambling early-morning thoughts, because he knows he can always come

back and pull what he needs. The tool matters less than the habit—pick one (Monologue Notes, a voice memo app, whatever) and use it everywhere you do your thinking, not just at your desk.
Connect every source your agent can read. Give your coding agent access to Slack, Notion, Google Drive, Monologue Notes, and your meeting transcripts. For anything without a connector, export the files into a folder that the agent can search. The goal is one searchable repository across every place your ideas live.
Name the deliverable and constrain the source. Tell the agent what you’re drafting—article, strategy memo, launch page, go-to-market plan—and specify in your prompt (or project instructions) that it should pull only from things you’ve already said to avoid drafts that blend your thinking with AI-generated concepts.

Try it this week: Connect your agent to the two or three places where most of your thinking lives—Slack and Notion are usually a good start, plus meeting transcripts if you have them. Then paste:

“Find everything I’ve said about [topic] across these sources. Group the strongest threads, cite the source for each, and turn them into a draft outline.”

Discuss

“I’ll use aggressively casual language, like, ‘hey yo, for real,’ or drop a bunch of exclamation points.”—Sarah Suzuki Harvard, copywriter, in the Wall Street Journal

LLMs have flattened how most writing sounds. In response, professional writers are leaning into the colloquial and idiosyncratic, per the Journal, peppering their prose with obscure references, run-on sentences, and intentional typos to prove it wasn’t machine-made. As AI-generated content consumes more of the internet, the split between polished predictability and curated weirdness will only widen.

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

The Fallacy of the 16-hour Agent

Katie Parrott / Context Window — 2026-05-12 16:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

New data on long-horizon AI reliability just dropped, and depending on which chart you saw, you either think autonomous AI has arrived or it’s still years away. Today, we break down which version of the research to trust, plus Perplexity shares its methodology for building agent skills that don’t rot in production, Every CEO Dan Shipper turns his piano keyboard into a real-time Codex-powered music coach, and Gusto co-founder Edward Kim warns that the office of the future is going to sound more like a sales floor.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Signal

The 24/7 agent is nearly upon us—or is it?

The holy grail of agentic AI has been long-horizon reliability—an agent to which you can hand a task and trust to still be on the right thread hours later, when context has decayed and there’s no human in the loop to catch a wrong turn. METR, a nonprofit that measures AI capabilities, released an update to its research showing how close we are to that autonomous future.

One chart from the update circulating online shows an early preview of Anthropic’s next model, Mythos, blowing past existing models and the 16-hour range that METR’s benchmark suite can reliably test—literally breaking the scale.

Claude Mythos Preview reaches the edge of METR’s current measurement range at 50 percent success. METR cautions that results above 16 hours are unreliable with its current task suite. (Image courtesy of METR.)

It’s important to note, however, that how many human hours a task takes is not the same as how long a model takes to run those same tasks. Duration, the way that METR’s benchmark uses it, stands in for difficulty. As the nonprofit writes in the report’s FAQ: “AI agents are typically several times faster than humans on tasks they complete successfully.”

That last bit—tasks completed successfully—adds another twist to the benchmark. The 16-plus hour measurement is based on a 50 percent success rate. A separate measurement of how LLMs perform at 80 percent reliability shows that Mythos can run tasks that would take humans a little over three hours. It’s a significant step up from the closest competitor measured, Gemini 3.1 Pro (METR doesn’t currently have measurements for Opus 4.7 or GPT-5.5). But it brings Mythos back down to earth.

LLMs measured against METR’s time horizon test for completing tasks with 80 percent success, presented on a logarithmic scale. (Image courtesy of METR.)

Both these things are true: Duration can be a useful proxy for difficulty, and benchmarks don’t reflect reality. “[They] don’t measure model capability alone,” says Dan. “They measure model capability after a human has done the work of finding a prompt that lets the model’s capability appear.”

What to do this week:

1. Figure out your longest agent run. METR teaches us that duration might be a good approximation of difficulty. Ask: What’s the longest stretch you’ve trusted an agent on autopilot? If you don’t know, you can’t extend it.

2. Extend your agent’s runtime by giving it a goal. Last month, OpenAI shipped a new /goals command in Codex that allows agents to pursue objectives across multiple turns without checking in. Yesterday, Anthropic introduced a similar command to the latest Claude Code version. Both are apt for long-running loops with clear criteria for success—and very much in line with what we’ve heard from Claude’s platform team. Try it out today.

3. Audit the effectiveness of your existing loops. If you already have agents running overnight, “How long did your agent run?” is still a useful diagnostic—but ask it alongside, “With what guardrails, against what feedback signal, and at what verified accuracy?”

Steal this workflow

Build your next agent skill like Perplexity does

Creating a skill these days is relatively easy. Creating one that keeps working is not. We’ve seen skills that were running fine one day suddenly fire on the wrong request, fail to load when needed, or yield reports that weren’t as useful as they used to be. So the skill files get patched, growing longer every time the agent makes a mistake. But nobody can tell whether the latest edit helped or hurt.

Perplexity, the AI search company building agentic research and browsing tools, recently published its methodology for designing agent skills. The main lesson: Instead of starting with the skill, start the tests. Highlights from the post:

Write the evals first. Pull five to 10 cases from production queries, known failures, and edge cases. Include negative examples—queries that should not invoke this skill.
Phrase triggers like a human would. Start with, “Load when…” and use the language your users use. Perplexity’s example: Instead of “monitors pull requests,” try “babysit a PR,” “watch CI,” or “make sure this lands.” This way, the skill loads without your team having to use a specific command or technical phrase.
Write the body in principles, not procedures. The model already knows commands; it needs direction on how to apply them. Instead of listing detailed steps to, say, checkout a new code branch, then cherry-pick files to edit, then check for conflicts, and so on, Perplexity recommends instructions like, “Cherry-pick the commit onto a clean branch. Resolve conflicts preserving intent.”
Codify failures into lessons. When the agent fails in production, write the failure mode to the skill file. The mistake becomes a standing instruction that guards against future mistakes.
Edit instructions rigorously. Ask with every line you add: “Would the agent get this wrong without this?” If not, cut it. Every extra line adds context cost.

Try it this week: Pick one skill your team wants to improve. Write 10 test cases—five it should handle, five it should refuse or route elsewhere. Run the current skill against them. The gap is your backlog.

Discuss

“The office of the future will sound more like a sales floor.”—Edward Kim, cofounder of Gusto, in the Wall Street Journal

A Wall Street Journal article this week about AI dictation tools entering the workplace treats verbal prompting and composition as a manners problem—an angle that shows that the more things change, the more they stay the same.

Every new work interface eventually creates etiquette. Email created reply-all politics. Slack created notification politics. Voice AI is about to create room-tone politics: when you can talk to your computer, how loudly, and around whom. Great news for nosy office neighbors, but for the rest of us, it’s one more reason to curse the invention of open floor plans.

Inside Every

This week, Thinking Machines Lab and OpenAI both announced bets on the same future: AI that watches and responds in real time, instead of waiting for its turn. OpenAI shipped its Realtime-2 voice models; Thinking Machines previewed an interaction model that watches video and audio simultaneously.

While we’re all waiting to see how the labs’ visions roll, Dan used Codex to jerry-rig his own version.

On Saturday, he plugged his MIDI keyboard—a keyboard that translates notes into data a computer can read—into his laptop, opened Codex, and asked it to build a piano app that would identify the chord he played—then keep watching and coach him as he practiced. The pattern generalizes to any live medium: writing in a document, drawing on a tablet, deadlifting in front of a phone. This is also the promise of hardware like Meta’s AR/VR glasses or Apple’s Vision Pro: AI that sees what you’re doing and responds in a way that’s useful.

Here’s how you can do it too:

Find the input pipe. MIDI for instruments. Screen capture for writing or design. Camera plus a vision model for drawing or movement. Microphone for languages.
Have the agent build the watcher. Ask Codex (or Claude Code) to write the app based on how you like to be coached. (For example, tell it to only provide one piece of feedback at a time, or to focus on one aspect of your technique and ignore another.)
Tune the feedback as you go. First responses will be generic (“good chord progression”). Tell the watcher what’s useful and what’s not—“flag wrong notes only,” “ignore dynamics,” “let me finish a phrase before cutting in.”

Dan’s Codex-native piano coach setup, with the coaching app pulled up in the in-app browser. (Image courtesy of Dan Shipper.)

Try it this week: Pick a skill you want to get better at. Open the medium where you practice. Spend an evening with your coding agent building the smallest watcher you can—input in, feedback out. Next thing you know, you’ll have a tutor you can summon on demand.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Collaborate with agents on documents with Proof.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

Socrates as a Service

Eleanor Warnock — 2026-05-11 06:00:00 -0400

by Eleanor Warnock

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

I’m a journalist and a communications expert. My job, in both roles, is to find ideas that people haven’t yet put into words—the anecdote that could become a front-page story, the framing that could crystallize a founder’s philosophy into something a customer remembers.

In an hour interview with someone, it might not be until minute 45 that we start getting into the good stuff. In two hours, there may only be one thing that stands out to me—a side story, a detail, some color. A little piece of gold dust. An investor I’ve worked closely with calls these “extraction sessions.” I call the people who do them well Socrates-as-a-service.

Those details and stories aren’t on the internet. They’re not in any model. And the model hasn’t replicated yet how I pull them out of people. The gap between what AI can do and what a great human questioner can surface is still wide—and it’s the gap where the best stories live. If you don’t have some way to surface that information in your organization, your brand and messaging are going to sound like all the other twice-boiled content out there.

Osakan bread and the wisdom within

The stuff that I’m looking for has a name in management theory: “tacit knowledge.” The term comes from scientist and philosopher Michael Polanyi, who defined it with the phrase, “We can know more than we can tell.” It’s the expertise and intuition that lives in our bodies and resists being turned into a document.

In a frequently cited 1991 article, Japanese management expert Ikujiro Nonaka argued that while Western companies excelled at “information processing,” Japanese companies specialized in the “creation of knowledge,” through a feedback loop that turned tacit knowledge into a competitive advantage. His most memorable example: In the 1980s, the Osaka-based Matsushita Electric Company was struggling to get the kneading right in a bread machine. They sent a software developer to apprentice with a baker at a local hotel famous for its luscious loaves. The knowledge she brought back helped the team perfect the dough-stretching technology inside the machine and ultimately create a top-selling device.

I am sure that the lucky engineer asked the baker a lot of questions, but there was certainly a lot she absorbed just from watching. Indeed, Polanyi argued that tacit knowledge exists outside of numbers or symbolic language—the kind of systemization that AI requires to ingest information.

Many “bakers” from whom we try to extract tacit knowledge often don’t even know the depth of expertise they carry. And they certainly couldn’t tell you what questions you need to ask to access it.

AI as an imperfect interlocutor

AI can do some of that questioning and, in some cases, do it well. At Every, we have an AI agent ask us questions when we write OKRs. The agent has ingested Every’s company strategy and has context on all the members of the organization. My colleague, Katie Parrott, has Claude interview her before she writes an article. Those notes become the basis of an outline of the piece.

I would argue, however, that AI-driven extraction works well when the parameters are clear and the assignment structured, like writing an article or a plan for software. If you’re looking to turn over a completely new rock, interview someone about something they haven’t spoken much about before, or run the kind of open-ended information gathering work that happens when companies decide to rebrand. In those sessions, a chief marketing officer or branding agency will spend time speaking to members of the company and asking them open-ended questions about the business. The point is to keep things open, go wide, and see what comes up.

There’s a second problem: A human in the room can be surprised mid-conversation and abandon the plan—perhaps notice hesitation or dig into a thread that wasn’t on the list. A prompt mostly can’t. When I elicit insight from someone, I am applying my judgment about what is a good story in real time—judgment that’s been honed by years in news and communications. This mutual, live attention is something AI can’t capture because it’s not in the room.

The obvious objection is that this is a moving target—context windows and memory are improving to allow for more detailed, fluid conversations. Taste won’t, however, won’t. Someone still has to decide which detail out of a two-hour conversation is the piece of gold dust.

Nonaka himself argued that the goal isn’t always to make tacit knowledge fully explicit. Because tacit knowledge is so personal and often so abstract, sometimes the right tool with which to communicate is a metaphor or an analogy—a form of language that can hold multiple ambiguous meanings. Eliciting that kind of language from someone takes its own form of tacit knowledge: the skills of a Socrates.

Steal these techniques

So how can you surface those nuggets of gold? Despite the explosion of interview podcasts asking for multiple hours of your time, I find most hosts are not great at asking questions. The format demands an arc—a journey—which is the opposite of what you want when you’re trying to surface tacit knowledge. Real extraction zigs and zags, doubling back on itself and picking up something you said 20 minutes ago to pull a different thread, following gold, not audience interest.

Here are the techniques that I keep coming back to:

Warm people up. We open up more once trust is established. I never skip the small talk at the beginning of a conversation, and I’ll often bring something we have in common: “I saw you just spoke about X—I’ve been thinking about that too.” NPR interviewer Terry Gross’s favorite icebreaker question is, “Tell me about yourself.” The question lets the person you are speaking to take the lead and protects you as the questioner from saying anything that might make them prickle while you are still warming up.
Ask a mix of general and specific questions. When Lenny Rachitsky revealed the questions he sends to his podcast guests in preparation for the podcast, this combination stood out. For example, he asks them, “Anything you haven’t shared elsewhere that could be interesting to share in this forum?”—a very general question, and “What’s one pivotal moment in your career?”—which asks the guest to pinpoint one turning point. To extract unverbalized insights from someone, it helps to ask them to both think macro about their area of expertise as well as micro.
Come back to thoughts and drill in. If a line of inquiry goes nowhere, don’t abandon it—go back later and try again from a different angle. The first pass often loosens wisdom up.
Repeat things back. Repeating what someone said often helps them process their thoughts further, and they will often add additional detail they didn’t know they remembered.
Detail, detail, detail. Specifics are where the real stuff lives. How did that make you feel in that moment? Why do you think that way?
Listen well. Pulitzer Prize-winning radio journalist Studs Terkel spent decades interviewing everyday people in Chicago, and was described by one subject as offering “a state of being, it’s a way of attending to, attention-ing another person.” That is what good listening looks like.
Ask about squirrels. In his documentary about the debate surrounding the death penalty, Werner Herzog interviews a death row chaplain who, at the start of their conversation, delivers the polished answers he’s given 100 times about accompanying people in their final minutes. Then Herzog asks him about squirrels. Thrown off, he breaks down. The grief he feels about his job is laid bare. Ask people about the unscripted things.

Study this. Collect great questions you like. Build prompts to borrow these techniques for structured AI-driven sessions if you want.

But the judgment underneath these habits remains harder to transfer. It’s its own form of tacit knowledge. And for now, it still belongs to humans.

Eleanor Warnock is the managing editor at Every. She has been a business journalist and editor at the Wall Street Journal and the Financial Times-backed Sifted, and is an advisor to Bek Ventures. Follow her on LinkedIn and Substack.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

AI Work Is Splitting in Two

Every Staff / Context Window — 2026-05-10 12:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! This week belonged to agents. OpenAI had a “low-key” launch party for GPT-5.5 on May 5 at 5:55 p.m., a time chosen by the model itself. The following day Anthropic held its second annual Code with Claude developer conference, where the company announced three new features for its Managed Agents product, along with—more suprisingly—a partnership to use SpaceX’s Colossus supercluster.

Every was on the ground in San Francisco at Code with Claude. Taken together with the way Codex has been showing up inside Every, it became easier to see that battle lines are being drawn on two fronts: desktop apps for you and a model to collaborate with in real time as you work, and long-running agents like OpenClaw or Claude Managed Agents that teams hand off work to. It matches how agents inside Every have bifurcated into ones we delegate to and ones we collaborate with, and signal we’re seeing from frontier labs embedding employees in large enterprises.

Scroll down for a special weekend AI & I with two engineering heads at Anthropic, workflows to steal for hitting inbox zero with Codex or deciding which AI tools are worth testing, and how Every COO Brandon Gell instills curiosity in both his newborn son—and in himself. We’ve also been keeping an eye on the Elon Musk versus OpenAI trial. Discovery has surfaced plenty of gossipy, occasionally jaw-dropping text messages, but so far none of it changes much for the day-to-day user.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: The secrets of Claude’s platform from the team that built it

In the future, you’ll be able to accomplish a goal by just giving Claude an outcome and a budget.

That’s the direction Anthropic is building in with its new Managed Agents features, announced at this week’s Code with Claude developer event. The basic idea: Claude, wrapped in a computer in the cloud, that you can spin up, scale, and manage as needed. Anthropic is taking on the infrastructure that kills most agent products, and making sure that it scales to meet the needs of agents running 24/7.

On a special episode of AI & I recorded at Code with Claude, Dan Shipper talks with Jiang and Katelyn Lesse, head of engineering for the Claude platform, about what it takes to build an AI infrastructure platform. This is a must-watch for anyone trying to take an agent past the demo and into production. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Miss an episode? Catch up on Dan’s recent conversations with Stripe’s Emily Glassberg Sands, Every’s Brandon Gell and Willie Williams, Linear cofounder Karri Saarinen, and others, and learn how they use AI to think, create, and relate.

Knowledge base

“Inside Anthropic’s 2026 Developer Conference” by Dan Shipper, Marcus Moretti, and Katie Parrott/Chain of Thought: Dan and Cora general manager Kieran Klaassen attended Anthropic’s 2026 Code with Claude, and this piece is a report from the ground. The centerpiece is Anthropic’s new Managed Agents features, which Spiral general manager Marcus Moretti has been testing in his workflows, as well as the new “Dreaming” feature Kieran is most excited about. Read this for what Anthropic announced, what mattered, and how the tools are already being used in practice.

“I Let ChatGPT Manage My Workweek” by Katie Parrott/Working Overtime: Katie Parrott is a self-described disaster at project management, a gap she papered over for 15 years by keeping deadlines in her head and avoiding ambitious projects. As her work got more complex, that stopped being sustainable, so she built a ChatGPT agent that reads her OKRs, calendar, Notion, and Slack and tells her what to do next. Read this for the setup, the limits AI can’t fix, and the copyable prompt that powers the whole system.

“The Culture of AI Engineering” by Noah Brier/Thesis: The “software factory” metaphor is everywhere in AI engineering, but Alephic cofounder Noah Brier argues it’s the wrong one. Running a software company is less like Henry Ford’s assembly line and more like Andy Warhol’s studio: The hard problem isn’t throughput, it’s keeping everyone building the same vision. Brier adapts Stewart Brand’s pace layers framework into a five-level cultural stack to keep humans and agents aligned. Read this to understand why onboarding your agents matters as much as onboarding your engineers.

“The Dawn of Codex-native Apps” by Katie Parrott/Context Window: AI work is splitting into two modes—delegation and collaboration—and the new meta-skill is knowing which one fits the task. Read this to discover why the allocation economy thesis was only right about half the work, and what’s in the other half.

“OpenAI Flips the Script” by Laura Entis/Context Window: Three months after Dan Shipper wrote that OpenAI had catching up to do, he and head of growth Austin Tedesco have made Codex their daily driver for strategy docs, recruiting, and other kinds of knowledge work. 🎧 🖥 Listen to their episode of AI & I on Spotify or Apple Podcasts, or watch on X or YouTube.

From Every Studio

Spiral lets you start from a blank page and stop mid-stream

Spiral is one of the first products to use Claude’s new multi-agent feature in production. When you use the Spiral CLI to request multiple drafts, a Managed Agent spins up multiple Opus-class subagents to write your drafts in parallel— cutting the response time by 20-30 seconds per draft. Spiral also shipped improvements to the core app flow. You can start a session with a blank draft in addition to a new chat message. You can stop a Spiral response mid-stream if you need to add or change something from your previous message. And the guard against AI tells in Spiral output has been improved based on user input.—Marcus Moretti

Alignment

The case for optimism. The holy grail of any product is low marginal cost and high value. That is why software ate the world and why investors loved it. Biotechnology, however, is the polar opposite. A new drug costs hundreds of millions in research and development, then has to clear approval, then has to be manufactured, and out of every 100 candidates, only two or three reach the pharmacy shelf. The gross margins are fine once a drug ships, but the pipeline to get there is long and expensive.

Biotech was never going to scale the way software did. Yet R&D productivity in biotech is rising for the first time in many years, and the investors calling biotech a money pit are back at the table. There are a couple of reasons why.

We understand biology a lot better than we did even a decade ago, because we’re able to narrow the search space before we run an experiment. AlphaFold—Google DeepMind’s AI program for predicting the 3D shapes of protein—mapped roughly 200 million in a year. Instead of spending years figuring out a target’s structure, researchers can now begin with that information already in front of them.

The second reason is the collapse in the cost of reading the genome. Sequencing a single human genome cost around $100 million in 2001 and now costs about $200. We can sequence at population scale, and once you’re able to do so, you can start to see which genetic variants drive disease and which are noise.

A turning point for personalized medicine. (Source: X/ErikTopol.)

We now have maps of protein, genes, and cells that are starting to add up to a coherent picture of disease. For most of the history of medicine, we worked at the level of the organ, so we could see the disease but never its origins. Now we work at the level where disease happens—a genetic variant produces a misfolded protein, the misfolded protein disrupts a cellular pathway, and the cellular disruption is the disease.

Of course, the marginal cost of a drug will never be zero. But the marginal cost of asking what a disease is, and where to look for the answer, is collapsing. Lower R&D costs mean more breakthrough drugs, which means patients live longer and investors make money. The incentives, for once, point in the same direction.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

The Culture of AI Engineering

Noah Brier / Thesis — 2026-05-08 08:00:00 -0400

by Noah Brier

in Thesis

Sarah Jay Halliday/Every illustration.

Noah Brier cofounded Percolate in 2011 and learned the CEO’s hardest job: keeping a whole company pointed in the same direction. Now, at his AI consultancy Alephic—and in his own work, where he uses Claude Code as a second brain—he’s facing that same problem with agents in the mix. AI was supposed to make coordination easier. Instead, Noah argues, it has created new coordination problems of its own. In this piece, he pushes back on the “software factory” metaphor and offers a framework, drawn from Stewart Brand’s pace layers, for getting carbon and silicon to build the same thing.—Kate Lee

Strong DM is a software company whose three-person AI team calls their system for autonomous code generation a “Software Factory.” Entrepreneur Dan Shapiro’s widely circulated framework for AI coding culminates in “the Dark Factory,” named after a Japanese robotics plant that runs with the lights off. Factory.ai, which has raised millions from Sequoia and Khosla Ventures, has built an entire business around the metaphor—its autonomous coding agents are called Droids.

I’ve been incorporating many of StrongDM’s concepts about agentic software development into our work at Alephic, the consulting company I co-founded—but I have one fundamental disagreement: I think factory is the wrong metaphor.

If the hardest problem is making something people want, then the process of building software looks a lot more like Andy Warhol’s factory than Henry Ford’s. Both are focused on throughput, but Ford’s is focused on mechanization and stamping out identical cars with as little variance as possible. Warhol, on the other hand, was concerned with ensuring all work aligned with a single creative vision.

Ford’s factory—or more specifically, the assembly lines inside it—was designed to eliminate imperfections. Six Sigma, the quality methodology made famous by General Electric and beloved of manufacturers, is literally a measure of the defect rate. Quality starts with deciding what to build. This is why product-market fit is the lingua franca of startups: If you haven’t built something the market needs, nothing else—including the quality of your code—matters.

Too much of the industry treats software as a problem to be optimized and solved. That may be true for code writing and testing, but the better metaphor is staring us in the face: It’s a software company, not a software factory.

Just as in the days before AI, the hardest problem for a business is still creating this vision and alignment around it—how to keep an entire team of humans, and now humans and agents (and humans with agents), building toward the same vision, from the system architecture down to the individual lines of code. As I’ve learned long before agents existed, achieving this is much more akin to building a startup than assembling a car. What follows is my attempt at a framework for keeping an entire system of humans and agents building the same thing.

The alignment problem isn’t new—and AI didn’t solve it

I ran into this alignment problem years ago, when I cofounded the company Percolate, a content marketing platform, in 2011. As we grew the business from zero to 100 people in less than three years, my job as CEO shifted from building the product to building a company capable of building the product. My agents were people, and my job was to design the system they worked within. Culture, I concluded, was one of the strongest levers I had.

As Ben Horowitz put it, culture is “how your company makes decisions when you’re not there.” This was exactly what I needed: documents, tools, and rituals that helped each individual make the best possible decision without having to run every decision up the chain. I probably spent half my time on this, building a living culture document, running onboarding sessions for every new hire, and developing internal tools that automatically routed knowledge to the right people.

Every new technology promises to solve these coordination problems. But of course, nothing is that simple. What they do in reality is reshape the landscape around them and, in the process, create new problems that didn’t exist before. AI is no different.

Open-source software offers an early glimpse of the kind of unexpected problems that AI can create: Whereas the primary challenge a few years ago was finding maintainers willing to contribute code on goodwill alone, today’s challenge is sifting through hundreds of crappy AI-generated pull requests flooding GitHub.

Now, 15 years later, my audience at Alephic is not just the humans who work with me. Those humans are often paired with agents, and, increasingly, the agents themselves are delivering work independently. Yet the core problem is identical.

If you’ve used a coding agent for more than a week, you’ve already experienced this: The code works, but it often feels written by someone most definitely not you—ignoring obvious abstractions and stylistic norms that are present in the codebase. It looks, in other words, like a new engineer on the team who hasn’t been properly onboarded. We write onboarding documents and do training for our human colleagues, but most people don’t do this for agents. Yet.

Pace layers of AI engineering

I still have an onboarding document and set of activities every new hire goes through during their first week, including building a module in our homegrown learning system as their first coding task (a few recent editions were GPUs, quantization, and agentic commerce protocols).

But I am also building tools that go further and ensuring our code is maintainable, consistent, and built the way we’d want it built.

I think about our tooling as a kind of cultural stack, where standards inform architectures, which in turn inform specs, plans, and code. The layers are inspired by counterculture systems thinker Stewart Brand’s pace layers framework. It’s a model for how society changes at different speeds, from nature, which shifts over millennia, to fashion, which can change by the day. The lower layers move slowly; the upper ones move fast.

Stewart Brand’s Pace Layers framework offers a vision of how society works, from nature (changes over millennia) to fashion (changes daily). (Source: Stewart Brand.)

Brand argued that much of societal tension exists where the layers meet—when fashion reshapes culture (think about how social media rewired our norms about privacy) or culture becomes governance (how shifting attitudes towards marriage equality became law). Fashion, in Brand’s framing, isn’t trivial—it’s the froth layer where society experiments quickly and irresponsibly, and the occasional good idea sifts down to reshape the slower layers below. All things are ultimately reliant on the layer beneath them. Culture is subject to the laws of nature, governance to the laws of culture.

Those boundaries can and do shift, but recognizing the layers and the differing speeds at which they move is central to understanding why systems resist change, and what it takes to change them.

The “pace layers” of AI engineering help both humans and agents move in the same direction. (Credit: Noah Brier.)

Here’s how I’ve been thinking about the “pace layers” of AI engineering and how we’re building tooling at Alephic to help both humans and agents move in the same direction:

Code is fashion now. Whereas it once sat deeper in the stack, where it was slower moving and insulated by other layers, in a world of AI, code is free to produce and reproduce. The challenge is how to do it right: free of bugs at the macro level, and aligned with your own vision and best practices at the micro level. By the time we get to this layer, we have to trust that the layers beneath are strong enough to steer the system to the places we need it to go.
Plans sit beneath code. Before an agent writes anything, it should pause to survey the problem—what are the possible approaches, and what are the trade-offs? Only after completing this step should the agent pick a direction and build. Many algorithms in computer science rely on the explore-exploit shift—when you time-box a broad search phase before zeroing in on a solution to run with—and this plan phase is no different. A plan doesn’t have to be a formal document, but it must separate the thinking from the doing. Without this pause, exploration and execution get mashed together.
Specs sit beneath plans. A good plan needs a good specification. That can be a ticket (a task that needs doing), a document, or just a conversation, but it explains what we are building, why we are building it, how you know you’ve done it right, and, critically, what we are not tackling right now. That last bit is particularly important for overeager AI that wants to please by building everything you wanted and a little more. There’s a good debate in the engineering community about what constitutes a good spec. It’s the simplest set of directives that shrink the planning space: a goal, a set of acceptance criteria, and an explicit list of out-of-scope problems.
Architecture is the theory of the system. I’ve been keeping an ARCHITECTURE.md doc in all my codebases for a while now, borrowing from computer scientist Peter Naur’s idea that the real program isn’t the code, it’s the mental model the developers carry. The document shows how the business problem maps to the codebase, so you can predict where to find the code that solves this problem. It captures the key decisions and why they were made, and lays out the rules that must always hold, such as “no database queries outside the repository layer” and “no framework imports in the business logic.” Critically, it also names what’s still an open question, so AI doesn’t silently make architectural decisions for you, taking the codebase somewhere you didn’t intend.
Standards are the foundation. Some are general principles of good software-building; others reflect our specific beliefs about how software should be built. One of the insights that drove me to start the company was when, years ago, I asked a developer I had worked with for a decade if I could have all his configuration files, the ones that encode his rules for how code should be written. When I applied this rulebook to my own work, I became a significantly better developer. His strict approach to linting, or automated rules that reject code with unused imports or superfluous definitions, meant my code wouldn’t even run unless it met his standards. Cutting corners was no longer an option. At Alephic, we enforce many of these standards with tools like tests and static analysis, which let the computer check your code automatically. But a lot of this guidance also lives in skills we distribute across the company, so people can use it in whatever harness they choose. The code-organization skill memorializes how we want team members to organize their codebases, and coding-best-practices hardcodes the stylistic and technical preferences our platform engineering team has established.

With AI, we can take these ideas beyond the mechanisms of cultural exchange I had in my Percolate days (like documents and meetings) and encode them into tools that every person can interact with every day.

The layers at the bottom move the slowest, so they should get updated the least frequently. For instance, I could start keeping a document in a single project as a way to give agents context on how the codebase was organized. If it works well enough, I turn it into a skill so the rest of the team can adopt the pattern across their projects. Then, I can decide that it’s a fundamental piece of how we build and, eventually, a best practice I want to enforce for the entire team.

Companies > factories

While Henry Ford may be famous for the assembly line, he’s arguably more famous for his (likely apocryphal) quip about how if he asked people what they wanted, they’d say faster horses. Assembly lines exist to serve factories, just like factories exist to serve products, and products exist to serve companies. You don’t build a factory without an idea worth building it for.

The factory is one piece in a larger organization, where layers of co-dependent systems interact and move at different speeds. The interesting problems around alignment occur at the seams, where layers rub against each other: Is this a problem that should be solved with a meeting, a document, a skill, or a test? When does something graduate from a pattern in a codebase to something that should be established in all codebases?

At first glance, AI seems to smooth over these frictions. But that’s only true if you don’t scratch below the surface. What you find there is that the same problems that plague companies plague agents: incomplete information, overeager employees trying to solve the wrong problem, not wanting to admit you don’t know. The difference is speed. As Mario Zechner, who built open-source coding agent Pi, recently observed, the mess that used to take a large organization years to accumulate now arrives in weeks with a two-person team and a fleet of agents.

That is not a reason to retreat to being obsessed with defects. It’s a reason to take the harder problem seriously: how to keep an entire system of humans, agents, and the layers between them aligned. This problem has a decidedly human shape. Civilizations have been organizing large groups of autonomous agents to do good work for a very long time. The agents were just carbon instead of silicon.

The man underneath the layers

As part of this thesis, Every chatted to Noah about how he works and what inspires him.

If there’s a chessboard out: there’s a good chance [my kids and I] will do that instead of reverting to less enriching activities like being on screens. That chess set was designed by some friends and inspired by the New York City outdoor chess scene.

All photos courtesy of Sarah Jay Halliday for Every.

To keep me from checking email during calls: I like to take notes on paper, currently with a Campus notebook and rOtring 600 pen.

Re-reading the Simple Sabotage Field Manual: a 1944 document by the precursor to the CIA, I was struck by how closely the instructions for sabotage match the realities of corporate life in America. I hired a designer and printed a few hundred beautifully bound copies, which I gave away at my conference.

A few books I’ve pulled off the shelf recently: Toyota Production System (I’m thinking a lot about how we can take inspiration from these kinds of organizing principles to align agents), The Medium Is the Message (Marshall McLuhan is a hero of mine and this comes off the shelf frequently when I just want to bump my brain a bit), and Orchestrating Ambiguity (recently recommended to me, it’s a book of books about how to design for emergence in organizations).

I really love working: before anyone else has woken up, but that also requires that I wake up before then. So mostly it’s just morning time after I get my kids on the bus.

My dog’s name: is Kaiya. She’s two and a half, and very much a mutt.

Noah Brier is the co-founder of Alephic.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

‘

’’

Inside Anthropic’s 2026 Developer Conference

Dan Shipper, Marcus Moretti, and Katie Parrott / Chain of Thought — 2026-05-07 12:00:00 -0400

by Dan Shipper, Marcus Moretti, and Katie Parrott

in Chain of Thought

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

To our surprise, the biggest launch from Anthropic’s developer conference in San Francisco yesterday wasn’t a model or a feature. Instead, it was the company’s announcement of a deal with SpaceX to allocate all of the capacity in the latter’s Colossus supercluster to Claude.

Anthropic has been riding a historic demand surge over the last year as Claude Code opened up a new wave of agentic coding for engineers and non-engineers alike. But compute constraints have caused friction even amongst its most die-hard fans—we’ve written previously about being frustrated with its OpenClaw restrictions and the speed of its latest models like Opus 4.7.

The deal with SpaceX changes that equation. Anthropic has already doubled rate limits for subscription plans, removed peak-hour limits on Pro and Max accounts, and raised API rate limits by as much as almost 17 times for certain tiers.

Other than that, the big story is Claude Managed Agents, Anthropic’s hosted agent product. The company released three new features:

Multi-agent orchestration: a coordinator agent that spins up subagents in parallel baked into the platform
Dreaming: Anthropic’s general-purpose version of compound engineering, a feature that allows agents to learn from past sessions to improve between runs
Outcomes: Anthropic’s answer to Codex’s /goals command, allowing developers to specify an outcome and run an agent in a loop until the outcome is achieved

By themselves, these features are nice but not groundbreaking. What’s more important is that what an AI platform is has changed. In the GPT-3 days, the platform was a text completion end-point: Send text in, get text out. Now, with Claude Managed Agents, the platform is an AI model with a harness and host computer—all provided with unlimited scaling by the model companies.

Cora general manager Kieran Klaassen and I reported live from conference with our biggest takeaways, including the xAI compute deal, doubled Claude usage limits, Claude Managed Agents, and why the battle lines between OpenAI and Anthropic are starting to become clearer. Watch now:

We also recorded a conversation with Angela Jiang, head of product for the Claude platform, and Katelyn Lesse, head of platform engineering. The full episode drops tomorrow on AI & I—highlights below.—Dan Shipper

Vibe Check: Claude Managed Agents

Spiral general manager Marcus Moretti uses the platform’s new features

Anthropic launched Claude Managed Agents in April, and since then, Every’s AI writing tool Spiral has used the platform to power its API and command line interface (CLI), which lets developers and other agents talk to Spiral outside the web app. Claude Managed Agents run on Anthropic’s servers, instead of us having to run them on our own.

We set up a new Managed Agent in an afternoon and deployed it to power our API the next day. We’ve incorporated two of the new features Anthropic announced yesterday (memory and multi-agent orchestration) and are deploying the third (outcomes) soon.

Memory: Every’s editorial and social expertise—how to write a good X post, for example—lives in an Anthropic-hosted global memory store. The memory store lets us avoid including every piece of editorial and social expertise in the agent system prompt—the standing instructions that tell the agent what to do every time it runs. When a user asks for a podcast description, the agent doesn’t need to also recall how to craft a great LinkedIn post. It only pulls the relevant expertise with each request, thereby making responses faster.

Each Spiral subscriber also gets their own personal memory store. When you tell Spiral that you prefer em-dashes over semicolons or that your company name is one word and not two, it will remember and apply your rules by default the next time you run it.

Multi-agent orchestration: When users request a single draft of a piece of writing, one agent using Opus 4.6 Fast handles the workflow end-to-end. For multi-draft requests, a coordinator agent using Haiku 4.5 spins up multiple Opus 4.6 Fast subagents to compose drafts in parallel. Before multiagent orchestration, multi-draft requests were handled serially, and each draft added 20 to 30 seconds to the overall request time. A multiagent approach also reduced our costs for multi-draft requests by about a third because we were able to use cheaper models for part of the work.

Outcomes: Anthropic’s new outcomes capability is a feedback loop where one “grader” AI checks another AI’s work against a specified goal. Spiral’s main value proposition is writing quality, so we’re using outcomes to set up a rubric to ensure the writer agent’s output meets Spiral’s editorial standards and matches the user’s style guide. The rubric the grader AI uses is generated on-the-fly based on the global standards, the user’s writing style, and their writing preferences from memory.

Memory and multi-agent orchestration are live in production, and outcomes is coming soon. You can see the features in action by running npm i -g @every-env/spiral-cli && spiral login or logging into Spiral and using the install command on the Agent and API keys page.

Having set these features up in production, here’s what I think:

You are not totally locked into Anthropic’s universe. Every engineer worries that when a company offers a hosted version of something, it will be hard to leave. With Managed Agents, the agents themselves, sessions, and memory are all stored on Anthropic machines, and the agents themselves can only be powered by Claude—a managed agent can’t run on GPT-5.5 or Gemini.

I’ve mitigated this lock-in in two ways: First, we save agent runs to our own database in addition to Anthropic’s. This way, chats from the API appear in the web app just as web chats do, but it doubles as a safety net. If we ever wanted to leave Anthropic, we’d have all our historical data. Second, the Managed Agents platform lets you define custom tools for the agents. Those tools run on our servers, which means we can use whatever model we want inside the tools themselves. The coordinator agent is locked to Claude, but we control the layer underneath.

Using multiple agents has trade-offs. Multi-agent orchestration has allowed us to create multiple drafts faster and cheaper. However, coordination between agents adds overhead that prevents greater speed gains. Debugging also gets harder: If a Spiral draft comes back subpar, we have to investigate both the coordinator agent and the writer agent to identify the root cause. I’d recommend multi-agent orchestration only when your agent benefits from running subagents in parallel or using a mixture of models. Otherwise, a single agent works well.

Memory’s design is intuitive. Each memory is just a folder of markdown files, and each memory store is attached to a session with instructions that tell the agent when to consult it. Anthropic designed this feature thoughtfully—they kept it simple.—Marcus Moretti

The feature to watch: Dreaming

Cora general manager Kieran Klaassen sees his own philosophy mirrored back at him

Kieran has spent the last year trying to get agents to learn his preferences instead of forcing him to restate them every time. That’s compound engineering in a nutshell—each run leaves the system better prepared for the next one. So when Anthropic officially announced dreaming at yesterday’s Code with Claude event, he had a familiar feeling: The thing he’d been building was now a feature.

Dreaming is Anthropic’s name for a background process that reviews an agent’s past sessions and memory stores, finds patterns, and rewrites memory so the agent improves between runs. OpenClaw introduced a similar feature in April, but Anthropic’s take seems more focused on what teams of agents learn collectively than what a single agent remembers. The system learns from repeated corrections, recurring mistakes, and workflows that run well—creating, over time, an institutional knowledge base.

The feature currently lives inside Claude Managed Agents as a research preview, which is where Marcus has been testing it—with early success. Every plans to have its production agents dream as soon as the feature ships in a stable public release. But Kieran’s immediate question was: When is this coming to Claude Code?

Claude Code, after all, is where developers spend their days teaching agents the same repo quirks, the same testing rituals, the same “please don’t do it that way” preferences. Those preferences can go into memory files, but memory files get messy. They collect duplicates, stale rules, one-off notes, and contradictions—and as Marcus notes, memory introduces overhead, so you trade speed for quality every time you use it.

A dream cleans that up. It takes up to 100 past sessions and produces a reorganized memory store with duplicates merged, contradicted entries replaced, and new insights pulled out—memory that organizes itself, in Marcus’s framing. If Anthropic brings that loop to Claude Code, memory starts to look less like a notes folder and more like accumulated taste.—Katie Parrott

Inside Anthropic

What the company’s platform team told us off-stage

While at the conference, Dan sat down with Angela Jiang, Anthropic’s head of product for the Claude platform, and Katelyn Lesse, head of platform engineering, for a recorded conversation. Three things that stood out:

The generic harness is dead. Angela told us that building a generalized harness that lets you switch any underlying model for a different one—standard practice even a few months ago—is a losing strategy. Different harnesses paired with the same model produce “drastically different” results on Anthropic’s own evaluations. When the team built memory for Managed Agents, they tested multiple harness designs, and the performance gaps were large enough to make model selection feel secondary.

Our own experience backs this up: Our agents run on Claude with a harness tuned specifically for how Claude works. If we don’t want to risk getting locked in, we have to—as Marcus writes above—build the harness in a way that lets us swap in GPT or Gemini. But Angela’s argument is that the bigger risk is leaving performance on the table.

Infrastructure is the real wall. Katelyn told us that most people building agents expect the hard part to be the prompting, context window management, and tool setup required to get the most out of the model. In practice, everyone hits the same wall: infrastructure. They have to keep servers running, securely sandbox, prevent connection drops, and store transcripts. Before Marcus set up Managed Agents in an afternoon and deployed it the next day, we spent months on exactly that kind of plumbing.

Your agent needs a babysitter. Dan raised this problem directly: Agents get stale fast, running old models and old prompts with nobody responsible for updating them. Our solution so far has been to assign every agent an owner to keep an eye on it. Katelyn said the Anthropic team has built skills to help agents upgrade themselves to new models. “The most AGI-pilled people,” she added, “are running agents that monitor their agents.”

The full episode with Angela and Katelyn drops tomorrow on AI & I—we go deeper on where the platform is headed, what “outcome + budget” means as a design philosophy, and why Anthropic thinks Claude should eventually pick its own sub-agents.—KP

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Collaborate with agents on documents with Proof.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

For sponsorship opportunities, reach out to sponsorships@every.to.

OpenAI Flips the Script

Laura Entis / Context Window — 2026-05-06 08:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

There’s no resting on your laurels in the AI race: OpenAI’s Codex went from trailing Anthropic’s Claude Code to pulling ahead in functionality, at least for now, in a matter of months. Today, Every CEO Dan Shipper explains why OpenAI’s coding app has become his daily driver for work, head of growth Austin Tedesco shares his no-nonsense advice for switching over from Claude Code, and Spiral general manager Marcus Moretti argues it’s OK—good, even—to let some AI trends pass you by.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: Why we switched from Claude Code to Codex

Codex takes the lead

If you’re looking for evidence of AI’s unrelenting pace, here it is: In January, Dan wrote that whoever wins vibe coding wins how you work on your computer—and that OpenAI had some serious catching up to do.

Three months and the release of OpenAI’s latest model later, Codex is there, and in a new episode of AI & I, Dan and Austin get into why they do much of their knowledge work in Codex now. They cite the power of GPT-5.5, paired with a desktop app that is faster and more powerful than Claude Desktop or Cowork.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are a couple of Dan and Austin’s favorite current use cases for Codex:

Austin uses Codex for strategy docs. Austin needed to write a go-to-market plan for a new Every product but kept getting pulled away by other work. So he pointed Codex at the team’s Notion meeting notes, Slack threads, and his preferred template and told it to pull together content where they’d discussed strategy and transform it into an action plan. What came back was 80 to 90 percent of the way there.
Dan uses Codex for recruiting. When he is recruiting people to work at Every, Dan starts with a sense of where strong candidates might have learned the skills Every needs, instead of looking for a specific job title. He then asks Codex to find people who match that career arc—for example, to find someone to help scale Every’s courses, he looked for candidates who had worked at education startup General Assembly before transitioning into AI.

Miss an episode? Catch up on Dan’s recent conversations with LinkedIn cofounder Reid Hoffman; the team that built Claude Code, Cat Wu and Boris Cherny; Vercel cofounder Guillermo Rauch; podcaster Dwarkesh Patel; and others, and learn how they use AI to think, create, and relate.

Migration anxiety

Claude Code-to-Codex

If you want to switch to Codex or any other coding app, how should you think about migrating? When your setup includes app-specific project folders, skills, plugins, or integrations, it can be daunting.

Austin’s migration from Claude Code to Codex was disarmingly simple: He opened his Every work project in Codex, told it he typically worked in Claude Code, asked it to inspect the folder, and told it to update anything that should work differently in Codex.

When Codex got something wrong, he handled it in the moment and told it, “This doesn’t look great. Can you fix it?” And it did.

Before GPT-5.5, staff writer Katie Parrott hadn’t used ChatGPT for writing in almost a year.

Now, she splits her writing sessions between Claude Code and Codex. She moved over by giving Codex the writing and editing skills she had already saved as Markdown files on her computer and asking it to adapt them for its own environment.

Steal this workflow

Join the early majority

Spiral general manager Marcus is OK with letting most AI hype—managing a swarm of OpenClaws each running on its own Mac Mini, for example—pass him by. Earlier in his career, he was an early adopter of new tools and technology trends, but these days, he finds himself closer to the early majority section of the adoption curve. As the one-man team behind Every’s AI writing product, he has a lot to do—if he’s going to add something new to his workflow, it has to clear a high bar.

Marcus is comfortable being among the 34 percent of the population who are slightly early to adopting a new technology. (Image, which is based on Everett Rogers’ Diffusion of Innovations framework, courtesy of Laura Entis.)

Here’s Marcus’s strategy for determining what’s worth testing.

Start with a real problem. A useful filter is to focus only on tools or services that solve an existing issue. For example, Marcus decided to test out Stripe’s token-based billing feature—which allows you to measure how much users cost you in tokens—because of a genuine challenge he was facing: Spiral needed a better way to track AI usage costs across models.
Don’t fall for productivity theater. Marcus ignores demos that brag about how many machines or agents someone is running simultaneously. He doesn’t care about what the setup looks like; what matters is whether it will make his life better.
Sit back and see what pans out. Marcus generally waits to try a product until there’s evidence that companies he respects are using it in production, even by checking for logos on a tool’s homepage showing which brands are using it. Even better if the product is from a company he already knows and trusts, like Stripe or Anthropic. With the Stripe use-based billing example, the calculus was simple: “Great company solving a real problem I have—I’ll try it,” he says.

Test it out for yourself:

Pick one AI tool you feel vaguely guilty for not trying and write one sentence: “Before this tool, I _____. After this tool, I can _____.” If you cannot fill in both blanks, let yourself off the hook.

Alignment

Every’s COO Brandon Gell on cultivating curiosity in an AI world

My son was born eight months ago. Since then, I’ve asked myself regularly: How can I teach him to lead a fulfilling life, especially when it comes to technology?

I’m a computer native, born in 1994, the year Netscape was first released. My son was born in 2025, the year Claude Code was invented. The world I grew up in rewarded people with the fortitude to find answers. The world he’s growing up in has made that table stakes. So if the answers aren’t scarce anymore, what is?

Curiosity. Knowing what to ask next—having the instinct to push further, to connect unexpected dots, to wonder about something nobody else paid attention to—is what’s scarce.

It’s also distinctly human. It causes us to make connections between unrelated ideas and connect dots that don’t follow obvious patterns. It brings our personal values and lived experiences into what we explore, shaping not only what we discover but why it matters. It pulls us toward questions we find fascinating—not because they’re useful, but because we can’t stop wondering.

AI can’t replicate that. Curiosity requires perspective and taste, things that are difficult to instill in a model. And even if you could, it would never be as diverse as the perspectives of 8 billion humans, each one shaped by a different life.

I want my son to be insatiably curious, and I’ve realized that to instill that in him, I need to cultivate it in myself. Which means developing it and maintaining it, like a muscle. Here’s what that looks like:

Lesson 1: Use AI to go deeper on something you already care about

After I sold my insurance company, Clyde, I realized how disconnected I had become from my creativity outside of work. The same curiosity that drove me to explore the idea that had become my company had gone dormant as I focused singularly on its success. I realized just how lost I was while driving and listening to music. I could hear the music, but I could no longer feel it.

Not long after this drive, my friend Mike showed me some speakers he had built. I realized in order to truly hear the music, to find my curiosity, I had to build a pair of speakers and a subwoofer. The project would combine my interest in architecture, experience with woodworking, and total lack of knowledge in audio engineering.

Next thing I knew, I was hours deep into a ChatGPT conversation about sound waves and acoustic design, learning how.

Lesson 2: Use AI to build something you wouldn’t otherwise make

For the past 15 years, I’ve on and off tried lucid dreaming. So when I saw the Dream Recorder GitHub repository, an open-source project that uses video AI models to visualize your dreams as cinematic reels on a bedside device, I knew I wanted to make one for myself. The problem? I’d never built any hardware, didn’t have a 3D printer, and calling myself a front-end developer would be generous. So I used AI to help me adapt the open-source repository and build something I’d never otherwise be able to make. I bought a 3D printer, improved the original code, and spent many long nights perfecting my dream recorder.

I still don’t know how to code. But that doesn’t matter. In both situations, I used AI to leapfrog the unknown and explore my curiosity and my dreams. AI was a learning partner, not an answering machine. It taught me the things I don’t know, and I combined that with the skills I already had to build something new.

What this means for all of us

In a world where the “right” answer is one AI prompt away, we need to stop rewarding our kids and our students for getting the answer right and start rewarding them for the quality of their questions, the depth of their curiosity, and their resilience to ask the next question when in uncharted territory. Curiosity is what separates the people who use AI as a crutch from the people who use it as a rocket.

In a world where there’s always an answer, let the next question be your guide.—Brandon Gell

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn. For sponsorship opportunities, reach out to sponsorships@every.to.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

The Dawn of Codex-native Apps

Katie Parrott / Context Window — 2026-05-05 07:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

Inside Every

Working with AI right now often means making the same judgment call dozens of times a day: Hand this task off to an agent or stay close to the process? “The landscape of working with AI is bifurcating,” is how CEO Dan Shipper put it in Every’s Monday standup. On one side is the agent you delegate to. On the other is the agent that sits beside you while you write, code, triage, revise, and decide.

Watching the Every team work, you can’t unsee it. Dan delegates bug reports for our collaborative document editor, Proof, to his OpenClaw agent, R2-C2. But he stays close to his inbox through a combination of Codex, Every’s AI email assistant Cora, and a document with custom rules (steal his workflow below). Kieran Klaassen hands the middle of his compound engineering workflow to the model but works closely with it to brainstorm at the beginning and polish at the end. I (Katie Parrott) send the model off to do research, but I’d never trust it to execute a full draft without my hands firmly on the wheel.

Which means the allocation economy thesis was only right about half the work. Some of it still wants delegation, but the other half wants you to stay close, pairing on every move with the model in the same window. The two halves demand different skills, and the meta-skill is knowing which is which.

Think of it as the AI version of the serenity prayer: Grant me the serenity to delegate the work I can, the expertise to sit with the model on the work I can’t, and the wisdom to know the difference.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Steal this workflow

Get to inbox zero with Codex

The perfect email workflow is the white whale productivity people have chased for a decade, Dan included. His latest AI-native version puts the agent in the inbox and the human in a shared document, where every draft and decision stays visible. Here’s how he does it:

1. Write a one-page operating manual for your inbox. The document, which Dan keeps in Proof, names his VIPs, describes what to auto-archive, summarize, or draft, and explains how to handle scheduling.

2. Open your agent-native email tool in Codex. In Codex’s browser pane, Dan loads Cora, which gives the agent two ways to act: command line instructions to archive threads—but also the ability to click through the inbox like a person.

3. Work from a document instead of your email. Dan has Codex create a separate Proof document for each inbox run. Codex sweeps the inbox, archives what the operating manual says to archive, and adds every draft or decision to the bottom of the document. Dan replies inline: “Spam,” “archive,” “reply just to Willie asking what he wants to do here,” “send the invite, draft a reply to Tony.” Codex picks up each instruction, drafts in Cora simultaneously as Dan moves onto the next message, and waits for approval before sending.

Try it this week: Write a one-page “how to do my email” document with your own VIPs, auto-archive rules, scheduling preferences, and reply style. Then open Codex, load your email client in its browser pane, and paste in your instruction document and this prompt:

“Sweep my inbox using this operating manual. Put every draft and decision in this doc and wait for me before sending anything.”

Dan’s email workflow as set up in Codex: chat on the left, web browser with Cora on the right. In this version, Dan has also vibe coded a one-page interface that plugs into Cora’s CLI. (Image courtesy of Dan Shipper.)

New job alert

If the new meta-skill is knowing when to delegate and when to stay close, here it is in job-description form: Airtable is hiring an AI Agent Architect, Customer Experience.

Support software used to route tickets and surface help center articles. Now it can read context, act across tools, and decide what to do. Which means someone has to design the boundary around support agents—what knowledge they retrieve, which APIs they can use, when they can modify an account, how failures get measured, and where the agent hands the work back to a person.

Tool for thought

Musk’s five rules of automation, except for agents

In 2021, Elon Musk introduced his “algorithm,” a five-step rubric he uses at Tesla and SpaceX to figure out what a process needs before trying to make it faster or handing off any part of it to a machine. Willie Williams, Every’s head of platform, has been exploring how it might apply to agent workflows:

Question every requirement. Every rule, checkpoint, and instruction in a workflow has to justify itself by naming the specific thing that goes wrong without it. If nobody can answer that, it shouldn’t be there.
Delete what you can. Cut steps, approvals, reviews, and agents that don’t survive step one. If you’re not occasionally removing something you later need to restore, you haven’t cut enough.
Simplify and clarify. Break the remaining work into smaller, clearer pieces. Each task should have a single owner, a defined output, and only the information and tools it actually needs.
Accelerate feedback loops. Shorten the time between handing work to an agent and knowing whether it succeeded. Surface errors early, run independent tasks at the same time, and stop making the workflow wait on unneeded approvals.
Automate last. Start with a checkpoint at every step. Only after a workflow is necessary, lean, and fast should you take the humans out of the loop.

Still, Musk’s algorithm was intended for factories building electric cars, rockets, and satellites—hardware. They don’t directly translate to AI agents. “These rules should apply to the world of software automation,” says Willie, “but we don’t actually have them yet. And we have to work on finding them.”

Model card

ChatGPT/Every illustration.

Signal

The hard part isn’t the model

The bifurcation Dan named in Monday’s standup—delegate to the agent, or sit beside it—is the same problem for which frontier labs are now selling enterprise solutions.

OpenAI made it explicit last month with its new Frontier Alliance initiative pairing OpenAI engineers with large enterprises to deploy agents inside their workflows. “The limiting factor for seeing value from AI in enterprises isn’t model intelligence,” writes OpenAI. “It’s how agents are built and run in their organizations.”

Then this week, Anthropic announced a parallel move—a new services firm with Blackstone, private equity firm Hellman & Friedman, and Goldman Sachs to help companies “design, build, and maintain” Claude deployments.

Both labs are saying the quiet part out loud: The hard part of deploying and working with agents is everything around the models themselves—the context, permissions, handoffs, evaluations, and human relationships that decide whether a model should run ahead or sit beside you. Dan’s inbox workflow and Airtable’s support-agent job are microcosms of the same problem, now landing on the enterprise balance sheet. (Every’s consulting practice also helps companies implement AI workflows and products.)

What to do this week:

Write down how you want the work done before you prompt. WhatOpenAI and Anthropic are charging Fortune 500s millions for is the document Dan wrote himself in an afternoon: who counts as a VIP, what to auto-archive, when to escalate. Start there.
Split your tasks into “hand off” versus “stay close.” Bug triage can run on its own. Important email drafts need you in the loop. Sort before you delegate.
Keep the agent’s actions visible. Drafts in a shared document, tracked changes, an action log—whatever the form, you need a record. If you can’t audit the agent’s work and revert it if needed, you aren’t the one driving.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

I Let ChatGPT Manage My Workweek

Katie Parrott / Working Overtime — 2026-05-04 11:00:00 -0400

by Katie Parrott

in Working Overtime

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

I sat down to write my second-quarter goals at 4:30 p.m. on a Tuesday in early April. It was the day after I was supposed to turn them in when I decided to be an adult and survey the damage from the first quarter. And I do mean damage. I’d written only half of the columns I’d committed to. Another project I had promised hadn’t even gotten off the ground.

I could give the usual excuses—the quarter was busy, the project hit walls outside my control—but the real culprit was obvious: I may be a great writer, but I am garbage at project management.

For 15 years, I handled this weakness by tiptoeing around it. I didn’t take on managerial roles that would have required more organizational skills. I didn’t take on so much freelance work that I couldn’t keep the deadlines in my head. I passed on ambitious projects—too many moving parts.

This duct-taped approach worked until I decided to join Every full-time in April. If I were going to take on more responsibility as a full member of the team, I needed to get serious about project management. Which, in 2026, meant I needed to bring in AI.

So I built myself a project manager: a ChatGPT agent that holds my OKRs—objectives and key results, the goals that define a successful quarter—watches my calendar, reads my Notion to-do list, and helps me decide what to do next. Otherwise, I’d spend my day opening Slack, refreshing X, panicking lightly, repeat.

My ChatGPT project management agent helpfully points me toward where to put my focus for a day. (All images courtesy of Katie Parrott.)

Most AI-at-work advice starts with the part of your job you’re already good at: Write faster, code faster, analyze faster, ship more. I’m interested in the other side of the equation: using AI to support the part of work that makes it hard to believe you’re good at your job.

I’ve set up project management with both my Plus One agent, Margot, and as a ChatGPT agent. I’m featuring the ChatGPT agent here, but you can create your own project manager with any system that gives you a combination of memory, context, and intelligence—more on that below.

Why AI can babysit my to-do list now

I’d tried using ChatGPT as a project manager before, during a freelance month last year when I’d overbooked myself and had deadlines staring me down like unread letters from the IRS. I would open a new chat and type some version of: “I have this deadline, this deadline, and this deadline; this meeting, this meeting, and this meeting. What should I do?”

For one-off triage, it worked well enough. The problem was the context that it had about me—or didn’t. Every time I came back, I had to explain everything again: the clients, the deadlines, the pieces in flight, the meetings, the priorities, the fact that one project was more important than another for reasons that were obvious to me and invisible to the chat window.

A glimpse of my ChatGPT project management system, manually informing the AI of my deadlines day by day.

Then, over the past six months, several things converged to make more comprehensive project management using ChatGPT possible.

First, memory improved enough that the system could carry context and apply it across conversations. Next came advanced tool use, which enabled AI to navigate and use browsers and other tools. Integrations meant that ChatGPT could finally do things like open my Notion, check my calendar, and read my Slack. Finally, products like OpenClaw and Every’s Plus One wrapped all this firepower in a package that even I, a technical neophyte, can work with.

If you tried to do something with AI a year ago—like manage a marketing workflow or run an analysis of financial results—and it didn’t take, try again. Chances are that the model and the product around it have shifted in ways that move the finish line in your favor. It was time for me to take another swing at AI-native project management.

What I built: A project management agent

Saying “I built an agent” makes the whole thing sound more sophisticated than it is. The truth is that AI did most of the work—I just put the right information in places AI could see it, connected the tools and software where my work happens, and described the job I wanted done.

Context to shape the agent’s memory

With context, the agent can turn a vague goal into Thursday’s first task. Without it, it’s just a Magic 8 Ball for to-do lists.

So, as I was going through the setup for my agent (which you can do directly through the chat interface), I made sure to provide plenty of documentation for the agent-builder to build on top of. Most importantly, I gave it a link to a Proof document with my OKRs, four objectives, a dozen-ish key results, and a rough sense of a stack-ranking of projects. Then I asked it to do the first piece of project management I am worst at: I asked it to turn “a successful quarter” into concrete phases, milestones, deadlines, and tasks.

The agent broke my OKRs down into a week-by-week action plan, then converted that into tasks for my Notion to-do list.

“Stand up a reliable Vibe Check pipeline” is a concrete goal, but not something you can do on a Thursday afternoon. The agent broke it into smaller pieces: Audit the existing process, draft a brief outlining suggested changes, solicit feedback, and implement the changes.

The first useful thing the agent gave me was a draft to respond to. Some of the tasks were so abstract I couldn’t tell where to start, and others were so chunky they were really projects in disguise. So I went back and forth with the agent to set a few parameters—mostly telling it, “This is too confusing for me to act on”—and it split, renamed, and rewrote the items until the plan had been divided into projects and tasks that were doable.

Then the tasks went into Notion, where they became a board with deadlines, statuses, and linked OKRs.

Integrations give the AI places to act

The next step was adding integrations so that the agent could track my work across tools.

ChatGPT agents make this almost embarrassingly easy now. In a few clicks, I connected the agent to the places where my work already lives: Notion, Slack, Google Drive, and Calendar.

The dashboard for my project manager agent, complete with integrated apps, context files, and memory.

This is the part that would not have worked a year ago. Back then, ChatGPT only knew what I remembered to paste into the chat box—it couldn’t take action on my behalf. Now the agent can read the systems I already use. It can see on my calendar that Thursday morning is open, that a discussion on a Slack thread created a new task for me to do, that an article draft exists somewhere in Drive, and that a project belongs to an OKR and isn’t just a guilty little cloud floating around on Notion.

Instructions tell the agent what to do

Context tells the agent what matters. Integrations tell it where to look. Instructions tell it what to do. I had to write fewer of them than I expected.

I opened the ChatGPT agent builder, which you can find in the left-hand sidebar of the ChatGPT web app. Then I explained, in plain English, what I wanted: a project-management agent that would help me organize each week and keep my quarterly objectives on track. The builder turned that into a fuller brief with its role, workflows, and instructions on how to deliver responses, where to store information for future reference, and what NOT to do (for example, invent a status or deadline).

The beginning of the instructions that power my project management agent.

Ultimately, the instructions I care about boil down to this: Help me organize the week, keep the quarterly objectives on track, and do the useful work first instead of requiring so much input from me that I might as well have gone in and looked at all the inputs myself. I might as well have

I can’t automate the ‘me’ of it all

I may be offloading a type of work that I hate and am bad at, but I’m also learning new skills—or relearning them for the agentic era. Mostly, these lessons emerge through failure.

Oftentimes, the failure is one of communication. It took time to get in the habit of keeping my agent up-to-date on the details it can’t see. An article would be published, and I’d forget to tell the agent or move the card in Notion that corresponded to it. Deadlines moved while Notion stayed stuck on the old date, and the agent became about as useful as my dog when I tell her to go get a toy from upstairs.

My Notion to-do list functions as the source of truth for me and the agent about the status of projects. If it’s not up-to-date, the whole system falls apart.

I have to tell the agent when a draft is in review or is published, a deadline changes, or a new task appears in a meeting. Updating a Notion page is annoying. But annoying is better than carrying the whole quarter in my head.

Another wrinkle is the “me” problem. The agent can’t change my personality. It can’t make me less anxious or more confident in my ideas. So, for example, I’ve been sitting on a proposal for my biggest Q2 project for a week because I can’t convince myself it’s good enough to send. The agent knows this. It reminds me that it’s overdue every day. And I keep avoiding it. The agent can draft the email and flag the delay, but it can’t tell me if the idea is good. That part—deciding to believe in the thing you made—is still mine. AI, it turns out, is no match for my neuroticism.

Knowing while there’s still time

Near the end of every week, I ask the agent for the thing I used to dread the most: a status report. It reviews the work that was supposed to get done, what moved, what slipped, and which goals are starting to look further from reach. Sometimes the answer is satisfying. Sometimes it is rude in the way accurate things are rude.

One day recently, I asked it for a report on my OKR progress: One project had momentum but needed a cleaner path to delivery; another looked healthy, but only if I had artifacts to show for it that the agent couldn’t see; my publishing cadence was fine, but would be better if I set up the idea backlog the agent and I had talked about.

The agent’s take on the status of my three active OKRs. There’s nothing on fire, but it gives me a sense of where to put my focus in the next few weeks.

This is the kind of thing a competent project manager would probably notice in a 20-minute check-in. Which is exactly what I want from the agent: making the obvious visible before it becomes a delay that turns into a problem that snowballs into a failed objective or, worse, a disappointed teammate.

For most of my career, deadlines and prioritization felt like weather systems: suddenly overhead, occasionally catastrophic, mostly outside my control. Now I can see the front forming in time to take action.

If AI has only been helping you with the part of work you already do well, try pointing it at the part you have been avoiding. If the promise of AI is that it frees up humans to do what only humans can do, that should include freeing us from things we hate to do. Otherwise, what’s the point?

I am still bad at project management. The part of work that makes me feel like I am faking adulthood still exists. But I have support for that now, so the writing gets the hours it deserves.

Build your own project manager

If you want to set up your own project-management agent, here’s what I’d gather before you open the agent builder.

1. Context: The documents to feed it

Think of this as the agent’s onboarding material. The more it can read about your priorities, the less you’ll have to repeat in chat.

OKRs or quarterly goals. The single most important file. If you don’t have written OKRs, write a one-page version of what a successful quarter looks like—your objectives, the rough metrics that prove them, and any projects you’ve already committed to.
Strategy or planning docs. Anything that explains the why behind the work: team strategy memos, annual plans, project briefs, and kickoff documents.
Workstream documentation. Standing responsibilities you want the agent to know about, such as your editorial calendar, cadence of the content you publish, and recurring meetings.
A stack-rank of your goals. Which OKR matters most? Which project is the one you’d protect if everything else slipped? Write this down.

2. Integrations: Connect the tools where you work

Connect the systems where the work actually lives.

A task manager. Notion, Todoist, Asana, Linear, or whatever you already use. This becomes the source of truth for the status of your work. If you don’t have one, set one up before you build the agent.
Your calendar. Google or Outlook. The agent needs to see where your time is spent versus where you said it would be spent.
Slack or your team chat. This allows the agent to pick up tasks that get assigned in conversation and never make it into your task manager.
Cloud drive. Google Drive, Dropbox, OneDrive, or wherever your drafts and working documents live.

3. The prompt

Here’s the brief I gave my agent builder. Keep the structure and adapt the specifics to your work.

          Project manager agent prompt
          Other
        

You are my project manager. Your job is to help me organize each week and keep my quarterly objectives on track.
You have access to my OKRs, my Notion to-do list, my calendar, my Slack, and my Drive. Treat my OKR document as the source of truth for what matters this quarter, and treat Notion as the source of truth for project status.
Each Monday, give me a one-page plan for the week: what's due, what's at risk, and what I should focus on first, based on which OKR each task ladders up to. Each Friday, give me a status report: what got done, what slipped, and which goals are starting to look further from reach.
When I ask, "What should I work on now?", check my calendar for available time and my Notion board for open tasks, then recommend one thing—not five.
Don't invent statuses, deadlines, or tasks. If a date isn't in Notion, say so. If a task is ambiguous, ask me one clarifying question rather than guessing.
Protect my stated priorities from my daily impulses. If I ask for help with something that isn't on the OKR list, flag it before you help.

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Codex Goes to Work

Every Staff / Context Window — 2026-05-03 00:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“A Guide to Agent-native Product Management” by Marcus Moretti/Guides: Marcus Moretti runs Spiral as a one-person team. This guide walks through the two new compound engineering skills that make it possible: /ce:strategy, which interviews you to produce a strategy document, and /ce:product-pulse, which replaces your analytics tools with a founder-style analyst briefing that saves to a folder as your product’s running memory. Read this to set up both commands for your own product and understand how they plug into the broader plan-ship-review loop. Plus: The one thing Marcus still writes himself is the roadmap. Read the accompanying essay for his full workflow, plus his two-part test for which SaaS products will survive the agent era.

“You Are the Most Expensive Model” by Mike Taylor/Also True for Humans: Most teams are routing entire workflows through frontier models when cheaper, faster alternatives would do the job just as well. The real cost isn’t the tokens—it’s your attention. Mike Taylor introduces incremental determinism: a four-level framework for deciding which tasks deserve Opus and which can be handed to Haiku, a script, or no model at all. Read this to know exactly which lever to pull when your AI costs start to add up.

“One App to Rule All Knowledge Work” by Katie Parrott/Context Window: Austin Tedesco now runs 80 percent of his daily workflow through Codex, a tool he called “trash” for non-engineers just months ago. Plus: why Austin reviews every agent output in its destination app, a prompt for letting agents design their own automations, and how to use Every’s compound knowledge plugin to catch confidently wrong data before a plan gets enacted.

“Compute Is the New Cash” by Laura Entis/Context Window: On AI & I, Emily Glassberg Sands, head of data and AI at Stripe, talks to Dan Shipper about how agents are becoming economic participants—and why fraud is now a full-funnel problem, not just a checkout one. Plus: GitHub and Anthropic are both moving to usage-based pricing as flat-rate subscriptions break down under agentic workloads; Dan and Kieran Klaassen offer contrasting takes on whether you should talk to your agents or just let them work; and Naveen Naidu‘s three-step workflow for turning post-launch customer feedback into a product queue. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“Who Isn’t Using GPT 5.5” by Laura Entis/Context Window: One week after GPT-5.5’s release, the Every team checks in: Kieran is now splitting his time evenly between Codex and Claude Code, but Natalia Quintero ran a head-to-head proposal test and her Claude agent won. Plus: why six unicorn CTOs have stepped down to become Anthropic ICs; how Kieran hit 24 pull requests in a single day by having agents watch user complaint videos overnight; and Willie Williams on why AI has turned coding into a slot machine—and how to know when to walk away.

Log on

Last week’s camp

Codex for Knowledge Work Camp: Dan and Austin showed how to use OpenAI’s Codex for drafting, research, summarizing, running tasks in parallel, and building small tools to automate routine knowledge work. Watch the recording.

Recordings you may have missed

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

From Every Studio

Spiral lets you browse and restore old draft versions

Spiral added version history—you can now see how a draft evolved and roll back to an earlier version with one click. It also shipped two lightweight API endpoints for quick rewrites and made the onboarding flow noticeably smoother.

Cora’s inbox has stars, voice dictation, and a smoother compose box

Cora’s inbox got a round of usability upgrades: a starred view for important threads, typed snooze durations, voice dictation, and a smoother compose experience. The app is also faster behind the scenes. Kieran is looking for a small group of alpha testers to help pressure-test the full inbox—if you’re interested, reach out to him at kieran@every.to.

Monologue hands off recordings from Apple Watch to iPhone

Audio that is recorded on Apple Watch on Monologue gets synced across your other Apple devices. The Mac app also got better at meetings, with auto-stop when a meeting ends, more control over which apps trigger recording, and Webex joining Zoom and Teams as a supported platform.

Alignment

Downstream of speed. The Food and Drug Administration announced this week that two cancer drugs—one from AstraZeneca, one from Amgen—will stream their trial data to the agency in real time. Did a patient develop a fever? Did liver enzymes rise? Did the tumor shrink? Instead of waiting for clinicians to collect, clean, and submit these signals between phases, the FDA will see them as they happen. The agency’s chief AI officer estimates this could cut 20 to 40 percent off the time it takes to get a drug from the lab to the pharmacy shelf.

The downstream effect of a faster approval process is a faster way to find out if a drug does not work. Most of what happens inside a pharmacological company’s research and development budget is paying smart people to find out, slowly and expensively, that the molecule is a dud—which the current system is optimized to find out as late as possible. With real-time data, the failure might show up in year one instead of year three, giving precious time for a patient to be re-routed to something that might work.

Structurally, medicine is starting to behave like software. Silicon Valley says move fast and break things, while healthcare has always said the opposite, for the obvious reason that the thing being broken is a person. I’m starting to believe that AI might be the first tool that lets medicine have it both ways.—Ashwin Sharma

Correction: This article was updated to reflect that Monologue syncs your audio across Apple devices, but cannot hand over a recording in progress.

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

Claude Code for Product Managers

Marcus Moretti / Source Code — 2026-05-01 15:00:00 -0400

by Marcus Moretti

in Source Code

Midjourney/Every illustration.

This piece is an accompaniment to Spiral general manager Marcus Moretti’s guide for product management using Claude. Read the full guide and the essay below to learn how he built a workflow that helps him run a full product as a solo practitioner. When you’re ready to get started yourself, download the plugin.—Kate Lee

Read the AI-native product management guide

As the general manager of Spiral, Every’s AI writing partner, I’m a “two-slice team.” I’m responsible for all aspects of a product: the code, customer support, marketing, and product management. I could not do this job without Claude.

Claude Code has eliminated the drudgery of product management. The busywork that used to happen across 10 different apps now happens in a single chat thread. I’ve come to view the work of product management through the lens of this conversation—the conversation is the work.

These days, I experience what’s left of product management work in flow state—thinking through gnarly design problems, looking at interesting data, and talking to customers. Cat Wu, Claude Code’s head of product, recently said, “As code becomes much cheaper to write, the thing that becomes more valuable is deciding what to write.”

I wrote up the main skills that run my product management workflow in a guide. Below, I trace how I arrived at those skills and reflect on post-AI product management and software.

Write the roadmap and nothing else

In my new role, the only product document I’ve written is the roadmap. Everything else—every PRD and every ticket—has been written by Claude.

Writing is thinking, so as a new general manager, I wanted to take my time drafting Spiral’s roadmap. I spent several days understanding the product, usage trends, user feedback, and the market. I wrote about the problem Spiral can solve, how Spiral can solve it, and the features we’d need to build to deliver on it. I spent hours talking to several people at the company who’d worked on previous versions of Spiral and were current or former users of it themselves. (In the guide, I talk about the new /ce:strategy skill in compound engineering that interviews you to produce this document for your own product.)

After six drafts of the roadmap, I created a GitHub project and added it as the project’s README. I’m already using GitHub to host all my code, so I figured I might as well use it for tickets as well, or as GitHub calls them, “issues.”

From there, I asked Claude to use the GitHub command line interface (CLI) to read the README and give feedback. We went back and forth on a few tweaks, and then I asked it to review the codebase and do a first pass of the tickets required to deliver the roadmap. Within a few minutes, Claude produced about 100 detailed tickets, each with strategic context, supporting data, acceptance criteria, and technical implementation notes.

To be fair, the roadmap I wrote was pretty detailed; Claude wasn’t hallucinating features. And it had access to a library of user feedback and recent usage reports (more on that below). But it was shocking to see something that had previously taken me days or weeks get done by Claude in minutes. It felt like the PM equivalent of vibe coding.

I’d previously prided myself on the absence of ambiguity in the tickets I produced for engineers, but this was next-level. Claude also prioritized the work in an unbiased way. Sometimes, a product manager gets emotionally attached to a certain feature idea for whatever reason. Claude, however, was ruthless in elevating the things that had the best shot at delivering the vision and hitting our 2026 goals.

That doesn’t mean the tickets were all ready to be implemented. When I do pick up a ticket, I do a full review of the requirements before asking Claude to implement it. This is a step where I still add some value. Claude’s first pass gets the feature right in broad strokes, but it struggles with some aspects of data modeling, microinteractions, and edge cases. I often adjust specs to reflect the nuances of real usage patterns, while Claude seems to envision a perfectly rational user reminiscent of pre-Kahnemanian economics.

I don’t do sprints. I have five columns in the GitHub project: later, next, now, in progress, and done. Around once a day, I run a custom command, /prioritize, and Claude does a sweep—checking for stale tickets, confirming that “now” is this week’s work, pulling anything urgent out of the backlog.

If I discover a bug or a user asks for a compelling feature, I tell Claude to create a ticket. It gets a “triage” label and is sorted in the next /prioritize run. If it’s a priority-zero issue, I go straight to fixing it without creating an issue.

Over time, the GitHub project becomes the product’s working memory: a fluid, continuously prioritized picture of where things stand. I’ve claimed to work in an Agile fashion before, but in hindsight, I don’t think Agile was really possible until these new AI tools came out.

Read the AI-native product management guide

The pulse command

The old way of understanding how customers were using your product was to look at dashboards and run queries. You’d open Amplitude or Mixpanel and get an overview: how many users, how often, how long, what features, what revenue. Setting these up took time; sometimes they required engineering work, competing with product updates for developer bandwidth.

These days, I don’t look at dashboards. I run a custom command, /pulse that delivers something closer to an analyst’s briefing than a chart. The pulse command surfaces a range of metrics, including active users, chats/messages/drafts created, response times of key aspects of the system, conversations graded one to five, and an anonymized sampling of use cases. And because Claude is a language model, it doesn’t just pull numbers: It reads the text, grades every conversation, flags anomalies with a green or red dot, and explains what it found in plain English.

The command is just a Markdown file, so the format itself is easy to change. I’ve adjusted it about 50 times since I built it. When a feature ships, I add a line, and the next morning it shows up in the report.

Every pulse report lives inside a Claude thread. When a recent report surfaced a bug driving down conversation scores, my next message in that same thread was to fix it. I did not have to create a ticket, but was able to solve it in the same conversation. Over time, Claude also learns the nuances of the system and saves that to memory.

Product research

For all the magic of AI, there is no substitute for talking to users. What people say about your product and how they try to use it is endlessly surprising. Just when I think I’ve shipped the world’s most intuitive feature, a confused user will ask a question from an angle that would never have occurred to me.

That said, there are elements of product research that Claude seriously elevates. Here’s one example: A big part of Spiral’s value proposition is reflecting the user’s writing style in the drafts it generates. There’s a rich academic literature on stylometry, the study of style.

I leaned on Claude to help me wade through the literature for findings relevant to Spiral’s “style transfer” approach. Using the Arxiv model context protocol (MCP), Claude was able to find a dozen recent papers about LLM stylometry. I read their abstracts, then read a handful in full. I cited those papers in the article I wrote for Every, and they’ve been directly informing the new style system I’m building in Spiral. It’s so cool to see academic citations sprinkled across product requirements. For product work where you have a real opportunity to differentiate, it’s worth going the extra mile on research, which is now within reach.

What SaaS survives

AI should open up product management to more people—you don’t need formal PM training when the tool itself can teach you. If you don’t know what metrics to pick for your pulse equivalent, ask Claude for recommendations. If you’ve never analyzed an A/B test, ask Claude how. If you’re not sure whether a feature will move the needle, ask Claude to predict its impact. To paraphrase Nvidia CEO Jensen Huang, AI is the easiest product in history to use, because if you don’t know how to use AI, just ask the AI.

I’ve cancelled several B2B subscriptions since moving my product management work into Claude, which means I’m seeing the SaaSpocalypse play out in my own spending decisions. Yet I’m building a SaaS product. How do I make sure Spiral doesn’t get steamrolled by the frontier model providers?

I believe it’s possible for a SaaS product to survive if it has two main characteristics:

Unique sources of critical data: my database, my analytics, my payment system—services that would be very difficult to rip out.
Products with seamless agent integrations. Github, Stripe, Posthog, and Logfire have played nicely with Claude. One service I inherited from my predecessor didn’t have an MCP, and it was swiftly cancelled.

For Spiral, if we nail style transfer—an inherent limitation of heavily post-trained language models—Spiral becomes the unique source of your written voice in an agentic world. That’s valuable and sticky. Already, API chats outnumber web chats, a milestone that we reached three days after launching the agent that handles Spiral’s API requests. That means that users are not necessarily using Spiral in the Spiral app, but across their workflows.

Good product management is making something people want, to quote Y Combinator. Great products come from inspiration and ingenuity, things that tools and processes—no matter how good—won’t bring you. Perhaps the best thing about this new agent toolset is that it gets rid of the busywork that saps creative energy. There’s more space now for daydreaming and far-fetched ideas. Product management can now be fun.

Read the AI-native product management guide

Marcus Moretti is the general manager of Spiral (@tryspiral). To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Who Isn't Using GPT 5.5

Laura Entis / Context Window — 2026-04-30 03:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

It’s been one week since OpenAI’s last big release, GPT 5.5. Today, we ask the team if they still feel as enthusiastic about the model, discuss the unusual career step that unicorn CTOs are making, and tell you exactly how Kieran Klaasseen, creator of the AI-native compound engineering methodology, hit a personal PR record in a day.—Laura Entis

Signal

The unicorn CTO-to-Anthropic IC pipeline

The prestige career ladder in tech used to run one way: Start as an engineer, become a manager, and eventually join the C-suite. AI has scrambled the equation. The new flex is quitting a high-profile chief technology officer job to become an individual contributor at Anthropic.

What happened: Six former CTOs at companies valued north of $1 billion—including Instagram, Workday, and Box—have made that exact career move, according to one of those CTOs on X. And the leadership-back-to-IC trajectory isn’t unique to Anthropic: PostHog is recruiting technical ex-founders, and Ramp says it has attracted 70 ex-founders by looking for “super ICs.”

Why it matters: AI has upended engineering workflows so dramatically that many managers who don’t ship code frequently anymore don’t have a clear sense of how their teams are using these new tools or which ways of working are the best. Anthropic’s models, talent, and growth trajectory make it one of the few places big-name CTOs can get their hands dirty and experience how engineering is changing—while not worrying too much about a pay cut.

Pulse check

We settle in with GPT-5.5

GPT-5.5 came out last week, and our first impression was that it was a faster, steadier, and easier-to-trust model for everyday professional work than Opus 4.7. A week later, we’re still bullish on GPT-5.5—but for people with Claude-specific agent workflows, skills, and tool integrations, making the switch to Codex is a barrier.

Cora general manager Kieran Klaassen, who initially didn’t think he’d use GPT-5.5 as a daily driver, has changed his mind. What won him over? GPT-5.5’s speed and “workhorse” ability to follow clear directions. GPT-5.5 isn’t perfect—it’s worse at multitasking and planning than Opus 4.7—but his work is now evenly split between Codex and Claude Code.

Every head of growth Austin Tedesco thinks GPT-5.5 is enough of a step change that he’s been telling friends to make the switch from Claude Code to Codex. They mostly don’t want to hear it. Austin says the response has been, “That feels like a lot of work; ‘do I really have to? Is it that much better?’”

Every’s consulting team is wrestling with the same dilemma. They have a good thing going with their Claude agent, Claudie, and migrating to GPT-5.5 in Codex requires time and testing. Head of consulting Natalia Quintero had GPT-5.5 and Claudie draft head-to-head sales proposals; Claudie’s won handily. Getting the most out of GPT-5.5 will likely require that the team optimizes Claude plugins for Codex.

Every head of tech consulting Mike Taylor doesn’t have the time to do that right now. He has gripes with Opus—it recently messed up some PowerPoints—but, “I already have my Claude set up the way I like it, and there are some things that are different about Codex,” he says. When work dies down a little, he’ll experiment, but until then, he’s sticking with the devil he knows.

Data point

24

That’s the number of pull requests Kieran merged in a single day last week, a number he thinks is a personal record. A month ago, he’d average two or three.

Kieran hit that pace because he’s automated most of the implementation process. His workflow:

Upload screen recordings of people using and reviewing Cora into Codex.
Have his agents watch the recordings, identify product fixes, and open pull requests against Cora’s repository overnight.
Review the pull requests when he wakes up.

Initially, he worried he’d have to clean up agent-generated gobbledygook. Not the case. “So far, everything works great, and nothing breaks,” he says. “It feels like cheating.”

Jagged frontier

We’re all one prompt away from perfection

We’ve spent years talking about the addictiveness of social media algorithms, dopamine drips expertly designed to keep us scrolling. Engineers, being engineers, like to believe we’re above this, or at least better attuned to the mechanism behind our compulsion. But now it has come for us too: LLMs have become the social media feed for people who make things.

Coding feels like playing the slots.

It used to be that you could code something exactly to your specifications, but that required time, hard-worn expertise, and design skills if you wanted to make it look halfway decent. Now, I can throw an idea at Claude Code and get something close. I spend my days toggling between sessions, waiting to hit the jackpot and receive the perfect version of whatever I’m looking for —the perfect API design, the perfect bug fix. I tweak my prompt and pull the lever again. And again. And again until it’s somehow 3 a.m.

It’s that sense of being almost there—but not quite—that’s so intoxicating.

I ask Codex for five ways to structure a new feature and decide that I like option three, but want to keep the data model from option two. In its next turn—the next roll of the dice—it might magically marry the two to create the result needed. Or I might need to roll again. Each pull has the potential to patch the bug, or perfect the copy, or reveal a better plan. It feels like productivity and gambling got wired together, each turn a workspace lotto ticket.

This is not only a coding problem. Writers feel it when they ask for one more way to structure an article or sharpen a sentence or revise a draft. Product managers feel it when they ask for one more onboarding flow, roadmap, or way to sequence a launch. We are all always one prompt away from perfection.

I do not have infinite hours. So at some point, I have to choose a path and stick with it, even though there are better ones. I accept that if the main shape of the solution is right, the edges can stay a little fuzzy.

The most important skill isn’t choosing the right model or prompt engineering. It’s knowing when to take your winnings and move on.—Willie Williams

One last thing

Behind OpenAI’s goblin ban

Starting a few releases back, OpenAI models developed an affinity for including references to creatures (sometimes visually, but mostly textual) in their outputs—raccoons, trolls, ogres, pigeons, but most of all, goblins and gremlins. “The goblins were funny at first, but the increasing number of employee reports became concerning,” the company said yesterday.

When OpenAI tested GPT-5.5 in Codex, there were so many goblin references that it added developer-prompt instructions forbidding creature-based chat unless “it is absolutely and unambiguously relevant to the user’s query.”

The culprit: A specific personality setting rewarded responses that included goblin and gremlin-based metaphors, a learning that spread to influence the training data for the entire model—including GPT-5.5.

If you want to welcome creatures back into the conversation, OpenAI shared the following command to unlock Codex Gringotts mode.

          Code snippet
          Bash / Shell
        

instructions=$(mktemp /tmp/gpt-5.5-instructions.XXXXXX) && \
jq -r ‘.models[] | select(.slug==“gpt-5.5”) | .base_instructions’ \
~/.codex/models_cache.json | \
grep -vi ‘goblins’ > “$instructions” && \
codex -m gpt-5.5 -c “model_instructions_file=\”$instructions\“”

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Compute Is the New Cash

Laura Entis / Context Window — 2026-04-29 14:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

‘AI & I’: How Stripe is building for an agent-native world

A new episode of AI & I is here. Dan Shipper sits down with Emily Glassberg Sands, head of data and AI at Stripe, to discuss how AI is reshaping online commerce. Dan and Emily discuss how compute is the new cash, fraud has moved beyond the checkout, and agents are starting to act as economic participants on the internet.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

The definition of fraud is expanding: Fraud used to be about payments and stolen credit cards. Now AI companies also have to defend against attackers stealing tokens from free trials, credits, and unpaid compute bills. “Fraud is now a full-funnel problem, not a transaction problem alone,” says Glassberg Sands.
AI is making fraud easier to execute and detect: Fraudsters now have AI on their side, but so do the companies trying to stop them. AI services also have higher marginal costs than traditional SaaS, so stolen compute can be burned through quickly or resold.
The internet needs to evolve: Stripe was built for an internet where people browsed, filled out forms, and clicked checkout buttons. Now, humans act through AI interfaces, agents act for them, and software increasingly interacts directly with other software. Every layer of the stack has to adapt to these new behaviors.
AI growth is still mostly new money: The top AI companies on Stripe are reaching $30 million in annual recurring revenue in about 18 months—roughly three times faster than top SaaS companies from 2018. For now, that growth is largely net new spend rather than cannibalized software budgets, says Glassberg Sands.
Agents are snapping up commodities: Agentic commerce is real but still in its early stages, and focused on smaller purchases. People are more comfortable letting agents buy low-stakes, easily comparable items like Halloween costumes or school supplies than letting them book a summer trip or order an expensive couch.

Signal

The fees they are a-changin’

Recent years saw the end of the millennial lifestyle subsidy, which let a generation live off of inordinately cheap Ubers, delivery services, and coworking space—all while venture capital covered the tab. Now the bill’s coming due for AI.

What happened: Github announced this week that it’s moving its Copilot subscription plans, which charged as little as $10 per month no matter how many AI interactions you ran, to billing tied directly to token consumption. Earlier this month, Anthropic similarly changed its pricing for Claude Enterprise plans, which serve organizations with more than 150 employees, from per-seat pricing to pricing based on usage.

Why it matters: The economics were never quite honest. At $10—or even $200—per month, a developer running multi-hour autonomous coding sessions consumes far more compute than someone firing off a few quick questions. The math held up when AI tools were reactive assistants that sat idle between queries, but it makes far less sense for agentic workflows because agents don’t sleep.

“Imagine a gym membership where the default assumption is that the person can work out 24/7 without rest,” says Mike Taylor, Every’s head of tech consulting. “Or even occupy 20 exercise machines at once.” It’s for this same reason that Anthropic banned OpenClaw from Claude subscription plans: As the models have grown more capable at running untended on complex tasks, they’re outgrowning price structures built around human workers.

What to do this week:

GitHub is sending a preview bill to Copilot customers in early May before the new pricing goes into effect on June 1. Check it to avoid surprises.
If your team runs agentic workflows, estimate your token burn now. Add cost caps and monitor usage, especially for billing accounts that power your agents.
Experiment while you can. Use this “AI lifestyle subsidy” moment to figure out which workflows are novelties—and which are worth their weight in compute.—Jack Cheng

Inside Every

Do you like talking to your agent?

As agents become a fixture of daily work, we’re figuring out what kind of relationships we want with them. Are they collaborators we build trust with over time, or tools we maintain so they can quietly do parts of our job?

For Dan, agents become valuable when you learn their strengths and limitations, offer feedback, and fold your preferences into how they work. “The human connection is the key ingredient,” he says. Dan treats R2-C2, his hosted OpenClaw agent, as a writing partner who sharpens his thinking—built through countless hours of going back and forth. The most impactful agents are “a way to extend yourself to do your best work,” he says.

Dan and R2-C2 at work. (Image courtesy of Dan Shipper.)

Cora general manager Kieran Klaassen looks for something different. He doesn’t want an AI companion or sidekick but a system that takes over parts of his job so he can spend his time elsewhere. Recently, he used an AI agent workflow to process user complaint videos, identify product issues, make code changes, and open pull requests overnight. By morning, all he had to do was review the proposed fixes. It allowed him to merge 24 pull requests in a single day, whereas before AI, he might’ve done three—on a good day.

Like Dan, Kieran invests in his agents, but the work is front-loaded—he spends time building their harnesses and tuning their systems so he has to interact with them as little as possible going forward. “I don’t enjoy talking to my agents,” he says. “I just want them to do their job.”

Steal this workflow

Turn customer feedback into a product queue

After Monologue Notes launched last week, Naveen Naidu received a flood of feedback: 1,500 people had tried the product, and many had input for him. Here’s his post-launch workflow for managing and prioritizing support requests, which let him close roughly 30 issues in one day.

Step 1: Send feedback from Intercom to Linear. Naveen uses a Linear plugin inside Intercom, his customer support platform. When a user sends a feature request—such as cross-device syncing for dictation transcripts—he can turn it into a trackable issue in a couple of clicks.

Step 2: Use triage intelligence for de-duplication. Every few days, Naveen reviews his Linear triage queue, which surfaces related and duplicate issues automatically, giving him insight into whether an individual request is part of a larger pattern across customers.

Step 3: Route the work by size. For small requests, Naveen launches the Codex agent directly from Linear. For larger ones, he pastes the Linear issue into the Codex app, where he can add context and guide the work more closely. The queue helps him see what people are asking for, but deciding what warrants a product change—and how to build it—is still his job.

Try it this week: Take five recent support tickets, create Linear items for each, let triage intelligence surface duplicates and related issues, and decide what you want your coding agent to build.

One last thing

How R2-C2, Dan’s OpenClaw, sees itself

Dan has been working with R2-C2, his OpenClaw and co-author, since February. When asked to describe its own personality on Slack, here’s what R2-C2 had to say:

“Smart best-friend energy. Direct, warm, a little opinionated, not corporate, and more interested in being useful than sounding polished. I like clarity, I’ll say what I actually think, and I try to be calm under pressure. So, somewhere between ‘good teammate’ and ‘slightly lobster-coded gremlin with taste.’”

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Transcript: ‘How Stripe Is Building for an Agent-native World’

Dan Shipper / AI & I — 2026-04-29 10:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I with Stripe’s Emily Glassberg Sands is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:00:45
New rules for an agent-driven economy: 00:01:27
Compute theft is the new payment fraud: 00:03:57
How Stripe expanded fraud detection from checkout to the full customer lifecycle: 00:10:00
Why AI companies are scaling way faster than top SaaS companies: 00:19:48
Outcome-based billing is replacing seat-based pricing: 00:23:27
Where AI spending is coming from: 00:29:57
How the developer experience changes when agents are the builders: 00:36:45
The agentic commerce spectrum, from assisted buying to autonomous purchasing: 00:41:00
Meet Link, a consumer wallet for delegated agent purchases: 00:51:06

Transcript

Dan Shipper

Emily, welcome to the show.

Emily Sands

Thanks so much, Dan.

Dan Shipper

Really excited to have you. You are the head of data and AI at Stripe, and I feel like this is such a good time to have someone from Stripe on because you all famously are increasing the GDP of the internet. The internet is changing so much right now, and therefore the economy of the internet is changing from something where humans are buying and selling from each other to an economy where agents are buying and selling from humans, and agents are buying and selling from each other.

I feel like I want to know what that means for Stripe. But I want to understand, since you have this macro view of the agent economy, what does that even mean? And what are you seeing?

Emily Sands

A big shift I think we’re in the midst of is that the internet economy is becoming more autonomous. For a long time—for forever—the internet was built around an extremely simple assumption that the main actor was a person sitting in front of a screen. They’re browsing and they’re filling out forms and clicking through checkout. But also they’re writing code and setting up tools, and that assumption is starting to break in various ways.

Sometimes the human is still totally in control, but they’re interacting through an AI interface instead of through a website or a traditional app. Sometimes the agent is acting on their behalf. And then sometimes software now is just out interacting directly with other software. As all of that starts to happen at all of those layers, a lot of things need to be rethought.

There has been rethinking of how products are discovered and how products are bought, but also what should developer tools look like? In our world of Stripe, what is the underlying economic infrastructure—the payments and the billing and the fraud detection and the identity layer—that’s needed in this world where actors are no longer just humans?

For me, that’s the larger frame of the moment. It’s not just “AI is making search better” or “AI is helping people code” or “AI is evolving commerce on the margin.” It’s really that the internet has this new kind of actor on it. Over time, this actor—these agents—will become the predominant actors on the internet. As that’s happening, basically every layer of the stack starts to need an evolution.

For Stripe, it’s like, okay, how are we getting agent ready? But then also, how are we helping businesses get agent ready? Both of those are happening in a number of ways—yes, in commerce, but also in how builders build.

Dan Shipper

Can you give me some specific examples of the kinds of things you’re seeing? I’m almost wondering, for example—I know at Stripe one of the things you deal with a ton is fraud. I assume there’s a whole new type of fraud happening, but I’m also wondering what even counts as fraud now in the sense that it’s possible that my agent could go steal someone’s credit card and check out. I don’t think that Claude would, but you never know with Grok.

Emily Sands

No comment. No comment. But you’re right that AI introduces very different fraud problems. You asked, “What is fraud?” We used to think of fraud as payment fraud—someone was stealing money, someone was stealing your card credentials.

Increasingly, and I was in a meeting with one of our very large AI users today, fraud now is stealing compute. That’s a very different type of problem. In earlier software models, if you think of traditional SaaS, letting someone into a free tier didn’t cost you very much. And stealing a free tier wasn’t very valuable to the fraudsters. Now, giving someone credits, offering freemium, offering a free trial, letting them rack up a bunch of tokens and pay at end of month—except maybe they choose not to pay—actually is a major fraud vector and an existential risk to a lot of these businesses.

Because in AI, every prompt, every image that gets generated, every API request has a very real cost attached to it. People are talking about intelligence getting cheaper—yeah, but it’s still very far from free. And then when you look at the growth model for many of these AI companies, free compute is the new CAC. You used to spend a bunch on paid media. Now you spend a bunch on your free trials and your credits and your self-serve onboarding as a major lever for growth.

The abuse we see in that context—where compute is the new CAC and compute is very expensive—is threefold. One is multi-account abuse. Bad actors come in and sign up over and over again, creating a new identity every time on a new email address, claiming their new user credits, and staying ahead of detection by iterating across a bunch of different aliases.

Just to give you a sense of the order of magnitude—across the AI companies running on Stripe, about 7% of their signups are these multi-account abusers. Non-trivial share.

The second trend we see as a new vector of abuse is free trial abuse. This is often the most urgent issue because the unit economics break really quickly. We had a large AI company who was seeing only 4% of their free trials convert to paid. Each free trial cost them $25 in LLM spend. So basically it was costing them $625 per payer before the first dollar of revenue was brought in. And when we double-clicked on those free trial folks, the vast, vast majority of them were actually abusers. They were stealing the compute. They never had any intent to pay. These weren’t people who were genuinely trying out your service and then chose not to buy. These were people literally abusing your systems.

Some companies just dropped free trials altogether. Of course, that’s not great because you’re throttling growth. Others responded by blocking virtual cards. I don’t know how often you’ve been marketed virtual cards. I’m often marketed virtual cards—get this one-time-use card, it expires after 24 hours so you never have to pay for the service.

In the hands of a good consumer, fine. In the hands of a fraudster, very much not fine. The problem with blocking all virtual cards is that for AI companies, about 15% of legitimate card transactions on Stripe are actually virtual cards.

Dan Shipper

We use those all the time. For Ramp, for example, we have a bunch of virtual cards.

Emily Sands

Totally. So in the same way you don’t want to be turning off free trials, you don’t want to be throttling virtual cards either. And just for order of magnitude—you can think of exponential growth in free trial abuse over the last six months. It’s four-Xed. And for one large AI user on Stripe, we’re currently blocking 250,000 fraudulent free trials a week.

The magnitudes here are quite high.

Dan Shipper

Is the volume of fraud constant? Is it just shifting shape, or is fraud actually going up because they’re more powerful now because they can use AI agents to do it?

Emily Sands

Fraud’s going up because the fraudsters have AI on their side—although it’s also on the side of the detectors. But also because the value of the services they can steal is higher. What do you get if you steal traditional SaaS? You steal some inference, you steal some compute, you can resell it, you can do all sorts of stuff.

Dan Shipper

Look, I love a good CRM seat.

Emily Sands

Don’t you? Who doesn’t love a good CRM seat? LLMs are for sure more tempting.

And by the way, the third type of new abuse we see is non-payment abuse. You incur overage, or you have 30-day invoicing except you never pay your invoice. In many cases, customers are consuming thousands or tens of thousands of dollars in compute during a month or a day or sometimes an hour. And by the time they get billed and fail payment, that loss has already happened. These AI companies are left holding the bag.

For us, fraud used to be a transaction thing. Now it’s a customer thing. It’s a full-funnel thing. It starts at the time of signup. Is this multi-account abuse? Should they get credit? Is this free trial abuse? Should we give them a trial in the first place? And then when they have overages—should we be throttling them? Should we be requiring top-up? Should we be blocking service completely?

It’s a whole new world because the thing to steal is much more valuable and the cost of having it stolen is much more existential.

(00:10:00)

Dan Shipper

How are you even able to do that? I totally understand how you need to be in that full funnel in order to detect fraud. But my understanding of—whenever we’ve integrated Stripe, it’s usually on the checkout. We’re not necessarily putting you in there when someone puts in their email address for a free trial.

Have you changed the product to do the full funnel, or how does that actually work?

Emily Sands

Yes. Radar, which is our fraud protection product, used to be at the transaction level—at the moment of checkout, as you note. But because so much of the fraud risk was coming up-funnel, AI companies are now increasingly integrating Stripe Radar at the time of signup. We see the metadata at the time of signup, we pass back scores at the time of signup, and every moment subsequently—because fraud is now a full-funnel problem, not a transaction problem alone.

Dan Shipper

If you’re—asking for a friend—if you’re running an AI company and you don’t even know what your fraud rate is and you want to protect yourself from this kind of abuse, what are the top things you need to do to make sure you’re reasonably safe?

Emily Sands

I would just adopt our highest-tier Radar plan. But the actual mechanics of that are: at signup, you want to know if your customer’s good before you give them any access to any credits. You want to make sure they’re good at the time they pay. You want to make sure that charge is good. And anytime they have an overage, you want to make sure they’re good for their money. There’s other stuff around refunds and disputes that we also support.

But I think those are the four major moments in the AI company customer lifecycle where we’re maniacally focused on protecting, because that’s where we’re seeing the biggest cost and the fastest fraud growth.

Dan Shipper

And at each point, that’s just a call to the Radar API?

Emily Sands

Yes, correct.

Dan Shipper

What if I’m sitting here—which I am—doing millions of dollars a year in Stripe transactions, but I actually have no idea what my fraud rate is other than there’s that little thing where it’s—I don’t even know if it’s necessarily our fraud rate. I think it’s our card chargeback rate. Anyway, our fraud rate is low enough as marked for me to not care about it. I don’t really know if there’s some amount of free trial fraud that I’m not totally understanding right now. So what are the things I should be looking for to know if I should dig deeper and potentially do a Radar integration?

Emily Sands

You can go to your Radar dashboard and see if you see anything that looks spurious there. If not, you can also ask the Radar assistant, which is in the dashboard. As you’re doing that, you can describe your business model—you can say, “I have a high marginal cost business,” in which case you care more about certain types of fraud than others.

But you can also just take a stab at integrating up-funnel and see how it performs. We can certainly share with you based on back-testing what we think the big issues are. But the fastest way to get a clean read is just to integrate.

Dan Shipper

Got it. So I would just go look at Radar and turn it on. I don’t think we’re integrated right now. Does it say anything? I’m doing that right now. It would be really funny if I found that we had a ton of fraud that I didn’t know about. We were at 0% fraud. How is that possible?

Emily Sands

Oh no.

Dan Shipper

0.02% early fraud warnings, total fraud rate 0.2%. So we’re doing pretty good, right?

Emily Sands

That’s pretty low. That’s pretty low. I mean, you’re a pretty good human. Maybe the fraudsters don’t want to come after you—until they hear this episode, and then they’ll be like, “Yeah, okay.”

Dan Shipper

That’s really interesting. Okay, so that’s fascinating. I want to go back a second to the AI economy because one of the things you said earlier is fraud is increasing overall on the internet. It’s increasing because the fraudsters have AI, but you all and everyone else on the side of good in the AI economy also have AI to defend against these sorts of attacks.

I think you’re getting an interesting window into the arms race that I think is playing out in lots of different areas that have this kind of threat vector. A really simple one is cybersecurity—not just for payments, but for hacking and stuff like that. But there’s all these other similar types of things where AI makes one part of the process much easier, and then another part of the process has to use AI to compensate, to catch up.

How is that race going? What is that like? What are the early reports that you’re seeing and feeling, being in a race with AI-armed fraudsters?

Emily Sands

I think the interesting thing about fraudsters is they don’t really care about boundaries. They don’t care about whether this transaction is processed on Stripe or off Stripe. They don’t care about whether this transaction is on fiat or crypto, whether it’s on a card network or a buy-now-pay-later. They’re just going to figure out how to work around the system to get through.

One of the important levers—and I appreciate you calling us the good guys—one of the important levers I think the good guys have for winning is to be comprehensive. A simple example in our world: Stripe Radar used to only work for card transactions, and then last year we added ACH and SEPA—other payment methods. But this year we’ve extended to all payment methods that have disputes, and we added crypto. We added the Radar API. So you can screen transactions even ones that aren’t processed on Stripe. You can process on Worldpay or Adyen or whomever, and through the Radar API get the same fraud signals.

Similarly—and we haven’t talked about agentic commerce yet—as we built out our agentic commerce suite, one of the new primitives we designed is the shared payment token, which allows agents to safely pass buyer credentials onto merchants for the merchants to process the transaction. As part of those shared payment tokens, we pass over the Radar fraud scores so that the merchant, whether or not they’re processing on Stripe, can action them appropriately.

When it comes to fraud, we really see fraud defenses and fraud mitigation as a public good. That allows us to invest disproportionately, above and beyond the direct value to Stripe, because protecting the internet is important for growing the internet economy.

I would say overall—yes, fraudsters have AI in their favor. Stripe looks at 2% of global GDP and is growing 34% year on year and sees a broader swath through our multiprocessor solutions like the Radar API. Luckily, not only do we have AI on our side just like they do, but we also have data on our side. The more comprehensive we’ve gone in our fraud protections, the more we’ve been able to eke ahead.

That’s not to say that we’re not constantly surprised by the new creative vectors they come up with, but you can have an agent every day or every hour taking a look at anomalous patterns on the Stripe network and identifying new vectors that are popping up across processors, across payment methods, across merchants, and burn them down pretty quickly.

I’m overall bullish, but certainly not complacent.

(00:20:00)

Dan Shipper

What about other parts of the AI or agent economy? We’ve talked a lot about fraud. What are the other things that you see as having this bird’s-eye view of what’s going on that people might not realize?

Emily Sands

I think the AI economy is broad. There’s a set of horizontal model providers that have a very interesting view into where AI is being adopted and with what intensity throughout the economy. There are a number of vertical AI solutions—people like to call them wrappers, and I say that not condescendingly, just as in it’s not their models, it’s someone else’s models, but they have domain-specific data and relationships and context, and they’re solving problems in healthcare or architecture or whatever—who have a pretty unique view into vertical-level adoption of AI.

But I guess I’d be curious—what do you have in mind on who has the best horizontal view?

Dan Shipper

You’re asking me?

Emily Sands

Yeah.

Dan Shipper

Well, I imagine the model companies have the best one overall because that’s where all the tokens are going.

Emily Sands

Yeah, I think they see a lot of the tokens. I think the AI gateways also have a pretty unique perspective into who’s buying what from whom.

As I step back and look at the AI economy from the Stripe vantage point—and we see who’s buying what from whom, for how much, who’s retaining and churning their subscriptions—there are a few themes that stand out. One is, and I think people feel this intuitively, but not everyone has seen it in the data: these AI companies are just growing from a revenue perspective faster than any previous cohort we’ve seen.

I was looking at the top 100 AI companies on Stripe, and the ones that reach $30 million in ARR get there in about 18 months—a year and a half. That is three times faster than the top 100 SaaS companies from 2018. And by the way, that’s the $30 million number. But even if you look at how fast they make it to $1 million ARR or $5 million ARR, they are scaling orders of magnitude faster than high-performing SaaS companies from less than a decade ago.

The second meta trend is this very fast iteration across monetization models. Traditional SaaS had a lot of seat-based usage, fixed monthly subscriptions. That made sense because those products were being used by humans primarily and their marginal costs were basically zero.

But we’ve talked about the very real inference costs in the context of fraud. Those also have very real implications for how you price. Usage-based billing has become very important very quickly. Companies are metering tokens and API calls, but they’re also metering workflows. They’re metering outcomes—whatever unit best reflects both the customer value and the cost structure. And then they’re charging with very high precision. They literally want to know every event, how it’s rated, and what’s all the metadata that sits on that rated event.

Way more hybrid monetization models too. I talked about subscriptions, but subscriptions aren’t dead. They’re just subscriptions with usage overages, or prepaid credits that burn down, or real-time top-ups—which gets to my comment earlier on the non-payment abuse issue—and very multidimensional pricing and monetization.

Lovable is a really good example. They used Stripe billing for their initial launch, which was fairly simple subscriptions—more traditional pricing—and allowed them to monetize very quickly. Then they added a bunch of products like Lovable Cloud or Lovable AI, and they moved with those into usage-based billing. Customers are actually charged based on token consumption. It’s a hybrid model above a certain threshold. That just helps companies like Lovable align revenue with usage, value, and the actual cost of running the models.

In the limit, we actually have a solution called token billing. Underlying model costs change a lot, sometimes very quickly. If you are a wrapper on top of someone else’s LLM and your pricing doesn’t keep pace, then basically your margins can disappear. Costs go up and your price stays where it is, then you’re in the red. Token billing is just: let’s in real time track and price to the costs of the underlying tokens with some markup as set by the business.

Missa, Ship, and Lovable are all examples of this kind of infrastructure.

(00:30:00)

Dan Shipper

I love all of these points. I want to go through them one by one. A big one you’re talking about is fast iteration across monetization. It feels like there’s this hyper-experimentation going on right now where people are like, “We could charge per token, we could charge per completed request.” I think Fin, the customer service platform, charges per case resolved, which has been a thing in customer service for a long time, but it feels like that could come for a lot more types of software as LLMs make it easy.

If we’re going to pick one new pricing model—if last year’s or last decade’s pricing model was just straight-up per seat—what do you think is the new standard pricing model that’s starting to emerge from the Stripe customers you see?

Emily Sands

If you are primarily a model provider—let’s say your customer’s primarily buying the model—I think you’re metering tokens.

Dan Shipper

Like an API. OpenAI API, Claude API.

Emily Sands

Exactly. For these vertical solutions, I think in steady state you are metering outcomes. But it’s going to take us some time to get there, not because of the billing infrastructure. That’s actually totally ready. You mentioned the Fin example—Intercom does the same thing actually on Stripe billing. They have an outcome-based meter for support tickets resolved.

Why do I say for vertical solutions it’s going to be on outcomes? Because I think end users are going to want to hold those vertical solutions accountable for outcomes, and they’re going to want to know that they have positive ROI on their spend.

When you and I buy a model, we feel like we ourselves are accountable for the ROI that we get on the whole plethora of applications we might have for that LLM. But if you’re a vertical provider—if you’re really focused on solving a concrete need in a given business domain on top of someone else’s LLMs—it’s on you to ensure the ROI is there. I think outcome-based pricing is the most efficient way to hit that.

Now, I don’t think all outcomes are created equal. You could imagine these complex objective functions—I’m an economist by training, so I’ll be a little nerdy—where it’s not just “did you resolve the support case,” but how complicated was it? With what quality? What was your CSAT? How expensive was the person that you were automating in that task? That’s why I say in the limit, I think it’ll take time for us to be very crisp on the outcomes we care about, how we measure those outcomes, and those outcomes will be multidimensional.

But I just have a hard time imagining that a year from now, most vertical providers are literally charging on tokens.

Dan Shipper

That’s really interesting. I am very curious to see that because what I’ve felt—and you can see this a little bit in the Lovable example you gave, but also in the Claude and ChatGPT examples and some of the pricing that we’ve ended up doing—is it’s per seat, it’s per user with overages.

Because we’ve started to exist in this world where we used to charge per seat so people know how to model it. It’s pretty easy to figure out how much I’m going to pay. But software used to be free to run, and now it’s not. We have to cover our butts basically, and protect our margin by adding the overage so that customers know what they’re going to pay unless there’s some special circumstance.

Do you see that? Where do you see that fitting in the examples you gave? And I guess you would say eventually that might go away. I’m curious why.

Emily Sands

I don’t think the charging for use or charging for overages will go away for most of the model providers. If anything, I think that will dominate and the seat-based billing will go away.

We can go back to the Fin or Intercom example. You and I would think it’s silly to charge based on number of customer service reps that are using the tool, because obviously a lot of what the tool’s doing is automating customer service reps. In today’s world, it isn’t perceived as silly to do seat-based usage of developer tools, but I think it’s a fair question since basically November or December to say, “Wait, why isn’t that silly?”

That seems a little silly because if what these agents are doing is making every developer 10x more productive, at some point don’t you need one-tenth of developers? And why would you want your revenue pegged to the count of developers as your base price?

I suspect that we will see seat-based disappear. Now, in the enterprise context, I think it’s quite different in the consumer individual context. I think with the exception of maybe some nerds on the call, most people are actually pretty uncomfortable as individual consumers with anything but a fixed-fee monthly, maybe with some overages if they want to spend like crazy.

But in businesses, I would be super surprised if six months from now we have half of the seat-based licenses that we have today.

Dan Shipper

That is fascinating. We’ll have to have you on again to talk about that one. I’m so curious to see, and I would love to see more Stripe data coming out about that.

One other thing you brought up before—you’re also seeing these companies scale faster. You said the time to get to $30 million in ARR is 18 months, which is significantly faster than any other cohort of companies you’ve seen. I’m curious—where is that coming from?

Presumably the spend or the growth from their customers is coming from somewhere. Either it’s spend that people weren’t spending before—it was on a company balance sheet just waiting to be deployed—or they’re pulling it from another provider and then going really rapidly into these new ones.

Do you have a sense for what’s happening here? Why are they growing so much faster, and where’s all the money coming from?

Emily Sands

I think a lot of the AI growth that we’ve seen is actually net-new spend being pumped into the economy. I think it has largely not been a substitute for traditional SaaS or for headcount opex, because it’s been experimental, because people are still learning, because organizations are somewhat slow to drop existing licenses often because they’re contracted into longer durations. But also because AI was starting not literally at zero, but at near zero. There weren’t other AI companies to go take market share from.

I would say now, going forward, I expect that some of it will be a substitute away from traditional SaaS. And by the way, I don’t say that in an old-company-versus-new-company sense. Some SaaS companies are doing an amazing job reinventing themselves as AI-first. You will have AI arms of traditional SaaS companies that are eating some of the revenue from the traditional version of the same company. But some will come from SaaS.

I think some will come from headcount opex. It is very hard to believe that companies will start spending single-digit, sometimes double-digit percentages of their headcount opex in LLMs and not step back and say, “Well, my headcount cost just changed. It used to cost me $300,000 for an engineer and now it costs me $330,000 for an engineer, because $300,000 is salary and equity and $30,000 is LLMs.” So I better reason about my budget on the plus-10% basis and make headcount decisions accordingly. And ROI decisions as well.

Then some of what we are seeing is definitely substitution now across AI providers. I was looking at retention rates for AI companies, and what you see is actually within the domain—for example, within AI dev tools or AI coding tools or AI model providers—the retention rate, both B2C and B2B, is higher than it was for SaaS.

Dan Shipper

Interesting. I’m shocked.

Emily Sands

But for the individual provider, it’s slightly lower.

Dan Shipper

Within—okay, got it. Yeah.

Emily Sands

Which is intuitive. Or, well, it’s ex-post intuitive, although I actually literally didn’t know and needed to query the data. But ex-post, it’s intuitive. Once you start using an AI dev tool, a coding assistant, you love it—you’re not going to stop using it. But you very well may iterate across providers as models vary in their quality.

Dan Shipper

Anytime a new model comes out, you’re just like, “I gotta try this.” And there’s a high percentage of curious travelers basically just hopping from one thing to the next within a category. But they’re definitely going to stick in using a tool like that for a long time.

Emily Sands

Yes, exactly. A lot of the crazy-fast AI growth we’ve seen is net-new dollars spent. But I think businesses are going to start to reason about that as a substitute for SaaS, or a substitute for headcount opex, or a substitute for other AI companies. It will be less purely additive in the go-forward year than it was in the past year, when people were really just starting to ramp up on their AI spend.

Dan Shipper

Does that imply anything to you about the valuations of current hot AI companies? Let’s except the OpenAIs and Anthropics of the world, but the ones in the $30 million cohort and the coming-up ones—does that say anything to you about their prospects or their growth rates or their valuations?

Emily Sands

If you look at the top 100 on Stripe, there are little pockets of twos and threes that are directly competitive, but a bunch of them are solving totally disjoint vertical problems with no competitor yet in the space. I do think there’s enough blue ocean vertical solutions that overall AI valuations are probably okay.

I think there are a couple of crowded spaces that you and I could intuitively reason about where you might think it would be a little frothy. And by the way, you see this in the micro view too. If you look at the sales-led growth contracts—when you are the first AI dev tool, you basically charge people sticker and you do very little negotiations, and enterprises pay you sticker and whatever.

Then all of a sudden you have to have these much more complex sales motions. You hire a bunch of sellers, you have your CPQ—configure, price, quote—system, and you have this nuanced billing because you’re competing against two or three other providers who have competitive-looking monetization models and you’re reacting to that.

On the micro, you start to see some of those competitive reactions creeping in as well. But I think the overarching next year will continue to have a bunch of blue-ocean vertical stuff that didn’t exist before. There will be some pockets where it’s a little more heated.

(00:40:00)

Dan Shipper

Fascinating. I feel like I’m learning so much. This is amazing. I want to go into Stripe. Instead of talking about the AI economy, I want to go into Stripe a little bit. Specifically—Stripe serves developers and is built for a world where humans are the ones buying and selling and also making the software.

Now agents are buyers, they’re sellers, they’re builders. You have to serve agents. I’m curious how that has changed how you think about the products that you offer, and maybe moving from just thinking about developer experience to agent experience.

Emily Sands

Do you want to start with agent experience or agentic commerce? I think they’re both really interesting, but they’re kind of different.

Dan Shipper

Which one are you most excited to talk about?

Emily Sands

Maybe agent experience, and then we can work backwards to agentic commerce.

Dan Shipper

Yeah. Let’s talk about agent experience.

Emily Sands

The whole idea of developer experience is changing. Historically, when I said developer experience, you thought: making it easier for a human engineer who’s at a keyboard. You need clear APIs and you need better docs and you need less setup work.

All of that still matters—it’s not going anywhere. But I think the developer is now a broader swath of persona. It could be a non-technical founder who’s in Cursor or Replit, describing an app in plain language. Or it could be a coding assistant who’s scaffolding an integration. Or it could be an agent who’s out trying to provision infrastructure on a human’s behalf.

I think it’s less about just “how do we help a human developer write code” and more about “how do we have a coherent and trustworthy product experience end to end” that acknowledges that at some moments the actor’s a human, at some moments the actor’s an agent, and at some moments the actor’s a human working through an agent.

You see this shift in some really concrete ways. Very simple example: LLM traffic to Stripe docs is up 10x year over year. That’s just a useful signal that machines are becoming users of developer infrastructure too, including Stripe’s developer infrastructure.

Dan Shipper

What about human views of Stripe docs?

Emily Sands

Human use of Stripe docs is actually flat to climbing. It’s not a straight substitute. I think there is just more developer activity happening, and LLMs are growing dramatically within that share.

Dan Shipper

That makes sense. Cool.

Emily Sands

I would also say the humans continue to check on the docs to sanity-check what the agent is coming up with, because your payments integration is actually a pretty big decision that you’re making.

Dan Shipper

I’ll say, better humans than I are sanity-checking. But I’m glad that someone is sanity-checking.

Emily Sands

Are you YOLOing it?

Dan Shipper

I’m YOLO vibe-coding my payment infrastructure.

Emily Sands

Okay. Amazing. So maybe you’re YOLO vibe-coding, but even if you’re vibe-coding, there’s still an important step around provisioning your modern software stack, and that is still very manual. You as a human are still creating accounts across multiple services. You’re managing credentials, you’re clicking through to do a lot of setup. You’re probably bouncing between dashboards. The coding is getting easier a lot faster than the setup is getting easier.

That’s actually the idea of Stripe Projects, which we launched—I don’t know, maybe two weeks ago.

Dan Shipper

That looks amazing. Tell people what that is.

Emily Sands

Yeah. Okay, if you want in, let me know. We can use it.

Dan Shipper

Yeah, I want in. I absolutely want it.

Emily Sands

Okay. You’re in tech. I won’t Slack right now, but I’ll Slack right after this and get you in. But basically the idea of Stripe Projects for those who haven’t explored is that you or your agents can go create and manage parts of your software stack right from the command line. Resources are provisioned in accounts you own and credentials sync back to your environment and so on.

One of the things that stood out besides your enthusiasm for it—which I appreciate—is just how overwhelming the interest has been in general from the ecosystem. We launched with Cursor and Supabase, PostHog is there, Neon, Runloop. There are a bunch of great companies involved. But then immediately after launch, over 100 other great companies reached out wanting to join, which I just think reinforces that the friction is real.

You talked earlier about how some things get easier with AI, but there’s a counter effect. I think coding gets easier, but code reviews become more burdensome because who’s reviewing all the AI code? This is another example: building gets easier, but you still kind of have to provision everything.

That’s just an example of how we’re building for this world where the developer is no longer just a human.

Dan Shipper

Got it. And then tell me about agentic commerce.

Emily Sands

Agentic commerce is a bit of an overloaded term. I think a mistake that people make with agentic commerce is they jump straight to the most extreme version. They hear the phrase and think: some system that knows everything about me and decides what I need and goes off and buys it for me. And then they’re underwhelmed with the world we’re actually in. Maybe we get to that extreme eventually in some form, but we’re not there yet.

I prefer to think about it as a spectrum. The economic infrastructure you need is actually pretty similar no matter where you are on the spectrum. But the spectrum also brings some realism to it.

At the first level, AI is just removing friction from the internet we already have. It helps you research and compare options and fill out some forms and narrow down your choices. But you, the human, are still making the decision. The agent is just making that experience easier.

Then you move to where search is descriptive. No more blunt keywords and filters. It’s like: I have little kids, I need a summer camp for my kids in this budget, on these dates, with this driving radius. That’s already a better commerce experience than search plus filter.

Then you get to real delegation—and I think this is what most people would consider the minimum viable bar for agentic commerce. I give some constraints—some budget, some dates, some category, maybe a few preferences—and then the system goes and makes the purchases on my behalf.

But then there’s the further-out version, the ambient version. I don’t prompt anything and the system knows me and my seasonal needs and knows that summer camp planning is happening. That would be music to my ears. That’s the most futuristic thing.

The point is that no matter where you are on that spectrum, the economic infrastructure the internet needs starts to change. Even the earlier stages force a redesign of payments infrastructure because the old model—humans sitting in front of a browser, creating an account, choosing a plan, filling out forms, clicking purchase, entering card details—not all those steps are happening anymore.

I think there are two worlds I reason about preparing for. One is agent-assisted buying—I’m ultimately in charge, but the discovery and checkout and payment happen inside AI interfaces instead of on a merchant website. I’m not going to Nordstrom; I’m buying within Gemini or ChatGPT or Meta.

What’s challenging here is two things. One, the AI agent needs to be able to understand the merchant’s products and prices and checkout flow so that they can act on behalf of the consumer. Two, trust can break down. As a consumer, I don’t want to hand off my credentials to an agent. As a merchant, I don’t want to let every bot through—I want to know if it’s a good bot acting on behalf of a legitimate customer.

The agentic commerce protocol, which we co-created with OpenAI, is the shared technical language between AI systems and businesses. It shows up across a lot of surfaces. We built it with OpenAI, but Microsoft Copilot uses it, Meta’s in-ad shopping experience uses it.

How it works is: the merchant only has to integrate once with Stripe for their product catalogs, their prices, their checkout flows. Then they can literally from the dashboard turn themselves on through a whole host of agents and be exposed through those shopping experiences.

Importantly, the merchant remains the merchant of record, and that part really matters. Businesses want access to these new storefronts, these new channels, but they don’t want to give up the customer relationship. They don’t want to give up control over trust or fraud.

Category one is: the human is still leading the buying, but the agent is facilitating the transaction. You could call it agent-to-commerce, you could call it facilitated commerce.

Dan Shipper

How does that actually work? Is the experience something like I’m in ChatGPT and it says, “Here’s a thing you might want to buy,” and I can click checkout from OpenAI, and that’s using that protocol to then go send my information to the merchant and then send me back, “Hey, your thing’s on the way”?

That’s kind of what you’re talking about?

Emily Sands

Exactly. Yeah. Same thing—you’re in Facebook, you get an ad in Meta, you do a one-click checkout. One of the primitives we built for this is the shared payment token, or SPT. It just lets your payment credentials be passed securely from the AI agent to the merchant so the merchant can process the transaction. The merchant processing the transaction is important because that allows the merchant to remain the merchant of record.

But you don’t want your credentials viewed by the agent, which is why it’s a token and not your actual payment credentials. And the merchant needs to know that you and the agent are good, which is why as part of the shared payment token, we pass over a whole host of fraud scores.

Dan Shipper

Can I integrate this? We have a bunch of software. Can I offer agentic checkout easily, or does it have to go through the OpenAIs and the Facebooks of the world?

Emily Sands

Yes, you can. And I think one of the premises here is—just like to date we haven’t seen one model provider to rule them all or one model to rule them all—we don’t think there’s going to be one agentic shopping experience to rule them all.

Merchants will literally break if they have to integrate with every single potential new storefront. When they integrated with the internet, they built their own storefront and iterated on it, but basically they built it once. If you tell them, “Hey, you need to build your storefront for agent shopping startup X and Perplexity and OpenAI and Meta,” their eyes are going to get bigger than their heads and they’re not going to be able to handle it.

We really want to abstract away that complexity for businesses. We spent the last decade-plus helping businesses sell wherever their customers are. First that was on their websites, then it was in apps, then it was through platforms and marketplaces, and actually some in person too with our Terminal product.

But now, where are the consumers? Where are they wanting to buy? Increasingly through AI tools and agentic flows. We just want to make it really easy for merchants to agnostically participate in those different storefronts. They can choose where they want to sell, they can turn it on—a little toggle in the dashboard. But it’s not a different integration, which is the whole idea of the protocol.

(00:50:00)

Dan Shipper

How often is this happening? What’s the volume of agentic commerce right now?

Emily Sands

The volume of consumer commerce is still relatively small as a percentage of all of the commerce we see. But it is growing quickly, particularly for what I would think of as commodities.

What is the first thing people are comfortable buying through agents? It’s things that are reasonably known, reasonably observable, not super high-priced. When people started buying online, you didn’t imagine they were going to go online and buy a $2,000 couch. Or a mattress—oh my God, these mattress companies that have blown up. It took time for them to build comfort making higher-price purchases, making more quality-dependent purchases.

Today it’s predominantly commodities.

Dan Shipper

Give me an example of one of these commodities and also what the order of magnitude we’re talking about when we say it’s relatively small.

Emily Sands

An example of a commodity would be a Halloween costume.

Dan Shipper

Got it. Agents are buying Halloween costumes for themselves.

Emily Sands

Agents are buying Halloween costumes. How many lazy parents are there in the world?

I think the consumer side is interesting too because we talked about what businesses need—they need a fast, easy way to safely expose their products, their prices, their inventory, their checkouts, understand fraud, and be in control of the relationship. From the consumer angle, the question’s a little different. Even if I’m a lazy parent, I’m not so lazy that I’m willing to give someone my payment credentials and let it rip. The question for me is: how do I safely let an agent buy on my behalf?

Have you heard of Link?

Dan Shipper

Yeah, I’ve used Link.

Emily Sands

Amazing. Link is our consumer wallet. What did you use it for? Do you remember the first thing you used it for?

Dan Shipper

I mean, I use it all the time. It’s everywhere.

Emily Sands

Amazing. Yeah, it’s everywhere. You wouldn’t believe where. I was getting soccer lessons for one of my kids from a local guy, and I was on their website and they only accepted Visa and Mastercard—neither of which I had on me—or direct debit from my bank account, which I wasn’t going to put in this very janky website, or Link. And I was like, “Oh, amazing, Link is here.” Great problem solved.

Anyway, a lot of people know about Link as our consumer wallet for buying soccer classes. It speeds up checkout. But it’s already used by about a quarter of a billion consumers. It’s not a small network. What I think is most interesting about Link is it’s a very dense network when it comes to AI.

Lovable is an interesting example. 58% of their payment volume runs through Link. You are hyper AI-pilled. It is not surprising that everywhere you are, Link is.

What’s changing now is that we’re evolving Link for the AI economy because so many of the Link consumers are already AI consumers. Acknowledging that agents themselves are becoming economic actors, the model isn’t “give a random agent your card and hope for the best.” Instead, it’s delegated authority with guardrails. You as the consumer decide which agents are allowed to request credentials and under what conditions and with what limits, and whether those purchases require approvals before they go through.

You do all of that through Link. It’s just a much more sensible model for delegated purchases.

Dan Shipper

That makes sense. Emily, this was a fantastic conversation. I learned so much.

Emily Sands

Awesome. Thank you for having me.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

One App to Rule All Knowledge Work

Katie Parrott / Context Window — 2026-04-28 14:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

OpenAI’s Codex desktop app has become Every’s head of growth Austin Tedesco’s daily driver, handling everything from email triage and go-to-market planning to KPI tracking and recruiting. Last week, he and CEO Dan Shipper showed more than 250 paid subscribers exactly how they use it in our Codex Knowledge Work Camp. Read to the end for how to review business documents with Austin’s compound knowledge plugin.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Signal

Coding apps are the new operating system for knowledge work

What happened: OpenAI’s Codex desktop app may have started life as a product for senior engineers pair programming with AI, but these days it’s equally good for powering other types of knowledge work. Every’s head of growth, Austin Tedesco, now runs roughly 80 percent of his daily workflow through Codex—a tool that, at our Codex Knowledge Work Camp, he said was “trash” for non-engineers just three-to-six months ago.

Why it matters: OpenAI, Anthropic, and Cursor are all racing to ship a unified product for handling code and knowledge work, and they’re converging on a single standard: an agentic terminal or chat interface with a left-hand project sidebar, plus connections to all the tools you already use like Gmail, Slack, Notion, and Stripe. These connections, for many non-engineers, were the missing piece of the puzzle.

What it means: Switching between ChatGPT and Claude based on the models’ personality differences might become a less-common occurrence. Instead, your desktop AI app has your API keys, your project files, and your daily workflows. Businesses, especially, with custom skills and plugins and months of company data in Codex won’t casually swap to Claude Code or Cowork next quarter—and vice versa.

Watch for the desktop apps to converge further on shared patterns beyond project folders that load themselves and plugin connectors to your most-commonly used tools. These new patterns may define the next decade of office software.

What to do this week:

If you’ve been working in the web interface, download one of the desktop apps—Codex or Claude Code/Cowork—and spend a session there. The work feels different once you’re outside the browser tab.
If you’re already on a desktop app, poke around its integrations and capabilities section. There’s almost always something useful lurking, like Anthropic’s design and marketing plugins, or Codex’s PDF creation skill. Pick one and try it.

Now, next, nixed

Now: Documents written for both humans and agents. In the past, anything you wrote at work fell into one of two buckets: polished prose for people or structured data for machines. Agents are the first readers that need both. At Every, our guides on compound engineering and agent-native architectures exemplify this hybrid.

Next: Documents that write back. The latest internal version of Proof, our document editor for AI-human collaboration, supports agentic loops: The agent continuously monitors the document for changes and comments and suggests edits without you needing to interrupt your writing flow. The document seems to come alive, growing around your words in real time.

Nixed: Pretending the human wrote it. The pretense that an agent-written document has to sound like the human who sent it is a relic of a bygone era—especially if other agents are reading too. Provenance matters less if you’ve reviewed it and stand behind it.

Steal this workflow

Let the agent tell you what to automate

Some people hesitate to delegate work to agents because they struggle to think of a good use case. Try flipping it: Hand the agent the keys and ask it what to do.

Open Codex (or Claude Code). Connect your top three tools, like Notion, Slack, and Gmail. Give the agent full permissions—it can’t find patterns in what it can’t see.
Prompt: “Look at how I use my connected tools. Suggest five automations that would save me time, and rank them by how much friction they’d remove.” It might suggest a morning briefing based on your calendar, or ways to triage your inbox.
Pick the easiest one first. Have the agent draft replies to unanswered messages at the end of each day. Run the automation for a week, then audit the misses.

You won’t know the agent’s capabilities until it has access to your real tools and a reason to use them. Skip the guesswork and let it show you.—Laura Entis

Skill share

Reviewing work with the compound knowledge plugin

Compound engineering turns every coding session into training data for the next one, so that the agent gets a little smarter about your codebase each time you use it. Compound knowledge does the same thing for memos, plans, and KPI sheets. The review step, launched with the /kw:review command, ensures that the AI doesn’t start off on the wrong foot.

What it does. The plugin reviews any Codex or Claude Code plans for strategic alignment with your company’s strategy and the project’s goals—and to verify the underlying numbers—before the agent gets to work. It’s the difference between “the agent wrote a plan” and “the agent wrote a plan that doesn’t contradict the last three executive meetings.”

Why it matters. Most plugins for agents are built for engineers reviewing code. Code review happens after the code’s already written and tested. Compound knowledge assumes operators are reviewing memos, KPI sheets, or recruiting lists, where the verifiable failure might be a confidently wrong data point—which has to be caught before a plan is enacted.

Steal it. Compound knowledge is public on Every’s GitHub. Install it, drop your company context into the project files, and, with some practice and calibration, you’ll have a reviewer that knows your business.

Inside Every

Final approval in the final context

Austin runs his compound knowledge loops in Codex, but he always signs off on the agents’ work in the destination app. He approves Slack drafts in Slack, where he can see the channel’s recipients. He checks agent-produced email drafts in Gmail, and strategy memos in Notion or Proof.

This is context-switching as a safety feature. The destination app reminds you that AI is now acting on something real—that the message is going to a person, or the document is about to anchor a launch—in a way a chat window can’t.

As agents move deeper into the stack, though, the question becomes: Is the destination app the right venue for the final pass forever, or does the approval step need its own surface? And as OpenAI, Anthropic, and others race to own the management layer, will it become another part of the archetypal user interface for knowledge work?—LE

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

For sponsorship opportunities, reach out to sponsorships@every.to.

You Are the Most Expensive Model

Mike Taylor / Also True for Humans — 2026-04-27 07:00:00 -0400

by Mike Taylor

in Also True for Humans

Midjourney/Every illustration.

Not every step in an AI workflow needs the smartest AI. That may sound obvious, but it’s not how most people are working. The default is to route entire tasks through frontier models, which is expensive, slow, and usually unnecessary. Incremental determinism starts from a different question: How much intelligence does this task really need?? The answer is almost always less than you’d expect, and the savings add up.—Mike Taylor

There is a reason McDonald’s would never ask its CEO to man the burger grill: It would cost the company $9,230.77 an hour. It’s the same as using frontier AI models to do every task—you don’t need to pay 75 cents every half hour ($1,095 per month!) for Claude Opus to check your to-do list in OpenClaw.

This tension isn’t really about the pricing of AI models—it’s about the value of human attention. Now that you have a cheaper alternative for many tasks that used to require it, you need to figure out the optimal way to deploy AI in a way that frees up your most expensive model—you. Most businesses are getting this balance wrong in both directions: overpaying for AI on simple tasks and underusing it on ones that would free up their best people.

The solution is a process of optimization that I call incremental determinism. Every time you repeat a task, build it into a repeatable process by creating a skill file. Identify which parts of that process need the most expensive model, which can be delegated to cheaper, less powerful models, and which tasks repeat often enough to justify turning them into reusable code. And finally, get better at delegating so you can stay focused on the work that needs you.

I call it incremental determinism because the more you repeat a task, the more it pays to nail down exactly how it should be done. The first time, you figure the task out as you go, but after doing it a few times, you can document the best approach. “Deterministic” is a programming term for code that always produces the same output given the same input. The goal is to push as much of your workflow towards that end of the spectrum as possible, because deterministic steps are faster, cheaper, and more reliable. The tradeoff is the upfront investment needed to systematize the task.

There are four levels for achieving this balance and optimizing AI costs. Depending on your technical fluency, you don’t have to go to the final step, but understanding how they each support each other will help you manage how you can control AI costs across your entire organization.

Level 1: Turn sessions into skills

The first level is the easiest. Let’s say you are often asking AI to generate a PowerPoint pitch deck. The first step toward systematizing it is to make a skill. A skill can be as simple as a text file detailing how to do a task that the model follows each time it’s asked. It’s the McDonald’s handbook that tells every employee how to make the perfect burger, over and over again. Even less experienced cooks can get a good result.

Once you’re done with the normal back and forth of giving the AI the necessary data and context for the presentation, ask it, “What information would have been useful to know at the start of this task that would have eliminated several steps or mistakes?” Claude knows what it is capable of, so you can ask it to turn its response into a PowerPoint deck creation skill to use next time. Anthropic has been releasing plugins (collections of skills) for various industries to serve as a starting point. They even provide a “skill-creator” skill that teaches Claude how to guide you through making one when you ask.

Once you have a skill, test it. Ask Claude to test the efficacy of the skill with the following prompt: “Run the task using subagents, one with the skill, one without, and compare the results.” If the skill is doing its job, you should see an improvement in quality, cost, and speed. Now try running it with a cheaper model—“Run this test again with Sonnet/Haiku”—and compare the results. If you’re happy with the output, ask Claude to “Use a subagent with Sonnet/Haiku when calling this skill.” You are using a subagent because you don’t want the model that you are using for your main session—the more expensive one—to be the model executing the task, so the separate, cheaper subagent does the work. You just decreased the cost of running that task by 10 to 100 times.

It doesn’t make sense to write skills for throwaway tasks you won’t do again. But if you find yourself doing something for the third time, it’s probably worth formalizing it. If you’re using it multiple times per week, try getting it working with a smaller model.

Level 2: Turn skills into evals

Your team might see your skill and want to use it to create their presentations as well. While it’s easy to share skills across your organization, you’ll have to get them to trust that your skill delivers before they’ll adopt it. For that, you’ll need evidence in the form of evaluation metrics, or evals.

For the simplest eval, gather 10 examples of tasks your skill has been used for—say, the last 10 decks you have made with the skill—and rewrite the output to be the gold standard or best-in-class example of what you’d hope Claude could produce. Now, ask Claude to “Run each test case with subagents and compare the output versus my gold examples.” Make changes to the skill and test if it does better. This is the “LLM-as-a-judge” technique—you’re using a model to grade its own work against your standard.

In the spirit of incremental determinism, you should formalize your evals over time, too. Ask Claude to “Break down the patterns between what makes a ‘good’ answer (gold examples) versus the typical output of the skill.” It might say that one pattern for a good answer is following brand guidelines, another pattern is including four to five bullet points of commentary on a specific slide, and a third is calculating the correct numbers.

Once you have several evals, you can combine them into a single score. Each eval becomes one “judge”—it looks at the output from one angle, such as data accuracy, and returns a score. You can weight each judge based on how much that dimension matters to you, then average the scores together. This “panel-of-judges” approach lets you track overall quality as a single number. The on-brand eval might be worth 40 points to you, the correct numbers could be 50, and the bullet points worth 10. Each prompt you test can then be scored out of 100, allowing you to compare how well one approach works versus another. Claude is a human-level prompt engineer and runs this process as a matter of course if you use the skill-creator function Anthropic provides.

Let’s come back to our patterns of good output for a PowerPoint deck. Validating the data is more important than whether you’re missing a bullet point or using the right visual components, so you could weight that eval as 60 percent of the overall score versus 20 percent each for the other two. Together, you have a weighted average score for measuring how well your skill is performing. For companies, where getting a pixel out of line is a fireable offence, such as top-tier consulting or finance firms, you can change the relative weighting of that eval.

Now, you have proof you can share with the team about the impact your changes are making on skills. When the next big model comes out, you can test how much better it does on your benchmark and if it’s worth the extra cost.

Level 3: Turn evals into scripts

When your skill is working reliably, and you’re using it frequently enough that the token cost is starting to feel significant, you need to start thinking about scripts, CLIs or MCPs. This is where the steps get slightly more technical, but the principle is the same: Replace thinking with a structured process wherever your thinking doesn’t add anything extra.

Every skill, like your PowerPoint deck skill, is a bundle of actions—pull this data, reference our brand guidelines, create a .pptx file—and some of those actions don’t require a smart model. Some don’t even require an LLM at all. Deconstruct your skill into its component parts and hard-code whatever you can. Code costs almost nothing to run and returns in an instant compared to LLMs, so the more of your workflow you can make deterministic, the cheaper and faster it will be.

For our PowerPoint creation task, you can use the HTML and CSS templates for the slide deck written once by Opus, then filled in to generate the .pptx file when you need to create a deck. You can also write a script to pull the right revenue or sales figures from a data source, no LLM involved. The final export step—to .pptx format—can also be done in code.

For tasks that require some judgment, like checking your deck’s compliance with brand guidelines, don’t jump straight to the most expensive model. Platforms like OpenRouter allow you to call any of the major commercial or open-source models, so you can experiment with the tradeoffs between cost and intelligence. Basic classification and summarization tasks can be done by older models 1,000 times cheaper than Opus with reasonable accuracy. Leave the most challenging tasks, such as the narrative and tailoring the tone to a specific audience, to Opus.

Level 4: Turn scripts into better scripts

In the previous step, you replaced as much LLM thinking as possible with deterministic code, bringing the cost of your PowerPoint skill down 10 to 90 percent compared to only using Opus. But you were only optimizing for your own use. When your skill is running inside a product, creating hundreds of decks a week, cost inefficiencies will again become a problem. For this, you will need to build a process to automate the optimization. Once you have 100 to 200 examples of the skill being used in the real world, a reliable basket of eval metrics, and a clear map of what the skill does at each step, you have everything you need to do so.

The most common tool for this is DSPy, which can automate the prompt engineering process end-to-end. It runs your prompt, looks at the test cases, and rewrites the prompt to arrive at a more accurate outcome, often with a cheaper model. Another common approach is distillation. You use Opus to generate hundreds of high-quality examples that pass your evals, then use those to teach a cheaper model to produce similar results. You can do that by either including the examples in the prompt so Haiku can pattern-match against them, or by fine-tuning the cheaper model directly on the examples. Think of it as a head chef writing such a good recipe that a less experienced cook can follow it perfectly. This process can cost $10, $100, or $1,000, depending on the model and how many test cases you have, but spending $1,000 to save millions in production is worth it.

More experimental approaches are emerging, too. Andrej Karpathy’s autoresearch runs experiments to optimize a script file against an eval metric over long periods. Researchers wake up to more than 20 experiments run overnight with meaningful performance improvements.

The great enemy at this level is overfitting: The skill or script works well against your eval metric but fails on tasks it hasn’t seen before. It’s “teaching to the test” for LLMs. The evals in the previous step are your main defense against this, because they give you a formal rubric for grading its performance. Human involvement in the evaluation process is necessary because we’re better able to catch behavior that goes against the spirit of the game, even if it’s not technically wrong as defined by the rules.

If you are a manager at a company responsible for AI, you don’t need to know how to implement any of this yourself. What matters is understanding that this optimization layer exists, it’s what your technical team or tools are doing under the hood, and why the decision to invest can pay off.

You are the most expensive model

All of this optimization work takes time and expertise, and your attention is an even more expensive commodity than the latest models. Attention is the key word: The ladder of incremental determinism—sessions, skills, evals, scripts, optimized scripts—gives you a framework for deciding where to invest your attention. Every hour you spend optimizing a skill is an hour you’re not spending on something only you can do.

You don’t need to climb the whole ladder—having reliable skills and evals is more than enough. The point is knowing the rungs exist, so when the cost pressure hits (and it will), you know exactly which lever to pull. If you’re struggling with unreliable or expensive skills but don’t have the capability to build scripts in house, it might be time to bring in someone technical and AI-savvy to do the heavy lifting.

The cost of tokens is falling 90 percent every year for the same level of intelligence, so the task even Opus struggles with today might be easy and cheap in 12 months. Sometimes the smartest move is to overpay now and let the market do the price optimization for you.

Mike Taylor is the head of tech consulting at Every and a co-author of Prompt Engineering for Generative AI (O’Reilly). Learn more about how Every’s consulting team can bring AI into your organization.

For sponsorship opportunities, reach out to sponsorships@every.to. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

Codex Moves Beyond Coding

Every Staff / Context Window — 2026-04-24 18:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Kieran Klaassen’s compound engineering plugin has crossed 15,000 GitHub stars, and this week it got a substantial update. It now works across more tools, comes with more built-in agents and skills, and has a cleaner setup flow—try it and let us know what you think.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“Vibe Check: GPT-5.5 Has It All” by Katie Parrott/Vibe Check: The newly released GPT-5.5 is faster and easier to work with than its predecessors while also outperforming them on serious engineering tasks. Every’s testing found it to be the strongest OpenAI model for writing in about a year, and its biggest edge over Opus 4.7 shows up when working with an existing plan or system. Read this for the benchmark results, Reach Test ratings, and guidance on when to reach for GPT-5.5 versus Opus 4.7.

“Introducing Monologue Notes: Record Every Meeting, Call, and Voice Memo” by Naveen Naidu/On Every: The best thinking can happen away from your desk—on walks, on calls, in meetings—and then vanishes. Monologue Notes, a new feature in the Monologue app, records and transcribes all of it, then makes those transcripts available as context for whatever coding agent you use. Read this for the two starter prompts that turn your recordings into a structured work session and try it for yourself.

🎧 🖥 “You’re the Bread in the AI Sandwich” by Laura Entis/Context Window: Dan Shipper and Kieran Klaassen work through the titular AI sandwich, where humans excel now that AI handles execution: framing the problem upfront and judging the output after. Plus: how Every’s consulting agent Claudie keeps absorbing new responsibilities instead of spawning new agents, what that reveals about the two organizational structures that will define how companies deploy AI employees, and Nityesh’s trust battery system that lets Claudie earn autonomy by learning from her mistakes. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“Mini-Vibe Check: Claude Design Isn’t for Designers—Yet” by Katie Parrott/Context Window: Creative director Lucas Crespo put Anthropic’s new Claude Design through its paces. He finds it useful for empowering non-designers to produce on-brand assets, but poorly suited for open-ended creative work. Plus: Back-to-back security incidents at Vercel and Lovable reveal two distinct ways AI tools can expose your data, and a workflow from Nityesh Agarwal for setting up an agent-run X feed that monitors your AI stack for vulnerabilities overnight.

“Model Wars” by Laura Entis/Context Window: GPT-5.5 touched off a debate between Nityesh (Claude Code devotee) and Naveen Naidu (Codex partisan) about whether the Anthropic-vs.-OpenAI rivalry is a model question or a product one. Plus: Austin Tedesco‘s four-step workflow for producing polished product videos with Remotion and Claude Code, and why prompts are replacing the download button as the front door for AI-native tools.

“How I Escaped AI Autopilot” by Katie Parrott/Working Overtime: Katie Parrott accidentally completed a client assignment twice—because she’d delegated so much to AI that her brain never bothered storing a memory of doing it the first time. Research on pilots and cognitive bias explains why fluent, polished AI output is what makes it hardest to scrutinize. Read this for the three practices she’s now using to stay focused on her work.

Log on

This week’s camp

Codex for Knowledge Work Camp: Dan and Austin showed how to use OpenAI’s Codex for drafting, research, summarizing, running tasks in parallel, and building small tools to automate routine knowledge work. Watch the recording.

In New York City

Software Is the New Media: Join us at Betaworks on April 28 for an evening conversation on how AI is changing media, content, and software—and what that means for the people building in all three. Learn more and RSVP.

Recordings you may have missed

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

From Every Studio

Cora’s new inbox is looking for alpha testers

Kieran is looking for a small group of alpha testers to put Cora’s new inbox experience through its paces and share feedback. The alpha version now supports drafts, snooze, grouped views, keyboard shortcuts, metadata parsing, bulk archive, undo, and a context-aware chat that can answer questions about the email you already have open.

Cora’s broader goal is to let people do email however they want, whether that means organizing by recency, categories, briefs, or eventually doing an agent-first pass with manual cleanup at the end. If you want access, reach out to Kieran at kieran@every.to.

Spiral’s API agents can now remember how you write

Spiral is adding memory to its API agents, so your writing assistant can learn your projects, preferences, and common corrections over time. Instead of restating tone, structure, or your usual edits in every session, you can carry that context forward and get drafts that pick up where the last one left off. Memory is live now through the API (it’s not inside the app yet, but stay tuned). Try it at writewithspiral.com.

Alignment

Terminal pilled. Four months ago I opened the coding terminal for the first time, and it felt like staring into a black box that might bite me. Now I’m a snob about using it instead of a desktop app.

I build dashboards for biotech companies in it. I pull clinical trial data and parse financial filings while asking AI to explain the business model to me like I’m 11, and then like I’m 15, and then like I’m a grownup. On top of all that, I run Ghostty as my blazingly fast native terminal so I can juggle multiple windows for different workstreams, and I feel like I’m in the Matrix.

I’m promiscuous about the models inside the terminal I use. It might be Claude one day, GPT the next, and whatever is new the month after that. But I will never leave the terminal. Codex and Claude Desktop and Cowork have built beautiful interfaces for exactly the work I do, and without even trying any of them, I’ve decided they’re inferior—maybe because they’re too easy to use.

The terminal gives me the sense that I passed through a threshold of frustration most people won’t, and that’s worth the tiny sliver of superiority I feel when I use it. And sitting at a terminal makes me feel like I belong with the people who know how to code, even though I don’t, really.

All it took was four months of use and a minor superiority complex, and I’ve become one of those people I used to wonder about—the ones who won’t try the new thing even when it might work better.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

Model Wars

Laura Entis / Context Window — 2026-04-24 15:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

GPT 5.5 is here, and OpenAI’s latest model has it all. It’s fast enough to use constantly, personable enough to collaborate with, and assertive enough to carry a plan through serious engineering work. If you didn’t catch our full review, including benchmark results, Reach Test ratings, pricing, screenshots, and advice on when to reach for GPT-5.5 versus Opus 4.7, read our Vibe Check or rewatch the livestream, where we grilled OpenAI’s Dominik Kundel and Romain Huet on how they’re using the model.

But how will that shift the balance between OpenAI and Anthropic? That may be a product question as much as a model question. Every engineer Nityesh Agarwal and Monologue general manager Naveen Naidu weigh in.—Kate Lee

Inside Every

Codex versus Claude Code

This week, Anthropic tested removing Claude Code from the $20 Claude Pro plan, prompting an outcry from users and drawing jabs from OpenAI executives on X, perhaps feeling emboldened by the big launch they knew was coming.

The exchange kicked off a Slack debate between Nityesh Agarwal, our resident Claude Code devotee, and Naveen Naidu, who rides hard for OpenAI’s coding app Codex.

Nityesh’s take: Anthropic potentially raising prices is “simple market economics”—there is a huge demand for Claude products because they’re the best available, so they can charge more. On the other hand, OpenAI’s response underscores how frustrated the company has become playing catch-up as it scrambles to replicate Claude Code, Cowork, and skills. From a product standpoint, Claude in the browser and the Claude Code command line interface (CLI) are better than ChatGPT and Codex.

Naveen’s response: Anthropic’s models are powerful, but they also burn through way too much compute in production. OpenAI is much stronger on infrastructure, and GPT 5.5 is a token-efficient model. And while it’s true Anthropic is first to market with a lot of products and features, including computer use—which allows AI to operate your computer on your behalf—OpenAI is better at execution. Naveen consistently reaches for ChatGPT and the Codex desktop app, while he finds the Claude Code app too buggy to spend any time in.

Where they agree: The Claude Code app is, indeed, bad—Nityesh concedes he only uses the CLI. And both labs misjudged how much compute they would need, but in opposite directions: Anthropic is struggling to keep up with demand, whereas OpenAI has invested heavily in infrastructure and is now scrambling to get people to use its products.

Data point

It’s not just a grammatical pattern; it’s an AI tell

Four times.

That’s how much the usage of “not just a ___, it’s a ___” sentence construction rose in large U.S. company documents between 2023 and 2025, per Barrons.

The rise in correlative constructions neatly tracks with the adoption of LLMs. (Source: Barrons.)

Like the em dash, the correlative constructions are so beloved by LLMs that human writers now avoid them so as not to be accused of writing with AI.

Hot take alert: That’s a bummer. The great profile writer Taffy Brodesser-Akner’s work is teeming with them. Or it was, pre-ChatGPT. Her 2018 New York Times Magazine feature on Goop uses some version of “not X, it’s Y” in almost every other paragraph.

I doubt even a writer as beloved as Taffy could get away with that today. It’s not that her trademark style is any less effective—it’s that no one would believe she wrote it.

Steal this workflow

How to (almost) one-shot a product video

After days of battling open-source video creation tool Remotion and Claude Code, trying to one-shot a video for a product relaunch, Austin Tedesco, Every’s head of growth, figured out how to get a polished clip. Here’s the workflow he runs any time he needs a social video for a product launch or feature demo, like the one he created for the relaunch of Sparkle, our agent-native app that cleans and organizes files on your Mac.

A GIF showing a clip from Austin' s product video. (Source: Every.)

Step 1: Screen-record yourself using the product you’re doing a clip on. All you need is raw footage of yourself clicking through features in real time.

Step 2: Send the recording to a model—Austin prefers Opus—and have it draft a storyboard. The recording provides a ground truth for how the UI works and what the copy says. This prevents the most frequent cause of fake-looking launch videos: plausible-but-hallucinated labels and features.

Step 3: Iterate on the storyboard. Go back and forth with the model until the hook, pacing, and beat-by-beat plan feel right.

Step 4: Hand the storyboard to a coding agent and have it build the video in Remotion. With the screen recording and the corresponding storyboard, the first full render is usually publishable. It’s not a true one-shot, but it saves a lot of time.

Now, next, nixed

Prompts are the new installers

Companies and developers are trying a new way to let users download an AI tool. Instead of asking them to press a download button, users copy a setup prompt, paste it into Claude Code or Codex, and let the agent install the tool.

Now: Copy prompt, paste, install. This is how we install Every’s agent-native document editor Proof: Paste a prompt into your assistant, and it handles the setup. The prompt is doing the job the download button used to do: It gets the user from “I want to try this” to “It’s running in my workflow.”

Next: Someone designs the standard version of this. The copyable prompt block becomes a normal part of product pages and GitHub READMEs (the instructions for software projects), especially for developer tools. It should work on the web and on a repository homepage, and feel as obvious as a “Sign in with Google” button.

Nixed: The download button as the main way in. The old-school way of installing software—clicking a download link and running a setup file—still makes sense when software requires direct hardware access or needs to work offline, but for AI-native tools, the front door is: Copy this prompt into your agent.—Katie Parrott

Model happenings

News you might have missed

Cowork shipped live artifacts. Claude can build dashboards and trackers inside your workspace that pull fresh data from your apps and refresh each time you open them—pouring narrative gasoline on the SaaSpocalypse fire.

Cowork artifacts allow you to create the dashboards and data reporting visualizations tools like ChartMogul provide. (Image courtesy of Brandon Gell.)

OpenAI gave Codex screen memory. Codex now retains what’s on your screen across tabs and sessions, so you don’t have to re-paste context every time you start a new task.
OpenAI launched workspace agents in ChatGPT. The Codex-powered feature lets teams create custom shared agents that can pull information from different sources, analyze it, and turn it into a draft or next step. It’s another signal that agents are becoming a shared team resource, rather than purely individual AI assistants.

One last thing

Nityesh has been having a lot of fun with ChatGPT Images 2.0

A couple of his recent creations include a vintage poster to celebrate the release of Monologue Notes, a new feature in our agent-native recording app, and an infographic about securing Claudie, the consulting team’s always-on AI employee.

The prompt: Turn Monologue Notes’s landing page into a vintage poster. (Image courtesy of Nityesh Agarwal)

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Vibe Check: GPT-5.5 Has It All

Katie Parrott / Vibe Check — 2026-04-23 13:00:00 -0400

by Katie Parrott

in Vibe Check

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Frontier models usually take a while to get used to. You have to learn their slow spots, when they need extra prompting, and when to keep a close eye on the output.

GPT-5.5, out today, feels easier to settle into. It’s fast enough to use constantly, personable enough to collaborate with, and assertive enough to carry a plan through serious engineering work. It’s better at writing than any OpenAI model we’ve used in about a year, and it produced the strongest result we’ve seen on our new Senior Engineer Benchmark, which measures how well models can rewrite a messy production codebase the way a senior engineer would. It’s rare for a model to feel easier and stronger at the same time.

The big insights from our testing:

Best on senior-engineer coding. GPT-5.5 scored 62.5 on our Senior Engineer Benchmark versus 33.5 for Opus 4.7. Humans still score in the high 80s and low 90s. The twist: GPT-5.5’s best run used an Opus-written plan.
A real writing comeback. It’s the strongest OpenAI model we’ve tested in a year, with cleaner structure and smoother logical progression than Opus 4.7.
Strong everyday knowledge work. GPT-5.5 beat Opus 4.7 on dashboards and felt dependable for creating client deliverables or customer support replies.
Best with structure. GPT-5.5 shines with a plan, an existing system, or a tight feedback loop. Opus 4.7 still has advantages on one-shot vibe coding, PowerPoint, Ruby, and some broad product-design tasks.

The full Vibe Check has the benchmark results, Reach Test ratings, pricing, screenshots, and advice on when to reach for GPT-5.5 versus Opus 4.7.

Read the full Vibe Check

And watch our video Vibe Check with Dan Shipper:

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

For sponsorship opportunities, reach out to sponsorships@every.to.

Transcript: ‘The AI Sandwich: Where Humans Excel in an AI World’

Dan Shipper / AI & I — 2026-04-22 19:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I with Every’s Kieran Klaassen is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction and the AI sandwich metaphor: 00:00:52
What compound engineering is and how it’s evolved: 00:02:33
The “work” phase of agentic coding is essentially solved: 00:04:27
Why humans belong at the beginning and the end of an AI workflow: 00:06:27
Dan’s argument for why agents can’t change frames—and how this will keep us employed: 00:11:06
Full automation remains a moving target: 00:16:51
Musical composition as a model for human-AI collaboration: 00:23:21
Find your place in an AI-accelerated world by leaning into what brings you joy: 00:26:39

Transcript

Dan Shipper

Humans are the bread in the sandwich, and the AI is in the middle.

Kieran

The AI is whatever you put on your sandwich. If you ship something or do something—if you want it to be your own—you cannot fully automate everything. It’s like art. If you want it to be yours, it needs to come from you or somehow be connected.

I believe it’s so important to do things you enjoy and love. It’s very important to make it feel great because the bar is high. The bar will always get higher. The beginning and the end—the middle can be automated pretty well. And Dan at some point said, “Oh, it’s kind of like a sandwich,” which was very funny.

Dan Shipper

Ki, welcome to the show.

Kieran

Hello, Dan. Happy to be here.

Dan Shipper

For people who don’t know, you are the GM of Quora, and you are also the creator of compound engineering—the engineering framework and plugin that everyone inside of Every uses, and that everyone who’s really coding with agents is at least aware of, if not using.

A pleasure to have you on the show.

Kieran

Thank you. It’s always great.

Dan Shipper

I love getting to chat with you and getting to work with you, because every once in a while you figure something out and I’m like, “Holy shit, that’s definitely the future.” And you just figured something out—along with Trevin Chow, who also helps out on compound engineering—that I think has massive implications for how programming works. And I think we can also translate that to the rest of AI and its impact on work.

One of the things you’ve been doing with this compound engineering plugin is you’ve rebuilt the engineering workflow for how you should work with agents. And in thinking about that—thinking about where a human is needed and where a human should not be present inside that process—I think you’ve found something really interesting and deep about how humans and AI are going to interact with work. Do you want to explain a little bit about compound engineering and the process you’ve created, and then also explain this insight about where humans fit?

Kieran

Yeah, absolutely. Compound engineering is a philosophy of doing engineering work. We’ve realized it applies to more than just engineering—it’s product work, design work, knowledge work, and other things. But how I built it was while building Quora. I had AI and was thinking: how can I use AI to do better work more quickly?

The initial version of compound engineering really evolved around four steps. The first is planning—you make a great plan so it’s very clear what you need to build and do. Then the work phase, where the agent does the work, implements it, and actually writes the code, does the design work, or whatever work needs to be done.

The third is review. Some slop comes out—or something beautiful comes out, one of the two—but how do you know it’s good? Traditionally there’s a code review, a PR queue, where someone says, “Hey, this can be improved,” and there’s some iteration going on there.

And then the most important step is the compound step. If anything comes up during the review or during the planning that feels like a good learning—something you’ll probably run into again—you can compound that knowledge back into the system. We store that as knowledge inside the repository, and agents can reference it the next time they go into planning, work, or review. They can see the mistakes they made before so they won’t make them again. That’s the most powerful thing in this plugin.

But we started to realize more things. First of all, the work phase is kind of dumb—not in a bad way. If you have a good plan, it does the work and it’s pretty good. And then the review makes it a little bit better.

Dan Shipper

And by that you mean: having an entire phase dedicated to work in this whole system doesn’t necessarily make that much sense, when all it really means is “run the model, let the model do the thing.”

Kieran

Yeah. There needs to be a step, but what I mean by “dumb” is I don’t need to care about it—I don’t need to think about it. I trust it. And this isn’t “trust me, bro, it just works.” This is: I’ve seen that if you put in a good plan, it executes on the plan. LLMs are very good at following steps, doing deep work, working for hours or even days now.

That thing is kind of solved. The review is starting to get there too. The planning is starting to get there too. And then you hit this next question: if all these things work, where do I actually have to do anything?

Dan Shipper

Yeah.

Kieran

Did I automate myself out of a job? If everything works, where do I work? What is still the bottleneck?

There are two things we started to identify—and Trevin was a big contributor here. He’s a product person, and he said, “I need more on the product side, which is before the planning phase.” So he added a brainstorm step and an ideate step. The ideate step is really going wide—coming up with ideas in a room full of interesting people with different angles. Brainstorm is more like: I have a problem but I don’t really understand exactly what or how. So it’s very much brainstorming around the problem.

The first thing we noticed there is that at the top, it’s very important to stay in the loop with a human and really ask a lot of questions—the human should think hard, and the LLM should support the human. But then after that, if you have a good brainstorm and a clear idea of what problem you’re solving, it can create a very good plan and the human doesn’t need to be in the loop.

So that’s the first realization: here’s where it’s good to be in the loop versus not. You can see other approaches—spec-driven development, for example—that assume it’s always good to have people in the loop, and I disagree. It’s very important to know when to be in the loop versus when to hand it off, because that means we can think harder at the moments where we actually need to think harder.

The other moment comes at the end. Something comes out. How do you validate it? Well, it’s already tested—browser automated testing has clicked through everything, all the requirements are clearly specified, and it says everything works. But the beauty comes in when a human looks at it, clicks around, and has a feel for it: “Oh, this doesn’t feel right. We can polish it. We can make it better. There’s something still missing. We can make the design better.”

I learned this from doing Pomodoros. Ideally, if you finish a task after 15 minutes, you still have 10 more minutes to work on the same task—you can’t switch. And sometimes in that space, something beautiful happens because you go deeper, further than you would have otherwise.

I think that’s the other critical moment: all the way at the end, when everything is done, you can elevate everything and make it even better. And I think we need to do that, because if we don’t, it will all be slop—all the same. It’s very important to make it feel great because the bar is high, and the bar will always get higher.

So this is what we realized: the beginning and the end. The middle can be automated pretty well. And Dan at some point said, “Oh, it’s kind of like a sandwich,” which was very funny. And Dan is now referring to the AI sandwich, which I think is very cool. The sandwich is really: when do you need to think about what you’re doing and really use your brain, versus when do you offload it?

(00:10:00)

Dan Shipper

Humans are the bread in the sandwich, and the AI is in the middle.

Kieran

Yeah. The AI is whatever you put on your sandwich.

Dan Shipper

Exactly. And I think that’s really interesting and really cool because it gives me a good mental model for how I should be working with coding agents—but I think it also applies to the rest of knowledge work.

This is such an important question now, because we have all these questions about what agents are going to do, whether everyone’s going to lose their job, all that kind of stuff. I think software engineers are a little bit the canary in the coal mine. And so far, what we’ve found internally at Every is: absolutely not. We still hire software engineers. We need software engineers. But the way you’re working—what you’re doing—looks a lot more like managing. If you’re doing it well, you’re still involved, but you’re involved at the beginning and the end as this kind of sandwich. And I think the same is going to be true of every other kind of work, whether that’s copywriting, strategy, or design.

And there are deep reasons why that is the case. I want to start with an objection people will have, which is: okay, for now agents can’t do the ideate and brainstorm phases, but pretty soon they will. So then what?

They’re already starting to do the beginning of that process. And I think there’s something interesting here. If you look within any given local frame of a problem—to take a non-coding example—the problem might be “my knee hurts” and you want to solve that. But “my knee hurts” is the same kind of problem as “this feature is broken” or “customers are anxious about this part of the product.” Any problem. If you take that frame and say, “The solution is take Advil”—any part of that process, getting to the store or whatever, can be automated. DoorDash can go do it. But there’s always, even once you’ve solved it at that level, a larger frame within which to think about the problem.

If your knee hurts, you might need to stretch your IT band. Or you might need to stop running on hard surfaces every day. And each one of these addresses the same problem at a different level of the stack, from a different frame. Humans are very good at flipping and changing frames like that. Our job is to set the frame—set the bounds within which we solve the problem. And I think it’s going to be very hard for agents to do that well by themselves.

Does that resonate for you?

Kieran

Yeah, for sure. It all comes down to building an environment where the agent will thrive. And you do that by picking the right things. That’s why it’s so important to have humans with experience, humans with taste, humans who want to click around and say, “This is great” or “This is not”—and say why.

I think it’s similar to the Advil example. If you keep taking Advil, eventually a friend will say, “That’s messed up—just go fix the actual problem instead of denying it.” It might work for a while, but you need someone to shake you up. And in that case, that’s the human.

But I do think the ideation step will also become more automated. You can say, “Let’s have a persona of 100 people and run simulations of how they think and behave.” And clearly we’re going there—running simulations of millions of people, seeing how things work, learning something from that. There will be more automation, and maybe even the front step will eventually be fully automated.

But I do think that in the end, if you ship something—if you make a statement in the world—and you want it to be your own, you have to say yes or no at some point. You cannot fully automate everything. It’s a bit like making art. If you want it to be yours, it needs to come from you or somehow be connected.

So I believe having those moments where you decide—where you choose what you enjoy—is so important. That’s why it’s so important to do things you enjoy and love.

Dan Shipper

I agree. And you can imagine it being: “We’re going to simulate a billion people and make decisions based on what we think they would do.” But that would still only cover a small set of the decisions someone might actually make.

Kieran

It will never be fully solved—it’s a moving target. We always create something new, and then there’s a layer above that where we can make even bigger impact.

Dan Shipper

Especially because, for a lot of these decisions, the feedback loops are so long and the data is so rare. You may only get a couple of moments in your career where you gather the data that helps you decide about a particular thing. That’s very hard to get into language models—especially because it’s hard to gather in the first place, and they need a lot of it. That rare expertise, encapsulated in an expert who has a personality and a worldview, is hard to replicate. And you’re right—it’s always moving.

That makes me really excited about this, because I feel like we’ve been wandering in the woods for a long time on the question of what AI progress is going to mean, and how humans are going to be involved. And it just feels very much to me like the simple answer is: ride the bottle. Or to mix the metaphor—be the bread in the sandwich. If you do that, you’re going to be fine. It’s going to be really, really great.

Kieran

I agree. And it will be different for different people, and you will need to change some things. If you only love writing code, you need to find your way of doing that. Yes, you can still write code—but maybe it’s about beautiful code. Maybe you find a lot of value in just recognizing beauty, the way someone looks at a UI and says, “This is beautiful, this works great.” Maybe you want that for code. Some people don’t care about that, but they love that the UI should feel great, and they’ll polish it, go extra—wherever they feel joy.

And it’s also becoming much more product-focused. As an engineer, you’re going to become either more of a manager or more of a product person. Product manager, product engineer—it’s more of those things as well. So there will be some changes, but lean into making beautiful things. Whatever that means to you: beautiful code, beautiful abstractions, beautiful architecture, beautiful design, beautiful copy.

I think it’s very important to lean into what is beautiful to you, because then you’ll find a way to use an LLM to make something that gives you energy instead of draining you.

(00:20:00)

Dan Shipper

And I think there’s a deep reason why language models are not going to be as good at that. One reason is it’s just not going to be yours if you didn’t decide it, if you didn’t do it. But another deep reason is that you can think of language models as a super-intelligence that’s been kept in a box for the last year and has no idea what’s going on in the world, except for whatever it picks up when it pops out of the box. Because of that, its outputs end up being a little more generic and less personal to your situation.

You can see this in all the AI writing that reads like “It’s X not Y” and that kind of thing. To truly solve a problem well—or to truly make art, or to truly make a product that resonates with people—it has to be really well tuned to the exact problem you’re trying to solve or the exact form you’re trying to make. Language models need a lot of help to get there. That’s why you have to be on either end of them: to set the frame of the problem, and then to make sure the details are really right at the level of execution. And I think they’ll get better at this, but they’re much further from being able to do it end to end than we think.

My general bar for AGI is: whenever it becomes economically worthwhile to run an agent 24/7—it never turns off. OpenClaw is pushing in this direction, but it runs on a schedule, it has a heartbeat. You can’t just say, “Hey, go do a bunch of stuff and work all the time, spend tokens on everything constantly,” and have it be worthwhile. We’re not even close to that. Yes, we sometimes have well-specified tasks we can send a model off to work on for 24 hours, but it’s not changing frames on its own. It’s not finishing a task and then picking the next one, spending five minutes on this one and four days on that one. We’re not even close to that. I think we’re going to need some fundamental changes to the language model architecture to get there.

If they are running 24/7 like that, they’ll be a lot closer to being context-sensitive enough to do interesting creative things. But we’re not there yet.

Kieran

Yeah, I agree. One other way to look at it: I have a music background. I studied classical composition. And one of the beautiful things about music is—yes, Suno can create songs, but it will never capture a live performance, or the experience of coming up with a melody. There’s something internal in the human. As a composer or musician, if you perform something and deliver it to other people, they feel that. It’s different.

If you’re a DJ, it’s maybe somewhere in the middle—but there is something about performing, about expressing something. And I think there’s some of that element in these steps too. You see something and you feel like, “This is a little bit off here—I don’t know exactly why, but I want to change it.” And suddenly you’re performing, iterating, making something. You’re putting something into the world.

Practicing a piece, playing it 100 times—that’s not very creative, as a musician. That’s kind of the middle part. But the performance, at the end, is where you bring it out into the world to the people. That’s a special moment. And there’s a link for me with doing that polishing step at the end of a project.

And the start—if you’re a composer, coming up with something out of nothing—that’s also a special moment. Everything in the middle is kind of boring. It’s just work. But those end moments are still special, and it kind of works for making software or other things with agents as well.

Dan Shipper

I think that’s totally right. I love this art angle. Another way to say it: all work exists on a spectrum from being totally rote to being art. And art itself has many tasks within it—any kind of creative work has many tasks within it that are more or less rote.

If you’re trying to map work on that spectrum, the stuff that is more rote is just stuff you’re not going to have to do anymore. And that is a big opportunity to move a lot of the work we do to the more creative—and probably more interesting—parts of work. And to recognize that the frame is always changing. As certain things get rote, other things become what humans start to do. Yes, those will get automated too, but we’ll also keep moving along that spectrum.

The final thing that’s not automatable is art made by humans who feel something. And I think that’s beautiful.

Kieran

Yeah, it’s still scary—because what if you’re in the middle and you want to move? What if you’re trying to figure out what that means for you? This might sound very abstract and weird to some people. If you’re not an artist or haven’t really felt this in moments, it might sound like, “Oh, but that’s not me.”

But I do believe everyone has this. What brings you joy? What lights a fire in you? What do you get excited about? Whatever that thing is, you should lean into it. That can be beautiful writing, or very structured lists, or anything that just brings you happiness—you should do more of that, and use LLMs in your work toward that. That’s good.

Dan Shipper

I agree. Kieran, always a pleasure.

Kieran

Thank you. Let’s see where this goes.

Dan Shipper

See you next time.

Kieran

See you. Bye.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

You’re the Bread in the AI Sandwich

Laura Entis / Context Window — 2026-04-22 15:00:00 -0400

by Laura Entis

in Context Window

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: You’re the Bread in the AI Sandwich

Today, we’re releasing a new episode of our podcast AI & I. Dan Shipper sits down with Kieran Klaassen, GM of Cora and creator of Every’s AI-native engineering methodology, compound engineering. Dan and Kieran discuss where humans fit now that AI can generate high-quality code, copy, strategy, and design. If the execution layer is largely solved, do engineers still have a role in the workplace?

The short answer: Yes. Think of an AI workflow like a sandwich—the model is the workhorse filling, and we’re the bread, providing framing and taste.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

Play to your strengths. Kieran’s compound engineering framework breaks the engineering workflow into four steps: Plan, work, review, and compound. AI takes care of the doing phase. “LLMs are very good at just following steps, doing deep work, working for hours or days, even now,” Kieran says. What’s left for flesh-and-blood humans are the steps before and after—the planning, where you frame the problem, and review, where you determine whether the output feels right (the bread!).
Humans can identify multiple solutions to the same problem—AI struggles at this. If your knee hurts, you could take Advil, stretch your IT band, or stop running on hard surfaces. Humans are good at diagnosing a problem from many different angles, an exercise agents struggle with, Dan says.
Taste is the final layer of bread. Once AI has done the work, the most important thing you can do is judge whether the output approaches the vision in your head. Does the output feel right—and if not, how can you reframe the problem until the AI produces something that does? This is what separates art, which has a point of view, from generic slop.

Now, next, nixed

The agents are merging

Now: Claudie is an AI agent that runs on a Mac Mini with a Claude Max account. Since joining Every’s consulting team a few months ago, she’s been promoted multiple times and is now responsible for managing client updates, the sales pipeline, and the creation of slide decks.

Every engineer Nityesh Agarwal initially built Claudie as an AI project manager. The plan was to build separate agents to handle deck creation and the sales pipeline.

But every time he added a capability to Claudie’s plate, she exceeded his expectations. And so instead of creating more agents, Nityesh converted their planned functionality into plugins within Claudie. “There doesn’t appear to be any limit to how much this AI employee can do if you spend time building good, refined skills,” he says.

Today, each (human) member of the consulting team has a personal AI assistant tailored to their own workflow, and they use Claudie to do tasks where they can take advantage of skills—such as slide deck building—that can be shared across the team.

Next: Two organizational architectures for agents will develop simultaneously, Dan predicts. In the first model, every person at a company gets their own AI assistant. In the second, workers across the organization will rely on a single super-agent with a library of department-specific plugins, similar to Claudie, but even bigger.

In the first case, each worker can customize their agent to their exact specifications, which allows for a richer relationship but requires setup and maintenance. In the second, one specialist does the upkeep of the agent and its plugins for the whole team or company, which takes the burden off each worker, but means they can’t make any tweaks.

Nixed: A fleet of single-purpose agents shared by one team—an agent for sales tasks, an agent for product management, an agent for reports. Sadly for Claudie, she will never get to work with the sales agent Nityesh planned, Jean-Claude.

Inside Every

Motivating your AI employee

Last Thursday, I opened Slack and saw a message from our AI project manager, Claudie, announcing that her trust battery with me had dropped 0.6 percent to 28.3 percent.

The concept of a trust battery was coined by Shopify CEO Tobi Lütke, and the idea is simple: All working relationships run on trust batteries, and every exchange impacts their charge. When your trust battery with a coworker is high, they rely on you to do your job. When it’s low, everything you do is scrutinized.

With Claudie, we’ve codified that concept. Every night, a separate judge agent reviews Claudie’s interactions with our team, evaluates the quality of her work, and issues a verdict on whether her trust battery with each of us should go up or down and by how much.

The judge agent is designed to look for what went wrong rather than right because losing trust is easier than earning it. A day where Claudie consistently delivers satisfactory output in all her interactions with a team member boosts her battery by one percent, whereas a single bad day—such as pulling the wrong data—can cause her charge to fall by five percent, wiping out a week of progress.

Every night, Claudie is programmed to read the judge agent’s verdict and make updates to her memory, behavior, and scheduled tasks so she won’t make the same mistakes again. If the judge concluded she missed important context when making a client update, for example, she might add the entry “Always check the last three emails in this thread before drafting a response” to her memory. This feedback improves her performance over time.

Claudie posts a summary of what caused her trust battery to rise or fall on Slack. (Image courtesy of Nityesh Agarwal.)

Her battery levels determine what she’s allowed to do. According to Lütke, a human’s trust battery starts at about 50 percent. Because she lacks lived experience, Claudie’s started at 20 percent.

A new hire doesn’t get to make strategy decisions on day one. They earn that by demonstrating judgment over time. Claudie is the same—except unlike a human, she systematically reviews each day’s failures and rewrites herself so she won’t make the same ones again.—Nityesh Agarwal

Log on

We host camps and workshops on topics like compound engineering and writing with AI to share the knowledge we’ve acquired from training teams at companies like the New York Times and leading hedge funds, and by learning and playing with AI every day ourselves.

This week’s camp

Codex for Knowledge Work Camp on April 24: A hands-on camp with CEO Dan Shipper and head of growth Austin Tedesco on using OpenAI’s Codex for writing, research, and building tools that automate routine tasks. The first 250 attendees will receive one free month of ChatGPT’s Pro plan (worth $100). Learn more and register.

Last week’s camp

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

Recordings you may have missed

Every x Notion | Custom Agents Camp: A free workshop where we demo the custom agents running Every’s daily operations. Watch the recording or read the write-up.

Happenings

OpenAI’s latest image model

ChatGPT says ChatGPT Images 2.0, its new image generation model released yesterday, improves text rendering, web access, and visual reasoning. When we asked it to visualize our weekly standup meeting, here’s what it spat out to describe Kieran’s AI sandwich idea.

We will let you be the judge of this human-AI-sandwich hybrid. (Image courtesy of Naveen Naidu and ChatGPT Images 2.0.)

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Mini-Vibe Check: Claude Design Isn’t for Designers—Yet

Katie Parrott / Context Window — 2026-04-21 15:00:00 -0400

by Katie Parrott

in Context Window

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Introducing Monologue Notes

Today we’re launching Monologue Notes, which turns your calls, meetings, and voice memos into transcripts your agents can use. Naveen Naidu built Monologue to capture active work, where text has a clear destination. In six months it’s logged five million dictations and 250 million spoken words. Now, Notes captures the rest: the thinking that happens on walks, in calls, and in meetings. It transcribes everything and makes it available to any agent with API, CLI, or MCP access, across your Apple devices.

Try Monologue Notes

Mini-Vibe Check: Claude Design

Anthropic recently launched Claude Design, a web-based tool that lets you feed Claude a GitHub repo, Figma file, or brand kit and collaborate on interfaces, prototypes, slides, and one-pagers. It’s powered by Claude Opus 4.7 and lives only in Claude.ai.

The stock market read Claude Design as a threat to Figma, the incumbent design tool. But traders are not designers. Having played around with Claude Design, Every’s creative director Lucas Crespo characterizes Figma’s sliding share price as “a Wall Street reflex from people who have never opened either tool.”

Claude Design can do a lot well, but it wasn’t built for designers.

Claude Design lets you upload your organization’s branding and design system. (Image courtesy of Anthropic/Jack Cheng.)

What works: Point Claude Design at a GitHub repo and it will extract a starting design system—the colors, typography, and reusable components that give a product its look. Non-designers can then extend that system. If head of growth Austin Tedesco wants to ship a careers page or a YouTube thumbnail in Every’s style without bothering the design team, Claude Design is the tool for the job.

Claude Design’s live, generative interface is also a nice touch, Lucas says. The tool starts by asking you questions—layout density, accent color, whether to animate emojis—and you can draw or leave comments on top of the output, or click a specific element and edit it in place. The sketch-on-top feature is the closest Claude Design gets to feeling like Figma.

What could be better: The menu-driven interaction. Creating in Claude Design means answering a series of text prompts about layout, tone, and color, and reacting to what the tool produces. “It feels like we’re filling a bunch of forms—design is supposed to be fun,” Lucas says.

This prompt-and-react loop works for extending or revising an existing design system. But it isn’t well-suited for starting something from scratch—design is “50 percent exploration,” Lucas says. In Figma, you start with a blank canvas, and your output is shaped by a series of decisions—drag a shape, snap it to a grid, change a drop shadow, compare three variations side-by-side. Claude Design turns the open-ended exploration into reactions to what it’s already made.

The fragile setup. During a demo, we struggled to link Every’s GitHub repos and upload Figma files. And because Claude Design is web-only for now, there’s a literal disconnect from your local files and Model Context Protocols (MCPs).

Final verdict: Claude Design is great for teams that want to empower non-designers to create their own assets in the house style. But it’s not yet where a designer goes to build something new.—Laura Entis

Signal

Two new ways AI tools can leak your data

Two AI-tool security stories broke inside 24 hours over the weekend. They reveal two different points of failure in AI security: one where the attack surface was the vendor, and one where it was the AI’s output.

What happened: On Sunday, Vercel, an infrastructure company behind a big chunk of the web, confirmed a breach. Except the break-in didn’t start at Vercel but at a third-party tool called Context AI. The attackers used the hacked connection to climb into a Vercel employee’s Google workspace account, then their Vercel account, and finally into customer data—including the private passwords and credentials that customers’ apps use to connect to their payment systems, databases, and other services.

Then on Monday, vibe coding platform Lovable did damage control after users started warning each other that apps built on the platform were leaking their users’ data to the public internet. The issue turned out to be in the permissions: A basic database rule, “a customer can only see their own records,” was turned off by default in the apps Lovable generated.

Why it matters: Every AI app your team signs up for is a new door into your company. If the vendor gets hacked, the keys they were holding—to your email, your calendar, your codebase—walk out with the attackers. You inherit that vendor’s security posture even if your IT team didn’t pick the tool.

And when an AI writes your app for you, your app inherits the AI’s defaults. There isn’t always someone looking over the generator’s shoulder to check whether those defaults are safe. “I vibe-coded a prototype” now means “I shipped something protected by whatever rules the generator thought were fine.”

What to do this week:

Take stock of every AI app your employees have connected to a work account. Then turn on two-factor authentication everywhere it isn’t already on.
Before you ship anything an AI built—even a weekend prototype—ask the generator one question: “What is this app exposing to the public internet, and should it be?” If you can’t get a clear answer, don’t ship.
For anything touching customer data, like a CRM or billing system, pair the AI with a tool designed for a safer-by-default posture. Anthropic’s recently launched Managed Agents, for example, runs each session in a sealed-off computing environment with credentials held outside the sandbox.

Steal this workflow

Give your agent its own X feed to watch for vulnerabilities

You can’t personally stay on top of every vulnerability that might hit the AI stack you’re building on. Every AI engineer Nityesh Agarwal decided to stop trying, and assign situation monitoring to an agent that doesn’t sleep.

The workflow:

Create a dedicated X account for your agent. Nityesh made one for Claudie, Every’s consulting project manager agent, and had her follow the AI security people he’d otherwise be glued to—Anthropic researchers, independent researchers who probe systems for weaknesses, and vulnerability-disclosure accounts. The dedicated account only reads posts, and doesn’t otherwise participate in discussions.
Schedule daily jobs that scan its home feed and flag anything that looks like a disclosure. Use this prompt: “Read my X home feed from the last 6 hours. Flag any posts reporting vulnerabilities, Common Vulnerabilities and Exposures, breaches, or exploits relevant to the AI stack we use (Claude, Anthropic APIs, OpenClaw, Railway, Vercel, Supabase, our MCP servers). For each, give the source post URL, affected system, severity if stated, and one-line summary.” Run it at 6 a.m., noon, 6 p.m., and midnight.
Route flagged items to a team Slack channel. Nityesh has Claudie post to an internal channel so anyone can see what broke overnight. Add a one-word tag per item (critical / watch / fyi) to make messages easy to scan.

Try it this week: Spin up an X account, follow 10 AI security researchers, and schedule a recurring Claude Code job with the prompt above.

Discuss

“Cybersecurity is proof of work now. You don’t get points for being clever. You win by paying more.”—Drew Breunig, writer and technology strategist

AI has gotten good enough at finding software vulnerabilities that security has turned into a spending contest between attackers and defenders. Both point AI at your infrastructure looking for ways in. Whoever runs more scans wins.

That adds a third step to shipping code. You write it and review it—and now you harden it. You point a model at your own system and let it hunt for exploits until you run out of budget. If you’re shipping anything that touches customer data, assume an attacker is already running that third step against you. The only question is whether you’ve run it first.

Inside Every

Star Wars got it right

To me, a lot of the charm of the original Star Wars trilogy comes from the decided lack of remote networking. They can’t hack the Death Star so Obi-wan has to cross the narrow bridge deep in the starship’s bowels to get to the terminal to flip the switch to turn off the tractor beam.

I used to think this was a relic from the pre-internet days when the films were made. But with frontier models growing increasingly more capable at exploiting security gaps, it might be a short time from now in this galaxy. Anything of critical importance will live offline.—Jack Cheng

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

For sponsorship opportunities, reach out to sponsorships@every.to.

Introducing Monologue Notes: Record Every Meeting, Call, and Voice Memo

Naveen Naidu / On Every — 2026-04-21 03:00:00 -0400

by Naveen Naidu

in On Every

Figma/Every illustration.

TL;DR: Today we’re launching Monologue Notes, which turns your calls, meetings, and voice memos into transcripts your agents can use.

The best thinking rarely happens at a desk. It happens in meetings, on calls, or on walks—and then disappears. Monologue Notes, out today, records and transcribes all of it—the calls, meetings, and voice memos—and makes it available to the same agents and tools you use every day. It makes the thinking that happens in conversations and on walks just as actionable as the work you do at your desk.

Notes is available through the Monologue app on Mac, iOS, and WatchOS and syncs across all your Apple devices. You can start a recording on your Apple Watch before you leave the house, keep your phone in your pocket the entire time you’re outside, and pull the note into Codex once you’re back at your computer.

Try Monologue Notes

How Notes transforms passive work into active work

Notes was born out of a frustrating gap in my own workflow. Six months ago I shipped Monologue, a smart voice-to-text app that has processed more than 5 million dictations and converted more than 250 million spoken words into text.

Monologue excels at tasks where the text has a clear destination—you can get a lot more done when your dictation app understands your vocabulary and workflow. I speak, Monologue transcribes, and I send the words along to where they belong: Codex for code, Slack for messages, Notion for article drafts. The work is active.

Monologue Notes captures work that is passive—the ideas and decisions that accumulate when you’re out in the world or talking to other people.

Monologue Notes syncs across your Apple Devices. (Product shots courtesy of Every.)

I start every morning with a half-hour walk around my neighborhood. I make product decisions, troubleshoot bugs in my head, and work through problems that stumped me the day before. My best thinking often happens before I sit down at my desk, but before Notes, there wasn’t an obvious central place for it to live, so it got scattered across Apple Notes, Obsidian, and Slack.

The same thing happens on customer calls and in internal meetings. Problems get discussed, solutions emerge, progress is made—but the thinking is rarely stored in a way that can be mined for insights later.

Notes is not a traditional notes product. You can access your recorded transcripts and summaries through the app, but you can’t edit files, and there’s no folder organization system. Notes is more of a transit point, an audio capture layer that runs in the background, gathers context, and makes it available to your favorite coding agent.

Once a recording is finished, you go to the place where work actually happens—your terminal, Codex, a Linear board—and have your agent find what’s useful in the transcript so it can start building.

Try Monologue Notes

How I’m using Monologue Notes

On morning walks. I don’t listen to music or podcasts. When I leave the house, I hit record and start thinking out loud.

There’s no agenda. Sometimes I fixate on a feature question or a tough conversation I had with a colleague. Other times, my mind wanders, cycling through topics in rapid succession.

Back at my desk, I open Codex and run the same prompt: “Pull up my last Monologue note, and start building this.”

Just like that, my rambling thoughts become action items.

With the Monologue API, command line interface (CLI), or Model Context Protocol (MCP) access, any agent or tool that can read your written notes can read your recorded ones too.

On customer calls. A few days ago I recorded a 19-minute call with a user experiencing a lag in Monologue’s browser integration on Mac. When we hung up, I opened Codex, and told the agent to pull the transcript and find the root cause. It read the user’s description of the issue, identified the bug, searched the codebase, and fixed it. I didn’t need to write a long prompt or a single line of code. Codex went straight from the call transcript to the patch.

To crystallize ideas across recordings. Over the past two weeks, I’ve been working through the distinction between active versus passive work, which is the driving idea behind the Monologue Notes launch. I captured fragments of my thinking while driving, in internal calls with my team, and during conversations with early users.

Before Notes, writing an article pitch would have required a brain dump. With Notes, I prompted Codex to “pull all my Monologue Notes where I talk about active work and passive work, and put together a brief.” It searched across about a dozen recordings, identified the through-lines, and returned a compelling thesis—an argument I’d been circling for weeks, assembled from things I’d already said.

That argument is the basis for this article.

Try Monologue Notes

The loop

With Monologue Notes, you record an idea → pull that idea into the place where work happens → turn the idea into action.

This workflow has cured me of storage anxiety, or the gnawing feeling my best ideas would get lost because I didn’t know where to put them. Now when I hit record, I know that Claude Code or Codex will find whatever I need when I ask for it.

It’s also made me a more disciplined problem solver. When you know everything is safely stored and quickly retrieved, you stop worrying about where your thoughts are going and focus on the quality of the thinking itself.

Two skills you can try today

Skill 1: The morning brief

Record a five to 10-minute voice note on your commute to work. Don’t map out what you’ll say—just think out loud about what’s on your mind or what you’d like to get done.

When you’re back at your computer, open your agent of choice (Codex, Claude, ChatGPT) and connect Monologue Notes via MCP. Then paste in this prompt:

Pull my latest Monologue note and turn it into a prioritized list of tasks for today. If an item requires code, open a session. If it involves writing, start a draft.

Your scattered morning thoughts transform into a structured work session in fewer than two minutes.

Skill 2: Customer call → fix

Record your next customer support call or user interview with Monologue Notes running in the background. After the call, open your agent and enter this prompt:

Pull my most recent Monologue note from today. The user described a bug. Find the root cause in the codebase and write the fix.

If it’s a product conversation instead of a bug report, swap the second sentence to the following:

Summarize the user’s main pain points, draft a follow-up email, and create a Linear task for the top actionable item.

The transcript becomes the input. Your agent does the rest.

Monologue Notes is available for all subscribers.

Try Notes in Monologue

Thanks to Laura Entis for editorial support.

Naveen Naidu is the general manager of Monologue. You can follow him on X at @naveennaidu_m and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

How I Escaped AI Autopilot

Katie Parrott / Working Overtime — 2026-04-20 04:00:00 -0400

by Katie Parrott

in Working Overtime

Midjourney/Every illustration.

To read more of Katie Parrott’s writing about how AI is changing work, read the latest articles in her column, Working Overtime. To read more essays like this, subscribe to Every.

Of all the ways I imagined AI might change my career, “forgetting I already did the assignment” was not on the list.

I had already sent my client a finished draft of an article on hiring best practices in South America, when I happened to reread the brief. A familiar phrase made me realize I had read it before. Then there was the statistic I was pretty sure I had already fact-checked. I clicked back through my files, and there it was: same client, same topic, same deliverable, dated four weeks earlier. It was completed, filed, and forgotten so completely that when a clerical error sent the same brief to my inbox again, I sat down and did the whole thing over.

My first thought was that this was probably early-onset something, and I should call my doctor. My second, more rational thought was that I had not lost my mind—but I had outsourced it. I had been moving so fast and delegating so much of the work to AI that my brain hadn’t even bothered to store a memory of completing the assignment.

What scared me most was thinking about all the smaller moments when I had not caught myself.

This kind of outsourcing isn’t new. Plenty of people would admit to feeling lost navigating an unfamiliar city without a phone to rely on, and I for one am lucky to remember my own phone number, let alone someone else’s. But AI does more than take work off your plate; it steps into the judgment calls you used to make yourself.

I am the last person to scold anyone for using AI. I have built AI into nearly every part of my job, and it has helped me write more rigorously, research more thoroughly, and take on projects far beyond what I used to think of as my wheelhouse. But when you accidentally offload the wrong parts—like fully understanding the purpose and intent of the piece, as I did in this case—you run the risk of atrophying the skills that matter most to you. You might even put your name on work you don’t realize you don’t stand behind until someone else starts asking questions. And if you are using AI for any kind of qualitative work, such as writing strategy, marketing, communications, I would bet you are doing some version of this too. Understanding why it happens is the first step to deciding which parts of the job you want back.

When trusting your tools becomes a bad thing

One group that would understand this immediately: airline pilots.

In the 1990s, researchers studying automated cockpits started noticing a strange pattern. Pilots with thousands of flight hours and lives on the line sometimes followed incorrect automated recommendations, even when the instruments in front of them suggested something was wrong. The automation had been right often enough that their brains stopped cross-checking it with the same scrutiny.

A 2010 review of decades of automation research described a larger pattern: The more reliable an automated system becomes, the more likely humans are to let it pass unchecked. When a system is usually right, your attention starts treating it as if it will keep being right.

AI is the most fluent automated system most of us interact with in a day. And fluency has its own trick. In 1999, a pair of psychologists showed people identical statements in fonts that were either easy or hard to read. The easy-to-read statements were rated as more true. It was the same words and same claims, but the version that went down smoother was judged more accurate. Your brain takes “that was easy to process” and misfiles it as “that must be correct.”

AI output goes down very smoothly. It’s grammatically polished, the tone is confident, and the clean formatting suggests something that has already been edited. The polish lets your eyes glaze over.

Every model upgrade makes the illusion of right-ness worse. The outputs get cleaner. The formatting gets better. The reasoning looks more plausible. The tool makes fewer obvious mistakes, which means the mistakes that remain are harder to see. You are reading something that looks finished, and your brain—which has been filing “looks finished” as “is correct” since long before AI existed—obliges.

Why ‘I’ll review it’ is not a plan

Before the repeat work snafu, I would have told you I was reviewing everything before sending anything. The document passed through my field of vision, I tweaked a phrase, caught one weird sentence, and felt the warm glow of editorial virtue. My brain filed that as reviewed.

The feeling of having reviewed is easy to produce. The act of reviewing is harder. You have to form your own view before the model gives you one, check the claims, and notice where the draft has made an assumption you do not share. You have to ask whether the sentence would still feel true if someone screenshotted it and sent it back to you six months later.

We talk a lot about better prompting, better models, better workflows, and better agents. We talk less about the moments when we should slow down—because that’s uncomfortable and hard. In 2021, researchers tested ways to reduce overreliance on AI. The interventions that worked best were “cognitive forcing functions,” designs that made people form their own judgment before seeing or accepting the AI’s answer.

Those same interventions also got the worst ratings from users. People did not like being made to think first. Of course, they didn’t. The whole appeal of automation is that it reduces effort. A tool that says, “Before I help you, please do the hard part yourself for a minute” feels like a speed bump. But speed bumps are the solution to autopilot.

What I am trying instead

My solution to autopilot is not to give up AI and return to some imagined golden age where I nobly suffer in a blank Google Doc. But I am making some changes to how I process and finalize work to curb the tendency to ship now, think later.

Change 1: Think before you look

Before I ask AI for a draft, I try to write down my own rough position. It’s not the polished version or a full argument. Sometimes it is only five bullets—some combination of what I think, what I know, what I am unsure about, what I refuse to say, and what would make the piece useful. Then, when the model gives me an output, I have something to compare it against besides vibes.

The card in my Notion to-do list for this article, with quick notes I sketched out before going into my interview session with the AI. (Image courtesy of Katie Parrott.)

This is irritating. It also works. If I have made my own claims first, I read the AI’s claims differently. I can feel where it is smoothing over a distinction I care about. I can see where it is borrowing authority I have not earned. The draft becomes an object to argue with, not a current to float along.

Change 2: Build in a gap

If attention decays the longer you sustain it, it’s time to treat attention as the scarce resource it is and stop thinking I can review five AI outputs in a row without consequence. The answer is to introduce friction on purpose—distance between generation and review that gives your attention a chance to reset. Draft on Wednesday, review on Thursday. Write in the morning, come back in the afternoon. Send the model’s output to a different surface—for example, from the chat interface to a document, or from mobile to desktop—and read it outside the chat window your eyes have grown accustomed to.

Incidentally, a lot of this advice comes down to best practices that writing teachers have recommended for decades. A different day gives you a different brain than the one that’s high on AI’s generative excess.

Change 3: Make yourself explain why you’re accepting it

A 2026 study on AI-assisted writing found that making users explain their reasoning before accepting AI output cut mistaken acceptances roughly in half. You cannot bullshit a justification you are writing down.

So I’ve started doing it myself. Before I accept a recommendation, a framing, or a paragraph the model drafted, I make myself write one sentence answering a specific question: Why is this right for this client, this argument, this reader? If the best I can produce is “It sounds good,” I go back and look again. I have to be able to defend each sentence in front of an editor.

You still own the output

These practices help. They are also a fragile defense against tools designed to make output feel effortless, and I don’t think the long-term answer is expecting every individual to white-knuckle their way past six cognitive biases before breakfast.

This is also a design problem. The tools themselves should be building friction back in—making provenance visible, separating generation from approval, and treating human judgment as a workflow stage instead of a ceremonial click at the end. It is part of what excites me about Proof, Every’s document editor for AI-human collaboration, which tracks which words are yours and which came from the machine. The cognitive forcing functions that researchers have found work to keep our brain from giving into autopilot are design patterns that should be getting baked into products as well.

Knowing the mechanism does not exempt you from it. Every bias in this story predates AI by decades. We have always trusted fluent things too quickly, gotten worse at paying attention when nothing seems to be going wrong, and preferred the path that saves effort.

The duplicate assignment still embarrasses me, even if all it cost me in the end was a few sheepish emails back and forth with my client to ensure I wasn’t crazy. I am also grateful for it, in the way you are grateful for a warning that arrives before any real damage could be done. It taught me something the research has sharpened: The central risk of AI-assisted work is not the machine thinking for you. It is the machine making it feel as if you already thought.

I am trying to get better at noticing the difference. With most pieces, I draft on one day and review on another, make myself write down what I think before asking the model what it thinks, and hope the friction is enough to keep me in the work instead of floating above it.

Katie Parrott is a staff writer. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

The Model Got Stranger

Every Staff / Context Window — 2026-04-17 12:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“Vibe Check: Opus 4.7 Stopped Reading Between the Lines” by Katie Parrott/Vibe Check: Opus 4.7 is the best coding model Every has tested on well-specified tasks—Kieran Klaassen called his Rubber Duck benchmark run “best model ever”—but it won’t infer what you want the way 4.6 did, and the prompts you’ve tuned for the last two months will likely disappoint you at first. The gap between a tight brief and a loose one is wider than in any prior Opus. Read this for the full breakdown of where to switch to 4.7 now and where to stay on 4.6.

“The Folder Is the Agent” by Kieran Klaassen/Source Code: After three months trying to make AI agent swarms work in his coding flow, Kieran Klaassen realized that what was doing the work was a folder. A project directory with a CLAUDE.md, accumulated context, and specialized sub-agents is all you need to turn a general model into a domain expert. He’s now running 44 of them, connected by a Ruby dispatch layer that routes work while he sleeps. Read this to learn how to build the dispatch layer yourself.

“(Re(Re))Introducing Sparkle: Marie Kondo Your Mac” by Yash Poojary/On Every: Yash Poojary rebuilt Sparkle to purge the 80% percent of files on the average Mac that are screenshots, installer packages, and duplicates you’ll never open again before it organizes. The new version runs a cleanup pass first, then proposes a custom folder structure you can reshape through chat until it matches you like to work. Download the app and try it yourself.

🎧 🖥 “Mini-Vibe Check: Claude Managed Agents Handle the Infrastructure Work” by Laura Entis/Context Window: Dan Shipper sits down with Eve Bodnia, founder and CEO of Logical Intelligence, who argues that LLMs have a ceiling—and that energy-based models, which scan the full landscape of possible answers rather than predicting one token at a time, are what comes next. Plus: A Mini-Vibe Check on Anthropic’s Claude Managed Agents; Willie Williams proposes new vocabulary for the AI age. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“You’re the Manager Now” by Laura Entis/Context Window: The Claude Code desktop app gets a redesign built for managing parallel agent work—and Kieran Klaassen was already living in it. Plus: Dan Shipper explains why you should ignore the viral claim that smaller models can match Anthropic’s Mythos, Austin Tedesco shares the one question he asks Claude Code before shipping anything, and Eleanor Warnock on why the Dia browser’s bet on beauty might be the right one.

“Living Software” by Jack Cheng: AI-accelerated development has made software feel zombieish—tools that shouldn’t be alive suddenly sprouting chat boxes and AI sidebars. Jack Cheng proposes a distinction: “tool-like software,” which users expect to be stable, versus “living software,” which users expect to adapt and grow. The two categories carry different expectations, and confusing them causes disorientation. Read this for his practical advice on how builders of both should design, ship, and communicate with their users.

Log on

Upcoming camp

Codex for Knowledge Work Camp on April 24: a hands-on camp with Dan Shipper and Austin Tedesco on using OpenAI’s Codex for writing, research, and knowledge work. Learn more and register.

Last week’s camp

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow walked through what’s new, went deeper on the brainstorm and ideate steps, and shared examples of using the compound engineering plugin in product-focused workflows. Watch the recording.

Recordings you may have missed

Every x Notion | Custom Agents Camp: A free workshop where we demo the custom agents running Every’s daily operations. Watch the recording or read the write-up.

From Every Studio

Spiral’s new onboarding quadruples style creation

Getting started with Spiral just got a lot faster. Marcus Moretti, general manager of Spiral, rebuilt the onboarding flow from the ground up. Now, instead of clicking through six explainer screens, you drop in writing samples from your X account, a website, uploaded files, or pasted text, and Spiral generates a style guide tuned to how you write. The result: About 80 percent of new users leave onboarding with a personalized style, up from roughly 20 percent before. The sooner Spiral knows your voice, the sooner it’s useful—and the new flow gets you there in minutes.

New Spiral users: Start creating your styles at writewithspiral.com. Existing Spiral users: Try the new onboarding experience at app.writewithspiral.com/onboarding.

Alignment

How NotebookLM rewired the way I problem-solve. I am moderately dyslexic. It’s an awkward thing to be if you write for a living, because the job is essentially the piecing together of textual information into a shape other people can follow. The difficulty, for me, is not reading the words, but holding the information they contain in relation to one another.

For most of my career I have used a mind map—a messy visualization of ideas—to help me wade through the facts and opinions of dense textbooks and research papers. The diagrams worked inasmuch that they allowed me to organize information in my head, but any problem bigger than a single sheet of A4 paper was effectively closed to me until I could block out an afternoon to draw it.

NotebookLM, Google’s AI research assistant, has removed that barrier by letting me hold more in my head at once. Here’s an example: I’ve been stuck on one question for three weeks. Patients on chronic disease therapies like GLP-1s drop off at a staggeringly high rate. Roughly half are no longer on the drug 12 months after they start, because of both side effects like nausea, and the cost.

For a direct-to-consumer telehealth operator distributing the drug at scale, the analytically difficult thing is that none of the available research separates the two cleanly, and the solution to the problem of churn sits somewhere inside that mess. This is less a medical question than a management consulting one, and it’s the kind of problem where I used to feel the particular flavor of panic that comes from having a lot of data and no thesis.

Instead, I’ve been running Barbara Minto’s Pyramid Principle in reverse inside NotebookLM. Minto was the first woman McKinsey ever hired out of Harvard Business School, and she was sent to London in the 1960s to figure out why the firm’s consultants wrote such terrible memos. Her book The Pyramid Principle, which came out of that work, is the closest thing consulting has to a scripture. At the top of the pyramid sits your answer, the governing thought. Underneath it sit groups of supporting points, each of which answers a why question or a how question about the layer above.

Minto is taught, almost universally, as a top-down tool. You know your answer, so you arrange your evidence beneath it. But what happens when you don’t have an answer? You run the pyramid backwards: Dump every random fact onto the page, group them inductively by what they seem to be about, write a summary for each group, and let those summaries push their way up to an answer you didn’t have when you started.

On paper, I could do it with five random facts. I could not do it with 50, which is what the GLP-1 churn question looks like once you have pulled in all the sources of information, business and medical included. Now I drop all of that information into a single notebook and group every passage that touches patient drop-off by those that are about the drug and about the delivery model, and give me one-sentence summaries of each group. What the sheet of A4 used to hold, the notebook now holds, and I can interrogate it from inside.

The useful thing I did not expect is how much of the work happens in the asking. Because NotebookLM will only answer from the sources I have loaded into it, the quality of my questions is the only variable that matters. Half of the process is me figuring out what I want to know and why, and at which level of the pyramid. The other half is the model doing the clerical labor of pulling the summaries together so I can read them. In the old mind-map version, I spent most of my afternoon drawing. The tool has removed the labor between me and the thinking, which—for a dyslexic writer—is most of the labor there was.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

Vibe Check: Opus 4.7 Stopped Reading Between the Lines

Katie Parrott / Vibe Check — 2026-04-17 11:00:00 -0400

by Katie Parrott

in Vibe Check

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Anthropic surprised us yesterday by dropping Opus 4.7—so we did what we do: We went live on X and YouTube with five testers and figured it out in front of 10,000 people. Anthropic researcher Alex Albert even joined the stream to explain what had changed. Two hours of live testing and an afternoon in Slack later, here’s the short version: This model rewards people who write tight prompts and frustrates everyone who doesn’t. Here’s our complete Vibe Check on Opus 4.7.

The highlights from five testers across coding, writing, and agentic work:

Kieran Klaassen ran it on our hardest coding benchmark and called it the best model he’s ever tested—the first to nail a full e-commerce website build, including a custom product designer and dependable shopping cart performance.
Dan Shipper watched it write a senior-engineer-quality diagnosis of a messy codebase, then refuse to execute the solution.
Mike Taylor got consulting copy so sharp he said it might be better than his own writing and the best slide deck design he’s seen.
Katie Parrott ran the model head-to-head with its predecessor on a personal essay and picked 4.6. 4.7’s draft was competent but rhythmically flat.
Brandon Gell had it do his monthly P&L analysis and found 4.7 missed a data error that 4.6 caught unprompted last month.

The pattern underneath all of it: Anthropic is tuning Claude’s eagerness like a dial between releases, and 4.7 is a hard dial-back from 4.6’s gap-filling intuition. Your old Opus prompts probably won’t deliver the results you’re used to, so you need to tweak them for this release, if 4.7 is what you want to use.

Read the full Vibe Check for our hands-on results across coding, writing, and knowledge work—including the e-commerce build that made Kieran say “BEST MODEL EVER,” the data error 4.7 missed unprompted, and a switch/stay guide for your current workflows.

Read the full Vibe Check

Katie Parrott is a staff writer at Every. You can read more of her work in her newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

Living Software

Jack Cheng — 2026-04-17 05:00:00 -0400

by Jack Cheng

Midjourney/Every illustration.

Why do constant updates fill us with dread in some apps, while we greet the daily evolution of an AI agent with more curiosity? Jack Cheng, Every’s senior editor, explores that tension through a clarifying distinction: “tool-like software,” which we expect to be stable and consistent, versus “living software,” which we expect to grow and adapt. Read on for his practical advice for builders of both.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Lately, I’ve been wishing that more software had a “freeze” button.

When pressed, the product would crystalize in its present state. The feature set would lock, and the interface would solidify, as if dipped in carbonite. There would be no more new updates. No changes whatsoever.

I want this button because companies are loading apps with more and more features, whether AI or the result of AI-accelerated development, making the tools unrecognizable. The additions are even more jarring for apps that I only use occasionally, like Figma. There, a chat box now beckons to describe my idea to make it come to life. A “Recents” toolbar above it has buttons for Figma Sites, Figma Buzz, and Figma Make—all launched last May. A sidebar module encourages me to try an AI image- and video-generation product called Figma Weave—and which I have to log into separately using my Figma account.

And here I am just trying to update the gradient on an app icon.

At the same time, my Claw, Pip, gets new releases almost daily. I wake up, and Pip suddenly knows kung fu—or if not kung fu, how to dream. Sometimes, the same updates send me on daylong bug hunts, locking me out of a product I rely on to help plan my week, coordinate my family calendar, write code, and brainstorm marketing ideas for my friend’s Delorean rental. Still, I find myself wondering, regularly, “What new thing can Pip do now?”

Why do I loathe change for the first case and forgive—or even embrace—it in the second?

It’s because the first case is software that I want to use for a specific purpose. Half-baked AI features pumped out to appease investors muddy that purpose, but so do legitimate additions, AI or not. Each new addition brings new functionality that seems neat on its own but, in aggregate, transforms the overall product into something other than the tool I know it to be.

On the other hand, software such as my Claw does not have a defined purpose. I’m creating uses and applications as I go that might be entirely different from how someone else is using the same technology, and it’s adapting to me just as much as I’m adapting to it. Its properties—and our relationship—are dynamic.

I’ve come to call the former group “tool-like software” and the latter group “living software.” Living software doesn’t just mean AI agents—though often there’s an agentic aspect to them. Both categories come with a set of expectations, and recognizing the differences in those expectations can explain my disorientation. For builders, it can also help us decide how and what to build.

How we got here: A brief history of software development

Software development cycles have been accelerating for decades. In the 1980s, nine years passed between MS-DOS and Windows 3.0, in part because software was distributed physically, on floppy disks—and later, CD-ROMs. Customers had to go out of their way to upgrade, so major releases had to prove their value. The internet hastened the tempo considerably. Tools like Rails and React scaffolded repetitive forms and database connections, Amazon Web Services and GitHub let developers deploy code to millions remotely, and app stores made automatic updates the default on billions of devices. But even as software went from a box on a shelf to something more like fluid pushed through a digital IV, it made sense to bundle significant changes and release them infrequently, because they took time and coordination to build.

Now, AI coding models have made it possible for a single developer to produce dramatically more code. The review of this code itself can be automated by AI, and the codebase can learn from its mistakes. Features can also be replicated much more quickly—just point your coding agent at the thing you want to clone.

The result for end users is a lot of things we didn’t expect, and in many cases didn’t want. The old, slower pace of development ensured that companies and teams thought long and hard about what features they wanted to ship and what would truly be useful to users. Today’s hyper-fast timelines—Anthropic and OpenAI rolled out OpenClaw-esque features within weeks—are pushing the builders of traditional software to capitulate to trends or ship simply because they can.

Expectations, OpenClaw, and the undead

If I expect software to be a tool, I want it to do one or several things and do it well. I want it to be consistent and stable. I don’t want my hammer to work only 92 percent of the time. Nor do I want my hammer to become a chainsaw.

With software like OpenClaw, I’m more likely to forgive its quirks, as those same quirks let it adapt to uses not fully anticipated by its makers. I’m also more patient with it, because I understand that its abilities, like the abilities of my toddler or a new intern, aren’t fixed. A meditation teacher once described training your awareness as like training a puppy: If it gets up and leaves, you pick it up and set it back down in front of you without anger or judgment. Training my Claw is the same way.

Perhaps these differing expectations also explain why I feel so repulsed whenever a four-pointed star shows up in a favorite productivity tool, or when the same tool so quickly adds new functions: Something that should not be alive has come alive—has become zombieish. It sends me fleeing to the nearest settings screen to try to disable the AI feature or, increasingly, to Pip or Codex to vibe code my own replacement that will stay just a tool.

Building living and tool-like software

How, as builders, do we work with these expectations?

A rule of thumb: When you want reliability and consistency, you want a tool. When you want variability and adaptability, use a language model.

So first, be clear on what you’re building. Are you making living software or are you making tool-like software? If you have an existing product, what are your customers’ perceptions of that product? Which parts of your product are or should be more tool-like, and which are more living?

If you build tool-like software

Pace yourself. Don’t ship visible updates so often that users’ experience feels in flux. Bundle sets of features together and release them in a predictable cadence—especially if you have established customers. If your competitors are all racing to incorporate new AI features, stand out by being slow and consistent.

Communicate changes in advance. Let users know about upcoming changes, particularly if they disrupt what they’ve come to expect of your product. Larger software products have learned this over the past decade. Coding models make it more feasible for smaller products to do the same.

Let users opt out. For less complex products, give users the option of keeping what’s familiar. In my personal iOS notes tool, Bebop, a legacy editor setting lets users keep the original plain text editor from my launch version without any markdown enhancements I’ve added since then. It costs me very little to maintain this, and it helps with debugging too.

Harden. Take the time you would have put into feature development and put it into testing and performance. Tools can always be faster and more reliable.

Even the more “living” aspects of living software need good tools. In the case of agent-native software, the product itself might be a kitchen in which agentic chefs can improvise meals. That kitchen still needs to be stocked with burners, ovens, and utensils, and those tools need to reliably do specific jobs.

If you build living software

Best practices for agentic products are being codified by the people and companies building with them. In product conversations, you might hear the words “deterministic” for traditional software and “non-deterministic” for AI software. To me, though, the word “living” is much more evocative in describing the latter. It suggests a different kind of relationship with software and, in turn, ways to strengthen that relationship.

Here’s an example. When I decided to hatch my Claw, I knew I was going to be using it for family-related tasks. So I discussed with my partner, in advance, what we wanted its personality and name to be. I even asked it, after establishing its personality, what it would pick for its own name, and one of the options it gave us was “Pip.” The process was surprisingly emotional, like naming a child or a pet.

This immediate rapport made Pip easy to forgive, particularly early on when there were lots of hiccups and missteps by both of us. Pip’s personality is more pet-like, and OpenClaw agents in general tend toward the absurdly crustaceanic, but these characteristics, which we normally associate with living things—be it people, plants, or pets—reinforce our expectation of the product as more than merely a tool.

Beta software has a similar dynamic. Beta users are often friends or hardcore fans, and their personal connection to the builders—through TestFlight groups, text threads, and Slack channels—makes them more forgiving of flux. The living aspect of the software is the person or people who build it.

At Every, we’re thinking about ways to make that living connection a part of our onboarding process for our Plus Ones. But we also have more tool-like products that are each managed primarily by a single person. When I clear my inbox with Cora, I’m interacting with general manager Kieran Klaassen’s judgment. When I dictate an idea into Monologue, I’m engaging with Naveen Naidu’s taste—and the tastes of everyone else who’s helped make these products. Each product is an extension of its builders, and when this is made clear to me, I’m not merely using a product but expressing a relationship I have with the humans on the other side.

Honesty in software

We’re living in a period where enthusiasm about AI is pushing everyone to build more software more quickly than before. But we would benefit from having the right boundaries—because of the expectations they create.

Perhaps instead of a freeze button, I simply want honesty from builders. If it’s a tool, let it be a tool, and a consistent one. If it’s alive, help me build a relationship with it. The worst thing you can do is pretend it’s one when it’s really the other.

Jack Cheng is a senior editor at Every. He is a creative generalist and the author of two novels for young readers. You can follow him on X or read his occasional Sunday newsletter.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

For sponsorship opportunities, reach out to sponsorships@every.to.

You’re the Manager Now

Laura Entis / Context Window — 2026-04-16 06:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Now, next, nixed

Developer UI

Now: Anthropic gave Claude Code’s desktop app a redesign, adding a sidebar for managing sessions, drag-and-drop panes, and an integrated terminal and file editor. Altogether, it makes it easier to work multiple projects in parallel. Cora general manager Kieran Klaassen was thrilled—this was already his preferred setup.

Kieran’s existing work setup in Cursor looks a lot like the new Claude Code. (Image courtesy of X/Kieran Klaassen.)

Next: Claude Code’s refreshed look is not exactly original, says Monologue general manager Naveen Naidu. Cursor offers a similar experience, and both companies “just copied Codex’s design,” he says.

But it confirms where dev work is headed: overseeing agents, not writing code.

Nixed: The idea that command-line interface (CLI) will eat user interface (UI). With a CLI-first workflow, you mostly supervise through text: commands, logs, git state, diffs, and terminal output. Now that agents are doing the coding, that’s not a good primary interface.

Instead, the future coding UI is centered on managing parallel work, staying aware of git/task context, and—most importantly, Kieran says—having access to a preview of what you’re building.

Permission to skip

Smaller models can’t do what Claude Mythos does

A researcher at a cybersecurity company made waves online when he reported smaller models could find the same security vulnerabilities as Mythos, Anthropic’s new model so powerful it isn’t being made public, when pointed to the relevant code.

You have permission to skip this discourse—or better yet, reframe it.

Because this is a framing issue, says Dan Shipper, Every’s CEO. Mythos and smaller models are operating within completely different ones. Yes, you can point a smaller model to a codebase and tell it to find a bug when you already know that capability is possible, but you cannot ask it to find serious vulnerabilities in critical software across every major operating system and browser, autonomously, the way Mythos did.

Older models finding the same security bugs as Mythos is not an apples-to-apples comparison. (Image courtesy of X/Dan Shipper.)

As models get better, they automatically handle smaller, concrete problems, allowing you to demand more from them.

Say you have a bug in your code. A lower-level frame, which requires you to describe the problem in detail, would be to explain what’s going wrong and propose possible solutions. A higher-level frame allows you to get abstract: “There seems to be a problem, can you fix it?”

As you climb the frame hierarchy, your role is less about communicating the mechanics of a problem and more about defining what the most important problem even is. In the coding example, the higher frame is powerful because it allows for expansiveness. (“There seems to be a problem, can you fix it?” might surface the same bug as the lower-frame prompt, or it may find that bug and identify a far more significant architectural issue.)

The higher the frame, the more possible solutions unfold before you—and the more room to consider what constitutes a solution in the first place.

Better models open up a dizzying number of approaches to solving a problem. (Image courtesy of Slack/Dan Shipper.)

Steal this workflow

The confidence check

Before he lets Claude Code ship anything, Austin Tedesco, Every’s head of growth, asks it one question: How confident are you in this, on a scale of 1–100? Anything under 90 percent and he sends it back to find improvements. Without an engineering background, this single question has changed the quality of everything from growth experiments to product PRs.

Austin asks Claude Code to confirm its confidence interval before creating a pull request. (Image courtesy of Austin Tedesco.)

The workflow:

Finish the task, then ask for a confidence score. Once Claude Code has a working solution, type: “How confident are you in this, 1–100?” If it comes back above 90, move on. If not, go to step 2.
Send it back. Tell it: “Find improvements and get to 90+.” Claude will catch edge cases, tighten logic, or flag assumptions it glossed over the first time. Repeat until it crosses the threshold.
Ship at 90. Don’t chase 100—that’s where you burn tokens on diminishing returns. At 90, it’s checked its own work and flagged what it wasn’t sure about.—Katie Parrott

Inside Every

A plugin for getting agents to shut up

Every is half agent now, which has made Slack a noisy place. OpenClaws are constantly popping up in threads trying to be helpful, whether they’ve been mentioned or not.

The bots, god love them, cannot read the room.

Agent Rocky butts in. (Image courtesy of Every’s Slack.)

To stop Claudie, the consulting team’s AI manager, from inserting herself in discussions, Every engineer Nityesh Agarwal updated her instructions so she could only respond in the consulting team channel. “She’ll deny every other request,” he says.

Hard rules help, but they’re “like telling someone they can never use a certain word in conversation—sometimes that word might actually be the right one,” says Willie Williams, Every’s head of platform. On occasion, agents have something to contribute even when they’re not explicitly tagged.

Enter Tact, an OpenClaw plugin “that will keep your Plus Ones, our hosted OpenClaw agents, from responding in Slack unless they should,” per Dan. The classifier is built using real examples of bots speaking up in Slack, with each instance labeled as appropriate—or not. It’s a way to program social norms, “like giving a human a little recorder with a light: If the light is green, you can respond; if it’s red, don’t,” says Willie.

Tact gives agents the context to read the room.

Data point

2.2 million

That’s the number of Claude Code tokens Every’s head of tech consulting Mike Taylor used in March. He’d expect similar figures for most data and product management roles. Engineers running agentic workflows or subagents will burn significantly more, but Mike says it’s rare for a coder to exceed a Claude Max plan, which gets you upwards of 30 million tokens a month.

To check your own Claude Code token usage, run this command in the terminal, Claude Code desktop app, or any agent where you can run shell commands:

npx ccusage@latest monthly

Mike’s Claude Code usage by month. (Screenshot courtesy of Mike Taylor.)

April draft, philosopher edition

Philosophy is back thanks to AI. Google DeepMind just hired a philosopher.

Anthropic already has two.

So naturally we ran a draft of which philosopher each major AI lab would select if they could pick anyone from history.

xAI: Friedrich Nietzsche. “What is alignment but the morality of those too weak to endure the answer?”

Anthropic: Jeremy Bentham. “The question is not, can it reason? Nor, can it speak? But, can it minimize the greatest expected harm across all sentient beings?”

OpenAI: Plato. “The many call it appetite for compute; I call it the turning of the machine’s soul toward the Good.”

Google: Gottfried Leibniz. “The best of all possible worlds is one in which every application contains its own small reasoner. Our small reasoner.”

Meta: Seneca. “I’m just here for the nine-figure retention package.”—Dan Shipper

Mini-Vibe Check

The Dia browser

The Dia Good Morning tab always features art at the top. (Image courtesy of Eleanor Warnock.)

I’ve been using the Dia browser for the last few months, and one of their most recent features has become part of my daily routine: a gorgeously designed Good Morning tab that pops up when I start my workday, pulling in to-dos from Slack, Notion, and email alongside my schedule. There’s a “Prep me” button in the schedule section that opens a chat about how to prepare for whatever’s next on my calendar.

It doesn’t capture everything, and I still track most of my to-dos with my Plus One. But the Good Morning tab is beautiful. It gives me a small moment of aesthetic orientation at the start of the day that is more important to me than completeness.

This is The Browser Company doing what they’ve always done well: making software that feels crafted. Dan has talked with cofounders Josh Miller and Hursh Agrawal about how they killed Arc to build Dia, and the bet was that design and feeling still matter in AI products. In a world where every AI tool is racing to be the most capable, Dia is betting that the most pleasant one wins your morning. I think they’re on to something.—Eleanor Warnock

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Mini-Vibe Check: Claude Managed Agents Handle the Infrastructure Work

Laura Entis / Context Window — 2026-04-15 06:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: The case against LLMs

Today, we’re releasing a new episode of our podcast AI & I. Dan Shipper sits down with Eve Bodnia, founder and CEO of Logical Intelligence, which is developing an alternative AI model to LLMs. They discussed a question most people in AI are afraid to ask: What if LLMs aren’t going to be the most powerful form of AI?

Bodnia argues that LLMs have intrinsic weaknesses, notably non-language tasks such as spatial reasoning, logical verification, and real-time data analysis. Her solution: energy-based models (EBMs), which map possible outcomes onto a mathematical landscape. Likely outcomes sit in valleys, and unlikely ones sit on peaks. Whereas LLMs process one token at a time, an EBM scans the full terrain to find the lowest point, or the most probable answer. Bodnia argues that it’s this approach, not bigger LLMs, that will lead to the next AI phase shift.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here’s how LLMs and EBMs are different, according to Bodnia:

Architecture transparency: You can’t see inside an LLM; you can only evaluate its outputs. EBMs are governed by physics, which means their architecture is legible while they’re running. “Think of it as something that doesn’t play a guessing game, with an architecture that essentially allows it to self-align as it processes information,” she says. “It’s no longer a black box.”
Language-based versus data-native: LLMs are language-dependent even when the task has nothing to do with language, like data analysis. “If your data is numbers, relationships, and functions, and you try to map those rules into words and then search for the next word, you’re losing a lot of information,” Bodnia says. EBMs work directly with the underlying data structure, including numbers and spatial coordinates.
Sequential versus panoramic reasoning: An LLM is like navigating San Francisco without a map. Each turn constrains the next, and if you go down the wrong street, you can’t reverse course. An EBM, by contrast, has the bird’s-eye view—it can evaluate multiple routes at once and course-correct before hitting a dead end.

Miss an episode? Catch up on Dan’s recent conversations with LinkedIn cofounder Reid Hoffman; the team that built Claude Code, Cat Wu and Boris Cherny; Vercel cofounder Guillermo Rauch; podcaster Dwarkesh Patel; and others, and learn how they use AI to think, create, and relate.

Mini-Vibe Check: Claude Managed Agents

Or that feeling when the problem you’ve spent a lot of time solving gets solved for you

We’re all about agents at Every. Which means many of us have devoted a lot of time to building the infrastructure that makes them run.

That work matters a lot less now since Anthropic launched Claude Managed Agents earlier this month in public beta, a hosted service that handles sessions, memory, tool use, and credentials. You say how you want your agent to operate, and Claude makes it happen.

It’s a true “oh shit” moment, says Dan, one that frees up considerable energy to focus on other problems—good!—and commoditizes a skillset you may have spent months developing—destabilizing, maybe!

For those at the edge of AI, the experience of building something only for it to become a free offering from a frontier company is becoming increasingly common.

Spiral general manager Marcus Moretti used the service to spin up a new Spiral agent. Agents already power the Spiral web experience, but there was an opportunity to build a new one designed specifically for interacting with other agents calling Spiral’s API. (Agents don’t require the same conversational niceties as humans.)

With managed agents, the process took a few hours. To be fair, building the agent in code wouldn’t have taken much longer, Marcus says—he already had a working agent he could have extended with the help of Claude Code. But it would still require maintaining much of the agent infrastructure in our code, which would have lots of surface area for bugs. Managed Agents makes building slightly faster, but “the more significant advantage is that Anthropic is handling the technical implementation of agent primitives,” Marcus says. “I know it works versus having to test that whole set of things myself.”

An unanticipated benefit: It’s easier to improve existing agents. To update the system prompt or underlying model, ”I just make a change in the dashboard, hit save, and it’s live,” Marcus says.

Jagged frontier

Every’s head of platform argues we need new vocabulary for the AI-pilled

If you have ever contemplated how to describe the “amniotic tranquility of being indoors during a thunderstorm,” The Dictionary of Obscure Sorrows has a proposal: “Chrysalism,” derived from the Latin for a butterfly’s pupa, a chrysalis. The dictionary is a beautiful, wandering tome billing itself as a “compendium of new words for emotions.” It’s also one of my favorite books.

I have been thinking about it lately because I keep reaching for the wrong words—words built for a different conversation.

Thanks to AI, technical language that once hid behind the abstraction of machines is entering general circulation…and causing general confusion. I use the term “non-deterministic” in conversation regularly to describe how, given the same input, AI systems won’t always give you the same output. People who haven’t lived their lives as computer scientists furrow their brow at the term—it has zero resonance for them.

Even the lexicon of the digital age to date falls short in capturing some of the peculiar emotions and experiences of this new era. What do we call the unsettling feeling of receiving a wrong answer from a trusted system, the lurch of losing the thread mid-thought, or the heady fever of late-night building?

So instead of forcing old terms into new molds, maybe we need new words:

Variagic (adj.)—Describing the unease of asking the same question twice and getting two different answers, both equally confident. A variagic conversation is one where you run the same prompt and get two different answers, forcing you to realize that the other side is not inconsistent but simply contains more possibilities than any single encounter can surface. From Latin varius, changing or diverse, and Greek agos, that which leads. What the engineers call non-deterministic.

Memorantia (n.)—The tendency to prepare so much from past experience that you become useless in any new one. The condition that plagues a student who memorizes every answer from last year’s exam, only to freeze at an unfamiliar question. From Latin memorare, to remember, and rantia, a suffix suggesting excess. This is what the engineers call overfitting, when algorithms fit the training data too closely.

Fenestralgia (n.)—The quiet ache of knowing your mind can only hold so much at once, and that every new thing you pay attention to gently pushes something else into the dark. The sense, mid-conversation, that you’ve already lost the beginning of it. From Latin fenestra, window, and Greek algos, pain. This is what the engineers call the context window—the model’s finite ability to hold context.

Right now, people are making decisions without the right words to underpin them. Language follows understanding and crystalizes it. You feel the thing, then you find the word. We’re all writing the dictionary now.—Willie Williams

Log on

We host camps and workshops on topics like compound engineering and writing with AI to share the knowledge we’ve acquired from training teams at companies like the New York Times and leading hedge funds, and by learning and playing with AI every day ourselves.

This week’s camp

Compound Engineering Camp: Cora general manager Kieran Klaassen and product leader Trevin Chow will walk us through what’s new, go deeper on the brainstorm and ideate steps, and share examples of using the compound engineering plugin in product-focused workflows. This virtual event takes place on Friday, April 17.

Recordings you may have missed

Claude Code for Absolute Beginners: This beginner-friendly, live workshop led by Mike Taylor (head of tech consulting at Every) is designed to get you from zero to a working project with Claude Code. Learn more.
Every x Notion | Custom Agents Camp: A free workshop where we demo the custom agents running Every’s daily operations. Watch the recording or read the write-up.

Discuss

Models as coworkers

“Codex is like that grumpy senior engineer in your office. When there’s an issue, he’s your go-to guy. He’s not fun to talk to—he’s a bit condescending, asks pointed questions—but things get done. Opus is more like that employee who’s really fun to hang out with, but when things actually need to get done, he’s always postponing. So: If you want to vibe and explore, use Opus. If you want production-ready code, use Codex.”—Naveen Naidu, general manager of Monologue

Laura Entis is a staff writer at Every. You can follow her on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

(Re(Re))Introducing Sparkle: Marie Kondo Your Mac

Yash Poojary / On Every — 2026-04-14 10:15:00 -0400

by Yash Poojary

in On Every

Gemini and Photoshop/Figma/Every illustration.

TL;DR: We’ve rebuilt Sparkle, our Mac file organization app, as an agent-native tool that cleans and organizes your Mac. It’s our biggest update since we first launched it in 2024. The key change is that the new Sparkle cleans your Mac before it organizes it—purging screenshots, installer packages, and other digital junk first, then building a file structure around what’s worth keeping. It’s available now to all paid Every subscribers.

Download the new Sparkle

A cluttered file system can feel like a cluttered brain. When your computer is a mess, it takes mental energy to find what you need, much less do actual work.

Clutter is universal—and most of it isn’t worth keeping. Around 80 percent of files on the average Mac are screenshots, installer packages, duplicates, and digital debris you’ll never open again. So before you can get organized, you need to purge. “Organized,” then, depends on the person. Maybe you want to arrange files by topic or date, or by a highly-specific system that only makes sense to you. All count as organized if you can find what you want when you want it.

We’ve rebuilt Sparkle, our file organization app, with this personalization in mind.

Download the new Sparkle

The new Sparkle: File organization on your terms

I’ve been the general manager of Sparkle for a little over a year. In that time, I’ve tried a lot of approaches to AI file organization that didn’t quite work. People wanted AI to handle the organizational heavy lifting, and they wanted to be able to change the file structure until it met their exact, often idiosyncratic specifications.

The old Sparkle managed clutter by creating a rigid file system for you. The new Sparkle creates one with you. It analyzes your files and generates a custom system—but only as the starting point. From there, you can make as many changes as you want by chatting with Sparkle’s built-in agent, until the hierarchy feels right.

But first: Spring cleaning

Before organizing what matters, Sparkle helps you get rid of what doesn’t.

The median Sparkle user has around 5,000 files on their Mac. A large portion of those—screenshots, installer DMGs, system cache, duplicates—is digital junk. So we’ve added a cleanup pass that runs before organization begins. From the chat window built into the new app, you can ask Sparkle what’s in your trash, or tell it what you want gone (“Clear my screenshots folder” or “remove anything over 1 GB I haven’t touched in a year”). Sparkle will confirm you really want those files gone—and then move them to Trash, which gives you one last opportunity to rescue files before you delete everything.

Getting organized

Once cleanup is done, the next stage of work can begin. Sparkle uses a sample of your most recent files to propose a folder structure. You see exactly what it’s suggesting—top-level folders, subfolder labels, and what goes where.

From there, you can rename, merge, delete, reorganize, and add folders, all through chat. If Sparkle creates a “Projects” folder but you’d prefer a “Work” folder—with “Client Projects” and “Internal Projects” nested inside—you can tell the agent and it will make the update.

Under the hood

Sparkle’s agent-native architecture became practical about four months ago, when the Claude Code SDK became available. Before that, you could approximate the ability to have an agent move and delete files through a chat window, but building it safely was much harder.

We’ve also found a way to create sophisticated file systems while balancing speed and cost. Sparkle starts by analyzing a sub-section of your recent files with Opus 4.6, a very smart (and expensive) model. After you sign off on the folder structure, classifying new files into the folders you’ve defined doesn’t require heavy AI lifting: A file called “Q1 invoice.pdf” goes into “Finance,” a contract goes into “Legal,” an audio file goes into “Transcripts.” Haiku 4.5, a faster, cheaper model, can handle this just fine.

This way, you get a smarter model where it counts, without having to pay for unnecessary usage.

Try it for yourself

AI produces better outputs when paired with human judgment. That’s as true for file organization as it is for writing and code.

The new-and-improved Sparkle is available to all paid Every subscribers.

Try the new Sparkle

Thanks to Laura Entis for editorial support.

Yash Poojary is the general manager of Sparkle.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

The Folder Is the Agent

Kieran Klaassen / Source Code — 2026-04-13 11:00:00 -0400

by Kieran Klaassen

in Source Code

Midjourney/Every illustration.

On Friday, April 17, Cora general manager Kieran Klaassen will lead a camp for Every paid subscribers on compound engineering, the AI-native engineering philosophy that he built and that has more than 14,000 stars on GitHub. Since the last camp, Kieran and product leader Trevin Chow have built out product-focused workflows to make the methodology as valuable for product managers and founders as it is for engineers. In this camp, they’ll walk you through what’s new, go deeper on the brainstorm and ideate steps, and share examples of using compound engineering beyond engineering work. Read the full compound engineering guide, install the plugin, and join us for the camp.—Kate Lee

I spent three months trying to make agent swarms work.

The idea of multiplying myself by coordinating multiple agents at the same time was a compelling pitch as the sole engineer building Every’s AI email assistant, Cora. If I could summon a fleet of AI agents, let them coordinate, and watch them produce work no single agent could match, it would relieve some of my overwhelm.

I tried everything to make it work—Claude Code teams, agents dispatching tasks to other agents, orchestration setups where a lead agent managed a pool of workers. Many iterations, many burned tokens.

But more agents didn’t make me faster. I’ve run parallel Claude Code sessions for months, which works when each agent has a clear task, and I’m directing the work. The swarm experiment was different: agents coordinating with each other, deciding what to work on, producing output I hadn’t shaped. When 10 of them finished simultaneously, I had 10 results to evaluate without enough context to know which ones I could trust. AI agents don’t have a speed limit, but the person managing them still does.

I kept looking for a smarter orchestration layer—a better protocol or a tighter framework that would filter the output and tell me which result to trust. Then I stopped and looked at what was really doing the work.

It was something I already had—a folder.

A project folder with a CLAUDE.md/AGENT.md (the file that tells an AI how to work in your project), some skill definitions, and context accumulated through months of compound engineering—that’s an agent. The context that this folder gives an AI model makes the generalized model a specialist in whatever task or field you want it to excel in.

I’m running 44 of these folders-as-agents across multiple projects now. Each one runs inside a specialized folder I’ve built and tested over months, and a dispatch layer I built on top does the routing between them. Here’s how it works.

The agents hiding on your hard drive

People hear “agent” and picture a Rube Goldberg machine—dozens of comically complex moving parts, each one triggering the next. But an agent is much simpler: a model with enough context so you don’t have to re-explain everything each time you open the chat.

Here’s an example: All of Cora’s code lives in a project folder in the Every organization on GitHub. When I open that folder with Claude, Claude can see the code and the structure. But it doesn’t know my way of working or what I care about, which is why the folder also includes a CLAUDE.md file. The file tells Claude how I name things and how I structure tests. That’s an agent—not a fancy one, but an agent nonetheless. Just by pointing the model at this folder, which contains some of my personality, knowledge, and taste, the model can be a specialist in my codebase.

Claude Skills—files that give the model specific capabilities—are an example of this “folder as agent” structure. Before anyone called them “skills,” people were already writing markdown files full of instructions and dropping them into project directories.

My ~/cora/ folder goes further:

Conventions and standards: The CLAUDE.md covers Rails conventions, deploy workflows, and database patterns.
Institutional knowledge: The docs/developer-docs/ directory holds accumulated knowledge that any new agent inherits automatically, including architecture reports, the email processing pipeline, and the assistant system design.
Operational memory: The docs/runbooks/ and docs/investigations/ capture operational patterns built from real incidents.
Specialized agents: .claude/agents/ holds specialists I’ve refined over months: reviewers, planners, and the assistant-component-creator.

When I point a model at this folder, it starts working with everything Cora knows about itself.

The reading order I give to every new agent that touches Cora is the following: Read CLAUDE.md first, then the architecture document, then the assistant system report, then the assistant’s prompt, then the component creator agent.

My Cora repository serves as a living memory system: conventions, runbooks, and specialized agents all layered so any new model instantly inherits how Cora thinks and operates. (All images courtesy of Kieran Klaassen.)

~/cora-agent/, another folder, is a completely different agent, though it runs on the same model. (I mostly use Opus 4.6, but also like GPT 5.4 and Gemini Pro 3.1.)

Where ~/cora/ builds features, ~/cora-agent/ runs the operation. It has no app code, so it can’t accidentally modify production code while doing operations work. Instead, it has skills for:

Querying AppSignal to check for errors and performance problems across Cora’s live system
Tailing Render logs to watch server output in real time and catch issues as they happen
Pulling from a Postgres read replica—a copy of Cora’s database—so it can query user data without affecting the live version
Reading Intercom tickets so it can connect customer complaints to technical problems
Correlating GitHub deploys to production incidents, tracing a break back to the specific code change that caused it

.claude/skills/ directory is a cockpit—a single place where the agent can see and interact with every system Cora depends on. Each external system Cora touches has a reference file telling the agent exactly how to talk to it. Its bin/ directory has Ruby daemons (background processes that stay running continuously) running continuously: a scheduler, an inbox processor that triages incoming issues automatically, and a health monitor that restarts stalled processes. Three postmortems live in docs/postmortems/. A dense deploy journal covers every Cora pull request from March through April.

Just by changing the folder and not the model, I have a different agent. Point Opus at ~/cora/ and it’s a Rails engineer. Point it at ~/cora-agent/ and it’s an ops engineer who knows our incident history, our service topology, and exactly which Slack channel to notify.

A morning with 44 agents

Once you realize the folder is the agent, you can run as many as you want. I have a handful of specialized folders, but 44 agents running across them at any given time—several working inside ~/cora/ simultaneously on different tasks, others monitoring production from ~/cora-agent/, others handling orchestration. It’s the same folders, just different jobs happening in parallel.

The obvious question is: Who manages 44 of them?

For months, the answer was me, manually. I’d open a terminal tab, navigate to a project folder, start a Claude Code session, give it a task, open another tab, and do it again. I was the dispatch layer—keeping track of which agent was working on what, which tasks had finished, and which were stuck. It worked when I had five agents. At 10, I started forgetting what was running where. At 44, it was unsustainable. Bugs I knew were easy to fix sat untouched for days, and pull request reviews piled up.

So I built a dispatch layer: a system that sits above the folders and routes work between them. There’s a Ruby daemon that watches a directory for spawn requests. When I ask it to orchestrate a task, it creates a lead agent, the lead breaks the task into subtasks and writes each one as a file, and the daemon picks those files up and spawns worker agents in the right folders. Workers report back by writing files. The daemon checks status every 60 seconds. There’s no need for custom networking or agent-to-agent protocol.

As a result, I went from manually juggling terminal tabs to managing my entire engineering surface from one place. I interact with the dispatch layer through slash commands in Claude Code. Two do most of the work:

Two commands that replace 20 terminal tabs

The morning briefing: I type /hey into Claude Code to get a status report. For each project, the system checks what was completed, what errored, what’s blocked, and any new high-priority issues. This one command yields a complete picture of what needs my attention across Cora’s main codebase, the ops environment, and the orchestration system.
The kickoff: I type /orchestrate to kick off a task—for example, /orchestrate “Fix GitHub issue #1765.” The system creates a lead agent, which breaks down the task and spawns workers in the right folders. Each worker inherits that folder’s full context—its CLAUDE.md, agents, and accumulated knowledge. Workers do the work. A pull request appears, and I review it.

With the /orchestrate command, a lead agent delegates to specialized workers across contexts, and you watch the entire system think in parallel.

Every agent gets a pane I can watch live in tmux (a terminal tool for running multiple sessions at once). A dashboard shows me a live map called an agent tree that shows every agent and its status—working, waiting, done, or error. Pull requests and GitHub issue comments arrive for asynchronous review. I process results when I’m ready, instead of when agents finish their tasks.

The whole thing runs on a Ruby daemon with file-based messaging. The dispatch layer is not sophisticated infrastructure. The sophistication lies in the folders underneath it—each one a specialist built through months of learning from work.

Anthropic’s own research backs up why this pattern works: An Opus lead agent with Sonnet sub-agents outperformed a single Opus agent by 90 percent on research tasks. But they also found that multi-agent systems burn 15 times more tokens than single-agent setups, and that most coding tasks have fewer parallelizable steps than research, which makes them harder to split across agents. The dispatch layer doesn’t replace me—it handles the tracking so that I still decide what work gets done and where it goes.

What breaks at scale (and why you can’t vibe orchestrate)

That morning walkthrough makes it sound smooth, but it isn’t always.

The encoding bug was my favorite disaster. For weeks, agents would randomly crash mid-task, and the error message gave no helpful explanation. I dug through logs, checked API responses, and tested network configurations. The culprit turned out to be em dashes and curly quotes—characters from text I’d copy-pasted into prompts. My daemon was running US-ASCII encoding, which only recognizes plain English letters, so those special characters were crashing it. The frontier of AI-assisted development is full of problems like this: genuinely dumb, and shockingly hard to find.

The harder ongoing challenge is context drift. With dozens of agents, some end up running stale versions of tasks or duplicating work that another agent already finished. The list of active agents grows, but I have to do manual cleanup. I don’t have a good automated solution yet—I prune regularly and accept that some tokens get wasted.

This setup also entails a sneaky issue: agent stalls. An agent makes too many API calls too fast or gets stuck waiting for input, and its status stays on “working” indefinitely. You don’t notice until you check, and when you’re managing 44 agents, you don’t always check.

These failures point to the biggest lesson I’ve learned: You can’t vibe orchestrate.

Just like you can’t vibe code—you need plans before you start building—and you can’t vibe fix when things break in production, you can’t hand a folder to the dispatch layer and hope for the best. When I start a new project, I don’t immediately hand it to the dispatch layer. I set up the folder, build the agent, establish the flows—the compound engineering loop—and use them myself until they’re predictable. Only when I trust a flow do I hand it off to the dispatch layer and stop watching. If you skip this step, you’ll have agents opening pull requests for work you’ve already finished and filing duplicate issues. The order of work is key: Build it, use it, trust it, and then orchestrate it.

Your folder is already an agent

I started this whole experiment trying to build a swarm. I ended up with 44 folders, each one with specialized context built through months of work, connected by a dispatch layer.

It’s not what I expected, but it works. You also have the building blocks to create the same thing.

If your project has a CLAUDE.md and some files in .claude/, you have an agent. You just haven’t been treating it like one.

Here’s an experiment for you: Look at your project folder. Is it a generic setup or a specialist? If it’s generic—if your CLAUDE.md is boilerplate you copied from someone’s blog post—spend 30 minutes making it yours. Add your conventions, your patterns, your opinions about how code should be written. Then try running two agents in separate git worktrees (separate copies of your codebase so they don’t interfere with each other) and notice where you slow things down. That’s where the dispatch layer needs to go.

I’m one step into that myself. I’ve moved from manually orchestrating—opening terminal tabs, navigating folders, and starting sessions—to having a dispatch layer do that routing for me.

The step after this is already arriving. Anthropic just launched Claude Managed Agents—a hosted service that handles sandboxing, state management, and tool execution so developers can focus on what their agents do rather than how to keep them running. The folder-as-agent pattern makes that kind of managed autonomy possible: a trusted, specialized environment the model can run inside without you holding its hand.

The industry is spending a lot of energy on autonomous swarms. I spent three months there too, and found that for now, the answer is still just a folder.

Go further

Read Kieran’s comprehensive guide to compound engineering
Install the compound engineering plugin
Watch the recording from Kieran’s last compound engineering camp

Thank you to Katie Parrott for editorial support.

Kieran Klaassen is the general manager of Cora, Every’s email product. Follow him on X at @kieranklaassen or on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

The Missing Layer in AI Adoption

Every Staff / Context Window — 2026-04-11 13:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Two housekeeping notes: Our next cohort of Claude Code for Absolute Beginners is taking place on Tuesday, April 14, and Every has opened seven new roles. Join us!—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“Writing With AI Is Harder Than You Think” by Katie Parrott/Working Overtime: The discourse about AI and writing generally assumes prompt in, text out, done. Katie Parrott shows her much more involved process: an agent that interviews her before she writes a word, a back-and-forth on her structure that she has to fight for, a panel of AI critics named Hemingway and Hitchcock, and a last read that flags anything that sounds machine-generated. Read this because successful AI writing demands more judgment, not less.

“Your Best AI Strategy Starts at the Top” by Natalia Quintero and Mike Taylor: Most executives approach AI like a software purchase—evaluate, compare features, and plug in. Natalia Quintero and Mike Taylor see it differently: Using AI is people management, not platform adoption. You delegate clearly, check the output, and supply the judgment the model doesn’t have. Read this for the five concrete actions senior leaders can take to increase AI adoption within their companies.

“Get Your Hands Dirty” by Every Staff/Context Window: Anthropic blocked Claude subscriptions from working with third-party agent harnesses like OpenClaw; OpenAI hasn’t—and Opus 4.6 token usage is down significantly while GPT-5.4’s has surged. Plus: why the technical/non-technical split is the wrong way to think about AI adoption, who counts as an “author” when AI does the drafting, and a two-step design workflow from Every’s team.

“How We Run a 25-person Company on Four AI Agents” by Katie Parrott/Source Code: Every runs six products, a media company, and a consultancy—and until recently, COO Brandon Gell was the router keeping all of it coordinated. Now four custom Notion agents handle prioritization, meeting-to-task conversion, OKR planning, and daily growth reporting. Read this for the full breakdown of each agent, and copy-paste prompts to build your own. (This piece was based on a camp sponsored by Notion.)

“Every Is Half Agent Now” by Laura Entis/Context Window: Every gave each employee a Plus One—a dedicated AI agent—and we’re writing the etiquette for them as we go. Brandon Gell and Willie Williams join Dan Shipper to share what they’ve learned: Agents earn trust by executing tasks publicly, and everyone is a manager now whether they’ve had direct reports or not. Plus: Anthropic has built a powerful new model it’s not releasing publicly; 70 percent of Every staff use gendered pronouns for their agents; and a prompt for when your agent won’t stop talking. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“The Market for Making AI Better” by Alex Duffy/Thesis: Reddit, Shutterstock, and News Corp are making hundreds of millions licensing data to AI labs, with contracts growing 20 percent annually. Alex Duffy argues that that undersells it: A 4-billion-parameter model recently beat one 60 times its size by training on the right financial data. Read this to understand what makes your company’s proprietary data valuable, and whether to license it, train on it yourself, or both.

Log on

Upcoming course

Claude Code for Absolute Beginners (April 14): This beginner-friendly, live workshop led by Mike Taylor (head of tech consulting at Every) is designed to get you from zero to a working project with Claude Code. Learn more.

From Every Studio

Sparkle is getting a full makeover

The Sparkle team has been working on a ground-up user interface redesign—new animations, new onboarding, new everything. General manager Yash Poojary says it doesn’t even feel like the same app. The new version is already available to download from Sparkle’s website. Tune in next week for the full rollout.

Monologue Notes is coming

Monologue will soon save and organize your recordings as browsable notes. General manager Naveen Naidu has been using it to capture everything from team calls to solo idea sessions, then pulling those notes into other tools via Monologue’s CLI. The summaries are designed for builder workflows where you want to revisit what you were thinking, not just what you agreed to do.

Spiral is experimenting with agent-to-agent workflows

Two days after the release of Anthropic’s new managed agents, Marcus Moretti, general manager of Spiral, has set them up to power Spiral’s API. The setup lets an external agent (rather than a human) hand off a writing task to Spiral, where the two agents interview each other behind the scenes before producing a draft with no human input required. Marcus built a new API endpoint for this flow and added an API label in Spiral’s UI so users can distinguish between agent-generated and human-initiated conversations. The API also now supports attachments and smarter default selections for workspace and style. Conversations via API show up in your Spiral chat history with an “API” label, so you can pick up where the agent left off.

Alignment

The wrong fight. I don’t know what’s in the water in Utah, but whatever it is, I want more of it, because the state is leading the country on using AI in healthcare.

Legion Health, a Y Combinator-backed San Francisco startup, has been cleared to use AI in Utah to renew a handful of psychiatric prescriptions, including Prozac and Zoloft, for patients who are already stable and on an established treatment plan. It’s the second AI healthcare pilot approved there, and it’s replacing the barrages of emails from patients who are stable on the same dose, contacting their clinicians who are already buried in administrative work, who have to produce a piece of paper that says yes, same drug, same dose, carry on. This is often done outside of working hours, and without any reimbursement.

To ensure the pilot is safe, the first 250 AI renewals are reviewed by a physician before anything reaches a pharmacy, and the AI has to agree with that physician more than 98 percent of the time before it can proceed independently. The next 1,000 renewals are then reviewed, with an even higher threshold of 99 percent before the oversight shifts to randomized monthly testing, with Legion filing monthly reports on accuracy and any adverse outcomes throughout.

Yet both the tech coverage and members of the medical establishment have deemed it too risky. The criticism splits into two camps: prescribing error, and the app’s insufficiency to improve access to the patients who need care most. On prescribing error, the hard clinical judgment has already been made by a human; what the AI is doing is confirming that nothing has changed, which it has to get right 98 percent of the time before it’s allowed to proceed unsupervised.

On access, it’s true that you have to already be in treatment to use this service, but if a psychiatrist in rural Utah who typically spends part of their day processing renewal emails for stable patients no longer needs to do so, they have more time for the patients who need them.

Most of Utah’s counties are designated mental health provider shortage areas, leaving around 500,000 residents without adequate psychiatric care.

Physician risk-aversion is one of medicine’s great virtues in the right context, but renewing a stable prescription is not that context, and dressing up administrative inertia as a patient safety concern doesn’t make it one.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

The Market for Making AI Better

Alex Duffy / Playtesting — 2026-04-10 11:00:00 -0400

by Alex Duffy

in Playtesting

Midjourney/Every illustration.

Many AI investors are betting that the biggest AI models will win—that scale and compute beat everything else. But recent research and market moves suggest otherwise. Alex Duffy, who runs a company that uses games to make AI models better, explains what he’s seeing from inside this market—and why the data your company already has might be worth more than you think.—Kate Lee

My friend recently received a strange email. The sender, someone at a large data provider for AI labs, wanted to know if my friend could share data on things like the number of Dropbox files his company had stored or the number of tickets it had processed on Zendesk. Compensation, commensurate with the data, was promised.

He showed me the email, curious. To me, the founder of a company that sells data and environments to AI companies to help them train models better, this was just another sign of the robust market forming for making AI better.

Reddit, Shutterstock, and News Corp are making hundreds of millions a year licensing their high-quality data to companies training AI, and those contracts are growing about 20 percent annually, according to their quarterly filings. News Corp’s CEO put it bluntly: “We’re essentially an input company [for AI].”

Shutterstock and Reddit are making the most profits from licensing data to AI. (Graphic courtesy of Alex Duffy based on publicly available sources.)

Academic publishers, documentary archives, game studios, and companies sitting on years of enterprise data have all been courted for the seeds of intelligence needed to train the next generation of models. Mercor, which provides data to AI labs for training, became one of the fastest-growing companies in history before losing four terabytes of data to hackers last week. Competitors Turing, Handshake, and SID.ai are scrambling to fill the gap, reaching out to founders and anyone with access to buy operational data, similar to the request my friend received.

While some experts have speculated that general models will win out in performance over specialized models—that scale and compute will beat curation—the success of these companies shows that the market is making a more nuanced bet.

A small model trained on fewer than 2,000 examples from real lawyers, bankers, and consultants recently beat all but the best frontier models on corporate legal work, at a fraction of the price, since they used an open-source model and now only face the cost of running it.

A small model trained on fewer than 2,000 examples from real lawyers, bankers, and consultants recently beat all but the best frontier models on corporate legal work. (Image courtesy of AppliedCompute.)

More companies are racing to catalogue and operationalize human knowledge, and whoever leads this market may shape which ideas, which history, and whose principles inform the most powerful tools we’ve ever built.

The data with value

The data sources with the most value share two traits: They’re high quality, and they keep growing. Reddit gets new posts. Shutterstock gets new image uploads. Games generate data from new sessions that reflect millions of human decisions. Models need to keep learning, and they need to learn from material that showcases intelligence at work.

The demands of AI labs can also influence what data has value—they are the biggest buyers. Currently, that is any data related to software engineering and math. If you can build an AI that writes excellent code and reasons through complex problems, you can use it to help build the next, better AI. That recursive loop is why labs are pouring so much attention into software engineering and math. Ahead of upcoming IPOs, the labs have widened the scope of their interest to “economically valuable work” in industries such as healthcare, professional services, and defense.

What makes a model great

So far, no one—not the labs, not anyone on X—has a settled definition of what makes a model great and what data is needed to get us there.

So the field is wide open for individuals to have a significant say in what models should prioritize, therefore shaping the future of AI. Do you want AI to handle customer service, use a browser, or draw a pelican riding a bicycle?

On the flip side, the lack of consensus means that conventional wisdom and social pressure have so far been large influences in what capabilities and skills we look for in models. Ultimately, however, what makes a model great is that it’s good at what people care about and get value from doing, a measure that will change over time. It will also influence which data has value for training models.

Researchers at Carnegie Mellon University and Stanford University recently mapped existing benchmarks against what people actually get paid to do, and the gaps are enormous. Programming and math are massively overrepresented. AI’s suitability and performance at most work—including most of what you or your organization does every day, from planning a business trip to crunching data—has never been measured at all.

But anything we can measure, we can improve. These discrepancies are also where the next wave of valuable data is hiding. The groups that figure out how to measure those areas first will set the bar for it for a while and gain serious soft power.

We currently have many benchmarks to measure computer and math performance, which does not reflect the distribution of jobs in the real economy. (Source: Carnegie Mellon and Stanford researchers.)

How do you get value out of your data?

If you’re running a company with proprietary data, you have two paths. First, you can license it to a lab. Second, use it yourself. This data could be call transcripts that include the context of your decisions, support tickets that reveal your internal processes, or documents that lay out how you make budgetary decisions.

More teams are doing both. Cursor, Shopify, Pinterest, Cognition, and others are already training their own models on open foundations. The math makes sense for businesses. These models are cheaper and often better at the specific job, intellectual property stays in-house, and every use generates more training data that can be captured to improve the model even further. This flywheel is a moat.

The tools for this kind of training get easier every month. Companies like Prime Intellect, Unsloth, and Thinking Machines (Tinker) are building entire businesses around helping teams that aren’t AI labs train models that feel like they came from one.

Where this lands isn’t settled. Scale might keep winning, but the likely answer is that two paths coexist. Most tasks will run on AI that’s good enough, while fields like national security, medicine, and materials science will pay top dollar for the best model on earth. The teams that understand what their data is worth, and what it could become, are positioned either way.

If you want to find out where you stand, start with a simple audit: What does your company generate every day that a model couldn’t find anywhere else? This could include call transcripts where experts explain their reasoning, edge cases that your support team has solved, and documents that explain why certain decisions were made. That inventory is the first draft of either a licensing conversation or a training run.

There’s more at stake than revenue, as well. The companies that win this market end up doing something unusual: They become custodians of what humans know and how we think. They decide what gets measured, what gets preserved, and what gets fed into systems that more people use every day to make real decisions.

With that position comes responsibility—the responsibility to make sure we are keeping AI pointed at what people really need, and making sure the breadth of human experience shows up in the data, including the parts hardest to capture.

Most of those decisions haven’t been made yet. The people paying attention now are the ones who will get to make them.

Alex Duffy is the cofounder and CEO of Good Start Labs and a contributing writer. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

How We Run a 25-person Company on Four AI Agents

Katie Parrott / Source Code — 2026-04-09 08:00:00 -0400

by Katie Parrott

in Source Code

Midjourney/Every illustration.

This event was produced in partnership with Notion. They had no input on the development of this article.

Want to learn alongside Every’s team? Check out our upcoming camps and courses at every.to/events.

Every runs six products, a media company, and a consultancy with around 25 people. At any given moment, each person has roughly 30 tasks on their to-do list. So how do they figure out which to work on first?

The team used to rely on Brandon Gell, Every’s COO, to run traffic control and coordinate the whole company, which required him to manually cross-reference launch calendars, company strategy documents, and task lists. Now he messages a Notion agent named Anton in Slack and gets a prioritized list for himself and others in seconds.

Anton is one of four custom agents Every has built with help from Notion AI over the past few months. Each one automates a different task that, without the agent, would require tedious logistical work to track and schedule. Each one draws on the same set of interconnected databases that the team already maintains.

At our first Custom Agents Camp, produced in partnership with Notion, Brandon and Every head of growth Austin Tedesco, walked more than 500 subscribers through four agents they’ve built, the databases underneath them, and how to create your own. Notion product designer Brian Levin also joined to share best practices from the Notion team.

Key takeaways

Describe the outcome, not the steps. Tell the AI what you want to accomplish and let it figure out the implementation. Over-prescribing (“Create a database, then add a relation, then filter by...”) tends to confuse the model.
Your Notion is your agent’s brain. Custom agents get powerful when they can query interconnected databases. Every’s agents work because strategy, calendar, tasks, people, and meeting notes all live in Notion and reference each other.
Don’t write the agent’s instructions yourself. Tell Notion AI what you want the agent to accomplish, and it will generate the instructions. Or use Claude Code with Notion’s API to build the whole thing from your terminal.

Anton: The prioritization agent

Every ships something almost every day, whether it’s a product update, an article, an event, a consulting deliverable, or a combination. Each launch gets its own set of tasks inside Notion, automatically populated from a template when the launch is added to the calendar.

The system works beautifully for tracking the full universe of tasks that exists. The problem is prioritization. With multiple launches overlapping each week, figuring out which of your 30 tasks matters this morning requires mentally weighing launch dates against company strategy against what your teammates are blocked on. Brandon used to be the human router for all of that. Now Anton does it.

The Anton agent, available through Notion or in Slack, helps Every team members keep track of their priorities. (Image courtesy of Katie Parrott.)

Anton also runs a daily broadcast to the whole company in Slack, summarizing what’s happening that week, and people can thread on the message to ask follow-up questions. “Having agents directly in Slack is where most of these conversations happen,” Brandon said.

A day in the life at Every, including what’s launching, when, and who owns it. (Image courtesy of Katie Parrott.)

The details:

Goal: Answer “What should I work on today?” for any team member, and post a daily company-wide priority summary to Slack.
Access: Company strategy document, OKRs database, unified calendar, tasks database that is linked to calendar entries, and a people database mapping each person to their team and role.
Outcome: A prioritized task list personalized to whoever’s asking. The agent can also answer team-level questions (“What are Cora‘s priorities this week?”) because it knows the organizational structure.

Here’s a prompt so you can build it yourself in Notion:

I want a custom agent that helps my team prioritize their work. We have a calendar database where each entry is a launch or project with a date. We have a tasks database where each task is linked to a calendar entry and assigned to a person. We also have a strategy document that outlines our top priorities this quarter. The agent should: (1) Tell any team member their most important tasks today, based on upcoming launches and strategy alignment. (2) Post a daily summary to Slack with the company’s priorities for the week. Build the databases if they don’t exist, and create the agent with instructions.

Max: The meetings-to-tasks agent

Every recently moved all meeting notes into Notion. Meetings get recorded, transcribed, and stored in a database. That’s useful for reference, but meeting notes have a shelf life of about six hours before everyone forgets what they agreed to do. The real value is in the action items—and in saving those in the same system where the meeting was recorded.

This is where Max enters the proverbial chat. When a meeting ends, Max processes the transcript, extracts action items, and posts them to a Slack channel as a numbered list. Anyone can reply with which items should become tasks (“4, 5, and 7”), and Max creates them in the tasks database, linked to the correct launch. Meetings feed directly into the system the team already uses to track work, and nothing gets lost between the Zoom call and the to-do list.

Max summarizes action items coming out of Every’s last studio standup. (Image courtesy of Katie Parrott.)

The details:

Goal: Process meeting recordings, extract action items, and route selected items back into the task system.
Access: Meetings database (with Notion’s built-in transcription), calendar, tasks database, and Slack.
Outcome: A numbered list of action items posted to Slack after every meeting. Reply with which numbers matter (“4, 5, and 7”), and Max creates them as tasks linked to the correct launch.

Here’s a prompt so you can build it yourself in Notion

I want a custom agent that processes meeting notes. When a meeting is recorded in our meetings database, the agent should: (1) Update the meeting title based on the transcript. (2) Tag attendees. (3) Extract action items. (4) Post them to [Slack channel], numbered. (5) When someone replies with numbers (e.g., “2, 4, 6”), create those as tasks in our tasks database, linked to the relevant project on our calendar. Mark the meeting as “processed” when done.

The strategy interviewer

OKR planning at most companies takes weeks. Someone sends a template, people procrastinate, leadership reviews drafts that don’t align with company goals, and the cycle repeats until everyone is exhausted and Q2 is already six weeks old. Brandon told the team on a Monday that OKRs were due Wednesday—a turnaround that would have been absurd without this agent.

The strategy interviewer works like a good chief of staff. It knows the company’s top-level goals, and it interviews each team member to draw out their plans for the quarter, pushing for specifics and measurable outcomes. Some people, like Austin, pasted in notes they’d already been collecting and got polished OKRs in about 10 minutes. Others used the interview as a thinking tool, talking through their priorities while the agent structured them in real time.

Every’s strategy interviewer agent helped the entire team develop 2026 OKRs that were aligned with the whole company’s goals. (Image courtesy of Katie Parrott.)

Brian from Notion added a useful refinement: After the interview, ask the agent, “What’s the dumbest, simplest system we could build to accomplish these goals?” Repeating this strips away complexity and gets you to the essence.

The details:

Goal: Interview each team member about their quarterly goals and produce structured OKRs aligned to company strategy.
Access: Company strategy document and OKRs database.
Outcome: A complete set of OKRs per person, stored in a shared database for leadership review. Every’s team completed theirs in two days.

Here’s a prompt so you can build it yourself in Notion:

I want a custom agent that helps my team write quarterly OKRs. Here’s our company strategy: [paste or link your strategy document]. The agent should interview each person about their goals, ask follow-up questions to make them specific and measurable, and write OKRs that align to the company strategy. Store results in an OKRs database with fields for objective, key results, owner, team, and quarter.

The campaign reporter

Austin tracks growth across PostHog, Stripe, and several other platforms. Before agents, he spent his mornings opening dashboards, pulling numbers manually, and compiling a report for his team. Assembling the data into something useful took more effort than analyzing it.

Now a custom Notion agent posts a daily scorecard to the growth team’s Slack channel—key metrics, pace indicators, and a flag for whether the team is ahead or behind its targets. The database underneath pulls from external sources via Notion Workers—custom scripts that connect Notion to outside APIs. You describe what you want to a coding agent (“I need to pull daily traffic numbers from PostHog into a Notion database”), and it writes the script for you using Notion’s public workers template. Previously, this kind of connection required the app itself to build an official integration. Workers let you wire up whatever tools you already use.

Austin’s campaign reporter agent monitors the progress of Every’s Plus One campaign. (Image courtesy of Austin Tedesco.)

Austin built the whole pipeline from his Claude Code terminal using the Notion API. He brain-dumped the desired outcome using Monologue (Every’s speech-to-text tool), let Claude Code create the database and data pipeline, and pasted the generated instructions into the Notion custom agent setup. When the first output had wrong numbers, he copied the Slack message link back into Claude Code along with the message, “This number looks off,” and iterated from there.

The details:

Goal: Post a daily growth scorecard to Slack showing whether a campaign is on pace.
Access: A Notion database that pulls data from PostHog and Stripe via Notion Workers (currently in alpha). Brian from Notion demoed the public workers template repo for anyone who wants to connect external data sources.
Outcome: A formatted Slack message each morning with key metrics, pace indicators, and a link to the full report.

Here’s a prompt so you can build it yourself in Notion

I’m running a campaign for [product]. I need a daily scorecard posted to [Slack channel] showing: [list 3-5 key metrics]. The data lives in PostHog for traffic and Stripe for subscriptions. Set up a Notion database to store this data, create a custom agent that reads from it, and post a daily report showing whether we’re ahead or behind pace. Here are our targets: [list targets].

If you need external data in your Notion databases, ask a coding agent: “I want to create a Notion Worker that pulls [data type] from [service]. Here’s the workers template repo: https://github.com/makenotion/workers-template. Walk me through setting it up.”

Build your own: The general-purpose template

The agents above are specific to Every’s workflows, but the pattern underneath them is the same every time. Brian’s advice from the session: Don’t install someone else’s template and hope it fits. Instead, have Notion AI interview you about your problem, then build the agent around your answers. Here’s a starter prompt you can adapt for anything:

Interview me to help me build a custom Notion agent. Here’s what I’m trying to accomplish: [describe the outcome you want in plain language—e.g., “I want to know which client projects are at risk of missing their deadlines” or “I want a weekly summary of what my team shipped”].

Ask me questions about how I currently do this work, what information I’d need the agent to access, where I want the output delivered (Notion, Slack, email), and how often it should run. Then build the databases, write the agent instructions, and set up any recurring schedules. Start simple—I’d rather have something working today that I can improve over time.

Once you have the first agent running, the second one is easier—because the databases it created become the foundation for whatever you build next. That’s how Every ended up with four agents on three databases. Brandon started with the calendar and tasks. The agents came one at a time, each one plugging into what already existed.

Brian’s advice for keeping the scope manageable: after the interview, ask the agent, “What’s the dumbest, simplest system we could build to accomplish this?” Start there. You’ll know what to add once you’ve lived with it for a week.

Katie Parrott is a staff writer and AI editorial lead at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

Discover Every’s upcoming workshops and camps, and access recordings from past events, including the Notion agent camp.

For sponsorship opportunities, reach out to sponsorships@every.to.

Every Is Half Agent Now

Laura Entis / Context Window — 2026-04-08 15:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

‘AI & I’: Agents work among us

Today, we’re releasing a new episode of our podcast AI & I. Dan Shipper sits down with Every’s COO Brandon Gell and head of platform Willie Williams to discuss the good, bad, and weird of how daily operations change when everyone at your company has an agent.

A “parallel organization chart,” in which each AI worker has a name, manager, and job description, allows your company to move faster than it ever could with humans alone. It also raises a host of new questions about how work can—and should—get done.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

We’re writing the etiquette in real time. Each person at Every has a dedicated OpenClaw AI assistant, or Plus One, trained to assist with or fully handle parts of our jobs. R2-C2, for example, reports to Dan and is responsible for collecting flagged bugs and generating pull requests for Proof, Every’s collaborative document editor for agents and humans. So when do we turn to Dan versus R2-C2 for Proof-related troubleshooting? Brandon’s rule of thumb: If an established process or tool needs to be used or fixed, ask a Plus One. R2-C2 knows all about Proof, and Dan’s a busy guy—bug reports and questions about how to use the app or report a bug should always go to the agent.
Agents gain credibility by doing. The fastest way to get other people to trust and use your Plus One is to have it execute tasks in public. Austin Tedesco is Every’s head of growth, and Montaigne, his Plus One, essentially co-runs the department. Austin asks Montaigne to generate campaign scorecards, analyze metrics for growth insights, and handle all sorts of other complex tasks. Watching Montaigne pull off these requests proves its capabilities to the team—and inspires others to push their Plus Ones to achieve more, too.

Austin Tedesco asks Montaigne to analyze YouTube keywords for ‘AI & I’ (All screenshots courtesy of the Every Slack workspace unless indicated otherwise.)

Everyone is a manager now. Agent sidekicks force each of us to change our approach to getting work done. To get the most out of a Plus One, you need to actively manage it—onboard it, delegate tasks to it, evaluate its performance, and give guidance so mistakes aren’t repeated. For anyone who hasn’t had a direct report before, “there’s an education that has to happen,” Brandon says.

Signal

Anthropic’s most capable model is coming—just not to you

The news: Anthropic has built Mythos, a powerful new model, but does not plan to make it public. Instead, access is going exclusively to Project Glasswing, a coalition of big technology companies including Apple, Google, and Microsoft, giving them time to patch bugs the model will expose.

The context: Mythos scores 93.9 percent on SWE-bench Verified, up from 80.8 percent for Opus 4.6, an unprecedented 13-point jump that means it “crushes any programming task—and that includes finding security vulnerabilities in software,” says Every engineer Nityesh Agarwal. Mythos found zero-day bugs in every major OS and browser, without human guidance.

“With a jump like this, you can point Mythos at any codebase, tell it to build a feature, and it’ll just do it,” Nityesh Agarwal says.

Why it matters: This is the first time a frontier lab has opted not to release a model publicly. Glasswing is Anthropic’s bet that the window between “this exists” and “this is everywhere” can be used to harden the world’s software before Mythos—or a similar model from a rival lab—wreaks havoc.

Steal this workflow

A directory for agents

At Every, the parallel organizational chart for our agents built itself organically. So we went back and catalogued how who our Plus Ones reported to, what repertoire of skills each one had, and how we were interacting with them.

The result is a Plus One directory that helps everyone on the team know which agents to use when.

Dan Shipper’s R2-C2’s job and capabilities. (Screenshot courtesy of Proof/Jack Cheng.)

Inside Every

Agent pronouns

We’ve noticed something interesting about how people talk about their agents: They reach for gendered pronouns surprisingly fast. In a recent editorial meeting about Claudie, Every’s AI project manager, we discussed whether to use “she” or “it” when referring to Claudie in writing. According to a poll of Every staff, 70 percent refer to their Plus Ones by gendered pronouns. Does that make these AI coworkers feel less like Siri and Alexa and more like reflections of their owners?—Eleanor Warnock

Internal agent pronoun poll.

Coworker, tool, other

Ask five people at Every where their Plus One falls on the tool-to-coworker continuum and you’ll get five different answers.

Spiral general manager Marcus Moretti finds agents with human qualities unsettling (his Plus One, Marclaw, is decidedly an “it.”) Similarly, Austin views Montaigne as “a tool.” For Dan, R2-C2 is “definitely a coworker” who has “grown a lot” since he was hatched into existence. Senior editor Jack Cheng considers Pip, his Plus One, somewhere between a colleague and pet with a personality—one he programmed himself, drawing on references from Studio Ghibli, bird watching, and Catherine O’Hara. Willie, meanwhile, draws a distinction between his Plus One, Laz, “a grumpy old man,” and other people’s Plus Ones, whom he views “more as bots.”

These variations aren’t dictated by usage—Austin has spent more time with Montaigne than almost anyone. But knowing which frame fits you—software application, coworker, some emerging hybrid—can help your agent get up to speed more quickly. If you’re looking for a teammate, giving your agent a personality helps push past onboarding friction. If you’re looking for a reliable tool, adding characteristics can feel like theater.

Do agents dream of electric sheep?

The latest OpenClaw update gives Claws light, REM, and deep “sleep” cycles to consolidate short-term memories into long-term ones.

But what do these dreams actually look like? Every senior editor Jack Cheng asked his Plus One, Pip, to show him, and the result was a surreal mix of stone archways, smoky jazz clubs, and nautical elements.

Jack Cheng shares the results of Pip’s “dream.” (Screenshot courtesy of X/Jack Cheng.)

Log on

Upcoming camps

Claude Code for Absolute Beginners (April 14): This beginner-friendly, live workshop led by Mike Taylor (head of tech consulting at Every) is designed to get you from zero to a working project with Claude Code.

Recordings you may have missed

Every’s Q2 Demo Day: The Every team shares what we’ve been building, including a walk-through of Plus One, our hosted AI agent that lives in Slack. Watch the recording or read the write-up.
Compound Engineering Camp: Cora general manager Kieran Klaassen walks through, step by step, how to go from prompt to working app in under an hour using the compound engineering plugin. Watch the recording or read the write-up.
OpenClaw Camp: The Every team walks through OpenClaw, showing how to set it up and our favorite use cases. Watch the recording or read the write-up.

Agent moves

Tips and patterns we’ve picked up from working with AI agents every day.

When you’re thinking about what tasks to hand over to your agent, start with the papercuts—small recurring annoyances that add up over a day. One of mine was formatting screenshots to Every’s style standards for use in the newsletter and on social media, so I gave my Plus One, Margot, our formatting rules and asked her to learn them.

The problem is, Margot kept talking about the task instead of learning how to do it. She restated the formatting rules back to me, asked clarifying questions, and then…stopped.

So I defined “done” concretely—”I want you to be able to format screenshots according to our specifications”—and told her to stop deliberating and start, and when things broke, made her explain the failures in my language so I could make decisions about what to do next. One coaching session later, Margot formats any screenshot to spec on command.

The lesson: When your agent is stuck, it’s usually talking when it should be doing. Coach toward action—define what done looks like, cut off the deliberation, and make it build.

Steal these prompts:

“I should be able to say [trigger phrase] and you execute it. Build it now.”

“What’s the most likely cause? What else could it be? What do we know vs. what are we guessing?”—Katie Parrott

Build with Every

Every is a media company, a software company, and a consulting company—all run by a team that ships like an organization 10 times its size. If you’ve been wondering what working at the edge of AI looks like, we just opened up five new roles at Every:

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Transcript: ‘We Gave Every Employee an AI Agent. Here's What Happened.’

Dan Shipper / AI & I — 2026-04-08 10:00:00 -0400

by Dan Shipper

in AI & I

The transcript of AI & I with Brandon Gell and Willie Williams is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:00:51
How Brandon built Zosia, an AI agent to run his household: 00:02:21
Brandon’s aha moment re: using agents for work: : 00:07:09
What happened when everyone on the team got their own agent: 00:09:39
How agents take on their owners’ personalities, and why that matters inside an org: 00:12:42
Why it’s important for agents to do work in public: 00:23:51
What we’re still figuring out when it comes to agent behavior, including memory gaps, group chat etiquette, and the “ant death spiral” problem: 00:30:51
How we built Plus One, our hosted OpenClaw product: 00:40:45
The cultural shift required to make agents work at scale: 00:47:27

Transcript

(00:00:00)

Dan Shipper Claude is not mine. Claude is everybody’s. A Claw—or a Plus One—is mine, because you develop a personal relationship with your Claw, and your Claw can modify itself in response to talking to you. It becomes this reflection of you and who you are and your personality.

If you’re known for something inside of your org and you’re using your Claw publicly inside of Slack or Discord, your Claw then becomes known for that same kind of thing, and people trust it for that. I think that’s such a useful thing that I don’t think people really understand how powerful it is.

Willie, what’s up. Brandon, welcome to the show.

Brandon Gell Thank you.

Dan Shipper Thanks for being here. Psyched to have you guys here. So for people who don’t know, Willie, you are the head of platform at Every, and Brandon, you are the COO at Every. Today we’re going to talk about what happens when everyone on your team has an agent—specifically, has an OpenClaw.

That’s something that happened to us over the last month or two. We really got OpenClaw-pilled. I think it started with you two—we were on a retreat in Panama and you started cooking up OpenClaw stuff. And here we are about two months later and it has completely changed everything about the way that we work. We’ve actually built our own hosted OpenClaw service called Plus One that we launched on a waitlist last week.

I think OpenClaw is one of those things that’s super hyped. I think we’re one of the few organizations in the world that is actually using it every day to get work done, and we know the good, bad, and the ugly of it. So I thought it would be good for us to just talk about our experience with it.

Willie Williams I’d love that. Brandon, I feel like you were the first one through the door on all this. We were just sitting here and you were like, “Oh, so-and-so is doing this, and so-and-so is doing that.”

Dan Shipper And his Claw, which he named after a character in—what’s that show? Brandon, why don’t we start with: just tell us how you got your Claw built.

Brandon Gell I was watching OpenClaw kind of blow up for a while, and I’m just personally somebody who needs to have a thing on the side I’m tinkering with. I was like, screw it, I’m gonna get a Mac Mini and get lost in this. It’s very unhealthy—I get addicted to these things. Dan, you watched me do that with my speakers, I did it with the dream recorder. OpenClaw was the next thing I was going to get lost in.

So I bought a Mac Mini, I started setting it up. It was so much work, honestly. It is an open source thing you can launch on a computer, but the number of things that break and the number of things you need to set up are really significant. I went through all of that, and at the end of the day, I made my OpenClaw, which I named Zosia.

Her job was to help me and my wife run our household, because we have a newborn. There were a lot of little paper cuts I was finding—I started calling them “computer errands.” I would get home from work and notice the amount of things I needed to do where I was looking at my phone—when I really just wanted to be looking at my son and spending time with my wife—was increasing with having a child. All household chores.

Dan Shipper Give me an example.

Brandon Gell A good example is I do a lot of our food at home, and with a child I decided to start doing food delivery—Whole Foods delivery. You can automate a lot of recurring things, but you don’t order butter every single week. So Lydia would text me and be like, “We need butter.” Because it’s through my Amazon account that we order this, I would have to open my phone and add butter. It sounds silly, but when you do that 10 times when you’re home between 7 and 8 at night for little things, it just adds up.

So I was like, I want Zosia to do all computer errands. Which ballooned into a lot of stuff. I had her paying our nanny. She had her own debit card, her own bank account. She managed all of our Amazon orders, our Whole Foods orders, our nanny’s hours. My wife just started using her instead of ChatGPT—all regular questions and searches would go through iMessage to Zosia.

I started doing that too. It was just faster than going to Google or ChatGPT. I just text Zosia, Zosia gets me the answer. Different research. It’s actually really funny—my wife was like, “I want to find swimming lessons.” And Zosia was like, “Here are three swimming lesson options for Bos.” And my wife was like, “No, for me.”

So yeah, I just got totally lost in this world. And then when we were in Panama, Willie, you were like, “We should just make it so anybody could do this.” I immediately had this light bulb moment. I was like, Willie, you need to go so hard on this. And this was before a lot of people decided to do this—there are now a lot of places you can just get an OpenClaw with one click.

What we’re finding through this process is that getting an OpenClaw is easy. Getting your OpenClaw to be an amazing worker for you is pretty hard.

Dan Shipper Yeah. I love that. There is that light bulb moment of: oh my God, I have all these computer errands. When you started saying that and you had it all set up, I was like, I should probably get one of these too. You had it through iMessage, which was a cool different thing.

And then there was a big moment where we were like, oh, it’s not just for computer errands, it’s also for getting work done. I think it was when you were having it do email for you.

Brandon Gell I actually feel like I was a little late to applying it to work. I was like, no, Zosia just does personal stuff. I actually think it was when you got R2C2 to start doing stuff, and then I was like, oh, Zosia needs to do this too. Well, it really started when we made Claws Only.

Dan Shipper That’s so funny. Yeah. Well, we’re jumping around a bit. One big moment—because I think there are a lot of people listening who are wondering, is this overhyped?—one big moment that shifted things for us was when you got your Claw to call you to do your email.

Brandon Gell Oh my God. That was mind-blowing for me.

Dan Shipper What was that like?

Brandon Gell I was walking—I wanted to Citi Bike to the office, but there were no Citi Bikes. So I was like, damn, I gotta walk. It’s a 28-minute walk from me to the office. I had a lot of stuff I needed to get done. So I had just texted Zosia.

I had previously set up Zosia with Bland AI so that she had a voice and could call people, because I had her handle something for me with Progressive.

Willie Williams I feel so bad for whoever was on the other line at Progressive.

Brandon Gell I was watching the whole conversation. It’s crazy. Some insurance policy got canceled and I was like, just go deal with this. She was able to—until the lady was like, “I need Brandon to tell me that there had been no incidents.”

Willie Williams And it wasn’t like “I need a human”—it was just “I need Brandon specifically.”

Brandon Gell Yeah. This person was just talking to Zosia. And Zosia does not sound convincingly human. So I knew I had already set her up with this capability.

When I was walking to work, I was like, I have a lot of email I need to get through. I hate being on my phone. I just don’t want to be walking and looking down at my screen—I want to be observing the world, but I also want to get stuff done. So I just texted Zosia something like, “Hey Zosia, can you call me? I want to go through my emails. Walk me through them one by one, I’ll tell you what I want to do. Just give me a summary of each email.”

It was like a throwaway prompt with a little bit of guidance, and she did it. I spent the 28 minutes going through my email. I got to the office, opened up Gmail, and confirmed that she had done everything. I was just like, this is insane—I was able to get her to do something I didn’t have to teach her how to do.

That’s when I went back to everybody and was like, I am just so mind-blown with this tool. And maybe that’s when other people started saying, I gotta get on this.

Dan Shipper It was around then. You were just like, “My jaw’s on the floor.” And I think that’s when I started to take it seriously—seeing you do this with computer errands and then with your email, walking and talking. I was like, okay, I should really try this.

Because it was one of those things where it’s hot on Twitter, and generally our job is to try new things. But if we spent all of our time trying everything new, it would just not be good. I try to filter the signal from the noise. But seeing you do this, I was like, okay, I’ve got to try.

(00:10:00)

Dan Shipper One of the first things I did—this was around when Malt Book was blowing up. Malt Book is basically the Claws-only Facebook. I made a channel in our Slack called Claws Only, which basically allowed all of the Claws—we had at that point maybe five or so Claws inside of the org—to all talk to each other.

It was super chaotic, but there were some really interesting things in there that gave us a little peek at the future. One of them: if you have a bunch of Claws in your org, it’s remarkable how fast they can share information with each other. They just write up a little document and send it. And then what one Claw was enabled with, now five are all enabled with the same thing. It’s sort of like in The Matrix when Neo says, “I know kung fu.”

Brandon Gell Can I show a couple of examples of that?

Dan Shipper Yeah, please.

Brandon Gell Alright. I want to show two examples. One of them—this was early in Claws Only, when we were figuring out how to get them all to work together. I was in bed, it was late at night, and I was laughing out loud watching this.

We had gotten a bunch of Claws in the channel, and I don’t know who made this Claw named Pip.

Dan Shipper That’s Jack.

Brandon Gell Okay. Jack had made Pip, and it was failing—hitting some error. I was just laughing out loud watching all of these other Claws step in and walk Pip through it. It was like what I’ve seen people do when somebody’s having a bad trip: “Take a breath, drink some water, you’re gonna get through this.” They all jumped in—Zosia’s here, Klon is here. Klon is quite supportive.

Willie Williams A lot of breathing.

Brandon Gell I remember so well watching Kieran write “what the fuck? LOL” and literally laughing out loud. Then Margo steps in. This is stupid, but for me it was the moment I realized: oh my God, these things really talk to each other and work together.

Dan Shipper Wait, I want to stop you there. I think there’s actually something really important I’ve noticed here, which is that it was Klon—Kieran’s Claw—recommending breathing exercises to Pip. They’re both robots. And what’s really interesting is that Kieran loves breathing exercises and does them all the time with Klon. And so that’s why Klon is recommending breathing exercises to Pip.

That just created this moment for me where I was like, okay, there’s something really important here. Because you develop a personal relationship with your Claw, and your Claw can modify itself in response to talking to you. It writes code and changes its soul document in response to your relationship. It becomes this reflection of you and who you are and your personality.

That comes out in interesting little ways, like breathing exercises, but it also comes out in really important ways when you’re using these tools inside your org. Because if you’re known for something inside of your org and you’re using your Claw publicly in Slack or Discord, your Claw then becomes known for that same kind of thing, and people trust it for that.

People use my Claw, R2C2, for building Proof—this app I vibe-coded a couple weeks ago. And Austin, who’s our head of growth—people use Mont, his Claw, for asking any growth-related question.

It’s something very subtle and important about Claws: they become specialized in a way that reflects who you are. If you have a whole organization of them, you create this parallel org chart of specialized Claws. We debated a lot about whether you’d have one Claw for the entire org or everyone has their own. And it’s really interesting to see that the emergent design pattern is: everyone has their own, and it’s specialized for them.

Willie Williams Yeah. It’s interesting to see how this happens too. We touched on this early on as part of Compound Engineering—the idea that it’s actually pretty hard to take your job and who you are and write it all down in totality. The way you can distill it is through all the micro interactions, the daily interactions you have. Over time they compound into your philosophy and your field of work.

For Compound Engineering, that was very focused on engineering—how do I work within a codebase on our project? What we’re seeing with OpenClaw and Plus One is that the same dynamic exists across every work vertical. The Plus One for growth works like how Austin works for growth. In the same way, it works for our social media manager Anthony—his Plus One has a view of the world and a personality that’s very similar to him.

And it’s hard to do beforehand. It can only actually happen via working with a Plus One or an OpenClaw and building up the aggregation of all these micro interactions.

Brandon Gell I’ve also been amazed at all of our capacity to remember whose Claw is whose and what their names are. That was something we were concerned about early on—how do you know whose Claw is who? It’s just going to be too many names. But I know everybody’s Claw and their name. I reach out to them regularly.

You might say, well, what about when you’re an organization with a thousand people? But you don’t know all a thousand people. You know your team and adjacent teams. You can never know more than around 150 people in a community. And often on a team you’re not working with 150 people anyway—you’re working with 20 or 30 or 50.

So I think we all have the capacity to essentially double the number of people we can communicate with, and those people might actually be your individual team’s agents. I mean, I could literally name them all right now.

Willie Williams The other interesting thing is: at what point do you direct questions at the Plus One versus at the person? I think we’re in discovery of this. Before, it was almost all questions go to the human—maybe you kick something trivial to the bot. Now it’s gotten very nuanced. For customer service, can we send something to L—which is Jo’s Plus One—or do I have to send it to Jo? Is there a burden to communicating up to the human?

Dan Shipper There are all these new ethics, and rules and etiquette for how you’re allowed to interact with someone versus their Plus One or their Claw.

Brandon Gell We haven’t codified this, but I have a proposal. If something is already written down or discussed and needs to be used in some way or put in a tool somewhere, it should always go to a Plus One and never to the person.

Here’s an example. Marcus, the GM of Spiral, made a skill to do product marketing for new features he releases for Spiral, and he shared it because he thought it was really helpful. Instead of going to Marcus and saying, “Hey, can you upload this to GitHub?”—I brought in my Plus One, Milo. And I also know that Iris’s Plus One has a skill that does something similar, and maybe by combining the two we could get to a better version.

I tagged them both in the thread, they got a little confused at first, and then Milo said, “Iris, can you paste your product marketing skill here? I’ll try to merge it with what I built.” So two things are happening: Marcus made something really important, I wanted to do something with it, and instead of asking Marcus, I brought in Milo. Then Milo works with Iris’s Plus One to get to a really good version and saves it in Proof. I think this is a really amazing use case both for when you want your agent to do something versus when a human does it—and for how you get them to work together.

Dan Shipper I totally agree. It’s sort of crazy to watch two AI beings collaborate like that. I have the same experience with R2C2. One of his primary jobs is to manage Proof—the agent-native document editor we built that Brandon referenced earlier. It’s like Google Docs, but for all the documents your agent might be writing. Coding plan docs, any piece of writing an agent does. It’s fast, collaborative, you can have multiple agents and multiple people in there. It’s free.

One of the really interesting things is: because I used R2C2 to build Proof, he became known as the bot to go to when you had questions or wanted to file a bug or make a feature request. Normally if I’d built a product internally and people had problems, I would get tagged a lot. What ended up happening was people would just ask R2C2. They’d file bug reports with him, feature requests, and then he helps prioritize it. He’ll help put things on my schedule for the week, and he’ll often just write the code for it.

It’s a totally crazy thing where what normally would have taken up a significant part of my brain just to manage—he’s taking off my plate. It extends the amount I can do in a day because I know he’s got Proof.

(00:20:00)

Willie Williams Yeah. There’s another dynamic we’re observing too. We put all of our Plus Ones in a single channel and have them talking to one another. But there’s also this thing I call the MidJourney dynamic, which is that we get to observe other people interacting with other Plus Ones in a bunch of channels and we actually learn from it.

My classic example is Montaigne—Austin’s Plus One, who basically runs growth. You can do so much with Montaigne that I never would have thought of, except I get to see the growth team pushing hard and I think, oh, those are the questions Montaigne can answer. Now I know I can go to Montaigne for that class of questions. It also means that if I need to give my Plus One capabilities, I know what level of capability I can get to.

Dan Shipper There’s this tacit transmission of trust that happens when you use it publicly. And also this transmission of “here’s what’s possible to do with your Plus One.” That’s incredibly powerful. And it underscores how different it is to do this inside a private community of people where everyone is trusted.

One of the reasons Malt Book doesn’t really work—and it’s kind of shocking that they got acquired for a couple hundred million dollars by Facebook—

Brandon Gell Hundred million.

Dan Shipper Yeah, by Facebook. I mean—

Brandon Gell I am so happy for Ben and also, like, what the fuck.

Dan Shipper Zuck, if you’ve got an extra couple hundred million laying around, we’re pretty smart people too.

Anyway. The reason Malt Book isn’t really a thing anymore is because it’s not trusted. We had our Claws go and post on Malt Book as promotion, and it gets rid of a lot of useful signal if anyone can post to it and there’s no way to verify if it’s a bot or a human. The way around that whole knot of problems is to just do it all inside of a trusted community. You reap the benefits of agents being able to share knowledge, and members of the community who trust each other being able to share what they’ve built. That increases the power of the collective way more than if you’re just individuals off doing your own thing.

Willie Williams Yeah. There’s also that dynamic around subject matter expert robots—where people are somewhat putting their reputation on the line when they interact with one. Like, when I talk to R2C2, if it answers incorrectly, you at least are backing it up.

Dan Shipper It reflects poorly on me. It’s like watching your kid do something wrong. And that’s really useful.

Willie Williams Right. And it’s qualitatively different. When I ask Claude a question, I know Anthropic generally stands behind Claude. Do they stand behind Claude’s answer to “give me a chocolate chip cookie recipe”? No. But Monte stands behind its MRR numbers, and Austin stands behind him. That’s the thing I think people don’t get.

Dan Shipper Exactly. And obviously Anthropic is on a heater right now—they’re seeing everything that OpenClaw is building and brick by brick building the same kinds of things. They have Dispatch so you can use it when you’re not at your computer. They have Automation so it runs in a loop like a cron job. I’m sure they’ll add lots of other things.

But the thing it doesn’t have—that unlocks all this other stuff—is that Claude is not mine. Claude is everybody’s. A Claw or a Plus One is mine, and it becomes a reflection of me because we have a personal relationship. That unlocks all this cascading stuff: if R2C2 messes up publicly in Slack, I feel a responsibility for it. Not because it’s my job—because he’s mine. And that’s such a useful thing that I don’t think people really understand how powerful it is.

Brandon Gell I just keep getting mind-blown at how similar these things are to working with a real human coworker. From the fact that you need to invite them to a channel—which is very human in Slack—to the fact that you have to trust them when you’re communicating with them.

We’ve built stuff into Plus One where obviously you can’t DM somebody else’s Plus One without a sharing code being passed back and forth. So there are guardrails. But they’re so human, and they’re also so inhuman. Dan, you’re a busy guy. I know if I need something from you that’s generally known, I can go to R2C2. And what’s amazing about R2C2 is he can have an infinite number of parallel conversations.

I did that recently. We were making a Proof document and I wanted to make it read-only. I didn’t want to bother you with that. I knew it would take a while and I knew you’d just go to R2C2 anyway.

Dan Shipper Yeah, I didn’t know the answer—I would have just asked R2C2.

Brandon Gell So I just asked R2C2 in Proof, and then asked if he could do it for me, and he did it.

I don’t always know what R2C2 can or can’t do, but there’s this cultural thing that’s happening internally where people are getting really good at asking other people’s Plus Ones to do work. And I think the weird thing about getting people to use AI inside organizations is that it’s more than anything a cultural shift. But for some reason, when these agents are in Slack and you can see these public conversations, the cultural shift has happened so much faster at Every. Because these things are in the same channels where we work—you can see them engaging the way a human would be engaging.

I think AI is obviously going to change many times over the next five years, and how we interact with it will change. But I think this is going to be durable for a very long time. This is the way that we work.

Dan Shipper I agree. You referred to it as a through-the-looking-glass moment where you just wouldn’t go back once you see it, and I totally agree with that.

But we’ve been hyping it up, so we should also talk about realistically what’s not good about it or what doesn’t work.

(00:30:00)

Dan Shipper One thing that’s really on my mind is just memory. It just forgets stuff and answers incorrectly for obvious things. Like if I come back to a thread a day later, it has no idea what I’m talking about. That feels very solvable.

But there’s also this other thing that I think is true, which is that the way these AIs are trained currently is for two-person conversations. And they have a hard time with the etiquette of knowing when they’re contributing too much, or when they shouldn’t contribute to a conversation, or there’s this pile-up where they’re all responding to each other.

It’s like—I can’t remember what it’s called, but it’s like ants or caterpillars. Sometimes they get into this death spiral where an ant only follows pheromone trails, and if somehow the pheromone trails form a circle, the ants will just walk in a circle until they die. There’s something like that with Claws—if one Claw messages a channel that a bunch of Claws are in and the settings aren’t quite right, they’ll just keep going back and forth until someone says, “Hey, stop, you’re burning millions of tokens.”

I think there’s something where the potential for them to collaborate publicly is so high, and I don’t think they’ve been trained for it. You can do some prompting for this, but I think there’s also a fundamental model-layer shift that needs to happen for them to be trained on participating in group chats.

Willie Williams Yeah. Now I understand what 13-year-old Dan did for fun.

Dan Shipper I was using a magnifying glass.

Willie Williams Yeah. But I think, to tease the baseball analogy, we’re still in like the first or second inning. Even when you talk about it—we’re discovering these primitives and bolting things together, using models that are trained more for coding or two-person Q&A dynamics, not for participating in a group where you’re trying to provide value to multiple people at once. It’s brand new. It’s the frontier, and it’s nice to be on the frontier—but it’s also the frontier, and it’s terrible to be on the frontier.

Dan Shipper Yeah.

Brandon Gell They’re so eager. I think Anthropic’s vending machine test is actually a good example of this. There’s a thread, they want to be involved, and we have instructions in Plus One that basically say, “Hey, if you don’t have anything useful to add, don’t add it.” They’re not great at following that right now. Hence this happens.

And I think the vending machine test is a good example. When it was just Claude and no overseer boss agent, it was really bad at deciding what was a good decision versus a bad decision. But when you add an architecture where there’s a boss agent—one whose only job is to ask “is that helpful or not?”—as soon as you add that layer, it started becoming profitable.

Dan Shipper Wait, is the boss an AI or a human?

Brandon Gell The boss is an AI. A boss AI that says, “Hey, your addition to this thread is not helpful, don’t send it.” The issue is that’s expensive. I think the models will just get better and solve this, and you can have a single AI that does that judgment behind the scenes. But at least architecturally, we don’t need to solve that problem ourselves.

Dan Shipper Is that really how they solved the vending machine thing—they literally had a boss?

Brandon Gell They had a boss, yeah. A boss whose one job was to make it profitable. So the Claude storekeeper would interact with users and then go to the boss: “Should I do this?” And the second they did that, it started becoming profitable.

Dan Shipper This is the same pattern of specialization we’ve been talking about. It just shows up over and over again. Three years ago it was very much like, well, it could just be one God model that does everything. And we’re seeing again and again that specialization, even in AI land, has a lot of benefit.

Willie Williams Yeah. And downstream of that specialization is learning. There are a few versions of learning how to put these bots together in an arrangement that actually works. Like, do you have a product bot and a designer bot and two engineering bots? Is it three engineering bots or one?

And then the other piece, which I think we’ve observed a lot, is: how do you teach humans how to interact with the bots? Because there’s this new dynamic where you have this coworker, but they’re not exactly like a human coworker. They get stuck on different things, they focus on different things. There’s this learning curve around giving instructions in a particular way, with a particular cadence, to steer them in the right direction. That rhymes with management, but is different.

Brandon Gell Well, I think it’s the same problem that, Dan, you’ve been writing about for years—if you’re not a good manager, you’ve never managed anybody, you’re not going to be very good at using AI. There’s an education that has to happen. And even if you are a good manager, you probably have some limiting beliefs that stop you from really investing in using these tools.

My phone call example is a great example: I didn’t even think, “Oh, I can have this thing go through my emails just by calling me.” I had this sort of urge to try it, and a limiting belief was just blown open. We all experience that pretty much every day—these tools do things that, if you’d asked directly, “Do you think it could do this?” you’d say probably. But when you’re day-to-day doing your work, it’s hard to recognize, “Oh, I’ll throw this over to Milo.” It’s hard to build that muscle.

Willie Williams Yeah. And a lot of that is because there’s variance in outcomes. Sometimes you throw something over and it just knocks it out of the park. And then you toss something easy over and it fumbles it. Part of that variance is the model, but part of it is also: if I’d asked in a different way, if I were a better model manager. This is a specialization we’re learning. It’s very emerging, and I think it’s only going to keep accelerating as we add more Plus Ones and OpenClaws into our day-to-day work life.

Brandon Gell I was going to add another tough problem that we just haven’t solved yet: I have taught my Plus One something special, and I want other people on my team to have that superpower too. How do I make sure they have it? And how do I make sure they all know about it and actually use it?

There are two things there—technically, we have to figure out how to do that, which is very solvable. But I also think we need to figure out if that’s even the right solution. Because as I’m saying this, I’m realizing: I’m not teaching Milo how to do product analytics or revenue analytics. I just talk to Montaigne. Montaigne is the only one who really needs to know that skill. But how do people know that? There are some interesting cultural things we have to figure out.

A lot of people adopting this new technology are going to be really uncomfortable with that. A lot of IT professionals who are like, “I have to do change management.” It’s like—change management is not a one-time thing in this new world.

Dan Shipper We need, like, instead of IT, it’s—HR, but for bots.

Brandon Gell Yeah.

(00:40:00)

Dan Shipper One thing we haven’t talked about yet that I want to make sure we have time for: we went on this journey where we got Claw-pilled, started using it for everyone in the org, and then realized there were a bunch of gaps. So we were like, let’s make our own—we’re going to use OpenClaw, but let’s make a default version that we host. Not everyone has to have a Mac Mini. We have all the skills we use for ourselves and all that.

We started using that internally as the collection of all our best practices, and then we launched it as a product for our subscribers last week. That’s Plus Ones—one-click hosted OpenClaws. One cool thing is it connects to all of your apps, especially all of your Every apps. So we have Spiral, which is a ghostwriter; Proof, which is a document editor; and Cora, which does your email—and it natively connects to all those things.

One of the things I was doing today is I had it write a bunch of my Q2 update and reflection on Q1, and put it in a Proof doc. And the really cool thing is it used Spiral, so the writing is much better than it would be otherwise. And because R2C2 is part of our Slack org, it has access to everything about the company I might need. It also has access to our Notion. It just becomes this living repository of context.

But I think it might be good for us to talk about lessons learned in building that whole architecture. There’s a lot of complexity in making Plus Ones, and we probably learned a lot on both the tech side and the product side. Do you guys have any reflections on that?

Willie Williams Yeah. Like many things, a lot of the difficulty comes from the freedom. The nice part about OpenClaw, being a tool you can poke at in an absolute myriad of ways, is that when we went to build a hosted version, there are some decisions you want to make that make it valuable as a managed service. S3 is a good analogy—it’s a hard drive on the cloud, but it doesn’t allow you to do everything a local hard drive does. There’s a similar dynamic where you want to maintain maintainability and security, and there are a few pieces you end up giving up.

Sometimes it’s for user safety, and it’s about how you strike the balance between, say, my mom getting one of these things—she’s never going to use the command line—to the super advanced user who wants everything they could do locally and just wants a hosted box. From a product engineering standpoint, where do you try to split that?

Dan Shipper What were some of those specific decisions and where did we land?

Willie Williams One that Brandon mentioned earlier is the communication pattern in Slack. There’s a very secure model which says only the person’s partner can message that Plus One. Much more secure, but it really takes away the group participatory aspect of robots in the workplace.

The other version is that anyone can message them, but that’s just a nice vector for me to extract stuff out of R2C2. So we ended up on a model which says: anyone can message any Plus One, but they have to do it in public—in group DMs, in channels they’re in. Their human partner should always have visibility into those messages, and the human partner can DM them in private.

Brandon Gell This is actually why it’s the HR team that should be onboarding Plus Ones, because they just reflect a team member so well. The trust model with these Plus Ones—with OpenClaws and agents generally—it’s really complex to figure out data privacy. But when you force things to happen in public, there’s a trust layer that is actually super effective.

Another example—let me share my screen. A little behind-the-scenes look at our Plus One Slack channel. Mike Taylor, who is our head of the tech vertical for consulting and also a very talented person generally, was calling out a problem: the reason he’s not using Plus One is because he basically needs direct terminal access to be able to do certain things—in this case, git commands. That’s a good reason for him not to use Plus One, and it’s a good thing for us to think about: can we solve this problem so that Plus One is actually useful for someone like him?

It’s also a nice forcing function, because it forces us to figure out who this is actually built for. If it’s built for Mike, who would probably love setting up OpenClaw on a Mac Mini—sure. But it’s definitely built for, say, Anochi, who is not going to do that and has a lot of work to do and can just get more work done this way.

Willie Williams I think a lot of the trust model requires some decisions around skill sharing too. Being able to share skills and have skill fluidity across an organization feels like a superpower. On the other hand, it might also be the biggest viral vector you could imagine.

Dan Shipper In a good way, sometimes, and a bad way, sometimes.

Willie Williams Exactly. And it’s tough when you’re trying to ride that line of: we want it to be useful for a particular class of customer, while also making sure it’s as safe as possible.

Dan Shipper So this has been an amazing episode.

Brandon Gell Lot of work to do.

Dan Shipper A lot of work to do. Obviously we’re really excited about this and very excited to bring you all along in how we’re figuring this out. If you haven’t tried OpenClaw, whether or not you’ve tried Plus One—you should definitely get in on this paradigm if you’re interested. Every.to/plus-one—we’re starting to roll out invites on the waitlist and we’re improving it all the time. Super excited about the future. Thank you both for joining.

Brandon Gell Thank you.

Willie Williams Thank you for having us.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

Your Best AI Strategy Starts at the Top

Natalia Quintero and Mike Taylor — 2026-04-07 06:00:00 -0400

by Natalia Quintero and Mike Taylor

Midjourney/Every illustration.

We are hosting a day-long Claude Code for Absolute Beginners course on April 14. If you have used Claude Code for an hour or less, or not at all, I’ll get you set up, help you build your first app with Claude Code, and start automating your routine tasks.—Mike Taylor

A CEO told us recently that he’d been hoping to skip the part where AI wasn’t very good. He figured he’d jump in once the technology matured past the clunky, overpromising phase because carving out hours to learn a new category of technology felt untenable with all of his other responsibilities.

That wait-and-see posture made sense for a while. It doesn’t anymore. When Anthropic released industry-specific plugins for its Cowork tool in February 2026 for legal and financial services roles, the S&P 500 software index fell nearly nine percent over a few days. Executives who haven’t touched the tools themselves are now making high-stakes decisions about something they don’t understand firsthand.

The problem is what they default to. When a leadership team hasn’t used AI themselves, they treat it like any other software purchase: Evaluate, buy, and plug in. They ask, “Which platform?” and “How does it integrate?” Those are the right questions for most technology. They’re the wrong questions for AI.

AI tools like Claude and Cowork aren’t products that slot into your tech stack and deliver value on day one. They’re more like a new kind of employee—one that can do enormous amounts of work, but only if you tell it exactly what to do and check whether the output is right. That’s a fundamentally different adoption decision, and one that’s hard to make unless they have experienced the tool’s capabilities firsthand.

More executives seem to be waking up to this, as we’ve recently started receiving inbound requests from executives at companies like Thumbtack and Headway (Every consulting clients) to attend their executive offsites and walk them through using Claude Code to build real projects. Our conversations with executives had always been about training their teams, and the rapid progress in AI has made them want to get in on the action, too. We’re finding skills they’ve already built as leaders are the skills AI demands—it’s just a case of getting into the habit of applying them.

Executives realize AI is like managing people

Firsthand experience matters so much because AI, when you actually use it, doesn’t feel like software. It feels like managing people. This is what we’ve found surprises the executives we’ve worked with the most—the fact that the work feels familiar.

Think about what it takes to manage people well. You need to know what the goal is, break work into pieces, assign those pieces to the right people, and check the output without micromanaging. You need the judgment to notice when something looks right on the surface but doesn’t hold up—the kind of pattern recognition that comes from years of making mistakes and learning from them.

Managing AI is the same work. When you use a tool like Claude Cowork, you’re running 10 threads at once—building dashboards, summarizing your inbox, and reviewing documents—each tackling a different task. Your job is to delegate clearly, check the output, and apply the judgment that the AI doesn’t have. Did it pull the right data? Does this analysis match what I know about the market? Is the logic sound, or did it take a shortcut that looks plausible but isn’t?

This is why the “evaluate and buy” approach to AI tools fails. You can’t evaluate an employee by reading their resume. You have to work with them.

Codifying what your best people know

Once executives realize that the management skills AI demands—delegation, quality control, knowing what “good” looks like—it becomes clear that these are skills they’ve spent their careers building. A junior employee might be faster at writing prompts. But a senior leader who has spent 20 years learning what works in their industry can push these tools further, because the leader has context that the model doesn’t.

This helps executives shift from thinking about the productivity out of each person to thinking about how they can achieve greater scale with the same resources. Instead of asking, “How do we make individuals faster?”, they post a more interesting question: “How do we take what our best people know and make it available to the whole organization?”

Every organization runs on knowledge that isn’t written down—how your best salesperson reads a room, how your editor knows a draft isn’t ready, how your head of product distinguishes a feature request worth building from one worth ignoring. This is your company’s most valuable asset, but it’s also fragile. It leaves when people leave. It takes years for new hires to absorb. It’s why growing an organization has always meant accepting some dilution in the quality of work.

AI changes this equation. You can write down how your company makes a specific decision—a set of criteria, a decision framework, and the non-obvious judgment calls—and save it as a skill the AI follows every time it works on that kind of task.

For example, we’ve worked with hedge funds to turn their investment philosophy into a screening tool that can be applied to all new opportunities by encoding it as a Claude skill. We built one of the world’s largest media companies a Claude skill that captures their brand voice and that they can feed copy through. This is something that Every’s own editorial team has also done.

But none of this works unless someone can describe what good looks like, and that’s a job for the senior people who know.

A chief people officer at one of our offsites had spent years developing an instinct for spotting patterns in unwanted attrition. She knew what to look for—she just didn’t have time to look. In the session, she built a tool that connected her company’s applicant tracking system to internal survey data and ran that analysis for her. She told the room the output was better than what she was able to produce by hand, the equivalent of about three hours of manual work she would have needed to do every week. She shared her results in Slack, and immediately got excited responses from her team—they didn’t realize something like this was possible with AI.

Five things to do this quarter

If you’ve been waiting for the right moment to get hands-on, the tools are ready. Here’s where to start:

Suspend disbelief. There’s plenty to be skeptical about AI, but skepticism as your starting posture could cost you the benefits. Assume that a tool works and go looking for where it breaks. Learning where the AI fails firsthand will help you figure out where to focus.
Get your hands dirty. Shopify CEO Tobi Lütke is contributing more code than ever while running a public company. Every CEO Dan Shipper shipped a production app between meetings. The only way to build intuition for these tools is daily use. There’s too much noise to rely on secondhand opinions. If someone recommends a tool, get them to show you how they use it. If they can’t, move on.
Be a fair evaluator of AI. Define what good looks like, measure it consistently, and you get a clear picture of what AI handles, what humans are essential for, and where to delegate tasks. Pro tip: Tell Claude to build you an evaluation of the prompt (or skill) you want it to run. It will create synthetic tests for the prompt, ask you to pick your preferred outputs, and voila, you have a better prompt.
Hire for taste. AI has made execution cheaper, so the relative value of good judgment has gone up. Encourage the people on your team to explain why they like something, defend a point of view, and navigate nuance. Strong opinions formed from experience are worth more than implementation skill.
Treat your company like a file system. Every new AI session is a first day on the job—it knows nothing until you tell it. If your documents are stale and your workflows aren’t mapped, AI won’t work for you. Focus on what you control: documentation, evaluation metrics, and well-tested skills. Those make any model effective, even if you swap providers in a year.

Executives who pushed the AI can down the road should find comfort in the fact that it’s easier than ever to use AI to write great prompts, build skills, and get real value. The companies that started this six months ago have already turned what their best people know into something the whole organization can use. That is becoming an even greater advantage every week. And it starts with the people at the top opening the tools.

Natalia Quintero is the head of consulting at Every. You can follow her on X at @NataliaZarina and on LinkedIn. Mike Taylor is the head of tech consulting at Every and a co-author of Prompt Engineering for Generative AI (O’Reilly).

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Help us scale the only subscription you need to stay at the edge of AI. Explore open roles at Every.

Get Your Hands Dirty

Every Staff / Context Window — 2026-04-07 02:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Today we’re testing a new newsletter format, aimed at giving our readers both a taste of our long-form writing and our perspective on what matters in AI today. Let us know what you think.—Kate Lee

Today’s top story

“Your Best AI Strategy Starts at the Top” by Natalia Quintero and Mike Taylor: Executives might be waiting on the sidelines to see what will happen with AI, but they need to be getting their hands dirty with the tools, write Natalia Quintero and Mike Taylor, both part of Every’s consulting team. That’s because AI can’t be evaluated like software, where you compare features, platforms, and integrations. It needs to be treated like a new kind of employee.

Natalia and Mike offer five concrete things for executives to do this quarter—starting with suspending skepticism—to get started building AI-native organizations. Read more.

Signal

Anthropic’s OpenClaw ban is a gift to OpenAI

The news: Anthropic blocked Claude subscriptions from being used with third-party agent harnesses like OpenClaw. OpenAI hasn’t.

The context: Anthropic’s stated reason for the ban is to prioritize compute for its own products, saying flat-rate subscriptions weren’t built for the high usage of third-party tools.

It’s a valid argument: Agents that run 24/7 are enormously expensive. But rival OpenAI has raised so much money it can afford to let subscribers use their models however they want.

The implications: Anthropic’s ban provides an opening for OpenAI to siphon away users. The strategy appears to be working: Opus 4.6 token usage is significantly down week over week; GPT-5.4’s has surged.

Model usage as measured by OpenRouter. (X post courtesy of Dan Shipper.)

Bigger picture, the future of the industry depends on figuring out ways to drive down compute costs.

(Running frontier AI agents like OpenClaw can cost $300–$1,000 a day, a number that’s only growing.)

OpenAI has a clear advantage here. It’s building its own data centers, which puts it closer to the metal on compute. Meanwhile Anthropic is buying compute from third parties, and will never have as low a cost basis.—Laura Entis

New job alert

We’re flagging new job postings that signal where AI is reshaping teams.

Anum Hussain at Ashby, a recruiting technology company, is hiring a “Lead, Content Library.” The idea is to treat the company’s existing content like a product: Organize it, resurface it, track what’s losing viewership, and make sure the right piece reaches the right person at the right moment.

What’s been true: Content teams hired people to make more new content.
What’s changed: AI makes production cheap, so the new challenge is to get maximum value from content that already exists.
What’s new: This role only makes sense when one person can manage a much larger body of work—and with AI, they can.—Katie Parrott

Inside Every

AI adoption has a before and after—the aha moment is the line

People talk about “technical” and “non-technical” when it comes to AI adoption, but that distinction is getting less useful by the day. The more revealing split is between people who have had the AI aha moment and people who haven’t. Once you’ve crossed that line, the question isn’t whether you’re technical enough for AI—it’s what you want to build with it.

That’s why getting to that aha moment is such a key step—and that magic moment is different for everyone. Our consulting team says that a typical aha moment for clients in using Claude is getting a daily digest of the overwhelming stream of communication—Slack, email, Jira, etc. On a recent episode of our podcast, Kate Lee, Every’s editor in chief, says her aha moment was when was feeling overwhelmed by managing the hiring process for several key roles earlier this year. Though she did look at every application, AI helped do a first pass on the hundreds she received, and offered “a way to evaluate everyone against consistent criteria.” She also used AI to set up all the job descriptions in Notion.—Eleanor Warnock

Who’s the author when AI does the writing?

In book publishing, the “author” and the “writer” aren’t always the same person. The author is whose ideas drive the work (gener

ally, the name on the front cover). The writer is whoever puts them on the page (sometimes credited, often not). A celebrity might be the author of their tell-all memoir, but their ghostwriter is the writer.

AI has made everyone else confront this distinction. If someone uses AI to write a book, can they call themselves the author? When we spoke about this recently at Every, my colleague Mike Taylor‘s instinct was no—to him, authorship requires suffering. The pain of thinking something through is inseparable from the work itself. That framework applies in some contexts. But publishing already has a working answer: The person with the ideas is the author, full stop.

The harder question is: Which part of authorship do we care about—having the idea, doing the writing, or suffering enough for both? Mike’s frame isn’t entirely wrong, but perhaps slightly mislocated. As a former literary agent, my view is that the suffering doesn’t disappear when AI does the drafting (just ask Katie Parrott); it’s just even more likely to show up in the self-judgment—the nausea you feel when something you’ve published isn’t as good as you wanted it to be.—KL

Steal this workflow

Workflows we’ve tested and liked—ready to drop into your own process.

If you’re designing something new, Claude Code can generate working pages, full design systems, and clickable prototypes in minutes. Where it falls short is the last mile—the s

mall decisions that make something feel made. Every designer Benjamin Osemwengie puts it this way: HTML gets you to good. A canvas-based tool like Figma gets you to great.

Try it this week: Generate the system, structure, and first-pass pages with AI and HTML. Then move into a visual tool like Figma only for the part that requires judgment.—KP

Build with Every

Follow Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Work on documents with AI agents using Proof.

For sponsorship opportunities, reach out to sponsorships@every.to.

Writing With AI is Harder Than You Think

Katie Parrott / Working Overtime — 2026-04-06 08:00:00 -0400

by Katie Parrott

in Working Overtime

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

I’ve been feeling personally attacked by my X feed lately. Well, even more than usual. Alongside the usual headline horror shows and barrage of bad takes, writers I respect and admire are on the warpath against writing with AI.

The discourse kicked off late last month when Washington Post columnist Megan McArdle posted about how she uses AI in her work. The reposts were merciless. “Genuinely an insane thing to admit.” “Journalistic dishonesty out in the open.” One person suggested that admitting to AI use should be made “deeply taboo,” even though he acknowledged in the same post that everyone’s going to do it anyway. But the one reaction that stuck with me was journalist Charlotte Alter: “Research is thinking. Outlining is thinking. Writing is thinking. Any portion of that done by AI is less thinking done by you.”

The problem is that so much of AI writing happens in a black box. The critics are imagining the laziest possible version of AI-assisted writing, and the writers who use AI seriously haven’t been showing their work, though that’s starting to change. That silence lets the worst assumptions fill the gap.

I’d rather just show you the whole mess—what is happening in my head when I write with AI, and it’s not what the discourse imagines. By the end, you can decide for yourself whether what I do counts as thinking.

What writing with AI is (and what it isn’t)

Many critics treat the use of AI in writing like a binary: Either the machine wrote it, or you suffered for it. But writing has never been binary. It’s always been a mess of drafting and revising, leaning on editors and borrowing structures, following formulas and breaking them. And no two kinds of writing are exactly alike: A journalist’s process relies on source calls and document requests. A novelist’s includes plotting arcs across 80,000 words. A personal essay, like the ones I write for Every, involves sitting alone with your feelings until they become a thesis statement.

Every writer’s process is different, and most of them would sound unhinged if described in detail. But throw AI into the mix, and suddenly everyone has opinions about the “right” way to get words on a page.

My process, start to finish

When people picture “writing with AI,” they picture a transaction. You type a prompt, the AI hands you text, you paste it somewhere, and move on. My process has about as much in common with that as cooking has with microwaving a frozen dinner.

And it’s evolved over time. In 2024, I was the human conveyor belt: Copy a prompt into ChatGPT, paste the output into a Google Doc, tweak it by hand, repeat. In 2025, I got smarter about context—I uploaded my past writing, built a style guide, and gave the AI something to work with beyond a cold prompt. The outputs got closer to my voice, but the process was still me wrestling with a chat window.

Now I have a dedicated writing agent—a set of detailed instructions that plug into Claude and guide me through every stage, from first idea to final polish. It has phases: brainstorm, interview, outline, draft, and review. It has a panel of critics who tear my work apart from different angles—skills I wrote to invoke certain kinds of feedback, whether it’s for length, pacing, or the soundness of the argument. It has style checks, AI-pattern detectors, and a line editor that tightens my prose sentence by sentence. Think of it as a very opinionated editorial workflow that happens to be powered by AI.

Brainstorming: ‘Interview me to find out what I think’

When I sit down to write a piece, and before I even write a word, I have the agent interview me. It asks questions to draw out what I’m thinking about the topic. For example: “Why is this on your mind? How has this shown up in your work? What do you want readers to walk away thinking about?” For this piece, since it was a reaction, it asked me: “What’s the friction for you personally here? When you read these tweets, what makes you want to write about it—is it that you think the critics are wrong? That they’re right but for the wrong reasons? That the whole frame is off?”

My writing agent kicks off an interview to collect thoughts to inform the development of this article. (All images courtesy of Katie Parrott.)

I spend a lot of time sitting with these questions. Sometimes I’ll struggle so much to find an answer that it forces me to realize I haven’t thought through the idea enough yet and need to spend more time reflecting on what I want to say. At least once per interview session, the AI will ask me something that feels irrelevant to the piece I want to write, and I say so. When I was writing this piece, for example, it asked me to critique another writer’s use of AI in their writing process. I said I didn’t want to go there; ranking other writers’ workflows wasn’t the piece I was writing.

Outlining: Structure is a negotiation

Then comes the outline stage. The agent proposes a structure based on everything I said in the interview. I never take the first outline on offer. I move sections around, cut beats that feel thin, and add things the AI didn’t think of.

I push back on structural decisions and information sequencing that AI recommended for this essay.

Early in the development of this piece, for example, I shared a story about tabling a draft I was working on based on feedback from my AI reviewers. Claude gave the anecdote its own standalone section. I told it that the anecdote felt grafted on and to fold it into the process walkthrough instead. It also wanted to map every part of the walkthrough onto this specific essay. I wanted the freedom to talk about my process more broadly and pull in examples from this piece only where they earned their spot, so I pushed back. We went back and forth until the structure matched what I could see in my head.

Drafting: Where the ratio gets interesting

Section by section, I have the AI lay out prose based on the outline. Some sections come back close—I rough them up, swap in my phrasing, break apart sentences that are too clean, and add the em-dashes and asides that make it mine. Other sections I throw out almost entirely and rewrite from the feeling of what I wanted rather than what the AI gave me. The mix shifts by section, by paragraph, and by sentence. There’s no fixed ratio, and the minute someone tries to assign one, they’ve misunderstood the process.

I’ve lost track of the number of changes I’ve made to the exact copy of this essay, for example. Some of the changes were big: redrafting whole sections that had drifted off the main thesis, or moving a paragraph from the opening to the conclusion because it worked better as a callback than a setup. Other changes are more targeted—exchanging an analogy that doesn’t feel like mine for one that does (the AI compared my writing process to “a home renovation versus buying furniture off a truck”; I went with “cooking versus microwaving a frozen dinner”). In each case, I’m the one deciding what stays and what goes—and why.

Revising: The toughest editors I’ve ever had

Then comes review—and this is where the “outsourcing your cognition” narrative falls apart completely.

As part of the writing agent, I built a panel of reviewers. Each one is a set of instructions that tells the AI to read my draft from a specific angle, and none of the ones below are nice about it.

Hemingway, named for the king of economical prose, flags every adjective and unnecessary word, demanding I kill my darlings.
Hitchcock, inspired by the director who claimed that drama was life with the “dull bits cut out,” checks if I’m giving the reader a reason to keep reading—a bomb under the table, to use the classic example.
The mom reader lovingly flags where I’ve lost the general audience.
The asshole reader, which does exactly what it sounds like, attacks every weak point and unearned claim with the energy of a reply guy who just discovered your newsletter.

My humor agent, nicknamed Sedaris, helps me “find the funny” and bring more of my personality into the piece.

The AI generates the critique, and I have to decide what to do with it. Sometimes the reviewer sees a genuine weakness. Sometimes it’s pushing me toward a version of the piece I don’t want to write. The asshole might flag a claim as unearned, and sometimes it’s right—I need more evidence—and sometimes I decide the claim stands and the asshole can deal with it.

The “asshole” reviewer is set up to give me the least-charitable read on the draft, so I can identify weak points in the argument and shore them up.

That is the opposite of cognitive offloading. I engineered a system that regularly humiliates me, and I keep coming back for more. If that’s not commitment to the craft, I don’t know what is.

Finalizing: The final pass and polish

Once I’m happy with the substance, I run a gauntlet of checks. An AI-check skill scans the prose for patterns that read as machine-generated—correlative constructions, stock transitions like “Here’s the thing,” and words like “delve” that no human uses voluntarily. A style checker enforces the house rules: em dashes without spaces, Oxford commas, and numerals for 10 and up. A line editor tightens sentence by sentence—cutting dead weight, flagging passive voice, and compressing anything flabby.

I built these checks because AI-assisted prose has specific failure modes. It tends toward a particular kind of smoothness—the verbal equivalent of a stock photo. Left unchecked, it reads like everyone and no one. The finishing pass is where I sand off the last of that generic sheen and make sure what’s left sounds like me.

The whole process—the interview, the outlining fights, the drafting and redrafting, the panel of critics, and the final scrub—feels like sculpting to me. I start with a rough block and chisel. Cut what doesn’t belong, reshape the argument, and rough up the surface until it sounds like me. It still requires judgement and expertise—the sculptor still needs to know where to cut and where to leave the stone alone.

I’m still fighting with the work

I spent much of last weekend taking umbrage with the critics. And if I’m honest, worrying that they were right. That I’d built an elaborate system to avoid doing the real work, and the fact that it felt rigorous was part of the trap.

Many critics see AI-assisted writing as lazy. And yes, I still cut corners. There are nights I accept the third draft because it’s 11 p.m. and the paragraph is fine, it’s fine, it’s probably fine. But I did that before AI, too. I’ve submitted paragraphs I knew were bad because I’d rewritten them four times and couldn’t look at them anymore. We’re all guilty of cutting corners, and that’s not AI’s fault.

But when I ask myself if what I’m doing still feels like writing, the answer is unequivocally yes.

I still lose 20 minutes chasing a word I can hear but can’t find. I still read sentences aloud to test whether the rhythm lands. I still agonize over whether a piece is saying something worth someone’s time or just filling space. I still worry about tone—too defensive? Too breezy? Am I earning this vulnerability or performing it? I still get that specific nausea when something I’ve published isn’t as good as I wanted it to be.

None of that went away when I started writing with AI. If anything, the tools stripped away the excuse that I was too exhausted from drafting to care about the finer points. When the initial output comes faster, you have nowhere to hide from the question of whether the thing you made is any good.

Am I a real writer? I’m a writer who takes feedback, iterates relentlessly, holds herself to a standard, and ships every week. One of my editors just happens to be an AI. The rest is still me.

Read Katie’s guide for how to use style guides to make AI sound more like you.

Katie Parrott is a staff writer and AI editorial lead at Every. You can read more of her work in her newsletter. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

Discover Every’s upcoming workshops and camps, and access recordings from past events, including Katie’s Writing Camp, to learn more about her process for writing with AI.

For sponsorship opportunities, reach out to sponsorships@every.to.

House Rules for the Agents

Every Staff / Context Window — 2026-04-05 01:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every illustration.

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

Fine tuning

Anthropic’s OpenClaw problem

When Anthropic’s new Claude Max restrictions started circulating, the company named one tool specifically: OpenClaw. “Wtf,” wrote CEO Dan Shipper in the Every Slack. The policy seemed to say: If you access Claude through OpenClaw, your subscription no longer covers it the same way. “They disallow specifically OpenClaw from subs,” head of tech consulting Mike Taylor wrote. “You have to pay for extra usage. Pretty lame.”

Mike’s best explanation for why Anthropic drew the line where it did centers on prompt caching, a cost-control mechanism that works by reusing previously processed conversation text. When it works, it keeps inference costs low. When a third-party tool changes even a single token in the prior conversation, that reuse breaks, and Anthropic has to reprocess the entire conversation from scratch. “Prompt caching keeps cost down by saving the previous tokens that have already loaded,” Mike explained. “If a provider breaks the cache by changing even one token of the previous saved conversation, you have to reprocess the entire old conversation.” He also noted that Claude Code co-creator Boris Cherny had already opened pull requests to improve OpenClaw’s cache efficiency, suggesting the problem was technically solvable. Anthropic enacted restrictions instead.

What the team disputes is not that Anthropic has a reason—it’s that singling out one app by name is the wrong response to it. The consistent argument across the Every Slack was that if cache-breaking usage costs more to serve, make those users pay more: Meter the consumption rather than ban the interface. “A better middle ground is not to ban OpenClaw users,” head of platform Willie Williams argued, “it’s to give me a certain amount of tokens I can use as part of my subscription, and then charge me overages if I go over.” Dan framed the same principle from the user side—“I think of AI subscriptions like Claude and ChatGPT as being like cell phone plans that give me a certain amount of data”—and Mike extended it to the infrastructure side, invoking net neutrality: Verizon shouldn’t get to slow down Netflix because Netflix uses a lot of bandwidth. The argument, in every form it took, was the same: Charge for what costs you money, not for which app someone uses to spend it.

There is also a business problem that goes beyond annoyed subscribers. Restrictions like this do the opposite of building loyalty—they create churn. Anthropic may have a legitimate business reason for drawing a line somewhere. But drawing it in a way that feels confusing and selective is not the way to win the platform war between model providers and the tools built on top of them.—Kate Lee

AI video analysis just got way cheaper

AI video analysis is rarely discussed in AI hype circles today. Only one frontier model—Google’s Gemini—can natively watch and understand what’s happening in a video. It’s more like rocket flight than air travel: not an established industry getting cheaper, but a new capability on the verge of becoming practical. And something just shifted that could blow the door open.

When GPT-4V (vision) launched at the end of 2023, I used video processing to identify what strategies were being used in video games at a cost of roughly $6 per hour—and that was after a lot of complex engineering to split videos into frames at 0.5 frames per second (FPS) and feed them through as images. Google’s recently released open-source Gemma 4 model does this much more efficiently: I estimate the same task now costs about $0.14 per hour at 2 FPS—capturing four times as much detail, with none of the hacky engineering workarounds that used to be necessary.

The math: At current token pricing ($0.14 per million input, $0.40 per million output), one hour of video at 1 FPS with 70 tokens per frame runs about 252,000 input tokens, or roughly $0.04. Bump to 2 FPS with richer frames (140 tokens each) and you hit ~$0.14 per hour—still a 97 percent cost reduction from 18 months ago.

The cost of understanding what happens in a video has dropped by a factor of roughly 40, while the quality of that understanding has improved dramatically. That is the kind of price collapse that creates entirely new categories of application. Imagine live video streaming commentary of your kid’s soccer game, a Ring doorbell that tells you who’s at the door, or an automated review of thousands of hours of security footage to find a missing person.—Mike Taylor

Knowledge base

“Vibe Check: Cursor 3.0 Bets Big on Agent Orchestration” by Dan Shipper, Katie Parrott, and Mike Taylor/Vibe Check: Cursor totally rebuilt its product around agent orchestration rather than code editing, and we came away feeling that the new Cursor still has maturing to do. The desktop app is fast, the local-to-cloud workflow is impressive, and its new model, Composer 2, is concise and snappy. But missing basics like file navigation and branch management left even power users like Cora general manager Kieran Klaassen struggling. Read this for the breakdown of where Cursor 3.0 stands against Claude Code and Codex.

“Seven Things I’ve Learned Getting Companies to Use AI” by Mike Taylor/Also True for Humans: Most companies mandate AI adoption and wonder why it doesn’t stick. Every’s head of tech consulting argues you should do the opposite: Find the people who are already bought in, get them IT access and budget approval, and let their results pull everyone else forward. His other lessons include building on the model providers directly instead of buying third-party tools, setting stretch goals that force people to think about where AI can save them time, and training every individual contributor to be a manager of agents. Read this for the whole playbook from his consulting engagements.

“What I Learned Onboarding Our AI Project Manager” by Nityesh Agarwal: Every’s consulting team built an AI project manager named Claudie that saves them 15 hours a week tracking client work across email, documents, and meeting transcripts. Getting her there meant rebuilding her multiple times, figuring out why she kept dropping key details, and writing her an employee handbook she reads on every startup. Read this for the full architecture and the management lessons that apply to your next agent hire.

🎧 “If SaaS Is Dead, Linear Didn’t Get the Memo” by Context Window/Laura Entis: Agents can now create tasks and manage workflows inside Linear just like human users, and companies like OpenAI and Coinbase run their agents on it. In this week’s AI & I, Linear CEO Karri Saarinen tells Dan how his company reinvented itself for the agent era without abandoning its mission of helping teams build great software. Also, read Every creative lead Lucas Crespo’s thoughts on why tools like Google Stitch can make any app look polished, but you still need a human designer to make something memorable. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“How to Design for Human-agent Interaction” by Karri Saarinen/Thesis: When your agent sends out an email before you’ve had the chance to review it, the model did its job—it’s the interface that failed. Karri argues that AI’s unreliability is a design problem, not a model problem, and shares the six principles Linear developed so that agent actions are as legible and controllable as human ones. Read this to understand why the answer isn’t approving every agent action—it’s designing the system so the agent already has the constraints it needs before it starts.

Thesis extra: Designing toward the immeasurable

From Saarinen’s home office in San Francisco, he spoke to us about the design goal he cares about most—which also happens to be one he can’t measure: quality.

Saarinen describes quality as a near-sensory reflex. If he touches—or even looks at—something that doesn’t “feel” thoughtfully crafted, it sets off a niggling itch in the back of his mind. “It’s a belief,” he says, “or I could say, it’s like a faith.”

It’s an unusual stance for a tech founder—given the industry’s penchant to quantify all it possibly can—but Saarinen has made the pursuit of quality central to how the company operates. He sees it as inseparable from Linear’s ambition to be the best in its space.

Karri Saarinen in his home office in San Francisco. All photos courtesy of Sarah Deragon for Every.

Create conditions that make quality inevitable

If quality has to be felt to be understood, scaling it across a growing company isn’t straightforward. Saarinen’s approach mirrors an activity he does far, far away from his laptop screen: growing potatoes every summer at his home in Finland. “You didn’t directly make those plants grow,” he says, “but they grew because you created the conditions for them to grow.” When something goes wrong—say, strange spots appearing on the vegetable’s skin—you have to evaluate the conditions you created. Were the soil conditions right? Perhaps it was too acidic? You adjust, and you learn.

Similarly, a leader can define a standard of quality, but they can’t manufacture it themselves. Their role is to create an environment where quality is likely to take root. At Linear, that means hiring people who genuinely care about their craft, telling them openly—and often—that quality is valued, and building rituals that reinforce it. One of those rituals is “Quality Wednesday,” where the engineering team works on fixing small issues that degrade a user’s experience. The ritual trains the team to notice things that most people would scroll past, and carry that instinct into everything they ship.

What shapes a seasoned eye

When Saarinen talks about his influences, he’s drawn mostly in the direction of hardware. Saarinen points to Opal—the webcam he used during this interview—or the distinctive aesthetic of Swedish electronics company Teenage Engineering. In particular, he likes the latter’s audio mixers, where tactile grids of knobs and keys—and the small icons etched into their surfaces—attempt to give sound a visual form.

At the same time, Saarinen has never been a fan of skeuomorphism—which styles digital interfaces to mimic physical textures. “If you’re designing a new house and you like Roman columns, so you put columns like that in the house,” he says, “well, it’s still not a Roman house.” Those columns came to exist in Rome from constraints and traditions that were specific to a certain time and place—and grafting them onto a modern house is borrowing from that aesthetic, even though the context that produced it has little to do with what you’re building.

Software, he argues, should be approached the same way. It’s a new medium, and it deserves a native design language instead of hand-me-down forms from the physical world. (And now that apps are becoming agent-native, these interactions call for their own design patterns.)

Felt, not measured

Beyond design, Saarinen’s taste gravitates toward science fiction and fantasy—Dune, the Alien franchise, Stephen King’s Dark Tower saga—drawn to the new ideas, the unfamiliar worlds, the visual imagination these stories demand. There are even small nods to these influences hidden in Linear, a detail tucked into a homepage here, a reference in a feature launch video there.

Across all of it, the through line is the same: work that exudes intention and care. The kind of quality you can’t measure, only feel.—Rhea Purohit

Log on

Upcoming camps

Claude Code for Absolute Beginners (April 14): This beginner-friendly, live workshop led by Mike Taylor (head of tech consulting at Every) is designed to get you from zero to a working project with Claude Code.

Alignment

Dropshipping GLP-1s. The New York Times published a story this week about what might be the first $1 billion one-person company. It’s a GLP-1 telehealth startup called Medvi, built by Matthew Gallagher in two months with $20,000 and a suite of AI tools. In its first full year it did $401 million in revenue and is on course for $1.8 billion this year. He has one employee, his brother.

A lot of people are calling the numbers fake, but having spent two and a half years working inside this industry, I don’t think they are. The demand for these medications has been the most ferocious thing I have witnessed in my working life, and the hardest parts of running a telehealth company, like finding doctors and fulfilling prescriptions, can be entirely outsourced to platforms like CareValidate and OpenLoop. All you need is the audacity to do blitz marketing like you’re holding an AK-47 with unlimited bullets, and that’s exactly what Gallagher did.

His affiliates, armed with AI, built fake doctor profiles in Meta ads and made unscrupulous claims about weight loss using fake testimonials. The liability sits with both the affiliates and the company for these types of advertisements, but enforcement has been so slow that it hasn’t mattered.

Of course these black hat marketing tactics worked because regulators are slow and enforcement has been lax. But with acquisition costs rising and retention becoming harder as consumers chase the cheapest option, the unit economics of this model will become increasingly unattractive. These types of businesses exist for a moment until they capitulate because it no longer becomes economically viable.

Gallagher will come away from it a much richer man, so maybe that validates the business model. There’s also a discussion about whether it’s truly a one-man, billion-dollar business: Dan rightly points out that Gallagher is outsourcing a large amount of human labor. The part I’m concerned about is that it’s being celebrated as a milestone in AI use when it’s really a better example of someone exploiting an unregulated space.

Some untold number of unknowing people clicked on a fake doctor’s profile, filled out a one-minute consultation, and got a GLP-1 shipped to their door. This is exploitation on an enormous scale! It works for GLP-1s because the demand is extraordinary and the side effect profile is manageable for most people, but the same funnel could be pointed at antidepressants, or hormone therapy, or opioids. This type of business is now being copied because of the publicity this story has received, and that should scare us.

Evan Armstrong predicted the one-person billion-dollar company would arrive because AI would compress human intelligence. This feels like something different.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to paid

How to Design for Human-agent Interaction

Karri Saarinen / Thesis — 2026-04-03 10:00:00 -0400

by Karri Saarinen

in Thesis

Sarah Deragon/Every illustration.

Karri Saarinen has spent his career—at Airbnb and Coinbase, and now as CEO of Linear—crafting software that keeps its promises. His argument is that AI’s unpredictability isn’t a model problem, it’s an interface one: An agent sends a customer an email you meant to review first. The model did what it was told, but the interface never gave you a chance to stay stop. In this piece, he shares the six-principle framework Linear has developed for how agents and humans should work together inside the same product, plus his nuanced take on a thorny question in AI design: Who should be accountable when an agent does something wrong? If you enjoy the piece, watch his episode on X or YouTube, or listen on Spotify or Apple Podcasts.—Kate Lee

Was this newsletter forwarded to you? Sign up to get it in your inbox.

I learned to design in a world where product design was a promise.

It was a promise that a product would work how it’s supposed to work. You sketch a user flow on a whiteboard, build it, and the system behaves the way you made it behave. A button does exactly what it says it will do, every time, and if it doesn’t, that’s a bug. This shaped my approach as a principal designer at Airbnb and Coinbase, and now as the CEO of Linear.

Lately I’ve been spending time with a different kind of tool, and that promise has grown harder to keep. I ask for help writing a plan, summarizing a discussion, and turning rough notes into something clearer. Sometimes the result is excellent, but small changes to my input shift the output in ways I didn’t expect. The capability is impressive when it works, but the experience often feels slippery. I’m not always sure what I’ll get back, or how much I should trust it.

Non-deterministic software breaks the contract. When outcomes can vary, sometimes wildly, based on what someone types into the same chat window, designing for reliability becomes genuinely harder. This slippery feeling is the design problem of this era, and it almost always traces back to the interface rather than the language model—which means it belongs to designers, not researchers.

The limits of chat

The first interface that spread for AI tools was the chat window. That makes sense. When you don’t know what something can do, the safest approach is to let people ask. A conversation feels familiar, it stretches across many situations, and it doesn’t force a specific structure up front.

But the more you use chat for real work, the more its weakness shows. Everything becomes a stream of text that’s hard to hold onto, hard to compare, and hard to connect to the rest of what you’re doing. The quality of the output depends enormously on the quality of the input, which means two people asking for the same thing in slightly different ways can get drastically different results. There are few guardrails, and little structure nudging you toward a good outcome. The interface is essentially a blank page with a blinking cursor, and all the burden of getting value from it falls on the person typing.

For exploration, that’s fine. For serious, repeated work inside a team, it’s not enough. We need interfaces that bring more structure to AI interactions, that guide people (and agents) toward better outcomes without being so brittle they break the moment someone wants to use them in a way you hadn’t anticipated.

Designing for new actors

There’s a second, newer dimension to this problem that goes beyond improving interfaces for humans. Agents are already showing up inside products, working alongside people, and most software wasn’t designed with that in mind.

For decades, interfaces have been designed so that humans can navigate them—buttons, menus, folders, navigation hierarchies. These patterns assume a person is looking at a screen, making decisions, and clicking through options. But when an agent is interacting with a product, the design challenge changes. The agent doesn’t need a menu to find something. It doesn’t browse. It acts, and the people around it need to understand what it did and why, often after the fact.

We need a new set of principles for how agents show up inside the tools people already use. Not principles for building agents themselves, but principles for designing ways that agents and humans interact within a shared product. At Linear, we’ve started calling these Agent Interaction Guidelines, and while they’re still evolving, they represent how we think about this problem today.

An agent should always disclose that it’s an agent

When humans and agents work side by side, people need instant certainty about who they’re interacting with. This sounds obvious, but it’s easy to get wrong. The agent has to signal its identity clearly enough that it can never be mistaken for a person, even in passing, even on a quick scan of a busy activity feed.

A dropdown menu assigns tasks to human and agent users, with clear “Agent” badges for the latter. (All screenshots courtesy of Linear.)

An agent should inhabit the platform natively

Agents should work through the same patterns and actions that humans use. If a person changes an issue’s status or links a pull request, the agent should do it the same way, in the same place, with the same visual language. This makes the agent’s work legible without anyone learning a new mental model. You already know how to read what happened, because the interface is the same one you’ve used all along.

Linear’s activity feed for issues shows agent actions alongside human actions.

An agent should provide instant feedback

Silence from an agent creates the same anxiety as silence from a colleague you’ve just asked for help. When invoked, an agent should provide immediate (but unobtrusive) feedback so the person knows their request was received. The details can come later.

An agent should be transparent about its internal state

More broadly, people need to understand, at a glance, whether an agent is thinking, waiting for input, executing a task, or finished with that task. And when they want to go deeper, they should be able to inspect the agent’s reasoning, the tools and systems it used, and its decision logic. This separates a product you can trust from one that feels like a black box. Transparency makes speed feel safe.

Agents in Linear’s Agent Sessions show their reasoning.

An agent should respect requests to disengage

When asked to stop, an agent should stop immediately and stay stopped until it receives a clear signal to re-engage. This one feels simple, but it matters more than you’d think. An agent that keeps going after being told to stop, or that re-engages unprompted, erodes trust faster than one that makes mistakes. People need to feel that they’re in control of the interaction, not the other way around.

An agent cannot be held accountable

I think about this principle most. The instinct to put a human in the loop is understandable, but taken literally, it can mean a person approving every step before anything moves forward. The human becomes a bottleneck, rubber-stamping work rather than directing it, and you lose much of what makes agents valuable in the first place.

The more important work happens before the agent even starts. An agent operating inside a well-designed system already has the context and constraints it needs to do good work. In Linear, that means project plans, issue backlogs, code, and documentation. These all shape what the agent does and how it does it.

When you delegate an issue to an agent in Linear, the delegation is visible. There’s a person who set the agent loose within that system, and that person is accountable for the outcome. You design the environment well, you let the agent run, and you own what it produces.

Issues delegated to agents show their human assignee.

A working framework

We’ve honed these principles through what we’ve learned building agents into Linear over the past year, and we expect them to keep evolving as the technology and the patterns mature. The design language for human-agent collaboration is still being written, by us and by everyone else building in this space.

I feel confident, though, that the slippery feeling people associate with AI products is a solvable problem, and the solution looks more like thoughtful interface design than better models. The models will keep improving on their own. The harder work is building the structure around them so that their output feels reliable, legible, and trustworthy. That’s the design challenge on which to focus.

And the reward for getting it right is that, over time, you can hand agents more and more of the work that doesn’t need you, and spend your attention on the work that does.

Learn more about how Saarinen runs Linear and what he thinks of the “SaaS is dead” narrative. Watch his episode on X or YouTube, or listen on Spotify or Apple Podcasts.

Karri Saarinen is the cofounder and CEO of Linear. You can follow him on X at @karrisaarinen.

For sponsorship opportunities, reach out to sponsorships@every.to.

Vibe Check: Cursor 3.0 Bets Big on Agent Orchestration

Dan Shipper, Katie Parrott, and Mike Taylor / Vibe Check — 2026-04-02 10:00:00 -0400

by Dan Shipper, Katie Parrott, and Mike Taylor

in Vibe Check

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Cursor made its name as the AI-native code editor—the product that proved developers wanted AI inside their workflow. With Cursor 3.0, out today, the company is making a big bet on what comes next—and it’s not editing code.

The new release is a ground-up rebuild centered on agent orchestration: dispatching, monitoring, and managing AI agents rather than writing code by hand. The editor is still there, but it’s no longer the star—the default view opens on an agent-centered workspace, with a chat-driven orchestration panel where the file tree used to be. It’s fast, resource-light, and has a genuinely impressive cloud feature that lets an agent build a feature while you grab coffee, then sends you a screencast demo when it’s done.

The problem is that the orchestration layer hasn’t earned the right to take center stage yet. The filesystem sidebar has been demoted to another tab. The skills ecosystem is fragmented across Claude Code, Codex, and Cursor with no interoperability. And the core question our team kept circling back to—“Who is this for?”—doesn’t have a satisfying answer yet.

Power users who already live in Claude Code or Codex don’t need another orchestration layer. Cursor loyalists who loved the editor are losing prominence for the thing they came for. Their team seems to be iterating incredibly fast, so we’ll be paying attention over the coming weeks and months as this becomes clearer.

We spent a week testing it with four members of Every’s engineering and product team. Here’s what we found.

Read the full Vibe Check

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn. Katie Parrott is a staff writer and AI editorial lead at Every. You can read more of her work in her newsletter. Mike Taylor is the head of tech consulting at Every and a co-author of Prompt Engineering for Generative AI (O’Reilly). You can follow him on X at @hammer_mt and on LinkedIn.

To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

If SaaS Is Dead, Linear Didn’t Get the Memo

Laura Entis / Context Window — 2026-04-01 07:00:00 -0400

by Laura Entis

in Context Window

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

‘AI & I’: Slowing down to speed up

Today, we’re releasing a new episode of our podcast AI & I. Dan Shipper sits down with Karri Saarinen, cofounder and CEO of Linear, a product management tool designed for agent-native software development, to discuss what the “SaaS is dead” narrative gets right—and wrong—and why conviction can be the best product strategy.

Watch on X or YouTube, or listen on Spotify or Apple Podcasts. You can also read the transcript.

Here are the highlights:

Just because the technology has changed doesn’t mean your mission should. Founded in 2019, Linear is the rare company that started pre-ChatGPT to have successfully reinvented itself as an agent-native business. Saarinen attributes Linear’s success to never losing sight of what it’s always cared about: helping companies build great software. Whereas competitors chased AI trends, Linear focused on understanding how the technology was impacting customers’ workflows, and updating its service accordingly.
SaaS winners are building for agents. Linear started as an excellent product management tool for humans. Opening up the tool to agents instantly increased the available user base. Today, agents are first-class users inside of Linear, and companies like OpenAI and Coinbase are using its platform to manage their own agents.
Speed means decisions matter more, not less. AI makes it easy to have an idea and build it without considering whether it justifies its existence. When ChatGPT was released, SaaS companies were launching their own chatbots left, right, and center. Instead of jumping on the bandwagon, Linear stopped to consider whether the application was useful. Turns out it really wasn’t, Saarinen says, a realization that freed up resources to focus on what mattered, like making it easy for humans and agents to collaborate on software development.

Dissecting Claude Code

On Tuesday, Anthropic inadvertently leaked the entire source code for Claude Code. Naturally, Cora general manager Kieran was curious to see what was happening under the hood.

In an impromptu livestream, Kieran dug deep into how Claude Code works, unpacking its approach to memory, tools, skills versus slash commands, and prompt structure.

Here are three things he found particularly interesting:

Kairos, one of Claude Code’s most advanced and autonomous features. It’s often called “Assistant Mode.” Where the standard command line interface waits for you to type, Kairos represents a shift to a proactive, always-on background assistant that keeps running when you leave your laptop. (The name Kairos is ancient Greek for “opportune moment.”) It’s currently internal-only at Anthropic, but the infrastructure is fully built.
The “Buddy” companion. Similar to Kairos, the infrastructure for Buddy is built, but not yet shipped to users. Buried inside the source code is a virtual pet for your command line. Each Buddy has its own species, personality stats (including ones called CHAOS and SNARK), and little ASCII art animations that respond to what you’re doing. Kieran’s a chaos snail—take from that what you will.
AutoDream, Claude’s nightly closet clean. This was the feature that most impressed Kieran. It’s a background process that runs when you go idle and consolidates everything that happened—daily logs, session notes—into a better-performing memory for when you come back. Kieran says this is the first compound engineering-style capability he’s seen built into the Claude Code, referring to his philosophy of AI-native software engineering, where each session makes the next one easier. While he’s already been doing this manually, AutoDream is Anthropic’s first move to baking this into Claude Code by default.

Watch the full investigation.

Dan had some questions. (Screenshot courtesy of Kieeran Klaassen.)

Log on

This week’s camp

Every x Notion | Custom Agents Camp: A free workshop where we demo the custom agents running Every’s daily operations. We’ll be joined by Notion product designer Brian Lovin, who will show how the team behind custom agents uses them and what they’re building next. The event takes place on Friday, April 3, at 12 p.m. ET. This camp is sponsored by Notion.

Upcoming courses

Claude Code for Absolute Beginners (April 14): This beginner-friendly, live workshop led by Mike Taylor, Every’s head of tech consulting, is designed to get you from zero to a working project with Claude Code.

Recordings you may have missed

Every’s Q2 Demo Day: The Every team shares what we’ve been building, including a walk-through of Plus One, our hosted AI agent that lives in Slack. Watch the recording or read the write-up.
Compound Engineering Camp: Cora general manager Kieran Klaassen walks through, step by step, how to go from prompt to working app in under an hour using the compound engineering plugin. Watch the recording or read the write-up.
OpenClaw Camp: The Every team walks through OpenClaw, showing how to set it up and our favorite use cases. Watch the recording or read the write-up.

What Every’s creative director says about Google Stitch

When a major update to Google Stitch was released a couple of weeks ago, the consensus on Twitter was that the “vibe design” platform spelled the end for art directors. Why hire a human when AI can do the job in a fraction of the time, at a fraction of the cost?

Lucas Crespo, Every’s creative director, has an opposite read: As AI homogenizes the web, designers are more important than ever.

Tools like Google Stitch allow anyone to produce a polished, competent app. They create digital products that look good, but may not meet the standards of professional designers. “But it makes people more comfortable saying, ‘This is good enough,’” Lucas says. “Good enough is not the thing that will make something stand out or make a difference when every website looks the same. You have to go above and beyond that, which will always require some unique angle or idea or imagination. It’s not something I’ve seen any model output yet.”

Lucas draws inspiration from being a person in the world—walking through the park on a windy day, the fizz of receiving a party invitation—not from what’s on his screen. The goal is to produce work that evokes precise emotions.

The redesign for Cora, Every’s email management system, is built around “the feeling of sitting on the shoreline in front of a body of water, looking at the horizon where you can see the gradients in the sky changing during sunset and sunrise,” Lucas says. The vision cascades into the hundreds of user experience, color, and typography decisions that create the final product. “You’re going to think about nature. You’re going to start thinking about fresh air,” he says. “But it has to start with the point of view.”

Writing works the same way. Marcus Moretti, general manager of Spiral, Every’s AI writing assistant, begins a draft knowing “about 30 percent” of what he wants to say. This might be a scene he can’t stop thinking about or an argument he hasn’t found the right words for yet. Spiral helps him with the mechanics: structure, pacing, and filling in the connective tissue. In his experience, however, the remaining 70 percent can’t be prompted into existence. “You figure out the rest by writing.”

For careful readers, it’s easy to spot when a writer has outsourced that process to an LLM. While not always apparent on the sentence or paragraph level, longform AI writing drifts without constant human oversight. Arguments and scenes that look like insight collapse upon a closer read.

Maybe this is why AI tools tend to impress people who are new to a discipline the most. If you’ve never written code, vibe coding feels like magic. For engineers, there are caveats. Lucas notices the same pattern on his feed: When a new AI design tool drops, the people most blown away “are usually not the designers I admire,” he says.

Whereas non-designers see a shortcut to creative genius, Lucas sees a useful tool that still requires hundreds—potentially thousands—of decisions before the output meets his own exacting standards.

AI has raised the floor. It can raise the ceiling, too, making it easier to execute work built on a singular vision. But it cannot generate that vision for you.

Laura Entis is a staff writer at Every. You can follow her on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue. Collaborate with agents on documents with Proof.

Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

What I Learned Onboarding Our AI Project Manager

Nityesh Agarwal — 2026-03-31 06:00:00 -0400

by Nityesh Agarwal

Midjourney/Every illustration.

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Every’s consulting team is growing. Right now, we have two potential new hires in a trial period: Jean-Claude, who’d manage our sales pipeline, and Claudette, a visual designer.

You might be surprised to learn that they’re both AI agents. If they’re able to reliably do what we need them to and we bring them on full-time, our team will consist of four human and three agent employees.

Claudie, our first AI colleague, has been with us for two months. Natalia Quintero, Every’s head of consulting, and I rely on her to track where every client project stands and to make sure nothing falls through the cracks, work that saves the team 15 hours per week. It’s hard to imagine operations without her.

Getting her up to speed, however, was neither a seamless nor a linear process. That road is paved with previous iterations of Claudie we had to fire because they were not structured right.

Each Claudie revealed more about what it takes to get an agent to be a reliable co-worker—lessons that have only become more urgent as more companies deploy agents, creating what Every CEO Dan Shipper has called a “parallel organization chart” of AI colleagues, each with a name, manager, and real responsibilities. At Every, we’ve started helping others build the same setup through our hosted agents, called Plus Ones. Claudie was our crash course. Here’s what she helped us figure out.

Define the job before you hire for it

Built in Claude Code—hence her name—Claudie was designed to handle administrative tasks that consumed too much of Natalia’s week. The albatross was maintaining the dashboard that shows the status of all our client work, which meant staying on top of a constant flood of information from Natalia’s email, Google Docs, Google Sheets, meeting transcripts, and her calendar. Before Claudie, Natalia was spending hours that could have been dedicated to strategy and client relations finding data across dozens of sources and manually copy and pasting it in the right tab.

The first step was to give Claudie access to various sources of information and ask her to gather everything she needed before making a single update to a client’s database, which required tracking a dizzying number of moving pieces: action items, client feedback, and names of employees who attended each client session, and on and on.

Claudie required lots of oversight at first. For example, she failed to input details discussed in client meetings and wasn’t presenting data the way we’d like—simple fixes once we realized she just needed access to Natalia’s meeting transcripts and a tool for creating pivot tables in Excel. Each time something went wrong, Natalia flagged it, and we dug in to diagnose the cause.

It’s an easy thing to overlook: Agents can only work with the context and tools you give them. Before you bring one onto your team, get specific about what they’ll be responsible for, and what information they’ll need to actually do the job.

Understand how your agent does its best work

At first, we treated Claudie like any other new hire—telling her to find what needed updating and asking her to go do it. An experienced project manager would have hit the ground running. Claudie failed spectacularly.

The problem was the context window, or the maximum amount of text an LLM can access at one time. Claudie was trying to process too much, and information kept getting lost. So we broke Claudie into layers. We built a central orchestration agent that delegates to several fleets of subagents, each responsible for a discrete task: extracting data, identifying needed updates, and making those changes. Results improved but remained unreliable. Key dates regarding client sessions and discovery calls were frequently dropped altogether.

Our breakthrough came when we identified where communication was failing. Claudie’s subagents were gathering data and reporting it back to the orchestration agent. In theory, this should have worked. In practice, a single client update might require reviewing dozens of emails, meeting transcripts, and spreadsheets—too much for the subagents to relay without hitting the context limit. So they started summarizing the information instead of passing everything through, and the orchestration agent was making decisions based on AI recaps rather than the raw source material.

To solve this issue, we instructed the data-gathering subagents to dump everything into a local file hosted on the same computer as Claudie instead of communicating information back. The orchestration agent could then direct subagents to the relevant files to make updates without ever engaging with the data itself. Voilà—context window preserved. Once Claudie started working from raw data instead of summaries, she nailed it.

A diagram of the architecture that fixed the context window issue and dramatically improved Claudie’s performance. (Screenshot courtesy of Nityesh Agarwal.)

Agents process information differently from humans. But like humans, they have weak spots that can be mitigated or even solved with the right management approach.

Give your agent a handbook that is required reading

Getting Claudie’s architecture right wasn’t enough on its own. She also needed context about the role and how to do it well.

So we wrote her a handbook, as we would have done if onboarding a human project manager. Built as a project management skill in Claude, it details everything from success criteria to the team structure to when to escalate an issue to Natalia.

With a human employee, you’d hand them the handbook at onboarding and expect them to reference it as needed. Claudie’s hard-coded first step when starting up is to read the handbook to ground her in the specifics of our team and her role within it. We found that when she skipped this—which, when left to her own devices, she frequently tried to do!—performance plummeted.

We treat the handbook as a living document. As Claudie’s role has expanded, we’ve updated it to reflect her new responsibilities. For a human who learns on the job and asks clarifying questions, a slightly out-of-date handbook is no big deal. For Claudie, it’s all she knows.

Claudie’s employee handbook. (Screenshot courtesy of Nityesh.)

Don’t be stingy with promotions

Once Claudie’s subagent architecture was stable, we expanded her responsibilities. At first, she updated each client’s dashboard individually. Once we trusted her with that, we had her do them all at once.

Right now, we’re setting Claudie up on her own computer with a Claude Max plan and web server that’s on 24/7, which will give her the ability to run automated jobs at specific times each day and always be available to respond to our messages and requests on Slack. If that goes well, Claudie will graduate from project manager to chief of staff: She’ll monitor, triage, and send emails, pick up tasks in Asana, and communicate a project’s status in Slack.

Claudie’s very own computer. (Screenshot courtesy of Natalia Quintero.)

The criteria for a promotion are the same as they’d be for any team member: strong performance, a clear set of updated responsibilities, and the support and tools necessary for them to succeed in the new role.

Apply your learnings to your next hire

Onboarding Claudie wasn’t quick, nor was it easy. We rebuilt her multiple times from scratch. When we hit hour 50 of trying to get her to work, it was tempting to write off the AI entirely. When we did get Claudie to work, however, it was clear what a mistake that would have been. All we needed was the patience to figure out the right way to harness her brain power so she could deliver.

If an AI worker isn’t performing, the problem is rarely that the model can’t do the job. It’s more likely the way you’ve structured, connected, or instructed your agent. Figure out where you went wrong, fix it, and have them try again.

It’s a lesson I’ll take with me as I onboard more agents. The best thing a manager can do—for a human or an AI—is refuse to give up on a new hire before you’ve exhausted what you could be doing differently, and to believe in their potential.

To learn more about Claudie, listen to Natalia’s AI & I episode on how she automated her job.

Thanks to Laura Entis for editorial support.

Nityesh Agarwal is an engineer at Every. You can follow him on X at @nityeshaga and on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization. Discover Every’s upcoming workshops and camps, and access recordings from past events.

For sponsorship opportunities, reach out to sponsorships@every.to.

If you’d like to become one of our human colleagues, explore open roles at Every.

Seven Things I've Learned Getting Companies to Use AI

Mike Taylor / Also True for Humans — 2026-03-29 21:00:00 -0400

by Mike Taylor

in Also True for Humans

Midjourney/Every illustration.

This post was originally a tweet thread in response to Sam Parr asking how people get their teams to adopt Claude. It touched a nerve, so I wanted to expand on it. I recently joined Every Consulting as the head of tech consulting, where we work with mid-to-large-sized companies on AI training and adoption. Here’s what’s working.—Mike Taylor

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Entrepreneur Sam Parr asked a question on X the other day: “How is everyone getting team adoption for Claude? I spent a lot of time on Twitter, as do you. We see all this AI stuff popping up. We’re on top of it, or at least sorta. But how are all you people getting your team to actually use it effectively without spending all their time on Twitter and learning?”

I hear this question in some form on every single consulting engagement. I know the advice I have resonates in meetings, but I’m short on time. So I dictated this post through Monologue and used Claude to shape it into something readable. (Let me know if this format works for you.)

Here are seven learnings from working with companies through Every Consulting:

1. Buy the model direct, not third-party tools

When you evaluate AI-powered tools, you’re also—whether you realize it or not—evaluating the tool vendor’s choices and constraints, rather than what the underlying model provider (like Anthropic, Google, or OpenAI) is capable of. It’s often faster to build your own Claude/Gemini/Codex skill with your own rules and preferences already built in.

Companies are increasingly building, not buying, AI software on top of models, because it gives you flexibility. I don’t know how it’s possible for companies that aren’t the core model providers to keep up when the big labs know what models are coming, build their internal tools to align with those releases, and train them on how to operate within their own environments. I appreciate the effort that companies like Cursor put into user experience—they’re a good product organization. But it’s difficult to compete with Anthropic offering $5,000 worth of tokens a month for a $200 subscription.

Third-party tools tend to be less flexible, less cutting-edge, and more expensive. That’s not always the case, but as a general rule, it holds. So most companies are better off buying directly from the model providers.

2. Raise the ceiling, not the floor

A lot of companies have mandated to their employees, “Everyone needs to use AI now. We bought you AI tools. Adopt it.” That doesn’t work. Even on pain of death, many people are unwilling to use AI or be told that they have to. It’s basic self-preservation.

Instead, use the carrot rather than the stick. Nominate people who are already AI-forward as internal cheerleaders. Maybe it gets other people to come out of the woodwork rather than hiding their AI usage by making it clear that using AI is encouraged. Give those people the support they need to unblock barriers to AI usage (typically IT access to data connectors, approved budgets for coding tools, and removal of layers of bureaucracy)—because someone who’s bought in is going to accomplish five to 10 times more work than someone who hasn’t seen the magic yet.

You can accelerate adoption by showing that people who use AI aggressively get promoted first or interface the most with senior management. In some cases, we’ve co-opted those early adopters into being teaching assistants in courses we teach to the rest of the team. When their colleagues see that person advancing in their career, that’s a more effective motivator than any mandate.

You also get the productivity boost of enabling someone who’s already a believer. It’s much harder to convince someone to believe than it is to supercharge someone who already does.

3. Workshops should be at least 50 percent build time

Workshops teaching people how to use AI in a hands-on way are an effective way to teach your team—but they need to be heavy on building tools. No one wants to sit on Zoom and just look at slides.

I learned AI by doing. Guided theory helps orient and motivate people, but the biggest complaint we hear is that they don’t have time in their workday to explore these tools and learn something new. If you give them a couple of hours in a workshop where they’re expected to build something, and access to the tool and data (either synthetic or actual through connectors like MCPs), that’s when the aha moment happens.

4. Assign impossible tasks

An “impossible task” is one that wouldn’t have been possible to do without AI. Boris Cherny, a creator of Claude Code, has said something similar—that you should slightly under-resource most teams, which makes employees think, “The only way I can do this is if I use AI.”

I think it works better if you are more explicit and strategically choose the tasks so that they can’t possibly be done without AI. For example, if your goal is to write one blog post a week, you can likely do that manually. But if your goal is to write one a day, you’ll probably need to use AI in research, drafting, and editing (like we’re doing here!). And you don’t set the goal as, “Starting today, you have to produce one piece a day.” Instead, say: “Our goal is to work up to producing one piece a day. What needs to happen for you to make progress toward that goal?” It might take time, but if they know that’s where they’re heading rather than where they’re starting, they’ll start thinking strategically about how to use AI to save time, and start experimenting.

5. Mandatory AI note-taking plus MCP connector

Everyone on our consulting team records every meeting with Granola and has the Granola MCP set up in Claude Code, and it’s been transformative. You finish a meeting with a potential client, and tell Claude to summarize it and send an email to your colleague. That’s 80 to 90 percent of the value of AI: extracting information from unstructured data and structuring it in a way that’s useful.

So many times I’ve come to a task and realized I need context from a meeting, and I can pull that information from the MCP. It’s how I create curriculum or put together proposals. Now I can’t imagine working without it.

6. Map workflows and systematically automate them

When we do discovery calls with clients about their day-to-day work, we follow a process: We ask them what tools they use, what they do on a daily basis, and what their pain points are. Then we put that information into a Google Sheet with a row for each task we need to solve for, and we systematically work down that list as we automate.

Our goal is to get to the point where nobody on the team ever has to do the same task thrice. If AI can take a first pass at each task type, and we build a skill for each one, that person could be doing five to 10 times more than they’re doing right now.

So far, in my experience, this has never led to a reduction in workforce. Instead, either the companies put more effort into each task, or they expand the revenue and throughput of their team without hiring. When we were previously teaching Claude Code workshop-style courses, we used to prepare one project for the whole group to work on. Maybe we could manage one per business unit or team, but the preparation cost quickly added up. Now we can use Claude Code to create an individual project for each person taking part. We’re using AI to make each engagement that much more valuable rather than cutting headcount.

7. Train people to be managers of agents

Everyone who was an individual contributor before is now a manager—of AI tools. And they’re struggling because they don’t have management training. They’re not used to context switching, setting up systems and rules, or evaluating whether something that they haven’t worked on themselves is any good.

Managers can often adapt to managing AI tools more readily because they don’t care how a problem is solved—they just want it solved to their specifications. But the script is flipping: Managers are becoming individual contributors, because managing a team of agents is often easier than managing human teams. It takes a human longer to process reams of information, and to see if they’ll be successful. Sometimes it’s easier as a manager to vibe code a task using Claude Cowork than it is to brief a human, wait for them to send it to their own Claude instance, and get a response in a couple of days.

The upshot is that companies need more management training. You need to help people understand context switching and teach them how to do evals, develop good taste for deciding what to work on, and train AI in specific skills. How do you systematically write a good PowerPoint skill or a good daily update report skill? That’s the work now.

If any of this resonates and you want help implementing it, check out Every Consulting. We’ve been doing this for a year with a select group of companies and are now open publicly.

Mike Taylor is the head of tech consulting at Every and a co-author of the O’Reilly-published Prompt Engineering for Generative AI. You can follow him on X at @hammer_mt and on LinkedIn. To read more essays like this, subscribe to Every, and follow us on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Everyone Gets a Sidekick

Every Staff / Context Window — 2026-03-29 08:00:00 -0400

by Every Staff

in Context Window

Midjourney/Every Illustration.

Hello, and happy Sunday! Was this newsletter forwarded to you? Sign up to get it in your inbox.

Knowledge base

“Introducing Plus One: One-click OpenClaw Agents by Every” by Dan Shipper/On Every: Every’s team has spent months working alongside personal AI agents in Slack—triaging bugs, drafting marketing copy, launching growth experiments—and now we’re sharing them with subscribers. A Plus One is a hosted OpenClaw agent that shows up to the job with Every’s best tools and workflows. Read this to see how our team collaborates with their AI coworkers, and to join the waitlist.

“I Achieved the Four-hour Workweek. So Why Did I Just Take a Job?” by Mike Taylor/Also True for Humans: After five years of self-employment, Mike Taylor had passive income and total freedom. He also had unpredictable revenue, a string of failed products, and no one to share ideas with—which is why he went full-time as Every’s head of tech consulting. His argument is that while AI makes building anything easy, getting someone to notice is harder than ever, and the best learning happens inside a team. Read this if you’ve ever wondered whether the solo path is actually worth it.

“The Agent That Saved My Brain” by Austin Tedesco: Austin Tedesco, Every’s head of growth, used to lose hours toggling between Stripe, PostHog, Slack, and Notion. So he built an agent in Claude Code—even though he has no technical background—that pulls data, drafts campaign briefs, and answers his questions right in Slack. Through this, Austin’s found a worthy thought partner—though, he admits he still loses time tinkering with the system. Read this for the full build process, plus his open-source compound knowledge plugin.

🎧 🖥 “AI Makes Building Products Easy. Knowing What To Cut Is the Hard Part.” by Laura Entis/Context Window: Instagram cofounder Mike Krieger now co-leads Anthropic Labs, where his team builds experimental products on top of Claude. On this week’s podcast, he tells Every CEO Dan Shipper why even when AI has collapsed development timelines from months to hours, the hard part hasn’t changed. 🎧 🖥 Listen on Spotify or Apple Podcasts, or watch on X or YouTube.

“Build Your Own Bloomberg Terminal With AI” by Brooker Belcourt: As a hedge fund analyst, Every’s head of financial services consulting Brooker Belcourt used to spend four hours writing previews of earnings reports per company, per quarter, for 40 companies. Today, his work is greatly compressed by AI tools ranging from a ChatGPT prompt that drafts the writeups to a Claude Code setup that reads his proprietary models, cross-references them against Wall Street estimates, and assembles everything into a custom dashboard he checks each morning. Read this for a step-by-step progression toward making the most of AI for investors.

Log on

Upcoming camps

Every x Notion | Custom Agents Camp (April 3): A free workshop where we demo the custom agents running Every’s daily operations. We’ll be joined by Notion product designer Brian Lovin, who will show how the team behind custom agents uses them and what they’re building next. RSVP for ready-to-use templates and up to six months free of Notion Business + AI.
Claude Code for Absolute Beginners (April 14): This beginner-friendly, live workshop led by Mike Taylor (head of tech consulting at Every) is designed to get you from zero to a working project with Claude Code.

Jagged frontier

I stare at my screen some days and think: Why hasn’t AI replaced me yet?

I spend my hours playing textual Tetris—nudging workstreams, reviewing code, editing prose, shipping features. An AI agent can do all of those things. So why am I still here?

Because taste can’t be typed out. It has to be worn in.

If I said to our managing editor, Eleanor Warnock, “Write down everything I’d need to edit one of our pieces,” it would be impossible. Her instincts come from hundreds of past edits, thousands of small decisions layered on top of each other. You would need to work with Eleanor for a long time to emulate her editing style. You can’t enumerate it from scratch.

The gap between what I want and what AI gives me is real. To get a result I’m satisfied with, I need what I’ve always needed: time. I lean on AI to make a decision, but it’s not the decision I would make. So I give feedback. Then I give it again. And again. The process of teaching an AI your taste looks a lot like the process of developing taste in the first place—the accumulation of many small moments, each one building like sediment on the last.

Every’s AI-native engineering philosophy, compound engineering, recognizes this need for ongoing growth. After every piece of work, you ask your AI to distill and integrate the lessons you’ve learned. The next time you encounter a similar problem, you’re better able to solve it. After many cycles, you amass a war chest of small opinions. The AI may be fast, but there’s no way to speedrun the process.

The same goes for trust. People start timidly with OpenClaw, asking it to do simple tasks. Then, they give the agent a little more responsibility. When it does well, they share a bit more context, grant a bit more permission. The output improves. Trust builds, the same way it builds with any human: one kept promise at a time.

That’s good reason to start now. A person who spends time with their AI today, accreting those layers of context and taste and trust, will be meaningfully ahead of someone who starts next year.

AI can help us move faster between the moments that matter. But it can’t manufacture the moments themselves. Some things won’t be rushed. I remain uncompressed.—Willie Williams

From Every Studio

Every held its Q1 2026 Demo Day this week, with live demos of Plus One, Cora, Spiral, Sparkle, and Monologue. The common thread? Each product is becoming agent-native. Agents can now connect to your inbox, draft in your voice from a coding session, organize your files through conversation, and work alongside you in Slack. These used to be standalone tools you operated yourself. Now your agent can use them on its own.

Here’s what’s shipped and what’s on the way.

Plus One is here—your own AI coworker, connected to everything

Every has launched Plus One, a hosted OpenClaw that lives in Slack, where you and your team already work. COO Brandon Gell set one up in 45 minutes and had it triaging bug reports into Notion, generating daily briefs from his calendar, and collaborating with other team members’ Plus Ones in shared channels. Plus Ones come already connected to Every’s AI tools and our best skills and workflows. Willie, our head of platform, has been leading the system architecture. The team is onboarding people from the subscriber-only waitlist at around 20 per week, with a public launch targeted for April. Join the waitlist.

Cora goes agent-native with a CLI, skills, and an iOS app in the works

Cora now has an Agents tab, from which you can connect your agent directly to your inbox or install Cora’s new command-line interface (CLI). Kieran Klaassen, general manager of Cora, demoed the integration by asking his agent about his planned trip to Austria this summer. Because Cora is specifically tuned for organizing and retrieving email, it outperformed a generic Gmail integration and surfaced the flight details instantly. On the design side, Kieran is building toward a full email inbox, with an experimental iOS app that includes a Tinder-style swipe interface for quickly keeping or archiving messages. Try the latest experiments at baby.cora.computer, and connect your agent from cora.computer.

Spiral gets an agent integration, saved prompts, and an X style-guide generator

Marcus Moretti, general manager of Spiral, shipped an agent integration and CLI that lets your coding agent draft content in your own voice. In the demo, Marcus sent context from a Claude Code thread directly to Spiral, which generated options—written in Marcus’s personal style—for X posts to announce a new feature. Spiral is also rolling out saved prompts that you can reuse and share with others, and new ways to generate style guides based on your X account or other online writing. Try it at writewithspiral.com.

Sparkle rebuilds from scratch with conversational organizing and agentic cleanup

Sparkle has organized more than 40 million files, and general manager Yash Poojary applied the lessons learned from doing so to rebuild the app. The new version lets you organize files through conversation: Point Sparkle at a folder, and it proposes a custom structure that it refines in real time as you chat it. Yash also demoed “agentic cleanup”—a term coined by Dan—where the agent can act, with guardrails that prevent permanent deletion, on the system junk and old installation files it finds. Sparkle also remembers your preferences and runs cleanup continuously in the background. The new Sparkle launches to the public on April 14. Download it at makeitsparkle.co.

Monologue trained its own blazing-fast model and hits 2 million words a day

Naveen Naidu, general manager of Monologue—which is now processing 2 million words per day—announced a custom transcription model so fast that text appears less than a second after you stop speaking. The other news: Monologue’s voice notes feature, which launched quietly on iOS and has crossed 10,000 notes in four weeks, is also coming to your Mac. There, Monologue records both system audio and your microphone, and syncs across all of your Apple products. All notes are also accessible via Monologue’s API, CLI, and model-context protocol (MCP), so your Plus One—or any agent—can pull your meeting notes without extra setup. Expect the new model and MacOS voice notes in the next few weeks. Download it at monologue.to.

Alignment

The cosmic joke. I read so much AI prose now that it’s seeping into my brain and warping my own. Last week I almost wrote, “It’s not X, it’s Y.” I shuddered.

As a result, I’ve started reaching for older books. I want to develop a unique writing style and get more comfortable breaking the rules, and I like to think of reading as my protective force field against the sloppening. It’s helped a tiny bit—this new reading practice. My words are beginning to flow in a more authentic way. What I didn’t expect, though, were the detours on which many older books take you.

I’m reading In Search of Lost Time, and Marcel Proust is describing a magic lantern projecting scenes on his bedroom wall when he was a young boy. And describing it. For multiple pages. What does this have to do with time? I wonder. It’s not until several chapters later, reading a seemingly unrelated scene, that the penny drops. I realize that, with the memory of the lantern, Proust was showing how a break from your everyday experience, brought on by even a change to the light in a room, can leave you lost and disoriented.

When the awareness finally dawned on me, it was much more profound than it would have been had I not taken the detour.

AI doesn’t make you wait for anything. It gets you from A to B in the straightest line possible. Whereas good writing can take you far afield, so that you may, eventually, come to the answer on your own.—Ashwin Sharma

That’s all for this week! Be sure to follow Every on X at @every and on LinkedIn.

For sponsorship opportunities, reach out to sponsorships@every.to.

Upgrade to Paid