Transcript: 'Cognition’s CEO on What Comes After Code'

The transcript of AI & I with Scott Wu is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:02:02
Why Scott thinks AGI is here: 00:02:32
Scott’s personal journey as a founder: 00:09:27
Why the fundamentals of computer science still matter: 00:16:55
How the future of programming will evolve: 00:22:30
A new workflow for the AI-first software engineer: 00:26:50
How Devin stacks up against Claude Code: 00:29:33
Reinforcement learning to build better coding agents: 00:40:05
What excites Scott about AI beyond Cognition: 00:50:05

Transcript

(00:00:00)

Dan Shipper

Scott, welcome to the show.

Scott Wu

Yeah. Thanks for having me.

Dan Shipper

Of course. so for people who don't know, you are the co-founder and CEO of cognition. You are the makers of Devon, the AI software engineering agent and recently the acquirers, the lucky winners of Windsurf. It seems like you've had a really crazy couple months over.

Scott Wu

Yeah, it's been a fun few months for us. I was going to say— I mean, it's been an interesting time for everyone, I think, in the AI coding space. But a crazy few months, especially for us.

Dan Shipper

So the thing I want to start with first is you said something on John Carlson's podcast recently—that you think AGI is here. Explain.

Scott Wu

Sure. So it's at least a little bit facetious obviously, but I think it's worth thinking about that 10 years ago, what would we have called AGI? And we would've said, okay, it's gotta be able to pass the Turing test. Obviously it's gotta be able to converse with you and just come off as a human and be able to relate to you and think in much the same way that a human does. It's gotta be able to solve tough technical problems, it's gotta be able to do a lot of the same tasks. It's gotta be able to interact with the real world the same way that a human would. And I think to first order, we've basically done all of those things. I mean, passing the Turing test, we've obviously done. We have OpenAI and others have released work on getting a gold medal at the IMO and the IOI, solving incredibly hard technical problems, building agents that can go and actually interact and reason in the world. And obviously I think there's an interesting question of, well, there's still so much more that humans do, and there's so much more to it.

And I think that's true. Maybe one way to put it is that my view is that's going to be true for quite some time to come. And sometimes people ask about AGI from a perspective of like, well, AGI is when humans have nothing left to do at all or something like that. Or one of the definitions that people use is like, AGI does 80 percent of knowledge work or things like that. I think these things are really hard to define and really hard to clarify, because humans specifically do the parts that are not automated. I mean, it's kind of like whether 80 percent of human work has been automated. I claim that it already has been a long time actually, because as soon as we had the tractor and as soon as we had—it's like if you think about what people did 1,000 years ago, we are doing way less than 20 percent of that work. And so a bit facetious for sure, but I guess my point is like I think we have a lot of levels of AI development that occur. I'm not sure there's one hard cutoff on what counts as AGI, but I think it's also very clear that we've hit a lot of the things that people would've considered insane just a few years ago.

Dan Shipper

I think that's such a good point. That's why I hate the 80 percent of knowledge work definitions because knowledge work changes. It's not like a static thing. And I think people underestimate that once you automate one level of work, there's always another level of work above that. And I mean, we've all seen this over the last three years. A lot of the stuff that I do today—a lot of the stuff I was doing three years ago is automated now, and I'm just doing more per unit work, which is really interesting.

And I mean, we've all seen this over the last three years. A lot of the stuff I was doing three years ago is automated now, and I'm just doing more per unit work, which is really interesting.

Scott Wu

Yeah. And as humans, we're always human-centric in our view, I guess is the way to say it. We're very proud of ourselves and our work, which we should be. But you can imagine at some point it's like you'll just be able to think a couple thoughts and then have all of this come out and happen in reality. And we'll still be saying, oh well AI can't do that. Humans are still doing all the important work. And of course, it's like AI at that point, or technology in general at that point will have made us 10,000 times faster by virtue of doing 99.99 percent of the work. But it's kind of very hard to define what counts as the percentage of labor.

Dan Shipper

Yeah. So I have a particular definition of AGI that I'd love to bat around with you. Feel free to criticize, poke holes in it, but also I'm interested in what you think. So the definition of AGI that I like actually comes from child psychology. So when children are born, they are essentially totally dependent on their caregiver. You can't leave them alone for any length of time. And as they get older, you can leave them for progressively more amounts of time to be on their own. So an infant or like a toddler, you can leave for like five minutes or 10 minutes or something like that in their room. As they become children you can leave them alone for like hours or more as teenagers, they go away for maybe a night at a time. And then they go to college and they're fully autonomous. And so my definition of—and I think if you look at the development of agents and just AI in general, it has followed that same trajectory. So when GPT-3 was first on the scene, we were just at the tab complete level of autonomy. And now we're seeing, you know, Devin or GPT-5 or Claude code run for 10 or 15, 20 minutes at a time, and you can sort of see this smooth lengthening of that leash in the same way you see a smooth lengthening of the leash for children. And so I think a good definition of AGI is when it is economically profitable to never turn your AI off. It's always working, it's always doing something. And when enough people are doing that, I think that counts as AGI.

Scott Wu

Yeah. I think that's super fair. I'd say one thing is obviously it's very dependent on dynamics in the sense that, well, if everybody has an AGI then the AGIs are competing with one another for their usefulness or something. And so there's like some amount—but I think that's right. I think there is a point when you can truly just have an always on agent that's going and doing meaningful work and producing value off of it. I think the one thing I would push back against is the idea of economic value. Just because, as we're saying, so much of economic value is depending on how substitutable it is that you're providing or things like that. But no, that's cool. I feel like your point about the doubling time of how long these agents can operate, I mean, it has continued. It is insane how long that trajectory has continued. There's always this saying that you can never trust an exponential curve, or you can never keep predicting points in an exponential curve. And yet, they've kept coming.

Dan Shipper

They certainly have. I'm curious, actually, one of the things that this makes me think about is because I'm thinking about growing up and I'm thinking about the process of growing up for you. I think that Devin is your first—Cognition is the first company you started, and is it—

Scott Wu

So I actually started a company before this. It was called Lunch Club. So I ran it for about five years.

Dan Shipper

Okay, so you're more of a veteran than I thought.

Scott Wu

Well, I’m still very much a noob.

Dan Shipper

I am curious though what has that been like? You've been running Devin, or you've been running Cognition for at least a few years. What were you like when it started and what did you believe about yourself in the world and what are you like now?

Scott Wu

Yeah. It's honestly crazy. I think especially because I had run a company before and in the sense of like I think over the last decade or so there are a lot of different great companies that got built. But the pace of what's happening in AI and the trajectory of AI has already I think been very vastly different from a lot of that, and it's already gone much faster. I think for me there were a few elements for sure. There's a little bit of the chip on the shoulder of I feel like I could do better. I feel like I could do more, but honestly, I think there was also just a feeling of like, I just have to try. That's kind of how I thought about it at the time. The way I thought about it was if you try and you build something really meaningful and it works out great, obviously. That's great. But what happens if that doesn't happen?

And the question for me was kind of like, would you rather try and give it everything you had and just find out that it didn't work out and you weren't the one and whatever else, or would you rather not try and wonder about whether you could have done it. And I think for me the answer was pretty clear that I just wanted to give it a go. And so that's a lot of what it was like for me. In practice obviously it was almost kind of like walking our way into building a company. It's not even necessarily the case that we were saying like, oh, we're going to build a company. I think at the time it was really just exploring ideas and AI and looking into the things that we found really interesting, which naturally were just nerds. And so AI coding is like the coolest thing. And so we were messing around with a bunch of these—it was me and a bunch of my friends who I've known for years and years. But as it became more clear that AI coding is going to take off and stuff like RL is going to really work over the next while and it's going to unlock a lot of these product experiences, I think there's a real question of like, is this the thing that we want to commit and spend all of our focus and all of our effort on? And that was the trade off for me.

Dan Shipper

Yeah. And how are you different now?

(00:10:00)

Scott Wu

Yeah. And now I think I've been having a great time, to be honest. And I consider myself very lucky in that, and I think there are a few things I think about that are really nice. I think obviously the problem that we work on is great. I think of the people that you work with as the most important thing. Someone gave me this advice a long time ago that you're going to spend most of your time working and so you might as well work with people that you really like. And I still always think about that. And so I think I've learned more things. I've hopefully gotten a little bit better over the course of the last couple years. But in a lot of ways I think it has been much the same way. I think—do you know the line of like leaving it all on the field? Yeah. And I really like that mindset of like, you want to go and try your best, but you also want to be able to live with the outcome. And I think that's how we very much think about it today, like, we give it everything that we have. We do the best that we can. You can control the inputs. You can't control what happens at the end. And we just want to be able to say that we gave it everything we have.

Dan Shipper

I feel that. I mean, I definitely feel like inside of Every—I just love the people that I work with, and that makes it so much better to run a company when everyone's having a great time together, you know? And I think another way to frame leave it on the field that I've felt myself is I would do this even if it failed. And even if it didn't make a lot of money, and I think that's actually somewhat rare today. And it sounds like that is a big part of your journey too, because these are things that you're just interested in playing around with yourself anyway, regardless of whether it was going to be a gigantic company.

Scott Wu

Everything could blow up today. Who knows what happens—we have a huge AI winter or something happens with the hardware or whatever and then AI collapses and our company collapses and whatever—god forbid—and I would still be like, that was an amazing two years. I had a great time. I'm really happy that I got to do this for sure.

Dan Shipper

But how has it changed your relationship to something that you used to love and just play around with having the amount of responsibility that you currently do to do that thing?

Scott Wu

Yeah, that's a good question. I mean, I grew up doing a lot of coding, obviously. I mean, funnily enough despite all of this—or arguably because of my day-to-day work, I get to do less coding now than I did before. And so I don't know exactly what that implies and I obviously like to do more coding whenever I can. But I certainly have not gotten to the point of saturating my desire to write code, if that makes sense. I do think there's something really satisfying about just building the—it's funnily, I think one of the things that comes to mind here for me is the otter logo of Devin. I don't know if you've seen the otter on Twitter or on things like that. It has always been an informal logo of ours. There's a question of whether to make it the official logo, but like random company stuff. But the reason I say it is because it actually is really how we think about it internally, which is Devin is just like our little buddy like a little cute little otter with his own computer. And it's just typing away and doing tasks for you. And that's kind of how we've always thought about it. I think if we were—I mean we are all programmers ourselves, obviously if we really felt like this was going to be the end of programming, I think we would be way less excited about this problem. And I think it's kind of just like teaching your own buddy how to code and then starting on that journey in a way that's been really fun for us.

The transcript of AI & I with Scott Wu is below. Watch on X or YouTube, or listen on Spotify or Apple Podcasts.

Timestamps

Introduction: 00:02:02
Why Scott thinks AGI is here: 00:02:32
Scott’s personal journey as a founder: 00:09:27
Why the fundamentals of computer science still matter: 00:16:55
How the future of programming will evolve: 00:22:30
A new workflow for the AI-first software engineer: 00:26:50
How Devin stacks up against Claude Code: 00:29:33
Reinforcement learning to build better coding agents: 00:40:05
What excites Scott about AI beyond Cognition: 00:50:05

Transcript

(00:00:00)

Dan Shipper

Scott, welcome to the show.

Scott Wu

Yeah. Thanks for having me.

Dan Shipper

Scott Wu

Yeah, it's been a fun few months for us. I was going to say— I mean, it's been an interesting time for everyone, I think, in the AI coding space. But a crazy few months, especially for us.

Dan Shipper

So the thing I want to start with first is you said something on John Carlson's podcast recently—that you think AGI is here. Explain.

Scott Wu

Dan Shipper

And I mean, we've all seen this over the last three years. A lot of the stuff I was doing three years ago is automated now, and I'm just doing more per unit work, which is really interesting.

Scott Wu

Dan Shipper

Scott Wu

Dan Shipper

Scott Wu

So I actually started a company before this. It was called Lunch Club. So I ran it for about five years.

Dan Shipper

Okay, so you're more of a veteran than I thought.

Scott Wu

Well, I’m still very much a noob.

Dan Shipper

Scott Wu

Dan Shipper

Yeah. And how are you different now?

(00:10:00)

Scott Wu

Dan Shipper

Scott Wu

Dan Shipper

But how has it changed your relationship to something that you used to love and just play around with having the amount of responsibility that you currently do to do that thing?

Scott Wu

Dan Shipper

I think that's a really good segue into one of the things I wanted to talk about, which is just how you see the discipline of programming changing. I think I've seen on my end, and I'm curious if this is similar to what you're seeing—I've seen on my end there's obviously people who don't use AI at all, but then there's this sort of bifurcation between I think more traditional engineers who are adding AI into their existing processes. And then there's kind of AI first engineers who maybe only learn to code with AI or maybe they are senior engineers from the past, but are just going full on into AI. And they're AI first. And they're only touching the code if they absolutely have to, which is a very different mindset. And I think that group of engineers has—there's a whole different set of ways of thinking and different primitives for how to do good engineering if you're only orchestrating agents. I'm curious if that's what you're seeing or how you see the landscape evolving.

Scott Wu

No, I think that's definitely the case. And it reminds me of one of my favorite fun facts actually, that I like to share. Do you know that teachers actually used to picket and protest against the idea of calculators? I did not know that. So when calculators first came out, there were a lot of protests of like we can't have this, this threatens math education and all that good stuff. And obviously, I mean, we did okay as a society, despite having calculators in our lives. And I guess the point that I want to make with it is like, sure, I think there are some things that go away and maybe people are—for example, maybe people have their multiplication tables memorized a little bit less in the post calculator era. But obviously if you just look at the combination of humans with the tools and what they can do the answer is much, much more today. And so I think what is going to happen, or what's actively happening already is I think there's going to be a somewhat different education path for how to be a really great engineer in the post AI age. The thing that's interesting, it's like there's so many levels of AI improvement that are actively happening right now that kind of change that answer.

But from what we can see, I think we can kind of imagine that a lot of what that looks like is more about really deeply understanding logical fundamentals, being able to break down problems and articulate the answers to them. Being able to think about different strategic trade-offs, thinking about architectures and so on. And less about just going and debugging your Kubernetes or knowing all of these kinds of obscure libraries or understanding some very particular like esoteric syntax or something like that. And I think that trend is already happening, you know? And so some people say that that means that computer science has no value. Some people say that computer science has way more value. I tend to think it's more of the latter. And the reason for that is because obviously you are still the one at the helm making the decisions. And a lot of how you make decisions and how you decide what to build and how you think about the trade-offs that you're making all goes back to computer science fundamentals.

Dan Shipper

Yeah, I agree with that. I think if you want to make the analogy, and I think it is a good analogy to make, to say, well, you're becoming a manager instead of an IC. When you use these agents the best managers typically have—if you're an engineering manager, typically have technical backgrounds. Or the best CEOs too, for software products, for example, tend to be able to go deep into how everything works to help resolve issues. And also to have good expectations set for what can be done.

Scott Wu

Yeah. It's like turning bricklayers into architects is one of the things that we've said as well. And yeah, to your point, if anything, the technical architect at the company is actually like the sickest software engineer. It's not somebody who's just walked in and—it's like if you say like, oh, this person's an insane software engineer. What do you mean? Usually you don't mean that they type really fast, you know? What you mean is like they can break down problems. They just have a really great feel for things. They just think really logically, they cover all the cases.

And I think those are the same skills that you're going to want to have. I think the thing that's kind of interesting is, and this is by the way not true just in code, but I would kind of describe it as a lot of professions today—for people who just get started, there's almost like a hazing experience where you spend your first few years doing the most boring stuff.

And then you get to graduate and do the interesting stuff. And I think what we are going to have now is a little bit more like an officer's school of going straight into learning interesting things. And that's good. If anything, that's probably more true in like investment banking or something like that—you spend like your first three years just going through spreadsheets. And then you get to do some of the cool stuff.

Dan Shipper

I think it's not even intentional hazing, but you have to go through six months of learning what a while loop is and what an if statement is before you can build anything interesting. And with Devin or Claude Code or ChatGPT, you can build something on your first prompt, and that's a huge difference.

Scott Wu

Yeah. I think it's not intentional hazing anywhere, or at least in most places, we like to think.

Dan Shipper

Well, I mean, in investment banking. I don't know.

Scott Wu

Perhaps, I guess I was going to say, it's kind of like there's a lot of this work that has to be done. And somebody has to do it. And so it naturally ends up being the most junior team members that go and have to take it on. But now that's Devin. And then you get to—to your point, it's kind of like skipping one of the rungs of the ladder and being able to be a manager directly and being able to be an architect directly.

(00:20:00)

Dan Shipper

Totally. Well, but I want to get deeper into this. So we're talking about what the future of software engineering looks like? What does the future of the landscape look like? And in particular, what I'm really interested in is the day-to-day of what software engineers are doing in this new world. And I'm curious for specifics, because I think the best way to think about this stuff is just look at what people are doing right now, because there are people who are living that way right now. I assume some of them are inside of Cognition. I assume some of them are your customers. So what I want to understand is what does that actually look like and what are the new interesting things you're learning about the way that engineering works from this perspective?

Scott Wu

Yeah, for sure. So I'll give the long term answer and the short term answer. It's my favorite topic to talk about, by the way. What is the future of software engineering, because it is, I think , still a pretty open question. I think in the long term, it's very clear that obviously these systems will continue to get more powerful. And a lot of what that looks like is just you as an engineer being able to operate at higher and higher levels of abstraction. And it's kind of in the same way that we've made the jump from assembly to Python or JavaScript or something, we are going to make that leap from looking at a bunch of boilerplate Python code to just being able to express your ideas in English of what you want to build. And so at some point you're not looking at your code, you're just looking at your own product and you're—I actually think the J.A.R.V.I.S. The Iron Man style future is in a lot of ways correct. In terms of what we'll have—a lot of the interfaces are going to change pretty significantly when you have an intelligent agent that can go and execute tons of things for you and you can just go and work with all these things. It's not obvious that keyboard and mouse is necessarily the right input format in that world. And so that's the long-term future.

What does that mean for us today? Obviously we're not all working with our own personal J.A.R.V.I.S. quite yet today. But I think in code you kind of see these different form factors emerging. And I think the older one that has existed is what I'd call kind of the IDE category of basically making you faster when your hands are on the keyboard. And that's all the tab complete and the chat with code base and all the tools there. And then the newer school is kind of this fuller agentic thing, running agents asynchronously in the background, having them take on full tasks and—and the simple way to describe it is up until the point where the agents are capable enough to handle everything and let you just operate 100 percent in that higher layer of abstraction, you want to have both. Because you want to have agents for the things that they can go and take on and just do it entirely independently. And then you want to have the synchronous IDE experience for the things that really need you at the wheel. And I would guess that that phase lasts for about three years or so. For the next three years we will have both IDEs and agents and then at some point beyond that, it's kind of like everything will just be dictating what you want to some kind of agent form factor. Obviously it's not the—again, there's a cutoff of what counts as an IDE and what counts as an agent? And the interfaces—

Dan Shipper

Yeah. When you talk about, for example, the IDE or background agents, are you counting Claude Code as a background agent? Like where does the new CLI stuff fit into this worldview?

Scott Wu

Yeah, so it's all a spectrum for sure, and I think about these CLI agents as—it's certainly somewhere in the middle. I would describe that as a bit more, it's a bit closer to a synchronous agent. And so if you compare it to a Cascade or something in Windsurf, it operates a little bit more like that where yes, it is an agent that can do multiple steps, but one you're kind of meant to be checking in with it more deeply. And two, it is not fully autonomous in the sense that it doesn't go all the way and create pull requests for you and work with all of your systems and it usually doesn't go and test all your code for you or things like that. And so it's more a spectrum than a binary for sure. I think in general we will be kind of operating on the spectrum and things will be gradually shifting more and more agentic and more and more autonomous. But we will have the spectrum at least for the next couple years. And then I think the natural question is what that experience should look like of using this suite of tools across the spectrum. And so you have the full synchronous things—I think tab complete itself is probably the most synchronous thing.

It's like you are still really going and dictating every line of code. Tab is just helping you go a little bit faster. Tab is kind of speculative decoding is my nerd view on it, if you know what I mean by that. Oh, so speculative decoding is a nice little trick in language models because it's easier to process input than it is to generate output much the same way as it is easier to process input for humans than it is to generate output. And so what you do is you have a big model and a small model, and you're trying to get the output from the big model. But what you do is you have the small model generate the output first, and then you have the big model take a look and say, oh, is this what you would've said? And then it goes through and reads each one input and then it decides oh, this was correct, this was not correct. As soon as it's not correct, then the big model resumes and does its output completion. But it's kind of like having your small model work for you anyway, so that's cool. So tab complete is exactly that. It's just funny , like, the human is the biggest model—that is the most expensive one to call and you have the human process. But obviously it's a super synchronous experience. And then it goes all the way to these full autonomous agents. And I think having the suite of tools is great. And then the question is kind of how do you split up your tasks into which ones should go into which buckets? And also for more complex tasks that are going across buckets, how do you use these tools in tandem with one another? And honestly, I think that's a pretty unanswered question today.

Frankly, I think there are—and I think there's a lot of progress that different folks are making in the space. I think there's a lot of really great thinkers in space. But I think how you start with a synchronous experience and then hand off to an async and go back—a simple example I might give is a lot of what you want to do—let's say in the world today with humans, you sit down and you're like, alright, I have this project idea. I think we should build it. What's the first thing that you do? It's not necessarily, I mean, hopefully it's not that you just immediately sit down and start typing code. A lot of it is you're just fleshing out the details, thinking about all the decisions that you're going to have to make. We should build this new feature and this feature is only going to apply to users in X, Y, and Z buckets. And if they're in bucket X, then they already have something that looks like this. And so we need to go replace that in the UI with the new thing that we're trying to build. If they're in Y then it's a totally fresh thing, so we should walk them through the onboarding. You're fleshing out all the details of what you need to do. And then I think the next step is typically something like building out a clear spec, thinking about the technical details and building out all of that. And then depending on the cycle, obviously there's more things. And then at some point it's handing off the implementation. And I guess my point is, I think AI coding agents should be able to help you through the full cycle, or this kind of suite of things should be able to help you through this whole thing. But naturally there are parts of it that you want to be doing synchronously and parts of it that you want to be doing asynchronously. So going and making these decisions of how you handle users in different buckets or whatever, is something where it's like you should be able to have the J.A.R.V.I.S. experience of talking directly live with your agent and your agent is like, oh, by the way, yeah, there's this edge case, there's this case, there's this case.

How do you want to handle each of these? Let's talk it through together. By the way, I looked deeper into the code paths and here are the things that I found, but obviously the decision is yours ultimately. And then maybe there's other things of kind of just building out the PRD itself. And then at some point when you're going and doing the literal implementation of the code and testing and just making sure that works, that's something that probably should happen more async. You don't have to be involved once you've kind of fleshed out all the details in the actual implementation itself. And then maybe when it comes time to do the PR review cycle, then you want to be synchronously there, you want to be able to read the diffs. And so there are a lot of these kinds of flows in our work today where naturally you want to be able to go from sync to async, to sync to async. And I think there's still a pretty open question of how you should be working between those together.

Dan Shipper

One thing I'm curious for your take on is, I think the AI coding space had this really interesting shift three months ago, where prior to three or four months ago everybody at Every, including me, were all using Cursor and Windsurf. And then Claude Code came out and, I mean, I used Claude Code with Opus before it came out and I was like, holy fucking shit, this is crazy. And literally overnight everybody at Every, and I think a lot of people around the space switched to these new CLI form factors. What do you make of that? Obviously people are still using Windsurf, they're still using Cursor, whatever, but a lot of the momentum went to CLIs. Why do you think it was so successful and what do you make of it?

Scott Wu

Yeah. Well, first of all, I think it's an incredible product experience to be clear. And I think the Anthropic team has done very, very well with that. I think there are a few things going on here. But broadly, the way I would describe it is the capabilities—well, the capabilities change every week, honestly. But there's, I would say, relatively meaningful step function changes every few months, let's say. And the thing that's really interesting is the correct form factor or the correct interface is a pretty tight function of the capabilities, if that makes sense. If you were trying to do full autonomous things—I'm making fun of myself a little bit here. If you were trying to do fully autonomous things in the GPT-3.5 era it's like, there were things that you can do, but obviously good luck younger Scott.

(00:30:00)

And obviously there are a lot of different versions of this that happened. But I think there were a few shifts that happened. I think the capabilities really improved to a point where you could start to handle a lot of day-to-day things in a more autonomous way, for one. And then I think the other thing is just I think Anthropic very clearly put a lot of love into the experience. I actually think of it as—I think a bit less—so something I'll say, maybe this is a controversial take.

I actually don't know that CLI itself is the most important part of the experience. And I would claim the reason I say that—we have this debate actually internally at Cognition of what is the form factor? Should the form factor, should it live in the IDE? Should it live in Slack? Should it be its own web app or whatever? And I think the answer that we kind of come to is that the form factor is just a software engineer, if you know what I mean by that. And it's kind of a non-answer, but also kind of hopefully it does say something which is basically it's less a question of, well, where do you go to interact with your tools? It is more a question of what do the tools do for you? And how do you expect to work with them? And so I think from that perspective engineers are spending a lot of time in the terminal. They're also spending a lot of time in the IDE.

They're also spending a lot of time on Slack. All of these are reasonable things to think about and to work with. And I think as time goes on, I think probably a lot of these tools will be integrated with more and more of these so that you can kind of call them from anywhere. But I think the bigger question, which I think Claude Code took a bit of a different spin on, is how should you be working with the tool? If that makes sense. And I think the way that I would describe it is kind of their view of the tool is the tool is you, whereas if you're in Windsurf and you're doing a tab completion, it's obviously very kind of something that augments you. If you're calling something with Devin then it's very much kind of like the software engineer sitting next to you and they have their own virtual machine where all of that operates and they've spun up the repo themselves. Claude Code is kind of handing the reins over to your AI buddy to take the wheel of your computer. Which I think is an interesting paradigm, frankly.

Dan Shipper

I agree. I think the thing that struck me about it was, it was a full send to a new version of agentic engineering where previous, except for Devin—but previous iterations were always, well, the AI is on the side and the CLI form factor was actually, no, no, no. All you need is to talk to the AI. And that was the first time that that happened for something that you are using on your own computer, which I think is the other really crucial component, at least for me, because having it be able to run bash commands on your computer makes it way more extendable and customizable than if it's in some environment that you don't fully control. Or sometimes, for example, in Cursor, the environment would be spun up and then spun down. And Devin, I think it's—I think it's more consistent. Last time I checked it was a consistent environment, but still harder to customize than your own machine.

Scott Wu

Yeah. And so I think there's a lot of difference—we talk about the entire kind of—there's a hyperspace of all of the decisions that you can make. There's synchronous vs. asynchronous. There's local vs. remote in terms of the environment that it operates in. There's in the IDE vs. in other things, there's single player vs. multiplayer and whatever. And I think what we're seeing is just, there's obviously some exponential amount of possibilities in the space. I think a lot of different single points of the space are pretty interesting and I think Devin is one point of the space. I think Windsurf and other IDEs are another point of the space. I think Anthropic unlocked a new point in the space with Claude Code. I think that, frankly, I think that there will be a lot of these points that exist for quite a while, and I think the full suite kind of should have a lot of these different experiences because obviously it depends a lot on your particular use case or your flow which one is best at each point in time.

Dan Shipper

What do you think the trade-offs are of the point in hyperspace that you're in? So, in particular, I think the thing that makes Devin unique, at least in my testing, has been it's an agent that lives on its own computer in the cloud persistently that you can talk to at any time which is just—it is a different bet than pretty much any other big company has made.

Scott Wu

Yeah, for sure. I think for us what we've seen is that Devin is basically—it is the highest power experience for a lot of kinds of deep things that you want to do. At the same time, it comes with tradeoffs largely in terms of onboarding and getting set up. And so naturally, I think that kind of leads to certain use cases that are things that are a lot of the engineering toil or a lot of these messy, repetitive things that you have to do over and over again, processing and understanding a big code base or things like that. And the reason I say that is because Devin has a lot of features to your point, like extensive repo setup where you can go and custom configure the entire VM that Devin operates in. There's a lot of memory and knowledge of learning a particular task over time.

There's a lot of deep work in just building a good representation of the code base and all the deep wiki and search work that we do to really understand these things. And on the other hand, it takes time to set up. And so I think the trade off of Devin kind of in that space is that it does take a decent amount of effort to onboard Devin. I mean, it is almost like onboarding a software engineer. But once you do, there are a lot of tasks because Devin has its own environment and because it can go and learn how to test things and run all the tests itself that it just can do that basically nothing else out there can do. And I think it's interesting for us because even before this Windsurf acquisition, we were already thinking about this question of, well, what should we do to basically make it a lot easier and make an experience that's a lot more accessible for folks.

And we were talking about a few different ideas with more synchronous experiences. And then obviously everything happened with Windsurf and it was a great opportunity for us. And so a lot of how we see it today is I think there is going to be a lot of work in terms of—it's not just onboarding the agent, but also in human software engineers themselves, learning how to work with more and more async agents, which makes sense. I think it naturally has to kind of transition from a sync to an async environment. And so that's kind of what has led to a lot of our thinking today is using Windsurf or having Windsurf as kind of a really fast time to value option that you can immediately just kind of download and use and get a lot of value out of. Over time, you learn how to work with a Cascade agent or you learn how to kind of use the deep wiki indexing in Windsurf and then naturally that takes you kind of to more of the async flow.

Dan Shipper

That makes a lot of sense. I'm curious, one of the things you said earlier that stuck out to me and I think is really true is that there is a tight dependency between model capability and what the right affordance or what the right harness is to use that model. So for example, with Claude Code, it becomes possible to do a lot of the CLI stuff because Opus 4 and other models of that generation are good enough to make that work. Whereas for GPT-3.5, if you did the CLI, that would've been horrible. And I think my question about that is, you guys don't have your own models as far as I know. Maybe you have fine tuned versions of models, but you're not building your own coding models. Is that true?

Scott Wu

We do a lot of post-training of models. So we do fine tuning and RL and things like that. We don't pre-train base models.

Dan Shipper

I guess, why not? Given that there's a lot of overlap between the capability of the model and what the right harness or affordance is to use it, does that make you worried about competing with OpenAI or Anthropic when they have this very tight coupling between these two things that are evolving together very rapidly?

Scott Wu

Yeah. So in model training there's—it's typically split into these two categories, pre-training and post-training. Obviously there's somewhat nebula—it is a bit of a, there are things in between as well. But pre-training is kind of this initial phase of basically taking the entirety of the internet or just every piece of data that you can get your hands on. And doing a massive training run and producing a model that can talk like somebody on the internet. And then post training is really fine tuning in these very specific capabilities and collecting very custom data sets of the capabilities that you want to express and teaching your model to express those. And then obviously there's all the rest that happens after the model stage of just kind of figuring out how to build a great product experience and how to use the model capabilities to the fullest and to deliver—the way I'd put it is, the pre-training part is just not really where we have an opinion. Frankly it's—obviously I think the foundation labs are their own massive kind of layer of the stack where they're spending billions and billions of dollars doing this massive pre-training and kind of basically getting as much data as they can find to go and do this.

(00:40:00)

On the other hand, I think post-training very much is and some of the examples that I'd give you are things like, well, we want Devin to be able to predict its own confidence, to have an opinion on how likely it is to be able to do this, or how well it understands the task, or things like that. Or maybe a more direct, practical one is like, alright one thing that engineers do a lot day-to-day is pulling up the Datadog, finding the corresponding logs and using that to debug what went wrong and then making the right edits. That's a very specific flow that you obviously need to have custom training data for—the models don't just learn that on their own. All of these things fit actually quite naturally into post training as a category. And so I think from our perspective it is really a question of what we think we spike most on and where we want to focus. I think as a startup, your edge always has to be speed and focus. And I think for us it's kind of like I think we know what our core DNA is about. And a lot of it is just understanding the nuances of real world software engineering and basically teaching that to the models in a way that you can build a great product experience. And that's what we focus on.

Dan Shipper

In the post training world right now, RL environments are really hot. And it strikes me that you all have been probably purposely building the perfect RL environment for post training software engineers. Tell me about that.

Scott Wu

Sure. Yeah, I mean, it's one of the beautiful things about code. And people talk about this obviously, but the fact that code has a much cleaner feedback loop because you could run the code or you have all—there's all the version control. You could see every commit that was made in history. You have so many of these tools, which you would love to have in creating these things. And I guess the only thing I would say here is it really does come down to just building the exact custom environments for the use cases that you care about. And so all this we just talked about like Datadog or other things.

There's honestly hundreds of these within code and random things that come—COBOL turns out there's still a bunch of COBOL out there in the world. And it's not something that the language models are super kind of adapted to understandably. But that is real work and real stuff that takes a lot of time for people today. And I think a decent bit of the work of RL—I've said this before, but my high level view on RL is kind of the platonic ideal of RL is that you can go and solve any benchmark. And once we have that, which I think we are getting closer and closer to having, then the question is kind of just okay, well what's the benchmark? And now the question that I think a lot of these application layer companies are thinking about is basically what is the benchmark? What is the exact set of tasks and environments? What are the tools that you're going to use? What are the decisions you'd make? How are you going to decide whether it was a success or failure?

And if you have described all of those things exactly and you've kind of collected enough data points around that, then you can train a model that just does it. It's kind of insane to think about that. But obviously it just means that, to your point, having the right environments and the right use cases is even more important.

Dan Shipper

And I'll say that's very hard. Even getting the answer like, how do we decide whether this is good or whatever is, that is actually quite hard. I guess it strikes me though, that there's maybe two ways to solve the kinds of problems we're talking about. So for example, making a great software engineer. One is having RL environments set up that mirror the kinds of problems that a normal software engineer would encounter. And then using that to generate data, to train the model to be able to solve those problems. The Datadog example. And basically having a company enumerate, what are the likely things that people are going to have to do and what are our users saying they want to do? Collect all that data and then train the model. On the other hand, though there's a lot of talk about continual learning which I think—I kind of think continual learning is already happening. It's just very simple and efficient. So on the other end of the spectrum instead of having to do all that work, you just, if you just made the model more sample efficient to be able to try stuff and learn it would make it so you wouldn't have to do as much of the RL environment, post-training type stuff. So why make the bet over here in the RL environment, post training side, instead of on the more sample efficient learning side?

Scott Wu

Yeah, it's a really good question. I think I'll give you my high level view on this as just kind of, almost like a philosophical question. Even if we are going to have the full—talked about all the reasons that AGI or ACI is bad—but the full intelligence that can do everything that we want it to do. Then obviously at some point it has to learn all of the practicalities of the real world. And I would argue we're bottlenecked by that right now. We're not bottlenecked by pure logical reasoning.

We could do some pretty insane logical reasoning with language models. And so how do you learn all the—for an accountant who's doing all their work every day, or for a paralegal who's doing—how do you learn the practicalities of what somebody does that way and build a model that is intelligent about that? And obviously the best way to learn something is to do it. And so you need this actual data of what this particular paralegal does in some form. And so then to your point, I think there's kind of this fork in the road of where that data comes from. And I would say, I would add a third one or just for completeness sake, I think most people agree with that.

But the three that come to mind are one is well, the data exists in the world already. You just have to go get it. That was kind of the pre-training view of the world. You just take more and more of the internet, you train it all, and because everybody has stuff on the internet, if you just keep doing that over and over, eventually you get a model that knows everything. Two is kind of well, it has to go be built by experts themselves. You have to go and really curate an environment and figure out exactly these 500 environments of your one task, and then do that for every task that you could possibly imagine. And that's how you do that. And then three to your point is you have an agent that can go out and do it by itself. So I think two and three are kind of the—those are the versions that you're describing of do we go and handcraft the environments vs. is there some continual learning that works out?

Dan Shipper

Yeah. It does it and fails, you onboard it to your company and it fucks up how to log into Datadog a few times, but then it figures it out, you know?

Scott Wu

Yeah. And I think the short answer is basically, I think we will get to three. But I think a lot of the problems that you solve along the way actually naturally apply to both. And so I think it's kind of one, the pre-training world. Personally, I kind of think we've roughly—we're kind of converging on a lot of the capabilities of pre-training at this point. Two is this RL world, which is actively the world that we're all in of going and doing custom RL to find particular capabilities. And then there is this kind of long-term continual learning, which will obviously unlock some really big things. But I guess I would just point out that for both two and three to your point, a lot of what you have to do is you just have to create actual agents that can operate in the real world.

And the way I kind of think about it is, I think the most important thing in two that's making it more successful is you can just be much tighter about curating the reward function. And a simple example of this is one of our evals or one of our environments. There's a Grafana dashboard that you need to set up and for some reason it's not working. So please figure out what went wrong. And it's very much meant to be a messy real world software share. This is the kind of stuff that you'll run into day to day. And the way the task works is obviously you go and install the packages, you run the code and stuff, and then you find some error. It turns out the error's because the version lock of the packages was slightly off in a way that makes it—it's the kind of stuff that you can run into all the time as an engineer. So then you have to find this and you realize you gotta downgrade the version of this, and then that leads to the next thing and you figure out what went wrong with that error, and then you get the thing running. And at the end there is a dashboard, and the really beautiful thing about an eval like this is you can just make the eval just so what does the dashboard say?

And the thing is, if you did not get the Grafana dashboard running, you will never be able to answer that question. And if you do get it running, you will always get the right answer. And this curation just means you have a much tighter feedback signal. For three you would want to be able to do this kind of life in the real world and have that feedback cycle. The problem, it's a lot tougher of like, well, you could get the dashboard running and then have the wrong number and then how are we going to know that that was wrong? Or we could do this other thing. And so it's not an unsolvable problem. I mean, I think we will make more and more progress on it over time, but I guess my point is just either way, building the full environments and the tooling for the agents that can do this is kind of the primitive that you need to be able to do, to do both of these routes.

Dan Shipper

That makes sense. The thing that comes to mind is a worry, which I'm curious for your opinion on—the Grafana eval. In that case, how generalizable have you found being good at that eval is? So the fact that it can set up a Grafana dashboard—if Grafana changes the way that their whole system works and that breaks the eval, are you training—does it end up being too brittle if it's trained for these specific kinds of environments that could end up changing pretty fast?

(00:50:00)

Scott Wu

Yeah. So the first answer is yes, you do need a bunch of environments naturally. But I guess the second answer—the meta level answer I'd give to you is kind of it really depends on how you set up the task. And so if the task is literally just alright, you just have to recall exactly what packages are needed to run this version of Grafana from memory, then yeah. It's not very generalizable because as soon as the next version comes out, that's not going to be knowledge. But if the task is meant to be such that, okay, you're going to go and Google this and then you're going to go and read the docs page of Grafana and you're going to use that and understand what went wrong, and then you're going to look at the error in your logs and find the file that that corresponded to, you're going to read that file and you're going to use that. So I guess my point is we as humans all figure it out somehow. And the way I would kind of broadly describe that we figure it out is we actually interact with the real world in a way that gives us the information that we need rather than just pulling it all from memory. And as long as your agent is set up to do that skill is something that is very generalizable.

Dan Shipper

What's something in AI that you're excited about that has nothing to do with Cognition or Devin?

Scott Wu

Oh, interesting. I've always felt like these personal agents should just be a thing. I'm surprised it hasn't happened already. This is the way I would put it. I think obviously there was Operator and there's kind of some deep research, I think is another good example. But I think just a mass consumer agent that you can just have on your phone that just takes care of things for you. I mean, it feels like the capabilities are there for that. Maybe I'm wrong, but it feels like something like that would be so valuable.

Dan Shipper

Everyone always gives the example of scheduling my dentist appointment or whatever. What does it do?

Scott Wu

Yeah, exactly. It's like everyone always gives the example of booking flights for some reason, which is great, obviously don't get me wrong, but this is not the only thing that I would have my personal agent do. There's all the—I would love to have my personal agent go and call and book a reservation and then deal with various messages and then make sure the package delivery went through and then go—it feels like there should be something there or it's just buying my Amazon packages for me. Just hey, I need another order of X or Y or whatever. It just goes and gets it. And I've always thought that should be something. It's funny because we obviously build Devin entirely for coding. But we've kind of messed around with it of just seeing hey, can Devin do this? You know? And it's like, yeah, we actually order all of our Amazon packages with Devin now, you know? And it's—that's amazing.

Again, it’s the Slack message or the Linear ticket is not the right form factor for that, and somebody else should build the right—but the agent capabilities are there, I guess is my point. And with Devin it's like, because you have memory and because you are storing secret keys, you might as well be able to handle your Amazon account as well. And because you have the browser, that is actually the set of things that you need to be able to do. But it feels like something like that hasn't really taken off in a way that I would imagine. I would imagine that 12 months from now we will not be—that we will have this, but I would love to have it sooner if possible.

Dan Shipper

That's really interesting. Well, I hope to have you back on the show 12 months from now talking about the future of personal agents and hopefully some really cool updates to Devin.

Scott Wu

Awesome. Thank you so much for having me.

Thanks to Scott Nover for editorial support.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.

We build AI tools for readers like you. Write brilliantly with Spiral. Organize files automatically with Sparkle. Deliver yourself from email with Cora. Dictate effortlessly with Monologue.

We also do AI training, adoption, and innovation for companies. Work with us to bring AI into your organization.

Get paid for sharing Every with your friends. Join our referral program.