Transcript of ‘How Notion Cofounder Simon Last Builds AI for Millions of Users’

The transcript of AI & I with Simon Last is below.

Timestamps

Introduction: 00:01:57
How AI changes the way we build the foundational elements of software: 00:02:28
Simon’s take on the impact of AI on data structures: 00:10:07
The way Simon would rebuild Notion with AI: 00:13:05
How to design good interfaces for LLMs: 00:23:39
An inside look at how Notion ships reliable AI systems at scale: 00:28:22
The tools Simon uses to code: 00:35:41
Simon’s thoughts on scaling inference compute as a new paradigm: 00:38:16
How the growing capabilities of AI will redefine human roles: 00:49:10
Simon’s AGI timeline: 00:50:28

Transcript

Dan Shipper (00:01:59)

Simon, welcome to the show.

Simon Last (00:02:00)

Hey, thanks for having me.

Dan Shipper (00:02:01)

So for people who don't know you, you are the cofounder of Notion. This is, I think, at least as far as I could find, the first interview that you've done outside of internal interviews for Notion. So I really appreciate you coming on.

Simon Last (00:02:13)

Yeah, of course. I tend to keep a low profile, but I'm happy to do it.

Dan Shipper (00:02:15)

Great. And you're leading the AI initiatives at Notion. As far as I can tell, you were also really pushing AI before it became a thing internally, which is really interesting. And the place where I want to start with you is obviously Notion is really well known for building thinking tools, and you were building thinking tools before there were even thinking machines. The way that you went about that is you created a text editor and hooked it up to a relational database, and you thought a lot about how to create the right primitives that to allow people to interact with that in a really flexible way to build whatever they wanted to build or think however they wanted to think. And that was in a pre-AI era. And so where I wanted to start with you is to ask you what you think the right primitives are for thinking with AI?

Simon Last (00:03:01)

Yeah, that's a good question. I think, yeah, probably helpful to start with what are the new primitives maybe. So the way I think about it is we've got the foundation models or the model itself, which, I think of like a thinking box where you can give it a bunch of context and some task and then it goes and does one thing for you. It could involve some reasoning and it could involve formatting it as an action—so doing something. And then the other tool is like embeddings—just like really good semantic search. So that's why I think those are the new primitives that didn't really exist before.

I think a lot of the same primitives still matter a lot. Obviously, a relational database, it's a pretty fundamental concept. If you're trying to track any information, it's pretty useful to do that. You don't just want to shove it into a text file. You want in a structured format that's consistent that you can query that they can connect things. The good news is all the primitives still matter. But now you can plug in these thinking boxes on top to actually automate some of the tasks that a human would do in the past. And especially things that are cumbersome and you don't want to do.

The way I think about the primitives that connect to AI. You've got databases, you have a UI around the database that a human can look at and the AI can use. The permission model is really important as well. There's a lot of coding agent tools coming out. It's super cool, but one issue with that is you don't really want to just make a Postgres database for you every time. What's the permission model? What can a reader write? How can I see the schema? It's actually really nice and important to have a permission model that the user can understand and control what I can read or write. I think a lot of the same primitives really matter and that I just think about where we're adding on top. Whereas before your database might've been essentially just data that you do manual data entry to plug it in or some lightweight integration but now you can actually put this reasoning box on top and much more fluidly transform information or pipe it in and out or do reasoning steps on top of it.

Dan Shipper (00:05:06)

What do you think about chat as one of the primitives? And do you think that's going to continue to be a main way that we interact with these tools? Or are there other primitives that are going to become more important?

Simon Last (00:05:13)

Yeah, I think chat is—probably some version of it is probably here to stay. The human interface is just so intuitive. You just talk to it. The big issue with chat is that you get this empty text box and most people don't know what to type in there. It's really great for someone that wants to explore the system and figure out what it can do. But not so great if you just want it to do some tasks for you. And this is actually true—not just of chat, but anything. This is actually one of Notion’s biggest challenges is that there's a lot of features and it actually takes a little bit of exploration to figure them out. We call them tool-makers—people that are interested in exploring the boundaries of the tool and making their own little custom software. But, one big discovery for us over the years is most people just don't care about that. They just want a solution to the problem that they have. And they just want to be presented to them. And don't really have the patience to go figure out this complex tool, which is totally understandable. And I think chat is like a low-level primitive where it makes sense to have, but the real goal is to connect people to some workflow or use case that's solving their problem. And it's probably not the best interface all the time.

Dan Shipper (00:06:31)

Yeah. We do a lot of work with big companies and I see that all the time there is, probably five to 10 percent of their people are like, they want to play around with chat. They want to learn how all the AI stuff works. And then everyone else will let me just do my job. And usually I think what works is like, letting those 5 to 10 percent like find the workflows and then give the workflows to everybody else, so that they don't have to chat with it, or they can start with a chat. That's pre-filled like the common things that they're doing. One of the interesting things I think about is chat and I'm curious what your thoughts are like, often in UI pre-AI, you had to make the updates to the state of the application. Checking your radio box, wherever it's discreet, it's either checked or it's not checked. And it's also usually along 1 dimension but with chat, you can move in a fuzzier, more continuous way through multiple dimensions at a time. Have you thought about that sort of change from discrete to continuous or single dimension to multidimensional? And like, how do you think those things work together best?

Simon Last (00:07:33)

My mental model is that, unless we're talking about embedding sliders or making it funnier sort of thing, excluding that, where the actual parameters continuous, my mental model for this is: You have your software state and you think of it like it's a JSON Blob and then you can have UI controls to manipulate that. And like you said, it's typically just editing this key to be false instead of true or something like that. And the user can only do one thing at a time then I think of the AI as you can give it some high level instruction and then it can go execute a sequence of commands, like a cascade of things, which are turning lots of the knobs. So that's how I think. Yeah, I guess the user's mental model can be fuzzier, but ultimately it still maps all the way down to what are the knobs that it's turning? It's just that it's okay. Maybe the user has a fuzzier understanding of it. And then it's going and doing 10 things for you. And then it still works in the same way. And also, it introduces this new challenge of explaining to the user what happened, especially if it's a complex state.

Dan Shipper (00:08:30)

Yeah. What do you think about that? Or what have you found in doing that?

Simon Last (00:08:33)

I think about it like, what is the thing that's changing and what is the most efficient, understandable way to present that. One that we've explored in the past is asking it to do edits across multiple documents. And then, we essentially just came up, it's nothing too crazy, it's UX where it shows you groups of by the pages and then it shows you the diff across each one. And then you can go zoom in and I'll look at the ones that you care about, but it's pretty tough. It’s just a fundamentally hard problem. If it's doing something complicated and then explaining the complicated thing. Yeah, it's just hard.

Dan Shipper (00:09:07)

Yeah, that makes sense. Or, if it's explaining one of the things I find is that even if you get it to summarize what it did, the summaries are so high level it's saying a lot without saying anything at all. And getting it to be concrete enough, but not too detailed is a really difficult challenge for some reason.

Simon Last (00:09:25)

Yeah. I think that's probably just fundamentally hard. I think especially if the thing is complicated, you're not going to fully understand it maybe until you read the whole thing. And then I think depending on the use case though, you probably can go pretty far with calibrating that prompt to the appropriate level of granularity. So I think you can get pretty far by at least calibrating it— I guess maybe if you were to pick it apart, there's the problem of summarizing at the appropriate level of granularity. And maybe it's just missing an important detail that you actually wanted it to be included. And then maybe there's the more fundamental problem—you do want to reduce the information. And so it makes sense to draw some things.

Dan Shipper (00:10:08)

I want to go back to the relational database point. I think for me, the way that I've thought about, or the mental model I have for relational databases, and you may have a different one, is, it's more effective to have a schema for a relational database if you know what the data is going to be used for. So for example, it's easier to have a relational database for that CRM where I know I'm going to use it to keep track of customers, so I have a customer table. And what's interesting about embeddings is they are able to capture so many more dimensions of what piece of information is relevant to that you can use it for storing information in situations where you don't know what the information is going to be used for in the future. And I'm curious about how— Obviously, so far with Notion you've had to solve using a relational database to store information that you don't know what it's going to be used for. And I'm curious how you think, embeddings change that picture, if at all.

Simon Last (00:11:00)

That's a really good question. I'll first address the point that it's hard to design it when you don't know what it's useful for. I think that's a really good pointer: don't design schemas that you don't know what they're for yet. This is something that I've been playing around with—AI helping you design schemas. We've tried versions where it just comes up with all the properties you might want, it can come up with a lot of things and like not all of them are useful. I've had a lot more success with only giving the minimal schema that's required for the actual tasks that that user currently cares about. Each property should have a purpose that just really focuses the task and makes it more effective.

One point there: In terms of how I think about embeddings vs. deterministic querying, I think you're getting at—I don't know—I just think of them as two different tools that you have in your toolbox. Ideally you have both. And you can even maybe combine them. This is only something that we're working on a lot is Q&A over databases. And when do you turn to a deterministic SQLite query. And when you turn to an embedding, I think it really depends on the question. And sometimes you want one, sometimes one of the other—a lot of it is performance costs like latency concern.

You could just make everything embeddings or you could just map a model over every run of the database, every time. And then you don't need embeddings or SQL either, right? Everything's unstructured, but I think that would be undesirable from a performance perspective and also, yeah, it wouldn't be fully deterministically accurate, which I think people probably care about if you're like, how many sales did we do last quarter? Do you really want the model?

Dan Shipper (00:12:28)

I can make it up a little bit and get close to it but it won't actually be right.

Simon Last (00:12:34)

Yeah. It seems a bit scary. So it depends on the question. Let's say I have a customer database or something and I'm like, how many sales last quarter? Yeah. I really do want to call them an amount and then sum over it. But then if I'm saying, do we have any customers in the entertainment space or something? Maybe I want to be flexible on that. So yeah, I just think of it like these are just tools in the toolbox and you want both. And then the challenges in defining that routing or mapping layer of, figure out which tools best for the job and combine them and then presenting the user with the best result.

Dan Shipper (00:13:05)

You very famously, a couple of years, I think, into Notion’s life, went to Kyoto and stripped it all down and pivoted the company and it became what Notion is today. And I'm curious, let's just assume as a thought experiment that you're going to have a second Kyoto. You're going to go strip away everything that Notion currently is and rebuild it with AI, how would you do it? Or how would you think about it from scratch? What would you do differently now that these tools are here?

Simon Last (00:13:36)

Yeah. That's how I operate. When I'm thinking of a new project, I like to be pretty unencumbered by the way things work. But then the magic is also about, taking this unencumbered crazy idea, but then also, ideally you want an incremental roadmap for everything. So I think there's a lot of details in there. I don't just want to make crazy ideas. I want to actually ship stuff incrementally, but then still get to the crazy place. I think the really key, exciting thing to me is this thinking box where there's plenty of knowledge work tasks that people don't really want to be doing, or that are too expensive for them to do because you have to hire humans to do it, can we automate that stuff? One big principle would probably be there's less humans touching the database and that the AI should be managing it for you. We're talking about customers. Let's say you have a CRM-style thing. Ideally, you never need to update any of the fields, right? If the deal closes, it should know the amount based on your email. If someone talks in Slack about how the deal's at risk, that should be in the structure somewhere. You shouldn't need to update stuff. I think in the AI world, the database becomes more of an implementation detail, and hopefully the user interacts more with the processed outputs of it rather than the raw database itself. So maybe for sales, you really care about a daily progress bar or seeing something about the productivity of your retail people or something like that. And those should all be just presented to you directly. And then the database is just this background thing. That's it. Implementing things you care about.

Dan Shipper (00:15:09)

I love that. I think the first point is that you shouldn't have to interact with the database. What it reminds me of is there's this constant thing with Notion and with any other kind of tool like this, where, especially if you're doing it inside of a company, you're always like, is this up to date? Or you're always, there's 5 percent of things that make it into Notion, but then there's 95 percent that's completely unwritten. And I think companies operate better when more of that stuff is legible and more of that stuff is written down and it's and it's updated. I've always had this thing that I think companies should have like librarians that are just responsible for that, having worked in a big company. I was the guy for a particular product. And it had a huge salesforce and I had written all these documents about how the product should be sold and what the details are and whatever. My previous company was a co-browsing company and I sold it and I was the co-browse guy internally at Pega, this big public enterprise software company. And even though I'd written everything down, all the sales people were just like, you're the co-browse guy on chat. What about this question? I'd be like, see my doc. But I think one discoverability was really poor for them. And then two, there's always that thing in your head where you're like, is this up to date? And it seems like what you're saying is that, there's an opportunity now to take without someone having to do it to take a lot of that stuff that would ordinarily not be written down and get it into a format where it's recorded for other people to use. Is that kind of what you're saying?

Simon Last (00:16:27)

We're definitely excited about that as a use case. I think, yeah, like with the current Q&A with Notion and third-party connectors, you can get at least— If the salesperson goes and asks the question, they can ask the AI. That's pretty cool. But then, yeah, I think a lot of the times you didn't write the doc in the first place. And then once you write it, you want to maintain it. I think, yeah, those are both really interesting use cases that I think it'd be super exciting. I think a fun thing about these thinking boxes is now you can treat a knowledge base like a database where the operations on it can be semantic. I think that's pretty exciting. It's thinking about: How can pieces of information conflict with each other, and how would you resolve that?

Dan Shipper (00:17:02)

Yeah, that’s super interesting. I love the word thinking box and it makes me think, what is thinking? What do you think the boundaries are of what that thinking box can do vs. not do?

Simon Last (00:17:13)

Yeah, I don't think there's that many boundaries. The abstraction is pretty— I think the instruction's kind of complete. In a way already, it's more than just the models that kind of suck still you give it vision—assuming it's multimodal, right? There's not much more to it. I think maybe there's robot actuation commands, but for those represented in, in the same model too, abstractions are already complete. And then, I feel like the really critical thing is the critical shape of it is it has some context and then it has these tools that it can use. And then those tools produce observations and then you just loop on that. And that's an agent that I can do anything, assuming the model actually works, they don't yet. Depends on the use case, but—

Dan Shipper (00:17:55)

I don't know if you saw, but Anthropic dropped a new computer use model for Claude the other day. And one of the things that they touted in that release is that you don't actually have to explicitly identify tools. And instead, the model just understands that there's a browser in front of it that has certain things and the computer has applications that allow it to do things. And so the tools are implicit rather than explicit. What do you think are the trade-offs there and do you think explicit tools are actually what's best or should it be implicit and where?

Simon Last (00:18:25)

Yeah, super excited to see that. I mean on a technical level it's still a tool. It's just the tools are like, click this coordinate and type this. I guess that's true. So it's still just implemented as tools. It's just that the click tool is pretty powerful. Click and type. You can do a lot of stuff. And then the observation is to see what happened afterwards. I was super interested to see that it's something I've been expecting to start working and it seems like it doesn't quite work yet, early signs and it's cool that they're showing it to the world.

The way I'm thinking about that is that the task you want to give the AI is the most convenient way to do the task possible. So there's some quality constraint around whether it can do it. And then there's some like performance latency constraint, and then maybe something around like how users can observe and control it. So I think computer use is going to be very open-ended. Like you said, at least currently the quality seems much lower than if you were to give it a more specialized tool. The latency is very bad—super slow. You could get much better results by giving it a code API, for example, your goal is to download a recipe. You can have it go to Google Search and find the recipe. Or if you give it a recipe search API and it gives it credit, that's going to be done in like less than a second. And then there's the controllability thing, which is pretty important. I think, especially if it's doing something autonomous for you, I don't think like the shape of this that I feel like bullish on is it doesn't seem that interesting to me to have it control your computer while you're watching.

The interesting thing seems like you ask it to do something and then it goes and comes back to you when it's done and I can go do something else. It has its own computer. And I want to be able to control what it has access to, so I think that's pretty important. And if you're giving it a computer, it's pretty open-ended. So we need to develop some of those controls around it, but I'm excited about it. I think ultimately the way I think about it is it's just another tool in the toolbox and the ultimate answer probably looks like a mixture of when you can get an API that's much better. And that will always be better. And then when you can't, it's nice to have this escape hatch where it does stuff on a computer.That makes a lot of sense.

Dan Shipper (00:20:36)

I think that's totally right. For tasks where they're repeated and you know they're going to happen inside your application, like having a specific API that just does them really quickly, is great. Just you have muscle memory for figuring out how to pick up a glass. And maybe there's like a tool in your head that is really tuned for picking up a glass to drink. And then you fall back to this slower, more open-ended thing that can do much more for tasks where the tools that you are more specific can't can't handle.

Simon Last (00:21:05)

Yep. Yeah, I think another angle, too, is interesting, is the market dynamics angle. I think we might start seeing people shutting down or not wanting that. There's gonna be a race where you know, people who use the computer using an agent are gonna want it to access all their stuff. And then the companies that manage those tools might not want that. It's like a third party. There's already a whole industry around like preventing bots from accessing websites and stuff. So now bots are useful for real work. So what are we gonna do about that? I think I'm not sure how that's going to play out, but I think that'll be really interesting.

Dan Shipper (00:21:40)

What would be your guess?

Simon Last (00:21:42)

I think people are definitely gonna want to do this and they're going to have legitimate reasons to do so unlike in the past, where maybe it's like scamming or hacking. Now it's like, no, I'm actually trying to perform this task. I'm paying for your software. So you should let me do that. I think that makes sense. I could see a world where I think probably the ideal outcome is something where like everyone allows that, but then they get paid for it in some way. I don't know what the shape of that is exactly, but yeah, I think that's ideal. It's if you make some software that's valuable and people are using it in this way, somehow value should accrue to you.

Dan Shipper (00:22:21)

Do you think we're going to see a world where there's interfaces that are specifically for verified humans and then LLM friendly. I guess it's an API, it's something different from a traditional API, but interfaces built specifically for LLMs.

Simon Last (00:22:36)

Yeah. I think so. I feel like my job description is to design those—there's all these quirks. The quirks will go away over time. I saw someone tweet the other day that you're gonna have an alternate form of your website. That's just plain HTML with divs and buttons. I love that idea. I think it's a tricky race because on the one hand, the current models are not good at many things and you do need to design those custom things, but then as the models get better, maybe you need a bit less of that and maybe they can also just build their own. I think eventually the model can just build its own scaffolding, right? You give it something and it's like, alright, I'm going to make a whole Python code repo that it may be inside of that. It's going to figure out this problem that I was just saying where it's like, alright, which things can I use code for? That's better. And then which things do I need to do? call out to some browser—that's way less ideal, but I'll do it if I need to, I feel like the ultimate abstraction just closes over all of this. So, as a human, just whether it's using a code API or browser is an implementation detail.

Dan Shipper (00:23:42)

You said that part of your job is figuring out good interfaces for LLMs. What are the current properties of a good LLM interface?

Simon Last (00:23:49)

That's super fun. Okay. So yeah, there's a bunch of principles. So one is that you want to align to things the model's been trained on as much as possible. What does that mean? So just as a concrete example one way we've tried representing a notion page is as this XML tree, which is much more faithful to the way it's persisted. But the model just wants to speak Markdown. And so—

Dan Shipper (00:24:10)

Interesting. But they're prompted in XML?

Simon Last (00:24:15)

The transcript of AI & I with Simon Last is below.

Timestamps

Introduction: 00:01:57
How AI changes the way we build the foundational elements of software: 00:02:28
Simon’s take on the impact of AI on data structures: 00:10:07
The way Simon would rebuild Notion with AI: 00:13:05
How to design good interfaces for LLMs: 00:23:39
An inside look at how Notion ships reliable AI systems at scale: 00:28:22
The tools Simon uses to code: 00:35:41
Simon’s thoughts on scaling inference compute as a new paradigm: 00:38:16
How the growing capabilities of AI will redefine human roles: 00:49:10
Simon’s AGI timeline: 00:50:28

Transcript

Dan Shipper (00:01:59)

Simon, welcome to the show.

Simon Last (00:02:00)

Hey, thanks for having me.

Dan Shipper (00:02:01)

Simon Last (00:02:13)

Yeah, of course. I tend to keep a low profile, but I'm happy to do it.

Dan Shipper (00:02:15)

Simon Last (00:03:01)

Dan Shipper (00:05:06)

Simon Last (00:05:13)

Dan Shipper (00:06:31)

Simon Last (00:07:33)

Dan Shipper (00:08:30)

Yeah. What do you think about that? Or what have you found in doing that?

Simon Last (00:08:33)

Dan Shipper (00:09:07)

Simon Last (00:09:25)

Dan Shipper (00:10:08)

Simon Last (00:11:00)

Dan Shipper (00:12:28)

I can make it up a little bit and get close to it but it won't actually be right.

Simon Last (00:12:34)

Dan Shipper (00:13:05)

Simon Last (00:13:36)

Dan Shipper (00:15:09)

Simon Last (00:16:27)

Dan Shipper (00:17:02)

Yeah, that’s super interesting. I love the word thinking box and it makes me think, what is thinking? What do you think the boundaries are of what that thinking box can do vs. not do?

Simon Last (00:17:13)

Dan Shipper (00:17:55)

Simon Last (00:18:25)

Dan Shipper (00:20:36)

Simon Last (00:21:05)

Dan Shipper (00:21:40)

What would be your guess?

Simon Last (00:21:42)

Dan Shipper (00:22:21)

Simon Last (00:22:36)

Dan Shipper (00:23:42)

You said that part of your job is figuring out good interfaces for LLMs. What are the current properties of a good LLM interface?

Simon Last (00:23:49)

Dan Shipper (00:24:10)

Interesting. But they're prompted in XML?

Simon Last (00:24:15)

So, it's prompted in XML, but typically the XML that's used in prompting is pretty simple. It's like wrapping this tag. It's very good at that. The format that we came up with is a much more complicated form of XML where each Notion block is in a tree and there's many layers of nesting and there's rules about which blocks can contain other blocks. And it's actually describing the spec as several thousand tokens. And then the trouble with that is that the models can do it, but you're actually harming its ability on the model on other tasks. And I guess my mental model is as it's making the tokens, it has to tend to all your complex instructions about the formatting and also the reasoning to answer this question Yeah. And it definitely makes it worse. And it's better if you can speak the language that you know how to do. And that's just a matter of the company turning on. And so Markdown is a good example where everyone turns on Markdown. And so it's just really good at that. You don't need to give any extra instructions for that.

Dan Shipper (00:25:05)

Even if Markdown has a more complicated structure, are you flattening the tree into a more linear—?

Simon Last (00:25:15)

I think Markdown is also simpler. It's just a very simple kind of lossy language with very few ways it can fail. So, yeah, I think that's one class of things aligning the models. Another one is that you want the structure of your output to be as simple as possible for the output that you need. I think that's really key for any formatting. You want it to be a bit looser, but it depends on what you're doing but you want to really go hard on making it as simple as possible while still doing the tasks that I care about.

Dan Shipper (00:25:52)

What would be an example of a time when you learned that or that stood out to you?

Simon Last (00:25:55)

I think the XML structure applies there too. When we first started doing it, our original principle was to just perfectly map the way it's actually persistent and displayed to the user. That's ideal, right? There's no lossiness at all. But then there's all these little quirks about just little things that get wrong. And it was often easier to just simplify it. And even if it's somewhat lossy it's worth it because at least you can control that. Whereas, if it's too hard for the model, it's the end of the line. I think another really interesting— There's the basics of: describe your task as simply as possible. Use few-shot examples. Another class of learnings that's can I just saying it's if you're working on the prompt and you notice some class of issues, I feel like the way I think about it is that my first line of defense is I want to try to make that class of issues be impossible in the system or validation around it.

That's the ideal. And then if I can't think of a way to do that, then I'm going to try to make the prompt better, add an example or change instructions. I think that's a really fun one and it, yeah, really depends on the task, but I'll give you one example. In the example, I'm making a fake test data and then there's one prompt that is describing the kind of fake test data that you want to make. And there's another prompt which actually writes out the fake test data in detail. And one little constraint with it is I don't want it to generate too much just because it'll be too many tokens and take too long. And so I have this constraint in your description, it should only make up to 10 records or something like that. But sometimes I just wouldn't follow that instruction. That's annoying. And one little trick is, actually, while it's generating the descriptions, ask it to estimate the number of records that will be needed for this test data. And just like forcing it to output that just, a kind of one aligns it much better because it's not just that it's in the instructions and it can maybe ignore that, but it had to actually produce a number. And then also, if the number produces too high, I can actually just throw an error and have it try again.

Dan Shipper (00:27:57)

That's really interesting. I think there's two principles packed in there. One is the thing you said, which is making a certain class of error impossible, which maybe it's like changing your prompt, or maybe it's not even calling the AI in a situation where that error might come up or something like that, which is really interesting to me. The other one is I think you're implicitly doing a little bit of a chain-of-thought thing where by asking it to output how many examples it thinks are necessary you're aligning it in the same way that chain-of-thought works.

Simon Last (00:28:22)

Yeah. I think doing structured chains-of-thought, it's really useful, specific to your task.

Dan Shipper (00:28:26)

One of the things that I think you all probably have is one of the most scaled AI applications in the world right now. And one of the things I think is really interesting to dig into is: previously software was deterministic and now it's very stochastic. It's much more squishy. And especially at scale, releasing squishy software to the world is scary. And I'm curious about what you learned about doing that, what you learned about good evals, all that kind of stuff.

Simon Last (00:28:53)

Yeah, it's really annoying. Prior to 2022, I never did any AI stuff really, besides taking some classes in college. And yeah, I definitely miss the days when I could write a QA doc and write some tests and it all kind of works. And I have a good mental model of it's not going to fail. And in all these cases, I think that the AI thing is so hard because it can fail. There's the problem that it can fail in some cases, but then there's this additional meta problem you don't even know. You might not even know the cases where it can fail. And usually what happens is, as you're ratcheting on a prompt, you end up discovering more and more of these. And sometimes you discover really major ones and after you have a huge eval set and then you discover some new ones like, oh man, this totally breaks it. So even just discovering the distribution of the possible errors is really hard. And I've definitely been led astray multiple times thinking that I'd solved it and then find this whole new class of errors that are really hard to solve. So I would say that's just really hard for evals. The way I think about that is you've got deterministic evals and then you've got non-deterministic evals. And then, if it's possible to make a deterministic eval, that's great. And I love to design workflows such that there's some classifier elements within there. Because, producing an enum or yes-no value or something like that. Those are great because they're super easy to eval because you can just collect the dataset of input and then the correct output and then just get a score that I'm always trying to— Yeah, if there's some complex workflow I love to come up with classifiers within there. That's one big strategy. And then there's non-deterministic evals. I found that—

Dan Shipper (00:30:26)

Is the vibe of this correct?

Simon Last (00:29:38)

Yeah. So just using an AI to evaluate something. The trouble is, if you have an AI evaluating, you need to eval your eval now. If running a prompt is hard, then now you have a whole other one. I've found that you have to be pretty careful. I've definitely learned to be pretty cautious about these. I found it's easy to come up with an idea for a model-graded eval that sounds good, but then in practice you try making it and then you just slog on it for a while. I'm discovering boundary cases of this thing. And so I think I found that they work best when it's quiet. When the thing you're trying to evaluate is quite targeted and you can describe it very clearly and you want to make them such that you want to make the shape of your task such that the evaluation task such that for the model you're using to run that the model is extremely good at the task to the point where you can actually trust it. And you don't need to like, spend a bunch of time evaluating that. And yeah, you want it to be very clear and narrow—really helps. And then using the appropriate model for it. But yeah, that's really hard. And then evaluation in general is really hard. I think yeah, another one is just around, you want to definitely have a really solid loop around like logging and collecting data sets, collecting issues and labeling them. And then optimizing the loop that lets you improve a prompt and Make sure that they're not regressing on the previous example. So yeah, there's a whole lot of stuff to do in there, but yeah, it's really annoying.

Dan Shipper (00:31:53)

About that part, using feedback or whatever to improve a prompt and then make sure you're not regressing. How are you doing that?

Simon Last (00:31:59)

Yeah, I think it really just boils down to having really good evals that are appropriate to the prompt and then having good data sets that capture the distribution of errors that you care about and then making it easy to run the evals and flag regressions. It's simple, but it's just annoying to set up and you have lots of prompts, you have to do it many times. So I found you want to have solid standardized ways to do this.

Dan Shipper (00:32:28)

Are you using all homegrown tools for this or are you using— Anthropic has an evals library, OpenAI has one, there's a lot of open-source ones.

Simon Last (00:32:34)

We use Braintrust for storing data sets and running evals, but all the actual evals we write ourselves.

Dan Shipper (00:32:41)

One of the things you said earlier that struck me is this idea of exploring the distribution of errors you can get. And sometimes you'll have a big eval set and then you'll find something in the distribution where you're like, it breaks everything. What have you learned about doing that exploration to minimize the chances you find something totally unexpected?

Simon Last (00:32:58)

Yeah. I don't try to do evals too early in a project. I think you can actually go too hard as well in the other direction. I definitely try when I'm starting a new thing, I feel like you want to start with more of a vibe check and really be flexible about how the task is structured. Because I found that especially early on, there's a lot of returns to just changing the structure of the flow. And if you spend too much time collecting data sets, you're just going to trip yourself up a lot. So I feel like there's a mode switch when you're like, alright, let's actually productionize this, where you want to switch to actually intensively finding issues. And yeah, I think it helps to have data labels dedicated to that. And then, yeah, it's just hard just figuring out. I'm trying to understand as best you can how it will actually be used and then make your data set, map that ultimately that's the game, and then after you deploy it then the game is around actually flagging those examples and saving them.

Dan Shipper (00:33:57)

When I interact with Notion AI, you're going to go find a bunch of presumably embedded text for my Notion and then put it in the context and that's going to help you answer the question. How do you think about how much to pack the context? Are you trying to get all the information, like even if it's not as relevant in there, or are you like being more selective about it and why?

Simon Last (00:34:17)

Yeah. It's an empirical question for sure. So I think it's changing all the time. Even just this year, it's changed so much because the models have all gotten pretty long in context. But, I don't know, some of the people make a tweet, it's like, “RAG is dead” or something. But there's also a latency cost concern.

Dan Shipper (00:34:37)

And attention is different, in the middle of the content—

Simon Last (00:34:40)

It also just doesn't work yet. Yeah. So I think we're definitely still strongly in the world where you want to limit the context. And if you can, and if you can remove irrelevant stuff, there's definitely returns to that even with the latest models. And if you ask an open-ended topic they'll tell you that as well. So I don't know. It's hard. It's really an empirical question though. And I think, yeah. And I'm hoping the attention gets way better and the context gets longer and it gets faster to process them. And like coaching is better. And then, yeah. Then maybe I like that world because then we can worry less about removing irrelevant stuff that makes our job easier. I'm very bullish on being able to do that. I feel like we're not quite there yet. And, we've certainly expanded the context that we show as models. Like our original Q&A was on the original GPT-4, which is like 8,000 tokens. So obviously we had to be super constrained on that. If you want it, we can only show a few thousand tokens, especially if you want to multi-turn and not have it just forget things. Obviously now it's much longer so we can show more, but yeah, it's still definitely constrained by the quality and the cost latency.

Dan Shipper (00:35:47)

What are you personally coding with? Are you using Cursor? Are AIs using Claude 3.5? Are you using o1? What's your workflow or stack?

Simon Last (00:35:57)

I use Cursor. Cursor is really good. The thing I like about Cursor is just the autocomplete they build is really good. It's way better than Copilot.

Dan Shipper (00:36:03)

Are you using Composer?

Simon Last (00:36:04)

It's the one where it's tab completion where it can do arbitrary edits?

Dan Shipper (00:36:08)

Composer is like a window that pops out and it makes multi-file edits and all that kind of stuff.

Simon Last (00:36:11)

Oh so, the command-k.

Dan Shipper (00:36:12)

Yeah. Command-i.

Simon Last (00:36:14)

Ah command-i. Oh, I don't think I’ve tried that.

Dan Shipper (00:36:18)

Oh, you gotta try it. It's really cool. Cursor Composer is much better at doing multi-file edits. It's a little bit more agentic.

Simon Last (00:36:22)

Yeah, I don't think I've actually even tried that. I've used the command-k and then I've used the command-i. Or, no, the sidebar chat.

Dan Shipper (00:36:30)

I think you have to do less because in the sidebar chat you have to go scroll through each step and then press apply and it takes a while and your composer does all of that a little bit more automatically.

Simon Last (00:36:38)

Yeah. I haven't tried much doing multi-file edits. I guess my mental model's probably not that good but I should try it. I feel like I don't understand if it's good or not, which I should. But yeah, I use a lot of autocomplete and then I guess my typical workflow is like, yeah, autocomplete all the time. Obviously that's just ambiently there. And then I'll ask for code. And sometimes I'll use a Notion or Claude or TypeScript interface as well. And then my model there is more about a specific function I wanted to write. That's typically the abstraction level that I ask it to code at. Anything beyond that, I don't know. No one's really cracked the multi-step coding agent thing yet. But it can write pretty good functions, if you give it good instructions and context. And I've tried using the retrieval over code, which helped me a few times. We have a pretty big code base and so I can be like, what's the function that does this thing?

Dan Shipper (00:37:26)

Is that in Cursor retrieval code or is it like somewhere else?

Simon Last (00:37:28)

It's in Cursor, yeah. That's part of the sidebar.

Dan Shipper (00:37:30)

I see. I see. That's really cool. And models-wise, are you using Sonnet or using o1?

Simon Last (00:37:36)

Sonnet is my day-to-day workhorse. o1, yeah, I've been playing with it a bunch. I've had some success with o1 and o1-mini on coding stuff, but, yeah, I think it feels like it's not that much better than Sonnet at coding and then it's slow. And it feels like it's also weird to prompt right now. I've definitely gotten weird results from it, especially when you like to give it a bunch of context. It seems like they trained it a lot on these math and coding puzzles, which have very little context and a lot of reasoning. And it's really good at that, but only if you can put it in the shape of that. And I found it's a bit finicky. I'm really excited about the paradigm though. And I'm curious to see when they produce the final model, that's been fully trained on a bigger distribution of inputs.

Dan Shipper (00:38:21)

I think that's what I was going to ask you about. What do you think about scaling inference compute as a new paradigm and where do you think it's going?

Simon Last (00:38:29)

Yeah, it's super exciting. I think it makes a lot of sense intuitively. I think I was surprised when it first came out or I think I was initially underwhelmed because I intuitively have thought that oh, wouldn't it be better if it reasoned in the latent space more wouldn't that be more powerful? But then I was talking to some friends about it and actually it makes a lot of sense for it to be language because the model has all this prior over language already. And the reasoning over language. And so it makes a lot of sense. It's like the dumbest possible thing you could do. It's just let's think more over using language and then my understanding is the tricky part is just making the reinforcement learning work. And there's all these details, like OpenAI figured out, but no one else has yet. But yeah, I feel pretty excited about it. I think I was pretty impressed by the graph of increased scaling compute—it gets smarter and it makes a lot of sense. I feel like the thing that's going to be the real unlock though is putting tools in the chain-of-thought.

And that's where it gets, I think, really interesting because right now the shape of the problem you need to give it is you need to give it all the context it needs upfront. And then it can think a lot about that and then produce an answer. But I think there's actually not that many things that are like— I'm sure there are plenty of things that are like that, but it's actually a specific tool. It feels like the big unlock that I'm excited for is putting tools in there and then you get reinforcement learning over that—it feels really interesting.

I feel like that's when agents are going to actually work, when you can give it some high-level tasks, give it these tools—maybe it's a browser and then enter a hard loop of think a bunch, use some tools, see the outputs, think more, and then keep looping on that. I feel like that can solve a lot of stuff. And then if whoever can figure out that kind of long horizon, reinforcement over that seems—

Dan Shipper (00:40:22)

I think tool use is coming. I think they said I was a Dev Day. I think they said it's coming before the end of the year, which would be pretty cool.

Simon Last (00:40:27)

It makes sense. It's the obvious thing that they should do. But it's probably hard to figure it out.

Dan Shipper (00:40:30)

So you talked about having them be trained on these math reasoning problems. What do you think about zooming models into those kinds of problems vs. I don't know. There's endless amounts of other things that you could have it do, and they're trying to make it reason. Do you think that allows it to come up with new things or be creative? If you let one of these things run for, let's say, you could just scale the inference time. Incidentally, would it come up with new things or do you think that's a different type of thinking that requires a different type of training loop?

Simon Last (00:41:02)

Yeah. I think the interesting thing— Obviously I don't know what OpenAI is doing, but by speculation, the thing that it seems like they're doing is that the place where you can get this reinforcement to work is when the results can be verified in some way. And so you're touching reality in some way. It's not that the model is making stuff up and the humans like, oh, this one's slightly better. You're writing some code and there's unit tests and they have to pass or there's a math problem and there's an answer to it. For this one reasoning step, there's a correct thing to think.

I think that's really interesting because it lets them scale up the training. And then I think you can do that. I think with that as a tool, you can now discover new things, I think, just by mining it, right? If you know how to verify some problem, you can spend lots of time thinking about the answer and keep going until it's correct. So I think in that domain, you can be really creative or just creative ways to solve this.

I think you're maybe pointing out something a bit fuzzier though, maybe an aesthetic creativity or something like that. That feels like a different thing. It feels like not the direction that the companies are going now. It feels like they're really doubling down on these verifiable things.

Dan Shipper (00:42:24)

It's really interesting to me: When you look at how mathematicians or scientists talk about coming up with new theories there's usually a big aesthetic component, like the idea of beauty or simplicity or whatever is driving them. And when you're coming up with new theories, they're often not verifiable when you first come up with them. So an example is we didn't verify a lot of relativity for a long time after Einstein came up with it. Do you worry that focusing solely on training loops where each step can be verified or at least the outcome can be verified or some of the thinking steps can be verified limits the ability of these things to think in ways that are valuable?

Simon Last (00:43:02)

Yeah. I think to some extent it's all they can do probably, because if we've exhausted all the human data and you need to make more data and you want it to be good. You have to verify it. So some of that stuff might be off limits until we like— I don't know though. I guess one way to think about it is, when Einstein came up with this theory, he's doing some verification, presumably he was writing down some equations, making sure they work out. But then he also has this built-in aesthetic which is a bit fuzzier of what's a nice theory. But that was based on his previous learning on physics and all the stuff he learned in his life. And I wouldn't be surprised if OpenAI makes some models and they train on all the physics and then they have it going mining to produce new physics and new math. I wouldn't be surprised if I actually developed something analogous to that aesthetic by it, you keep mining and it discovers 10,000 new theorems that are all proven to be correct. And then you start asking high-level questions about math. I wouldn't be surprised at all if it had some analogous thing going on there.

Dan Shipper (00:43:58)

Yeah, I guess if it's if the theorems that it's seeing are representative of that aesthetic, the aesthetic is in there, even if it's we're not really even talking about it.

Simon Last (00:44:06)

Yeah. You would hope that the aesthetic is pointing to the truth in some way and it's coming from the truth. The simpler theory is nicer because it's more likely to be correct. And to the extent that those are true, I would expect the model to learn them, too.

Dan Shipper (00:44:25)

Which is a big assumption. It's definitely more likely to be useful—or beautiful.

Simon Last (00:44:29)

Yeah, humans have additional constraints. Yeah and in some ways the aesthetic is a computational shortcut. The human Go player can prune out a bunch of classes of moves that maybe the AI would brute-force more. Maybe Einstein had to do that in order to just due to the limits of human cognition. Maybe the only way Einstein could have done it is to have the aesthetic shortcuts.

Dan Shipper (00:44:45)

Yeah, I think that the beauty of LLMs is they get you the thinking machine without having to brute-force—they do the same kind of aesthetic shortcut or intuitive thinking where early AI attempts. It was like, okay, to solve chess, Deep Blue, we're just going through the branching tree, but when you're going outside of chess, then you get to these problems where the branching tree is so computationally expensive to traverse that it doesn't work anymore. And language models have figured out how to set the frame of possibilities correctly so that they don't run into the computation.

Simon Last (00:45:20)

I think it's both. If it's producing a new theorem, you'll probably have to try many times. But then it can also have these thinking shortcuts. So I can see the best of both worlds. You couldn't get a human to do that much work.

Dan Shipper (00:45:27)

Yeah. You're trying many different times and you're examining different, smaller parts of the tree, but you're not brute-forcing the entire tree because the theorem that produces—

Simon Last (00:45:35)

Yeah, there’s probably still shortcuts, it's not gonna do stuff that obviously doesn't make sense. But then maybe it's just a really hard task and there's many things that could make sense.

Dan Shipper (00:45:43)

Yeah. I could only find one other interview with you. So this is coming from that interview. But one of the things that you said in that interview is that, one of your core philosophies is, if you can build something that you want, that's like a great place to start. Something builds up that you're going to use for yourself. You were the first users of Notion, all that kind of stuff. And I'm curious if you have anything that you feel like you're building for yourself right now, or if there are things like that you want right now.

Simon Last (00:46:08)

Yeah. I would say that principle, it's not universal because obviously, if you're designing, if you're making something for diabetes patients and you have diabetes, someone needs to make that. Maybe not only people with diabetes, but it's helpful if you have it. So it's a luxury. I think of it as a wonderful luxury that if you're deciding what to do with your life, it's nice to pick one of those things. And I really cherish that it's a thing that we can use every day. And I lean into it and try to compound on it. What's something that I'm doing? I guess a lot of the AI stuff. It ties back to a lot of the stuff that we've talked about that I'm excited about. There's cumbersome tasks that me and people around me are doing all the time that we can not do. Yeah, within the realm of like knowledge we, I guess we're talking about some of them, but keeping documentation up to date databases, not having to do manual data entry and community update. That’s super annoying. There's a common class of tasks around like rolling something up and summarizing things from a database or from boxes and like that, you have your weekly update and, especially in a big organization, there's many levels of weekly update. And it's super annoying. It's a huge use of everyone's time. Why are we doing that? The data's there. The docs are written and the Slack channels responded to it. It's all there and then you could write an update at any level of granularity that you want. They care about if you just have the right context and the right instructions.

Dan Shipper (00:47:30)

I totally feel that and we have a product studio so we incubate little products and everything that we incubate we use ourselves, which I think is great. And I think one of the really cool things about AI is that it means that the low-hanging fruit hasn't been picked yet. So you can make something for yourself and there's not 10 years of tech nerds who have made it before—it's all new. So everything gets made. And one of the things we're making that's been delightful that kind of what you said reminds me of is every morning now I get like a little podcast of meetings or Discords. We use Discord to chat about things that I missed. And it's the NotebookLM stuff where it's like this sort of NPR “Fresh Air” type thing. And it's two hosts talking about what happened. And I can get caught up while I'm doing dishes or whatever. It's really fun. I'm super excited for that coming.

Simon Last (00:48:26)

Actually, I just wrote an idea down that's very similar to that the other day. That's so fun. You'll love to see that. I definitely vibe a lot with the feeling of openness in the space. It's really exciting. And it feels one thing I really appreciate as well that really keeps me going a lot is that. So many things haven't been tried and also it feels like the technology is so overpowered in some ways that I found it very common that I can just come up with a random technical idea. Just, oh, I wonder if it would work if I did it in this way. And then usually it does work, which is pretty cool. It's like once you develop the intuition and maybe there's a paper about it you could go read, but it's really just simple like most of the LLM papers are simple ideas.

Dan Shipper (00:48:55)

Yeah and sometimes there isn't. And it's like you're on the frontier.

Simon Last (00:48:59)

Yeah. It's easy to be on the frontier and just to try stuff and it often works, which is such a good feeling. It's oh, I had this creative idea and I tried it and it worked. What's better than that? That's pretty amazing.

Dan Shipper (00:49:07)

Totally. So I put out a tweet or an X post asking what I should ask you. And someone who submitted a question is Linus, who used to work on the AI team. Linus asked how working with AI has changed what you think machines are good at vs. what humans are good at?

Simon Last (00:49:25)

Yeah, I think it's always shifting. Only content that has changed in that regard. I'm pretty AGI-pilled, so I think there will be nothing at some point. But right now, there are many things that humans are better at. I think actually, maybe a better question is, what should humans do? It will quickly be that you're actually just not better at anything. I feel like the most important thing there is making sure that the shape of the future is aligned with what we care about and we agree and consent to it and support it. And a lot of it is around like us orchestrating the tasks that the AIs are doing. What do you want them to be doing? That's a really important question. And then once you have it going and doing that making sure it is doing the right thing and you have a way to observe it and check on it. And that feels like the key.

Dan Shipper (00:50:15)

So a way to boil that down is wanting is what humans are good at.

Simon Last (00:50:20)

I think so. Yeah. It’s defining the high-level goal and making sure that's what's being done. Ultimately at the end of the day that's everything. And then all over time, all of the details of how it gets a little be whittled away and over the ice doing all of it which is great. I think that's a great world.

Dan Shipper (00:50:35)

You said you're AGI-pilled. What does that mean?

Simon Last (00:50:38)

I would assign a pretty decent probability to AGI in the next 10 years—and not necessarily requiring huge paradigm shifts.

Dan Shipper (00:50:49)

Do you have a personal definition of AGI? What does that mean to you?

Simon Last (00:50:53)

Yeah. I like OpenAI’s definition and just can it do all the economically useful tasks? I think that's I think the key question for the critical question I think is around the intelligence explosion. So do you think it's possible? What's the probability of that? How fast will it happen? And then the key question there is when will AIs be good enough at doing all the AI research tasks to significantly contribute to making the models better. And I'm pretty concerned about that. I find the case pretty compelling, if you actually break down what it was an AI to actually do. You're reading the literature and talking to colleagues and gathering my contacts. And then you're coming up with high-level ideas and agendas. And then maybe you're designing experiments and then you're writing code to run the experiments and then you're analyzing the experiments, just writing more code and then you're looking at the results and deciding what to do next. And all these things feel like a sufficiently good— A foundation model could at least get you a lot of progress on all of them. And eventually I don't see how you need to change the shape of it that much. And then if you get that dynamic really kicking in, I think it'd be a really big effect.

Dan Shipper (00:51:55)

Yeah. The thing that's interesting about that, going back to the OpenAI definition of AGI, I'm curious for your thoughts. The thing that strikes me about that idea, which is that AGI is achieved when AI can do all the economically valuable tasks, is that AI changes what the economically valuable tasks are. And so you're creating a moving target in the same way. I think the Turing Test turned out to be a moving target. In a world where humans are leveraging AI to do more economically valuable tasks that we could have done before. It's a self-fulfilling AGI that has not been achieved yet because we're changing the tasks.

Simon Last (00:52:24)

We've seen the past few years, people are always raising the bar. There’s memes about it, right? It's oh, but I can't do this one particular thing that I really care about. It's not a static world. I saw a tweet the other day, I was talking about how just 68 years ago in South Korea, most people were farmers. And then, now it's obvious they're hardly anyone's a farmer. It's been fully automated and the economy is 50x or something. That’s pretty crazy. And I feel like we're headed into a similar dynamic where the jobs people do are going to completely change. And there's going to be a bunch of new stuff that we do, and it's going to create 100x more economic value than all the economic value that we have now. AGI is happening in that curve somewhere, which is going to be pretty weird, I think. And I think a lot of that's going to come from the fact that when you discover an economically useful task, with these models, if you buy more computers, you can just scale it up more. Which is, in a much more elastic way than you go with humans, you have to train up and hire and stuff like that. So I expect it to look pretty weird where a bunch of classes of tasks are going to get scaled up. It's gonna produce a lot of value, and then it's gonna make the shape of the economy look different and in a confusing way.

Dan Shipper (00:53:34)

So in preparation for this interview, I asked Notion AI what I should ask you, and one of the questions it asked that I thought was really interesting is: How do you see the role of human creativity and intuition change in a world with AI?

Simon Last (00:54:45)

I don't think I assign a special value to humans being creative or intuitive. We've been side-talking about this multiple times, but we were all surprised, I think when DALL-E came out and it was like, oh, I thought it was going to be boring work, but it's actually making beautiful images. It’s art or at least the technical part of art of implementing the idea. Yeah, I feel like it's like a category error to assign special value to any particular thing that a human can do. And that AI might not be good now. What really matters is more of the will of what you want to be done, and your ability to communicate that, and observe it and keep it on the road. That's what really matters. And it could be, but then it's gonna be better at us. It's gonna be better than us at being creative and being intuitive in these things.

Dan Shipper (00:54:30)

How have you developed that will in yourself?

Simon Last (00:54:39)

I think openness is important. So just openness to experience it, especially in a world that's changing. And I think being ambitious is important. Especially with these crazy AIs, like we should really ratchet up our ambition of what we want humanity to do and not be so encumbered by previous failures when this new tech unlocks a lot of stuff—that's pretty key. Yeah. I think being optimistic.

Dan Shipper (00:55:00)

I love it. And I think that's probably as good as any place to end. I really appreciate you taking the time to do this. I learned a lot. This is great.

Thanks to Scott Nover for editorial support.

Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.

We also build AI tools for readers like you. Automate repeat writing with Spiral. Organize files automatically with Sparkle. Write something great with Lex.