Was this newsletter forwarded to you? Sign up to get it in your inbox.
OpenAI launched a new model, o1 (previously code-named Strawberry), yesterday. It’s significantly better at reasoning tasks, scoring in the 89th percentile in competitive programming, and exceeding Ph.D.-level smarts on physics, biology, and chemistry questions.
It’s been taught to use chain of thought reasoning to answer each question it’s given rather than just blurting out a response.
Chain of thought, of course, has been around for a long time. It’s the practice of asking a language model to solve problems by thinking out loud. You’re probably better at doing long division if you write out the steps one by one than you are at doing it in your head. Language models are the same way: Chain of thought creates a tunnel of reason that keeps the AI on track.
Chain of thought used to be just a prompting technique that would improve outputs in the original GPT models.
o1 is different because it’s been trained via reinforcement learning to always use chain of thought in its responses without any extra prompting required. Now, when you ask ChatGPT with o1 enabled a question, up pops an expandable thinking indicator that lets you see its thought process:
It also gets the classic strawberry problem correct. Hooray! I’ve been playing around with o1 a lot for the last day and will have much more to say over the next few weeks, but I wanted to give you a quick reaction today.A new paradigm in AI: Test-time compute
Well, I’m glad I named this column Chain of Thought because it turns out Chain of Thought is probably the next big paradigm in AI progress. (Better to be lucky and partial to polysemy than good, as the saying goes.)
As I mentioned in my article on Strawberry, the key ingredients for AI progress so far has been: more data and more compute during training.
The interesting update from Strawberry is that OpenAI has found a way to add a new dimension on which to improve performance: compute during inference. The company has found that when Strawberry takes longer to respond to a prompt—in other words, when it’s given more time to think—it generally responds more accurately.
This wasn’t necessarily the case with previous models. The longer GPT-4 was left to run in an autonomous loop, the more likely it was to go off the rails or get stuck in a meaningless rabbit hole. Because o1 has been trained to perform better on chain of thought reasoning, it seems to be able to better stay on track.
The success of o1 gives OpenAI a new way to approach performance improvements. Instead of doing a training run for GPT-7 that requires the entire energy output of the sun, it can do something with a shorter feedback loop: giving o1 more time to think before it responds to a prompt.
o1 and the allocation economy
Imagine a future when you ask ChatGPT to do something, and you don’t expect it to respond immediately to every task. Instead, for a really important or difficult task, you might say, “Go spend a few hours on this,” and come back to it later.
This could turn into another skill necessary for model managers in the allocation economy: knowing when to turn to an expensive, long-running model like o1 and how to get the most out of it. Running o1 on a big query—one that will require it to think for a long time—is the equivalent of a bet. You won’t know if it worked for minutes, hours, or days. So you need to get good at knowing which bets to take and how to formulate a prompt that will be most likely to succeed.
Today, if you’re using o1 yourself, I think you won’t notice a huge difference for most use cases. For me, probably 10 to 20 percent of prompts require the reasoning that o1 can provide, and from my early testing, it does seem better.
But the real winners are businesses that are building with this stuff. We have several internal product incubations at Every that I think will get significantly better just from dropping in o1 as a replacement for Claude or GPT-4o. The proof will be in the pudding—and I’ll report back—but I’m really excited about it.
A few quick hits
Riemann Hypothesis remains unsolved by o1
As mentioned in my Strawberry piece, The Information reported that o1 is able to solve math problems it hadn’t been able to before. Andrej Karpathy joked on X that o1 refuses to solve the Riemann Hypothesis, one of the most famous unsolved problems in mathematics that attempts to explain the distribution of prime numbers. I’ll be curious to see how far it can get on new problems.
Can o1 create new knowledge?
If you wanted to understand how far o1 will get in creating completely new knowledge about the world, here’s a thought experiment:
How would a version of o1 trained only on writing from 1500 and earlier perform? Or 1800 and earlier? Or 1900 and earlier? Would it discover geocentricity? Calculus? Would it discover the steam engine? The assembly line?
My guess is that o1 would seem like it was stuck in the past no matter how long you left it to run. If you asked the 1500 version to predict the motion of the stars, it would be able to use the Ptolemaic heliocentric system common at the time, but it probably wouldn’t posit geocentricity (which only gained traction in the 1700s). To us, it would seem parochial, like talking to a smart—but dead—ancestor.
I’ll have more to say on this at a later date, but I think that o1 is more like an extension of the LLMs-as-Ph.D.s thesis I laid out a few weeks ago than something that can create entirely new knowledge.
One model to rule them all—or a pantheon of gods?
One of the big questions in AI is whether there will be one model to rule them all, or a pantheon of gods. In other words, will everyone be using GPT-7 in the future? Or will there be room for many different models suited for different tasks?
My take is that we’re headed toward a world with 1-2 big winners, and a pantheon too. ChatGPT is good enough for most tasks, and OpenAI continues to increase its lead as a general-purpose chatbot.
But there is a long-tail of tasks for which other, more specialized models are going to be valuable. o1 underscores this, as it’s significantly better at math, for example, but Claude is still a much better writer in my early testing.
Links
- The team at Devin, the programming agent, reviewed o1 and found it helped Devin perform significantly better.
- Watch Ammar Reshi, the head of design at ElevenLabs, use o1 and Cursor to create an iOS weather app in 10 minutes.
- Here are a few interesting demos of o1 completing tasks in physics, genetics, and economics.
Dan Shipper is the cofounder and CEO of Every, where he writes the Chain of Thought column and hosts the podcast AI & I. You can follow him on X at @danshipper and on LinkedIn, and Every on X at @every and on LinkedIn.
Comments
Don't have an account? Sign up!
Thank you.
Appreciated the timeliness and the 1500 / 1700 /1900 thought experiment