Sponsored By: Notion
This essay is brought to you by Notion, your all-in-one workspace solution. Startups can now enjoy 6 months of Notion Plus with unlimited AI, a value of up to $6,000—for free! Thousands of startups around the world trust Notion for everything from engineering specs to new hire onboarding. Access the limitless power of AI right inside Notion to build and scale your company.
When applying, select Every in the drop-down partner list and enter code EveryXNotion at the end of the form.
I’m working on a series of pieces trying to create a map of the psychology and behavior of language models: compression, expansion, translation, and remixing.
I realized that in order to write it, I needed a clear, lower-level explanation about how language models work. In other words, I needed to write a prequel to my piece about language models as text compressors, a Phantom Menace of this series, if you will (except, hopefully, you know, good). This is that article.—Dan Shipper
Was this newsletter forwarded to you? Sign up to get it in your inbox.
If we want to wield language models in our work and still call the results creative, we’ll have to understand how they work—at least at a high level.
There are plenty of excellent guides about the internal mechanisms of language models, but they’re all quite technical. (One notable exception is Nir Zicherman’s piece in Every about LLMs as food.) That’s a shame because there are only a handful of simple ideas you need to understand in order to get a basic understanding of what’s going on under the hood.
I decided to write those ideas out for you—and for me—in as jargon-free a way as possible. The explanation below is deliberately simplified, but it should give you a good intuition for how things work. (If you want to go beyond the simplifications, I suggest putting this article into ChatGPT or Claude.)
Ready? Let’s go.
Let’s pretend you’re a language model
Imagine you’re a very simple language model. We’re going to give you a single word and make you good at predicting the next word.
I’m your trainer. My job is to put you through your paces. If you get the problems right, I’ll stick my hands into your brain and futz around with your neural wiring to make it more likely that you do that again in the future. If you get it wrong, I’ll futz again, but this time I’ll try to make it more likely you don’t do that again.
Here’s a few examples of how I want you to work:
If I say “Donald,” you say: “Trump.”
If I say “Kamala,” you say: “Harris.”
Now it’s your turn. If I say “Joe,” what do you say?
Seriously, try to guess before going on to the next paragraph.
If you guessed “Biden,” congrats—you’re right! Here’s a little treat. (If you guessed wrong, I would’ve slapped you on the wrist.)
This is actually how we train language models. There’s a model (you) and there’s a training program (me). The training program tests the model and adjusts it based on how well it does.
We’ve tested you on simple problems, so let’s move on to something harder.
Predicting the next word isn’t always so simple
If I say “Jack,” what do you say?
Try to guess again before going on to the next paragraph.
Obviously, you say: “of” as in, “Jack of all trades, master of none! That’s what my mom was afraid would happen to me if I didn’t focus on my school work.”
What? That’s not what immediately flashed through your mind? Oh, you were thinking “Nicholson”? Or maybe you thought “Black.” Or maybe “Harlow.”
That’s understandable. Context can change which word we think will come next. The earlier examples were the first names of celebrity politicians followed by their last names. So you, the language model, predicted that the next word in the sequence would be another celebrity name. (If you thought of “rabbit”, “in the box”, or “beanstalk” we might need to futz around in your brain!)
If you’d had more context before the word “jack”—maybe a story about who I am, my upbringing, my relationship with my parents, and my insecurities about being a generalist—you might have been more likely to predict “jack of all trades, master of none.”
So how do we get you to the right answer? If we just beefed up your smarts—let’s say, by throwing all of the computer power in the world into your brain—you still wouldn’t really be able to reliably predict “of” from just “jack.” You’d need more context to know which “jack” we’re talking about.
This is how language models work. Before the word that comes after “jack,” the models spend a lot of time interrogating it by asking, “What kind of ‘jack’ are we talking about?” They do this until they’ve narrowed down “jack” enough to make a good guess.
A mechanism called “attention” is responsible for this. Language models pay attention to any word in the prompt that might be relevant to the last word, and then use that to update their understanding of what that last word is. Then they predict what comes next.
This is the fundamental insight of language models:
The Only Subscription
You Need to
Stay at the
Edge of AI
The essential toolkit for those shaping the future
"This might be the best value you
can get from an AI subscription."
- Jay S.
Join 100,000+ leaders, builders, and innovators
Email address
Already have an account? Sign in
What is included in a subscription?
Daily insights from AI pioneers + early access to powerful AI tools
Comments
Don't have an account? Sign up!
This was an amazingly clear description! Thank you! I do wonder if this is really all there is to this process—or if there are emerging properties coming up, as you can see with complexity in nature. Take the "apple test" (which I've modified as a two-word combo such as this: Write ten sentences in which the fifth word is "apple" and the final word is "rosemary."); right now, only Claude 3 Opus passes my apple-rosemary test; why would other frontier models not do it too if they are just predicting the next word using huge dictionaries? Is there something else happening?
Plus, there is coherence to the outputs in a way that I find hard to see as just next-word prediction. On the Leaderboard, I tested two unknown models by presenting a hypothetical (at the time of the Louisiana Purchase, it turns out that Colombia bought the United States, so that the US is now a department of Colombia) and asked for a campaign poster for the 2024 election in the Department of the United States—what came up was brilliant; it's so hard to think it was just pieced word-by-word as opposed to chunked together with a clear sense of where the argument was headed down the line and not just after the space bar.
I'm often so amazed by these things that I find it hard to think of them as mere word predictors. Still, your explanation was really good! Thank you for that!
I commend you for attempting to explain such a complex topic in layman terms. As an ML practitioner a few things made me cringe but I'm also well aware I likely wouldn't be able to explain these concepts with zero technical terms.
I do want to touch up on one part, the "biggest, baddest dictionary you’ve ever seen". Perhaps I misunderstood, but you seem to be implying the model's vocabulary is created (or at least modified) during pre-training. That isn't the case. I'm actually not sure what you are referring to as "vocabulary" here, as some of the explanations seem to describe model weights behavior (vs. vocabulary words). I wonder if readers who haven't read any other "LLM 101s" might come out of this with erroneous assumptions.
Again, this is a solid attempt at simplifying the maths behind deep learning. I'd just recommend adding a disclaimer that this is quite simplified and the inner workings of LLMs are much more involved.