DALL-E/Every illustration.

How Language Models Work

A 100-percent jargon-free guide

101 2

Comments

You need to login before you can comment.
Don't have an account? Sign up!
@deleted_138234 about 1 year ago

This was an amazingly clear description! Thank you! I do wonder if this is really all there is to this process—or if there are emerging properties coming up, as you can see with complexity in nature. Take the "apple test" (which I've modified as a two-word combo such as this: Write ten sentences in which the fifth word is "apple" and the final word is "rosemary."); right now, only Claude 3 Opus passes my apple-rosemary test; why would other frontier models not do it too if they are just predicting the next word using huge dictionaries? Is there something else happening?

Plus, there is coherence to the outputs in a way that I find hard to see as just next-word prediction. On the Leaderboard, I tested two unknown models by presenting a hypothetical (at the time of the Louisiana Purchase, it turns out that Colombia bought the United States, so that the US is now a department of Colombia) and asked for a campaign poster for the 2024 election in the Department of the United States—what came up was brilliant; it's so hard to think it was just pieced word-by-word as opposed to chunked together with a clear sense of where the argument was headed down the line and not just after the space bar.

I'm often so amazed by these things that I find it hard to think of them as mere word predictors. Still, your explanation was really good! Thank you for that!

Leo Larrere about 1 year ago

I commend you for attempting to explain such a complex topic in layman terms. As an ML practitioner a few things made me cringe but I'm also well aware I likely wouldn't be able to explain these concepts with zero technical terms.

I do want to touch up on one part, the "biggest, baddest dictionary you’ve ever seen". Perhaps I misunderstood, but you seem to be implying the model's vocabulary is created (or at least modified) during pre-training. That isn't the case. I'm actually not sure what you are referring to as "vocabulary" here, as some of the explanations seem to describe model weights behavior (vs. vocabulary words). I wonder if readers who haven't read any other "LLM 101s" might come out of this with erroneous assumptions.

Again, this is a solid attempt at simplifying the maths behind deep learning. I'd just recommend adding a disclaimer that this is quite simplified and the inner workings of LLMs are much more involved.