**How Large Language Models Predict the Next Word** *By Tert Slamy* Picture this: You stumble upon a short movie script that describes a scene between a person and their AI assistant. The script includes what the person asks the AI, but the AI’s response has been torn off. Suppose you also have a powerful magical machine that can take any text and predict what word comes next. You could finish the script by feeding the existing text into the machine, seeing what it predicts for the AI’s answer, and repeating this process to gradually build the dialogue. When you interact with a chatbot, this is exactly what’s happening. A large language model is a sophisticated mathematical function that predicts what word comes next for any piece of text. Instead of predicting one word with certainty, it assigns a probability to all possible next words. To build a chatbot, you start by laying out some text that describes an interaction between a user and a hypothetical AI assistant. You add the user’s input as the first part of that interaction, then have the model repeatedly predict what the hypothetical AI assistant would say in response. This is what gets presented to the user. The output tends to look more natural if you allow the model to select **
Scene 1 from Large Language Models explained briefly
** less likely words at random. So, even though the model itself is deterministic, a given prompt typically produces a different answer each time it’s run. Models learn how to make these predictions by processing an enormous amount of text, typically pulled from the internet. For a standard human to read the amount of text used to train GPT-3, for example, it would take over 2,600 years of non-stop reading—though larger models have since been trained on much more. You can think of training a little bit like tuning the dials on a big machine. The way a language model behaves is entirely determined by these many different continuous values, usually called parameters or weights. Changing those parameters alters the probabilities the model gives for the next word on a given input. What puts the “large” in large language model is how they can have hundreds of billions of these parameters. No human ever deliberately sets those parameters. Instead, they begin at random, meaning the model initially outputs gibberish, but they’re repeatedly refined based on many example pieces of text. One of these training examples could be just a handful of words, or it could be thousands. In either case, the process involves inputting all but the last word from that example **
Scene 2 from Large Language Models explained briefly
** into the model and comparing its prediction with the true last word from the example. An algorithm called back propagation tweaks all the parameters in such a way that the model becomes slightly more likely to choose the true last word and slightly less likely to choose all the others. When this is done for many, many trillions of examples, the model not only starts to give more accurate predictions on the training data but also begins to make more reasonable predictions on text it’s never seen before. Given the huge number of parameters and the enormous amount of training data, the scale of computation involved in training a large language model is mind-boggling. To illustrate, imagine you could perform one billion additions and multiplications every second. How long would it take to do all the operations involved in training the largest language models? A year? Maybe something like 10,000 years? The answer is actually far more than that—it’s well over 100 million years. This is only part of the story. This whole process is called pre-training. **
Scene 3 from Large Language Models explained briefly
** The goal of auto-completing a random passage of text from the internet is vastly different from the goal of being a good AI assistant. To address this, chatbots undergo another type of training, just as important, called reinforcement learning with human feedback. Researchers flag unhelpful or problematic predictions, and their corrections further refine the model’s parameters, making them more likely to produce predictions users prefer. Looking back at the pre-training, this staggering amount of computation is only made possible by using special computer chips optimized for running many operations in parallel, known as GPUs. However, not all language models can be easily parallelized. Prior to 2017, most language models processed text one word at a time. Then, a team of researchers at Google introduced a new model known as the transformer. Transformers don’t read text from start to finish—they soak it all in at once, in parallel. **
Scene 4 from Large Language Models explained briefly
** The very first step inside a transformer, and most other language models for that matter, is to associate each word with a long list of numbers. The reason for this is that the training process only works with continuous values, so you have to somehow encode language using numbers. Each of these lists of numbers may somehow encode the meaning of the corresponding word. What makes transformers unique is their reliance on a special operation known as attention. This operation gives all these lists of numbers a chance to talk to one another, refining the meanings they encode based on the context around them—all done in parallel. For example, the numbers encoding the word *bank* might be adjusted based on the context surrounding it to somehow encode the more specific notion of a *riverbank*. Transformers typically also include a second type of operation known as a feed-forward neural network, which gives the model extra capacity to store more patterns about language learned during training. All of this data repeatedly flows through many iterations of these two fundamental operations, and as it does, the hope is that each list of numbers is enriched to encode whatever information might be needed to make an accurate prediction of what word follows in the passage. At the end, one final function is performed on the last vector in this sequence, which now has had a chance to be influenced by all the other context from the input text, as well as everything the model learned during training, to produce a prediction of the next **
Scene 5 from Large Language Models explained briefly
** word. Then, the model’s prediction looks like a probability for every possible next word. Although researchers design the framework for how each of these steps work, it’s important to understand that the specific behavior is an emergent phenomenon based on how those hundreds of billions of parameters are tuned during training. This makes it incredibly challenging to determine why the model makes the exact predictions it does. What you can see is that when you use large language model predictions to autocomplete a prompt, the words it generates are uncannily fluent, fascinating, and even useful. If you’re a new viewer and curious about more details on how transformers and attention work, boy do I have some material for you. One option is to jump into a series I made about deep learning, where we visualize and motivate the details of attention and all the other steps in a transformer. But also, on my second channel, I just posted a talk I gave a couple of months ago about this topic for the company TNG in Munich. Sometimes I actually prefer the content I make as a casual talk rather than a produced video, but I leave it up to you which one of these feels like the better follow-on.