I’m fed up of hearing about how LLMs are next token predictors, and therefore they <cannot do some task> <aren’t really doing cognition> <are just guessing>.
There’s lots of philosophical objections, but fundamentally, framing AI as next token predictors in the first places is just misleading and inaccurate. Here’s why LLMs aren’t naive next token predictors.
What is Next Token Prediction
Let’s first briefly cover what “Next Token Prediction” even means. It is referring to base training (also called pre-training), the first step in training a LLM. We’ll talk about the other steps later.
Before a text is used for training a model, it’s split up into short sequences called tokens. Each token is about the length of a word, we can think of them as words for our purposes. LLMs are complicated functions that take an input sequence of tokens, and output a probability distribution for a single token. To train the model, we take the training text, and cut it up into pairs, each a long input sequence[1], and then the token that follows it[2].
E.g. the text “The cat sat on the mat” would get split into several such pairs: (“The”, “cat”), (“The cat”, “sat”), (“The cat sat” “on”), etc. This is the training data to be used. The model is fed the input sequence from each pair, and then it outputs a probability distribution for the next token. Then the model is carefully tweaked so that its probability of giving the correct answer increases, and the wrong answer decreases. This is repeated tens of trillions of times.
When training is done, and it comes to using the model to generate text[3], the process is different. You give the model the text generated so far, and it gives you a probability distribution. A token is randomly picked from that distribution and appended to the text. By repeating this process, a full sentence is generated, where the LLM is invoked once at a time to get the next token, and it gives its best guess as to what happens next.
Hopefully, it’s pretty clear why that’s called Next Token Prediction. The model literally outputs a guess for the next token, over and over again.
It turns out that to do a good job of such a task, models are forced to learn a great deal from the text. They need to consider language and grammar, obviously. But they also need to consider the content of the text. Imagine processing a maths textbook. If you saw the sentence “Nine plus nine makes”, a good prediction for what follows is “eighteen”, which requires knowing or at least memorising the maths. In extremis, you could imagine processing an entire book, which ends with “And, Watson, the name of the murder is”, which would require having an understanding of all preceding clues and inferences to be a good guess. LLMs aren’t capable of that, which illustrates that next token prediction is quite challenging, and gives the right sort of pressures during the training process.
Anyway, if you follow this training regime for enough tokens, and your LLM’s architecture and your tweaking are set up correctly, you end up with a model that is able to continue sentences convincingly, if not necessarily that accurately. In its internal structure, it will have set up patterns for dealing with factual recall, grammar, logic etc.
So yeah, the term Next Token Prediction is pretty accurate, mechanically. Where do the misconceptions come from, then?
The Learning Process Influences Pairs of Token with Long Separations
You’ll notice that during base training, an LLM is given a long sequence of input tokens, and asked to predict one more. These sequences are frequently tens of thousands of tokens long, sometimes going up to a million. All those tokens contribute to the answer, and during training, the computation on all those tokens is tweaked together.
You can even flip around your point of view. Normally, training is phrased as predicting a single token, using all preceding tokens. But it would also be valid to say that each token is used multiple times during the prediction of future tokens of a sentence, and contributes to all of them. The training process ensures that the calculations done on any given token are not just useful for the one that is immediately following, but all following tokens.
In other words, the phrase “next token prediction” gives a false sense that LLM computation is somehow last-minute. That they are forced to consider words in isolation, without regard for how it fits together. This is not accurate. If the training process finds that it has the capacity to learn either a structure useful for predicting the single next token, or a structure that is not immediately useful, but will help with many future tokens, it’ll probably prefer the latter.
LLMS Can and Do Plan Ahead
Let’s do a thought experiment. You are a spy in a foreign country. It’s imperative you blend in to avoid counter-espionage. Right now, you are driving in a car and are concerned your erratic movements may result in you being caught. What is your strategy?
Well, the first thing you would do is observe what manoeuvres are common. U-turns are very rare, so you should also rarely do them. A bit more sophisticated, and you’ll notice cars rarely take two left turns in a row. Then you spot that the street you are coming to is a cull de sac. You should turn off – cars only pull into dead ends at the start and ends of journeys, which are also relatively infrequent. In fact, now you think about it, cars usually drive around with a specific destination in mind. To be really convincing, you’d need to consult a map of the area and pick turns that are consistent with route planning.
In this analogy, each individual turn you make is the equivalent of a token. And your ongoing route is the sentence you are generating. It’s not a perfect analogy – a good strategy for you would be to randomly pick a distant city, consult your map and drive there as best you can, while LLMs are structurally prevented from making that sort of random choice[4].
But still, the analogy conveys something useful. To drive any journey, you must do it as a sequential series of single turns, chosen one at a time. Nonetheless, considering a journey in terms of individual turns is a poor way to navigate. Any driver first forms some idea about where they want to go, then a high level plan of how to achieve it, careful consultation with available map information, and only then do they actually start the journey, and come up with turn-by-turn directions. The turns are the least important part of the cognition involved, they’re just the easiest one for an external observer to see.
Returning to LLM-land, not all sentences go somewhere, but plenty do. Is there evidence that LLMs do similar sorts of planning? Yes, plenty. The field of AI interpretability is all about inspecting the internals of models to get some approximate idea what an LLM is considering in its internal processes. We frequently see that concepts are raised as they become relevant, long before any tokens relating to that concept come out.
One of my favourite examples of this is Anthropic’s work on planning in poems. When writing a rhyming poem, you are quite constrained in the choice of the final word of a line. If you don’t take account of this while composing, you’ll quickly find you’ve written yourself into a corner. Non-reasoning LLMs cannot backtrack, so writing half decent rhyming structure requires them to plan ahead at least one line at a time. Anthropic’s work finds some of the structure responsible for this planning, such as being able to detect the concept “rhymes with rabbit/habit”, and even doing brain surgery to change the model’s behaviour by tampering with these structures.
LLMs are Trained by Reinforcement Learning, not just Next Token Prediction
This is the biggest point of all. Next token prediction, i.e. pre-training, is where model training begins. But all it gives you are relatively weak models, which convincingly mimic the input text. To turn them into a instruction-following, tool-using, reasoning chatbot, further stages are needed. And the majority of those are Reinforcement Learning (RL). RL is fundamentally not Next Token Prediction. The idea is that you run the LLM to generate entire sentences or complete responses, then you score the response as a whole. Then that score is used to tweak the model to be more/less likely to generate something similar in the future.
By the time the model is done with RL, there’s no meaningful sense that it’s trying to “predict” any specific token. It instead make choices that lead to good outcomes overall. It’s still outputting one token at a time, but like with the car-driving analogy earlier, that is due to the sequential nature of writing, rather than telling you much about the generation process that performs well in an RL setting5.
There’s two types of RL that are particularly important to be aware of. Reinforcement Learning from Human Feedback (RLHF) is used to reward models for good behaviour. It makes models helpful assistants, teaches them when to refuse requests, gives them specific personalities. Reinforcement Learning with Verifiable Rewards (RLVR) is used to reward models for getting the right answer to questions after spending a long time thinking about a problem. It makes models more capable of advanced multistep logic, better able to explore multiple lines of thought, and capable of planning then executing the plan.
In other words, RL is responsible for much of the “intelligent” behaviour that AI discourse is centred around. If you don’t understand it, then you cannot conclude anything interesting about current AI. RL has very different properties to pre-training – notably, it’s not so dependent on the quality of the training data, and can train models to superhuman levels of ability.
So to conclude, I don’t have a problem with the term “Next Token Prediction”. What I object to is the suggestion that this gives us much understanding of how LLMs operate, or acts as limit on the sorts of things they might one day be capable of. We are all creatures of time, forced to say the first part of a sentence first, and the end of it last. You wouldn’t hold that against your fellow humans, would you?
I’m fed up of hearing about how LLMs are next token predictors, and therefore they <cannot do some task> <aren’t really doing cognition> <are just guessing>.
There’s lots of philosophical objections, but fundamentally, framing AI as next token predictors in the first places is just misleading and inaccurate. Here’s why LLMs aren’t naive next token predictors.
What is Next Token Prediction
Let’s first briefly cover what “Next Token Prediction” even means. It is referring to base training (also called pre-training), the first step in training a LLM. We’ll talk about the other steps later.
Before a text is used for training a model, it’s split up into short sequences called tokens. Each token is about the length of a word, we can think of them as words for our purposes. LLMs are complicated functions that take an input sequence of tokens, and output a probability distribution for a single token. To train the model, we take the training text, and cut it up into pairs, each a long input sequence[1], and then the token that follows it[2].
E.g. the text “
The cat sat on the mat” would get split into several such pairs: (“The”, “cat”), (“The cat”, “sat”), (“The cat sat” “on”), etc. This is the training data to be used. The model is fed the input sequence from each pair, and then it outputs a probability distribution for the next token. Then the model is carefully tweaked so that its probability of giving the correct answer increases, and the wrong answer decreases. This is repeated tens of trillions of times.When training is done, and it comes to using the model to generate text[3], the process is different. You give the model the text generated so far, and it gives you a probability distribution. A token is randomly picked from that distribution and appended to the text. By repeating this process, a full sentence is generated, where the LLM is invoked once at a time to get the next token, and it gives its best guess as to what happens next.
Hopefully, it’s pretty clear why that’s called Next Token Prediction. The model literally outputs a guess for the next token, over and over again.
It turns out that to do a good job of such a task, models are forced to learn a great deal from the text. They need to consider language and grammar, obviously. But they also need to consider the content of the text. Imagine processing a maths textbook. If you saw the sentence “
Nine plus nine makes”, a good prediction for what follows is “eighteen”, which requires knowing or at least memorising the maths. In extremis, you could imagine processing an entire book, which ends with “And, Watson, the name of the murder is”, which would require having an understanding of all preceding clues and inferences to be a good guess. LLMs aren’t capable of that, which illustrates that next token prediction is quite challenging, and gives the right sort of pressures during the training process.Anyway, if you follow this training regime for enough tokens, and your LLM’s architecture and your tweaking are set up correctly, you end up with a model that is able to continue sentences convincingly, if not necessarily that accurately. In its internal structure, it will have set up patterns for dealing with factual recall, grammar, logic etc.
So yeah, the term Next Token Prediction is pretty accurate, mechanically. Where do the misconceptions come from, then?
The Learning Process Influences Pairs of Token with Long Separations
You’ll notice that during base training, an LLM is given a long sequence of input tokens, and asked to predict one more. These sequences are frequently tens of thousands of tokens long, sometimes going up to a million. All those tokens contribute to the answer, and during training, the computation on all those tokens is tweaked together.
You can even flip around your point of view. Normally, training is phrased as predicting a single token, using all preceding tokens. But it would also be valid to say that each token is used multiple times during the prediction of future tokens of a sentence, and contributes to all of them. The training process ensures that the calculations done on any given token are not just useful for the one that is immediately following, but all following tokens.
In other words, the phrase “next token prediction” gives a false sense that LLM computation is somehow last-minute. That they are forced to consider words in isolation, without regard for how it fits together. This is not accurate. If the training process finds that it has the capacity to learn either a structure useful for predicting the single next token, or a structure that is not immediately useful, but will help with many future tokens, it’ll probably prefer the latter.
LLMS Can and Do Plan Ahead
Let’s do a thought experiment. You are a spy in a foreign country. It’s imperative you blend in to avoid counter-espionage. Right now, you are driving in a car and are concerned your erratic movements may result in you being caught. What is your strategy?
Well, the first thing you would do is observe what manoeuvres are common. U-turns are very rare, so you should also rarely do them. A bit more sophisticated, and you’ll notice cars rarely take two left turns in a row. Then you spot that the street you are coming to is a cull de sac. You should turn off – cars only pull into dead ends at the start and ends of journeys, which are also relatively infrequent. In fact, now you think about it, cars usually drive around with a specific destination in mind. To be really convincing, you’d need to consult a map of the area and pick turns that are consistent with route planning.
In this analogy, each individual turn you make is the equivalent of a token. And your ongoing route is the sentence you are generating. It’s not a perfect analogy – a good strategy for you would be to randomly pick a distant city, consult your map and drive there as best you can, while LLMs are structurally prevented from making that sort of random choice[4].
But still, the analogy conveys something useful. To drive any journey, you must do it as a sequential series of single turns, chosen one at a time. Nonetheless, considering a journey in terms of individual turns is a poor way to navigate. Any driver first forms some idea about where they want to go, then a high level plan of how to achieve it, careful consultation with available map information, and only then do they actually start the journey, and come up with turn-by-turn directions. The turns are the least important part of the cognition involved, they’re just the easiest one for an external observer to see.
Returning to LLM-land, not all sentences go somewhere, but plenty do. Is there evidence that LLMs do similar sorts of planning? Yes, plenty. The field of AI interpretability is all about inspecting the internals of models to get some approximate idea what an LLM is considering in its internal processes. We frequently see that concepts are raised as they become relevant, long before any tokens relating to that concept come out.
One of my favourite examples of this is Anthropic’s work on planning in poems. When writing a rhyming poem, you are quite constrained in the choice of the final word of a line. If you don’t take account of this while composing, you’ll quickly find you’ve written yourself into a corner. Non-reasoning LLMs cannot backtrack, so writing half decent rhyming structure requires them to plan ahead at least one line at a time. Anthropic’s work finds some of the structure responsible for this planning, such as being able to detect the concept “rhymes with rabbit/habit”, and even doing brain surgery to change the model’s behaviour by tampering with these structures.
LLMs are Trained by Reinforcement Learning, not just Next Token Prediction
This is the biggest point of all. Next token prediction, i.e. pre-training, is where model training begins. But all it gives you are relatively weak models, which convincingly mimic the input text. To turn them into a instruction-following, tool-using, reasoning chatbot, further stages are needed. And the majority of those are Reinforcement Learning (RL). RL is fundamentally not Next Token Prediction. The idea is that you run the LLM to generate entire sentences or complete responses, then you score the response as a whole. Then that score is used to tweak the model to be more/less likely to generate something similar in the future.
By the time the model is done with RL, there’s no meaningful sense that it’s trying to “predict” any specific token. It instead make choices that lead to good outcomes overall. It’s still outputting one token at a time, but like with the car-driving analogy earlier, that is due to the sequential nature of writing, rather than telling you much about the generation process that performs well in an RL setting5.
There’s two types of RL that are particularly important to be aware of. Reinforcement Learning from Human Feedback (RLHF) is used to reward models for good behaviour. It makes models helpful assistants, teaches them when to refuse requests, gives them specific personalities. Reinforcement Learning with Verifiable Rewards (RLVR) is used to reward models for getting the right answer to questions after spending a long time thinking about a problem. It makes models more capable of advanced multistep logic, better able to explore multiple lines of thought, and capable of planning then executing the plan.
In other words, RL is responsible for much of the “intelligent” behaviour that AI discourse is centred around. If you don’t understand it, then you cannot conclude anything interesting about current AI. RL has very different properties to pre-training – notably, it’s not so dependent on the quality of the training data, and can train models to superhuman levels of ability.
So to conclude, I don’t have a problem with the term “Next Token Prediction”. What I object to is the suggestion that this gives us much understanding of how LLMs operate, or acts as limit on the sorts of things they might one day be capable of. We are all creatures of time, forced to say the first part of a sentence first, and the end of it last. You wouldn’t hold that against your fellow humans, would you?
Called the context
Called the label
Called inference
There are some LLMs, diffusion models, that do not output tokens sequentially, though they are not as popular for chatbots