Filler tokens don’t allow sequential reasoning

Brendan Long

This is a linkpost for https://www.brendanlong.com/filler-tokens-dont-allow-sequential-reasoning.html

One of my favorite AI papers is “Lets Think Dot By Dot”, which finds that LLMs can use meaningless filler tokens (like “.”) to improve their performance, but I was overestimating the implications until recently^[1] and I think other people might be too.

The paper finds that LLMs can be trained to use filler tokens to increase their ability to do parallel reasoning tasks^[2]. This has been compared to chain of thought, but CoT allows models to increase sequential reasoning, which is more powerful^[3]. I now think this paper should be taken as evidence against LLMs ability to perform long-term reasoning^[4] in secret^[5].

This means that if a problem can be broken down into sub-problems, but the model isn’t wide enough to process it in one pass, the model can instead parallelize across multiple filler token positions and then combine the results. However, if the problem requires step-by-step thinking and the model isn’t deep enough, filler tokens don’t help. In comparison, Chain of Thought helps in both situations.

My metaphor for this is that filler tokens allow a model to dynamically increase the size of layers, but CoT allows the model to dynamically add layers.

The problem

The operations within each layer in an LLM occur in parallel, so all input data must come from a previous layer. Attention allows layer n to collect data from layer n-1 from elsewhere in the context.

Visualizing data flow in an LLM. Positions `i`, `i+1` and `i+2` all receive exactly the same inputs except for positional data and filler tokens.

To continue sequential reasoning from a previous layer, the model either needs a deeper layer which can attend to the previous output directly, or the previous layer needs to output a meaningful token which can be fed in from the top.

This is a problem for filler tokens, since the network has the same depth at every input, and the only new information is the filler token and the positional information.

It’s possible^[6] to exploit the positional information to process the same inputs differently (this is what the dot-by-dot authors do), but it’s not possible to process it for additional steps. An n-layer network only gets n layers of processing, no matter what path the data takes through it.

^{^}
To be fair to the authors, they say all of this in the paper. I just didn’t understand it.
^{^}
Specifically, they find that on a problem where the model needs to check the sum of up to 364 triplets with only 6 attention heads, it's able to spread the work across filler token positions and then select the triplet which sums to zero.
^{^}
Any parallel algorithm can be executed sequentially, but not all sequential algorithms can be parallelized.
Inituitively, if your algorithm is "think about apples and oranges at the same time", you can turn it into "think about apples and then think about oranges"; but if your algorithm is "look at the last word and then think about it", there's no way to parallelize that since the second step depends on the first step.
Don't take "not all" too strongly though. Many sequential algorithms can be turned into parallel algorithms, especially if you're willing to take an efficiency hit. For example, "do x and then do [y1, y2, y3, ...] based on the results" can be turned into "do [(x, y), (x, y2), (x, y3), ...] in parallel and then discard most of the results".
^{^}
Long-term meaning reasoning that carries forward between positions. Some frontier models have over 100 layers, so the amount of hidden sequential processing they can do is still non-trivial.
^{^}
Although steganography is still an option, at least in theory.
^{^}
This is separate from my main point, but it’s also really hard to train a model to parallelize like this. The authors of the paper had to use hand-tuned training examples and custom positional information to make it work. And even then, it only learned to do this for one problem.
It’s theoretically possible for an LLM to learn a general method of parallelizing computation across positions using filler tokens, but I would be surprised if they were able to learn something this complicated by accident through RL.

Filler tokens don't allow for serially deeper cognition than what architectural limits allow (n-layers of processing), but they could totally allow for solving a higher fraction of "heavily serial" reasoning tasks ^[[1]] insofar as the LLM could still benefit from more parallel processing. For instance, the AI might by default be unable to do some serial step within 3 layers but can do that step within 3 layers if it can parallelize this over a bunch of filler tokens. This functionally could allow for more serial depth unless the AI is strongly bottlenecked on serial depth with no way for more layers to help (e.g., the shallowest viable computation graph has depth K and K is greater than the number of layers and the LLM can't do multiple nodes in a single layer ^[[2]] ).

Or at least reasoning tasks that seem heavily serial to humans. ↩︎
With a wide enough layer, you can technically do anything, but this can quickly get extremely impractical. ↩︎

Filler tokens don't allow for serially deeper cognition than what architectural limits allow

This depends on your definition of serial cognition, under the definitions I like most the serial depth scales logarithmically with the number of tokens. This is because as you increase parallelism (in the sense you use above), that also increases serial depth logarithmically.

The basic intuitions for this are:

If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as "serial depth" is supposed to capture intuitively what computation you can do with "time", it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).

I agree with this, and I think LLMs already do this for non-filler tokens. A sufficiently deep LLM could wait and do all of its processing right as it generates a token, but in practice they start thinking about a problem as they read it.

For example, if I ask an LLM "I have a list of [Paris, Washington DC, London, ...], what country is the 2nd item in the list in?", the LLM has likely already parallelized the country lookup while it was reading the city names. If you prevent the LLM from parallelizing ("List the top N capitals by country size in your head and output the 23rd item in the list"), it will do much worse, even though the serial depth is small.

I've tested this: models are similarly bad at two-hop problems (when was Obama's wife born?) without explicitly verbalising the intermediate hop (so either: no CoT or dot-of-thought), and much better when they can explicitly verbalise the intermediate hop.

Yeah, I think the architecture makes this tricky for LLMs in one step since the layers that process multi-step reasoning have to be in the right order: "Who is Obama's wife?" has to be in earlier layer(s) than "When was Michelle Obama born?". With CoT they both have to be in there but it doesn't matter where.

Or at least reasoning tasks that seem heavily serial to humans. ↩︎
With a wide enough layer, you can technically do anything, but this can quickly get extremely impractical. ↩︎

Filler tokens don't allow for serially deeper cognition than what architectural limits allow

The basic intuitions for this are:

If you imagine mechanistically how you would add N numbers together, it seems like you would need logarithmic depth (where you recursively split it in half, compute the sums of each half, and then add the results together). Note that attention heads compute sums of numbers that scale with the number of tokens.
There are physical limits to the total amount of local computation that can be done in a given amount of time, due to speed-of-light constraints. So inasmuch as "serial depth" is supposed to capture intuitively what computation you can do with "time", it seems like serial depth should increase as total computation goes up (reflecting the fact that you have to break up the computation into local pieces, as in the addition example above).

LESSWRONG
LW

LESSWRONG
LW

74

Filler tokens don’t allow sequential reasoning

74

The problem

74

74