One of my favorite AI papers is “Lets Think Dot By Dot”, which finds that LLMs can use meaningless filler tokens (like “.”) to improve their performance, but I was overestimating the implications until recently[1] and I think other people might be too.
The paper finds that LLMs can be trained to use filler tokens to increase their ability to do parallel reasoning tasks[2]. This has been compared to chain of thought, but CoT allows models to increase sequential reasoning, which is more powerful[3]. I now think this paper should be taken as evidence against LLMs ability to perform long-term reasoning[4] in secret[5].
This means that if a problem can be broken down into sub-problems, but the model isn’t wide enough to process it in one pass, the model can instead parallelize across multiple filler token positions and then combine the results. However, if the problem requires step-by-step thinking and the model isn’t deep enough, filler tokens don’t help. In comparison, Chain of Thought helps in both situations.
My metaphor for this is that filler tokens allow a model to dynamically increase the size of layers, but CoT allows the model to dynamically add layers.
Every layer in an LLM operates in parallel, so all input data must come from a previous layer. Attention allows layer n to collect data from layer n-1 from elsewhere in the context.
i, i+1 and i+2 all receive exactly the same inputs except for positional data and filler tokens.To continue sequential reasoning from a previous layer, the model either needs a deeper layer which can attend to the previous output directly, or the previous layer needs to output a meaningful token which can be fed in from the top.
This is a problem for filler tokens, since the network has the same depth at every input, and the only new information is the filler token and the positional information.
It’s possible[6] to exploit the positional information to process the same inputs differently (this is what the dot-by-dot authors do), but it’s not possible to process it for additional steps. An n-layer network only gets n layers of processing, no matter what path the data takes through it.
To be fair to the authors, they say all of this in the paper. I just didn’t understand it.
Specifically, they find that on a problem where the model needs to check the sum of up to 364 triplets with only 6 attention heads, it's able to spread the work across filler token positions and then select the triplet which sums to zero.
Any parallel algorithm can be executed sequentially, but not all sequential algorithms can be parallelized.
Inituitively, if your algorithm is "think about apples and oranges at the same time", you can turn it into "think about apples and then think about oranges"; but if your algorithm is "look at the last the word and then think about it", there's no way to parallelize that since the second step depends on the first step.
Don't take "not all" too strongly though. Many sequential algorithms can be turned into parallel algorithms, especially if you're willing to take an efficiency hit. For example, "do x and then do [y1, y2, y3, ...] based on the results" can be turned into "do [(x, y), (x, y2), (x, y3), ...] in parallel and then discard most of the results".
Long-term meaning reasoning that carries forward between positions. Some frontier models have over 100 layers, so the amount of hidden sequential processing they can do is still non-trivial.
Although steganography is still an option, at least in theory.
This is separate from my main point, but it’s also really hard to train a model to parallelize like this. The authors of the paper had to use hand-tuned training examples and custom positional information to make it work. And even then, it only learned to do this for one problem.
It’s theoretically possible for an LLM to learn a general method of parallelizing computation across positions using filler tokens, but I would be surprised if they were able to learn something this complicated by accident through RL.