The Case Against Continuous Chain-of-Thought (Neuralese)

RobinHa

Main thesis: Discrete token vocabularies don't lose information so much as they allow information to be retained in the first place. By removing minor noise and singling out major noise, errors become identifiable and therefore correctable, which continuous latent representations fundamentally cannot offer.

The Bandwidth Intuition (And Why It's Incomplete)

One of the most elementary ideas connected to neuralese is increasing bandwidth. After the tireless mountains of computation called a forward pass, we condense everything down to ~17 bits (the log₂ of our vocabulary size).

This seems insane. Imagine pitching a neural network architecture where layers 5, 10, 15, and 20 have hidden dimension 20, while normal layers use 512. You'd be laughed out of the room. And that's not even accounting for discreteness.

So why do I think this apparent insanity is not just tolerable but necessary for LLMs?

The Noise Accumulation Problem

Let's imagine LLMs passed vast latent messages instead of tokens - say, the final hidden state or something a few matrix multiplications away from it. No bandwidth concerns. Pure continuous reasoning.

This message won't be perfect. The forward pass involves finite compute, imperfect optimization, distributional shift as we move off the training manifold, etc.. Call this aggregate imperfection "noise" or "error".

Here's the problem: noise in continuous space has no natural factorization.

When a 4096-dimensional vector is slightly "off," which components are errors and which are intentional signal? The representations weren't designed with error boundaries. The noise is semantically entangled with the message. There's no way to look at the latent state and say "that part is the mistake."

This noise might occasionally get recognized and corrected, but this shouldn't be expected. There are too many values the noise can manifest in, too many ways it can blend with legitimate concepts, not easily distinguishable for the next forward pass to catch.

So the noise accumulates. And accumulates. Until eventually you've drifted far enough from any correct trajectory that you produce clearly wrong outputs. Yet, you can't identify where things went wrong.

Discreteness as Regeneration

Discrete tokens don't have this problem. The noise exists in the final hidden layer just as much, but it gets rounded away when sampling from the categorical distribution.

This is comparable to digital vs. analog communication, think of old telephone networks passing messages:

Analog: Each relay amplifies the signal, amplifying noise with it. Transcontinental calls were impossible: the accumulated noise drowned the signal.
Digital: Each relay regenerates by classifying the signal and outputting a clean version. Noise gets corrected to tolerable levels at every step. Arbitrary distances become possible.

The token vocabulary is the regeneration codebook. The softmax is the classifier. Each autoregressive step is a relay station.

Of course, sometimes the noise is large enough that you "round in the wrong direction": you sample the wrong token. This is major noise. But crucially: major noise is identifiable in a way minor noise isn't.

Why Identifiability Enables Error Correction

Going back to the example of telephone networks, imagine a digital communication system messing up: one of the relays misclassified the signal and as a result the wrong signal was generated and send further down the line. This error is fixable - we know where something went wrong. Maybe the classifier is broken, those specific cables aren't isolated well enough, whatever it may be, there is no reason to succumb to despair: We surely know of a different set of relays and cables that should work out, starting from where we left off. And if they don't, we just figure out when something went wrong once again and progress slowly but surely through this maze.

Now compare this to the analog case - there is no clear "mistake", no single intervention point we can leverage as checkpoint. At any relay, noise accumulates, never a lot but always a little. Our best shot would be resending the signal down a completely different path, from start to finish, and just hope for a miracle.

The same logic applies to token-based reasoning. When you pick token X instead of token Y, that's a discrete, inspectable decision. The model reflecting on its own output can ask: "Was X the right choice here?" It can explore: "What if I had said Y instead?" The discrete structure gives you a search tree of token choices rather than an undifferentiated continuous trajectory.

This is what backtracking looks like in chain-of-thought reasoning. Thinking back, recognizing that a particular choice led somewhere unproductive, and exploring alternatives from that branch point.

This doesn't mean error correction is easy. Understanding when something went wrong is genuinely hard - you're often stuck, not knowing what the correct path forward looks like, and you just pick a direction to explore. But the point is that it's feasible. Because the error isn't gradual and diffuse, you can hope to identify it and therefore fix it. You can progress slowly but surely through the maze, backtracking when needed, rather than praying your next blind attempt happens to stumble on the right path.

The Bandwidth Counterargument

You might object: if the reasoning chain is short enough, noise won't accumulate to catastrophic levels, and the bandwidth cost of tokenization dominates. There's truth here.

Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.

The question is what happens over long reasoning chains, where hundreds or thousands of sequential steps compound whatever imperfections exist.

Tokens Are More Expressive Than They Seem

The expressiveness of discrete symbols shouldn't be underestimated. Each token is one of ~100k options, but carries deep meaning and (more importantly) rich connections to other tokens. A single sentence can express a vast space of concepts through composition.

This matters especially when you consider how well-adapted the model is to these specific embeddings. It's like how semantically transparent languages (Chinese characters, for instance) allow readers who know only basic components to understand novel compounds: they grasp the building blocks and how they interact, letting them extrapolate beyond their training data.

Theoretically LLMs can operate on embeddings outside the vocabulary. But whether patterns learned in an absurdly high dimensional and interpolative space (opposed to a deliberately finite, highly-structured set) will prove the same amount of generalization is questionable.

The Tokenization Counter-Counterargument

Here's another objection to my thesis: current tokenization actively introduces errors.

Arithmetic with subword tokens is a mess. Proper nouns get split arbitrarily. Multilingual text suffers. The discretization bottleneck doesn't just filter noise - it also destroys information that continuous representations would preserve.

This is real. But I'd argue it's an indictment of current vocabularies, not of discreteness itself.

The 100k tokens of English-optimized BPE weren't designed for reasoning. They encode accidents of orthography, not conceptual primitives. The solution isn't to abandon discreteness - it's to have better discrete representations (this isn't GOFAI, i promise).

Toward Minimal Compositional Vocabularies

There's a lot of cruft in languages like English that reasoning-focused training shouldn't have to care about. In fact, I'd argue we should go smaller, not larger. Strip from 100k to 1000, or even 500 tokens. Force the LLM to develop its own "language" with the only constraint of immense compositionality enforced through the small vocabulary.

A compositional vocabulary with 500 primitives might express the same concepts as 100k English subwords while being far more training efficient: new relations learned through training suddenly don't first have to go through multiple abstractions like choice of words, writing style, etc.. but directly take place on the reasoning level, allowing for strong generalization.

Escaping The Local Optima

A big problem is the local optima that LLMs enjoy through pretraining - whether they will themself ever truly escape is questionable but much needed to reach greater heights, literally. Instead of trying to come up with difficult reward schema, the carrot, I believe the stick is needed: simply continuously (the irony) strip them of the tokens they have grown so accustomed to.

By removing a growing set of tokens from the categorical distribution during rollouts while still exposed to optimization pressure, the LLM will naturally get forced out of it's local optima, since we are literally destroying it. It will try to represent now blacklisted tokens and associated concepts through patterns of other tokens, finding suitable representations that allow it to keep it's vast pretraining knowledge intact (hopefully).

I'm well aware that this sounds like AI Safety hell, but if you believe current models can't just yet cause Armageddon, i hope you can see how interesting the results of such training would be; regardless if "failed" or "successful", whatever those two may even mean in this context.

Summary (of claims)

Continuous latent representations accumulate noise that is semantically entangled with the signal.
Discrete tokenization acts as regeneration: minor noise gets rounded away, major noise becomes a localized, inspectable error.
Identifiable errors enable backtracking and search. Diffuse errors require restarting from scratch.
Current tokenization is suboptimal for reasoning, but the solution is better discrete vocabularies, not abandoning discreteness.

This is a really good point. Even if you could train a neuralese model, it would rapidly accumulate errors during inference and go out of distribution.

This is already a problem with tokenized models, where one incorrect token forces the model to condition on that token, but for continuous models we'd expect basically every output to have some error.

If this were true, wouldn't it imply that quantizing the model (or at least the KV cache) would improve performance?

No, for several reasons: For starters, quantization is normally done after training and not present during training (mainly because it introduces a lot of grad problems) - this is not comparable to the token distribution which we incorporate during training and train on. (In other words, it can't take advantage of any possible benefits cause it was trained on a whole other setting)

Even more importantly, the error doesn't aggregate for the KV Cache (the normal weights obviously can't, they are literally fixed): inspecting a token's KV cache in the i'th layer, it only carries the noise from the 0'th to i'th layer (any minor noise from before got removed since it was a discrete token at the beginning). It in turn will carry this noise to other tokens through the attention mechanism which will then still have to face the noise from the i'th to final layer. But this is just a normal forward pass worth of noise, not something we have to worry much about since we are now just gonna tokenize it, removing all minor noise. (In other words, my argument focuses on noise that grows and grows through autoregressive steps, this is just the noise of a normal forward pass)

Specifically as noted in "The Bandwidth Counterargument":

Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.

I think the question is whether applying quantization to hidden states in the middle of the forward pass during both training and inference would improve performance, which your argument would seem to imply.

I'm not sure which passage you seem to refer to, saying that my argument implies this. The sections "The Bandwidth Intuition/Counterargument" are supposed to clear exactly this, stating roughly that I understand that there is still obviously a loss of information and as such it's nonsensical for a normal NN to have miniscule layers. This isn't an accurate assessment for neuralese LLMs though. They recursively aggregate this error, turning it into a lot bigger problem. If simply allowed to grow through the tens and hundreds of forward passes, it's simply not worth it. If we do already tokenize them though, quantization no longer would serve any real purpose.

I hope my position has become a little bit more clear.

But why would this error accumulation be a problem in recurrent forward passes and not one long forward pass?

I'm not claiming I know the perfect cut off point between not losing information and not letting errors accumulate, if something like that even exists. It could very well be that after 3 forward passes with neuralese, it would still be mostly fine or it could also be that even in the middle of a single forward pass it makes sense to have some kind of mitigations (I think you could build an argument around MoE being a mitigation). But what this perfect ratio is doesn't really matter, the point is that recurrent forward passes will be a thousand times worse than a normal forward pass and therefore can't be worth it anymore.

Making everything discrete is one extreme just as making everything continuous is the other extreme. I'm arguing that the golden ratio lies somewhere in the middle, recognizing the importance of both rich, continuous representations and clear, discrete representations.

Makes sense from a brain perspective, too.

Neurons are noisy but by using discrete representations (symbols), we can do extremely reliable cognition (math and logic). Though that’s all still represented by neurons, which is a difference with decoding tokens from AI.

This is a really good point. Even if you could train a neuralese model, it would rapidly accumulate errors during inference and go out of distribution.

If this were true, wouldn't it imply that quantizing the model (or at least the KV cache) would improve performance?

Specifically as noted in "The Bandwidth Counterargument":

Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.

I hope my position has become a little bit more clear.

But why would this error accumulation be a problem in recurrent forward passes and not one long forward pass?

LESSWRONG
LW

LESSWRONG
LW

11

The Case Against Continuous Chain-of-Thought (Neuralese)

11

The Bandwidth Intuition (And Why It's Incomplete)

The Noise Accumulation Problem

Discreteness as Regeneration

Why Identifiability Enables Error Correction

The Bandwidth Counterargument

Tokens Are More Expressive Than They Seem

The Tokenization Counter-Counterargument

Toward Minimal Compositional Vocabularies

Escaping The Local Optima

Summary (of claims)

11

11