This is a really good point. Even if you could train a neuralese model, it would rapidly accumulate errors during inference and go out of distribution.
This is already a problem with tokenized models, where one incorrect token forces the model to condition on that token, but for continuous models we'd expect basically every output to have some error.
If this were true, wouldn't it imply that quantizing the model (or at least the KV cache) would improve performance?
No, for several reasons: For starters, quantization is normally done after training and not present during training (mainly because it introduces a lot of grad problems) - this is not comparable to the token distribution which we incorporate during training and train on. (In other words, it can't take advantage of any possible benefits cause it was trained on a whole other setting)
Even more importantly, the error doesn't aggregate for the KV Cache (the normal weights obviously can't, they are literally fixed): inspecting a token's KV cache in the i'th layer, it only carries the noise from the 0'th to i'th layer (any minor noise from before got removed since it was a discrete token at the beginning). It in turn will carry this noise to other tokens through the attention mechanism which will then still have to face the noise from the i'th to final layer. But this is just a normal forward pass worth of noise, not something we have to worry much about since we are now just gonna tokenize it, removing all minor noise. (In other words, my argument focuses on noise that grows and grows through autoregressive steps, this is just the noise of a normal forward pass)
Specifically as noted in "The Bandwidth Counterargument":
Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.
I think the question is whether applying quantization to hidden states in the middle of the forward pass during both training and inference would improve performance, which your argument would seem to imply.
I'm not sure which passage you seem to refer to, saying that my argument implies this. The sections "The Bandwidth Intuition/Counterargument" are supposed to clear exactly this, stating roughly that I understand that there is still obviously a loss of information and as such it's nonsensical for a normal NN to have miniscule layers. This isn't an accurate assessment for neuralese LLMs though. They recursively aggregate this error, turning it into a lot bigger problem. If simply allowed to grow through the tens and hundreds of forward passes, it's simply not worth it. If we do already tokenize them though, quantization no longer would serve any real purpose.
I hope my position has become a little bit more clear.
But why would this error accumulation be a problem in recurrent forward passes and not one long forward pass?
Main thesis: Discrete token vocabularies don't lose information so much as they allow information to be retained in the first place. By removing minor noise and singling out major noise, errors become identifiable and therefore correctable, which continuous latent representations fundamentally cannot offer.
The Bandwidth Intuition (And Why It's Incomplete)
One of the most elementary ideas connected to neuralese is increasing bandwidth. After the tireless mountains of computation called a forward pass, we condense everything down to ~17 bits (the log₂ of our vocabulary size).
This seems insane. Imagine pitching a neural network architecture where layers 5, 10, 15, and 20 have hidden dimension 20, while normal layers use 512. You'd be laughed out of the room. And that's not even accounting for discreteness.
So why do I think this apparent insanity is not just tolerable but necessary for LLMs?
The Noise Accumulation Problem
Let's imagine LLMs passed vast latent messages instead of tokens - say, the final hidden state or something a few matrix multiplications away from it. No bandwidth concerns. Pure continuous reasoning.
This message won't be perfect. The forward pass involves finite compute, imperfect optimization, distributional shift as we move off the training manifold, etc.. Call this aggregate imperfection "noise" or "error".
Here's the problem: noise in continuous space has no natural factorization.
When a 4096-dimensional vector is slightly "off," which components are errors and which are intentional signal? The representations weren't designed with error boundaries. The noise is semantically entangled with the message. There's no way to look at the latent state and say "that part is the mistake."
This noise might occasionally get recognized and corrected, but this shouldn't be expected. There are too many values the noise can manifest in, too many ways it can blend with legitimate concepts, not easily distinguishable for the next forward pass to catch.
So the noise accumulates. And accumulates. Until eventually you've drifted far enough from any correct trajectory that you produce clearly wrong outputs. Yet, you can't identify where things went wrong.
Discreteness as Regeneration
Discrete tokens don't have this problem. The noise exists in the final hidden layer just as much, but it gets rounded away when sampling from the categorical distribution.
This is comparable to digital vs. analog communication, think of old telephone networks passing messages:
The token vocabulary is the regeneration codebook. The softmax is the classifier. Each autoregressive step is a relay station.
Of course, sometimes the noise is large enough that you "round in the wrong direction": you sample the wrong token. This is major noise. But crucially: major noise is identifiable in a way minor noise isn't.
Why Identifiability Enables Error Correction
Going back to the example of telephone networks, imagine a digital communication system messing up: one of the relays misclassified the signal and as a result the wrong signal was generated and send further down the line. This error is fixable - we know where something went wrong. Maybe the classifier is broken, those specific cables aren't isolated well enough, whatever it may be, there is no reason to succumb to despair: We surely know of a different set of relays and cables that should work out, starting from where we left off. And if they don't, we just figure out when something went wrong once again and progress slowly but surely through this maze.
Now compare this to the analog case - there is no clear "mistake", no single intervention point we can leverage as checkpoint. At any relay, noise accumulates, never a lot but always a little. Our best shot would be resending the signal down a completely different path, from start to finish, and just hope for a miracle.
The same logic applies to token-based reasoning. When you pick token X instead of token Y, that's a discrete, inspectable decision. The model reflecting on its own output can ask: "Was X the right choice here?" It can explore: "What if I had said Y instead?" The discrete structure gives you a search tree of token choices rather than an undifferentiated continuous trajectory.
This is what backtracking looks like in chain-of-thought reasoning. Thinking back, recognizing that a particular choice led somewhere unproductive, and exploring alternatives from that branch point.
This doesn't mean error correction is easy. Understanding when something went wrong is genuinely hard - you're often stuck, not knowing what the correct path forward looks like, and you just pick a direction to explore. But the point is that it's feasible. Because the error isn't gradual and diffuse, you can hope to identify it and therefore fix it. You can progress slowly but surely through the maze, backtracking when needed, rather than praying your next blind attempt happens to stumble on the right path.
The Bandwidth Counterargument
You might object: if the reasoning chain is short enough, noise won't accumulate to catastrophic levels, and the bandwidth cost of tokenization dominates. There's truth here.
Having bottleneck layers in a normal neural network is nonsensical - when the "distance" is small, you should stay in high-dimensional, continuous latent space. By the time one forward pass finishes, noise hasn't yet grown enough to matter and tokenization can clean it up.
The question is what happens over long reasoning chains, where hundreds or thousands of sequential steps compound whatever imperfections exist.
Tokens Are More Expressive Than They Seem
The expressiveness of discrete symbols shouldn't be underestimated. Each token is one of ~100k options, but carries deep meaning and (more importantly) rich connections to other tokens. A single sentence can express a vast space of concepts through composition.
This matters especially when you consider how well-adapted the model is to these specific embeddings. It's like how semantically transparent languages (Chinese characters, for instance) allow readers who know only basic components to understand novel compounds: they grasp the building blocks and how they interact, letting them extrapolate beyond their training data.
Theoretically LLMs can operate on embeddings outside the vocabulary. But whether patterns learned in an absurdly high dimensional and interpolative space (opposed to a deliberately finite, highly-structured set) will prove the same amount of generalization is questionable.
The Tokenization Counter-Counterargument
Here's another objection to my thesis: current tokenization actively introduces errors.
Arithmetic with subword tokens is a mess. Proper nouns get split arbitrarily. Multilingual text suffers. The discretization bottleneck doesn't just filter noise - it also destroys information that continuous representations would preserve.
This is real. But I'd argue it's an indictment of current vocabularies, not of discreteness itself.
The 100k tokens of English-optimized BPE weren't designed for reasoning. They encode accidents of orthography, not conceptual primitives. The solution isn't to abandon discreteness - it's to have better discrete representations (this isn't GOFAI, i promise).
Toward Minimal Compositional Vocabularies
There's a lot of cruft in languages like English that reasoning-focused training shouldn't have to care about. In fact, I'd argue we should go smaller, not larger. Strip from 100k to 1000, or even 500 tokens. Force the LLM to develop its own "language" with the only constraint of immense compositionality enforced through the small vocabulary.
A compositional vocabulary with 500 primitives might express the same concepts as 100k English subwords while being far more training efficient: new relations learned through training suddenly don't first have to go through multiple abstractions like choice of words, writing style, etc.. but directly take place on the reasoning level, allowing for strong generalization.
Escaping The Local Optima
A big problem is the local optima that LLMs enjoy through pretraining - whether they will themself ever truly escape is questionable but much needed to reach greater heights, literally. Instead of trying to come up with difficult reward schema, the carrot, I believe the stick is needed: simply continuously (the irony) strip them of the tokens they have grown so accustomed to.
By removing a growing set of tokens from the categorical distribution during rollouts while still exposed to optimization pressure, the LLM will naturally get forced out of it's local optima, since we are literally destroying it. It will try to represent now blacklisted tokens and associated concepts through patterns of other tokens, finding suitable representations that allow it to keep it's vast pretraining knowledge intact (hopefully).
I'm well aware that this sounds like AI Safety hell, but if you believe current models can't just yet cause Armageddon, i hope you can see how interesting the results of such training would be; regardless if "failed" or "successful", whatever those two may even mean in this context.
Summary (of claims)