The Case Against Continuous Chain-of-Thought (Neuralese)
Main thesis: Discrete token vocabularies don't lose information so much as they allow information to be retained in the first place. By removing minor noise and singling out major noise, errors become identifiable and therefore correctable, which continuous latent representations fundamentally cannot offer. The Bandwidth Intuition (And Why It's Incomplete) One of the most elementary ideas connected to neuralese is increasing bandwidth. After the tireless mountains of computation called a forward pass, we condense everything down to ~17 bits (the log₂ of our vocabulary size). This seems insane. Imagine pitching a neural network architecture where layers 5, 10, 15, and 20 have hidden dimension 20, while normal layers use 512. You'd be laughed out of the room. And that's not even accounting for discreteness. So why do I think this apparent insanity is not just tolerable but necessary for LLMs? The Noise Accumulation Problem Let's imagine LLMs passed vast latent messages instead of tokens - say, the final hidden state or something a few matrix multiplications away from it. No bandwidth concerns. Pure continuous reasoning. This message won't be perfect. The forward pass involves finite compute, imperfect optimization, distributional shift as we move off the training manifold, etc.. Call this aggregate imperfection "noise" or "error". Here's the problem: noise in continuous space has no natural factorization. When a 4096-dimensional vector is slightly "off," which components are errors and which are intentional signal? The representations weren't designed with error boundaries. The noise is semantically entangled with the message. There's no way to look at the latent state and say "that part is the mistake." This noise might occasionally get recognized and corrected, but this shouldn't be expected. There are too many values the noise can manifest in, too many ways it can blend with legitimate concepts, not easily disting
I'm not convinced by your stance on not refreshing probes. Formally, we have three outcomes for this "gamble" of adding probes to the loss: x is the model doesn't shift its representations and actually achieves the desired result, y is the model shifts its representations but by refreshing we could still probe them, and z is the model shifts its representations and refreshing doesn't help anymore (say non-linear representations for a linear probe).
What I think your stance boils down to is saying that you are willing to generally consider this wager, arguing P(z) is often not that high. But, when conditioned on the prior outcome y, you argue P(z|y) is too high and isn't worth it. You don't really state any clear argument... (read more)