13 Arguments About a Transition to Neuralese AIs

[-]1a3orn1mo74

Good summary of some pros and cons. Neutral lists of arguments both for and against something that don't try to come to a conclusion are rationally virtuous, so kudos.

An addition point con-neuralese -- it's a mix of 1, 3, 7, but you didn't mention it: I think CoT might give you a capability lever as well as an alignment lever.

Specifically, a CoT (or a stream of neuralese) could be legible to the model (or not legible). This may enable some process like this.

1. Suppose the process by which you arrived at some particular answer could be something you can survey, understand, and remark upon; one might ask oneself "how could I have done this reasoning faster" if you got the answer correct, or "hrm what was the root cause of the mistake I made here?" if you did not. An LLM can obviously do this with a legible CoT. It's unclear or dubious that an AI could do these with neuralese.
1. Granted that you can survey, understand, or remark upon a CoT, this opens up a chance to improve capabilities by (2.1) having the LLM edit the CoT, inserting reflections like "Ah I made a mistake" or "Let's enumerate some different possibilities..." (think all the kind of stuff @raemon likes to do), (2.2) running forward the CoT with the inserted reflections as prefill (2.3) training on the altered CoT iff the model gets more improvement. Lots of things you can do in (2.3)
1. In theory this should permit much richer and more reflective alterations to the CoT than "dumb" policy gradient updates.

The above is a rather specific proposal, and so has many burdensome details that are likely to be wrong, but I think something like this is not entirely unlikely.

From another angle -- from one of the links you have above has Nostalgebraist talk about how from the users perspective, a legiible CoT is a capability enhancement:

In short: from an end user's perspective, CoT visibility is a capabilities improvement.

I ended up just switching to 3.7 Sonnet for the task discussed above – not because it was "smarter" as a model in any way I knew about, but simply because the associated API made it so much easier to construct prompts that would effectively leverage its intelligence for my purposes.

This strikes me as a very encouraging sign for the CoT-monitoring alignment story.

But from my angle, this is also a capability bonus -- because it allows the model to look at itself, etc?

I'm not sure about this of course, but I thought it worth adding.

[-]Rauno Arike1mo10

Thanks, this is a good point and definitely worth adding to the list!

Assuming that this is useful for capabilities, I'd imagine that neuralese models would also be able to eventually develop mechanisms to reflect on their own reasoning processes. The kind of introspection that Anthropic recently studied might be seen as one precursor capability for this and as evidence that this is incentivized by the training process, even if the model also has the ability to reflect on the CoT. But of course, the ability to reflect on internal activations in a deep way might be much harder to develop than the ability to do this with a legible CoT.

[-]Vladimir_Nesov2mo70

Even Anthropic is building frontier AIs (since Opus 3 or Sonnet 3.5; this was a bit of a surprise to some at the time). Thus if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it, as soon as the immediate behavioral downsides get mitigated to the usual level. Any other kinds of downsides won't be a reason to not go there, with frontier AIs.

Also, continual learning is analogous to neuralese, an inscrutable way of preserving/propagating information with long chains of reasoning, different from text-only notes. In both cases, you can build a textual "transcript" of a process of thinking, but it's not necessarily faithful, and doesn't screen off earlier thinking from later thinking.

[-]Rauno Arike1mo10

if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it

Yep, I agree on this. However, it still seems worthwhile to think about whether and when we expect neuralese to unlock meaningfully better capabilities in the first place.

Also, continual learning is analogous to neuralese

Hm, I haven't heard this argument before. I agree that there's a sense in which it's analogous to neuralese, but it isn't obvious to me that it's analogous in a concerning way. The main reason we care about preserving legible CoTs is that it enables us to catch dangerous kinds of reasoning, e.g. reasoning about subverting safety measures. If a model doesn't have enough serial depth to perform the relevant reasoning in a single forward pass, it would need to verbalize this reasoning in its CoT at some point regardless of whether the relevant facts were learned during pre-training or through continual learning. An exception would be coming across data that lays out a perfect strategy for subverting the most recent safety measures, but hopefully such strategies won't be written up, at least in places where a misaligned AI could encounter it.

[-]Vladimir_Nesov1mo20

Even with text-only reasoning traces, there is inscrutable thinking in the activations that can span the whole reasoning trace (with global attention), it just has to remain shallow, as its depth is only given by layers, and its depth can only be extended by looping around through the text-only reasoning trace. There will be some steganography (RL makes sure of this), even if it's not substantial, that increases the effective depth of inscrutable thinking beyond a single pass through the layers. But it stops with the reasoning trace, so this affordance is still bounded.

Continual learning greatly extends the thinking horizon, likely using inscrutable weight updates, thus in particular extending the feasible depth of inscrutable thinking that can avoid showing up in the text-only reasoning traces.

[-]Seth Herd1mo42

This is interesting; you make good points on both sides, probably by virtue of gathering them broadly. Nice work!

This is about whether we will move to neuraliese on the current default path. I think there's an additional important factor: We must not. If the default path takes us there, we must put our shoulders against it and change that path.

I think it's hard to argue that faithful CoT doesn't really help the odds of aligning LLM-based AGI. And it's really hard to argue that's not important enough to be worth some real capability loss. I realize that commercial and national defense pressures are strong; but we should not fail to mention this fact in any discussion of losing natural langauge CoT. If we do, we're substantially more likely to all die. It would be a horrible choice and a horrible mistake. Preventing research on improvements without CoT should be a very high priority.

The arguments for risk:

Every plan I know of for aligning LLM-based AGI benefits immensely if we still have somewhat-faithful CoT. And that's critical, because all of those plans suck! That's including my own. Taking an unreliable plan and making it a lot worse is really, really bad for our odds of survival.

Of course that's a big claim. I tried to give full credit to those who think alignment is a lot easier in Cruxes of disagreement on alignment difficulty. But even with the most generous steelmanning I can't get any reasonable misalignment estimate down below maybe 10% risk of disaster. Remove CoT and that goes way up, in that case maybe 20%.

More estimates based on my most recent and careful thinking are more like 40-60% chance of outright alignment failure (with a lot more chance that If we solve alignment, we die anyway). That would go way up to 60- 80% if we lose CoT.

On the actual question of the efficiencies of losing CoT:

I don't think there are large gains to be made by dropping language entirely. Language is reason-complete; that's what it's for, and we humans spent a long time crafting it to be complete and efficient. Language compresses very nicely, too; thinking in language doesn't prevent efficiency improvements by collapsing steps.

BUT even if that turns out to be

[-]Vladimir_Nesov1mo40

Continual learning destroys the hope that we get to early AGIs while avoiding the issues with neuralese, because it almost certainly requires inscrutable communication that maintains thinking for a long time (likely in the form of some kind of test time weight updates), and it's an essential capability. With neuralese itself, there is a significant possibility that it doesn't get too much better by that time yet, but continual learning does need to happen first (as a capability rather than a particular algorithmic breakthrough), and it more inherently has the same issues.

[-]Rauno Arike1mo10

Yeah, I strongly agree with everything you've said here (aside from the question of whether there are large gains to be made by dropping legible CoT—the post describes my uncertainties). It seems likely to me that whether we stick with readable CoTs will ultimately be decided by efficiency considerations, which is why I focused on those in the post, but of course agree that we should do our best in putting our shoulders against it.

[-]williawa1mo40

I asked this exact question a while ago and wrote up some of my thoughts. https://www.lesswrong.com/posts/PJaq4CDQ5d5QtjNRy/williawa-s-shortform#YmyQqQqdei9C7pXR3 but I didn't get any answers unfortunately.

They mostly hinge on the parallelism and inference/RL vs pretraining point. The arguments that carry the most weight for me now are:

(against) Natural language being kind of reasoning-complete, and LLMs will stick to it even as they get smarter (related to the fact that humans reason in inner monologues)
1. Basically your sweet spot argument
2. I find this unlikely, but not entirely implausible.
(for) More serial depth, recurrence plausibly has better continual learning properties
(for) Fully recurrent architectures are O(1) generation and O(1) memory during inference, contra attention which is O(n) for both
(for) Fully recurrent models do better credit assignment during training
(against) Training recurrent architectures just seems hard on a technical level.

CoT having nice safety/business properties seems unlikely to matter to me. Just, current labs, seem unlikely to hold back capability gains for such things. Maybe I'm too cynical.

I also didn't understand the SGD inductive biases point. Like if you're doing something like GRPO, you can train the model to output tokens that lead to fruitful reasoning down the line. Could you explain this? Like, I agree tokens are a less expressive medium than activations are, but this seems like the same argument as the information bottleneck one. I could also see it going the other way. Like, its hard to represent rigid logical operations in activation space, and being forced to move through tokens spaces has a useful inductive prior for "reasoning" because of its discretisation.

[-]RohanS1mo32

I think CoT may be very valuable in helping capabilities researchers understand their models’ shortcomings and design training curricula that address those deficiencies. This sort of falls under “business value” itself. I currently think this is one of the stronger arguments for why we won’t switch to neuralese.

(The post only vaguely gestures at this argument, so your interpretation is also fair, but I think business value and capabilities gains are pretty closely related rather than being opposing factors.)

[-]Rauno Arike1mo10

Thanks for bringing this up, I hadn't seen your shortform post! We've converged on pretty similar pros/cons. A few comments:

It isn't obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire generation process, so I don't expect O(1) inference to be a free lunch—e.g., as a trade-off, models that can handle extremely long sequences might require huge hidden state vectors that are uselessly large during the first half of the generation process, introducing a different inefficiency that transformers don't have. Additionally, transformers could be enhanced with something like infini-attention that bounds the max size of the attention matrix while still requiring the model to reason about any fact retrieved from the memory bank in its CoT.
The credit assignment point is an important one that I didn't consider in the post. Wang et al. (2025) provide some evidence that there might be simple heuristics that enable much better credit assignment for transformers as well, but the updates for a recurrent architecture would still be more precise.
The inductive biases argument is the most handwavy one in the post, maybe I should have left it out. I do think there's a distinction to be made between wastefulness (which the information bottleneck argument is mainly about) and ease of learning (which the inductive biases point is about), and there's an intuitive sense that directly optimizing the weights to implement good algorithms should be easier than optimizing them to output tokens that eventually combine to implement a good algorithm, conditional on there being efficient optimization methods for both. However, your point about discretization possibly being a good inductive prior is reasonable and I don't really have strong empirical evidence for resolving this question either way.

[-]williawa1mo10

It isn't obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire gene

I agree it is unclear, but I would still expect it to. Just because.

Its very easy for models to remember stuff in general
Like the residual stream of a transformer is empirically highly redundant
Does it actually need to remember all the stuff? Humans don't remember all the stuff.

[-]teradimich1mo20

Have you seen this market?

When will an AI model using neuralese recurrence be first released to the public? (currently 11 Oct 2027)

[-]Rauno Arike1mo20

I hadn't, thanks!

^{^}

For example, Qwen3-235B-A22B, which to my knowledge is the deepest open-source transformer released this year, has 94 layers, with each layer involving five serial operations—three in the attention and two in the MLP block. Thus, the total number of serial operations in a Qwen3-235B forward pass is 470. It seems plausible that some proprietary models have slightly deeper forward passes, but likely not much deeper.

^{^}

See Appendix II of Jacques Hadamard’s An Essay on the Psychology of Invention in the Mathematical Field for the source. The book extensively discusses the role of non-verbal thinking in mathematical research; Henrik Karlsson’s recent post When is it better to think without words? provides a summary.

^{^}

Though note that this study has at least two significant limitations: first, impairments in inner monologue are assessed only through a test for silent rhyming ability, and second, the sample size is only 11 patients.

Arguments for a transition to neuralese	Arguments against a transition to neuralese
A lot of information gets lost in text bottlenecks.	Natural language reasoning might be a strong local optimum that takes a lot of training effort to escape.
The relative importance of post-training compared to pre-training is increasing.	Recurrent LLMs suffer from a parallelism trade-off that makes their training less efficient.
There’s an active subfield researching recurrent LLMs.	There’s significant business value in being able to read a model’s CoTs.
Human analogy: natural language might not play that big of a role in human thinking.	Human analogy: even if natural language isn’t humans’ primary medium of thought, we still rely on it a lot.
SGD inductive biases might favor directly learning good sequential reasoning algorithms in the weight space.	Though significant effort has been spent on getting neuralese models to work, we still have none that work at scale.
In the limit of capabilities, neuralese is strictly more powerful than CoT reasoning.	There’s a depth-latency trade-off that might make it less appealing to train large recurrent LLMs.
	There’s a lot of safety value in being able to read the model’s CoTs.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

38

13 Arguments About a Transition to Neuralese AIs

38

38

What do I mean by neuralese?

Arguments in favor of a transition to neuralese AIs

1) A lot of information is lost in a text bottleneck

2) The increasing importance of post-training

3) Active research on recurrent LLMs

4) Analogy with human thinking

5) SGD inductive biases

6) The limit of capabilities

Arguments against a transition to neuralese AIs

1) The natural language sweet spot

2) The parallelism trade-off

3) Business value of visible reasoning traces

4) Analogy with human thinking

5) Evidence from past attempts to build recurrent LLMs

6) The depth-latency trade-off

7) Safety value of visible reasoning traces

Conclusion