Good summary of some pros and cons. Neutral lists of arguments both for and against something that don't try to come to a conclusion are rationally virtuous, so kudos.
An addition point con-neuralese -- it's a mix of 1, 3, 7, but you didn't mention it: I think CoT might give you a capability lever as well as an alignment lever.
Specifically, a CoT (or a stream of neuralese) could be legible to the model (or not legible). This may enable some process like this.
The above is a rather specific proposal, and so has many burdensome details that are likely to be wrong, but I think something like this is not entirely unlikely.
From another angle -- from one of the links you have above has Nostalgebraist talk about how from the users perspective, a legiible CoT is a capability enhancement:
In short: from an end user's perspective, CoT visibility is a capabilities improvement.
I ended up just switching to 3.7 Sonnet for the task discussed above – not because it was "smarter" as a model in any way I knew about, but simply because the associated API made it so much easier to construct prompts that would effectively leverage its intelligence for my purposes.
This strikes me as a very encouraging sign for the CoT-monitoring alignment story.
But from my angle, this is also a capability bonus -- because it allows the model to look at itself, etc?
I'm not sure about this of course, but I thought it worth adding.
Thanks, this is a good point and definitely worth adding to the list!
Assuming that this is useful for capabilities, I'd imagine that neuralese models would also be able to eventually develop mechanisms to reflect on their own reasoning processes. The kind of introspection that Anthropic recently studied might be seen as one precursor capability for this and as evidence that this is incentivized by the training process, even if the model also has the ability to reflect on the CoT. But of course, the ability to reflect on internal activations in a deep way might be much harder to develop than the ability to do this with a legible CoT.
Even Anthropic is building frontier AIs (since Opus 3 or Sonnet 3.5; this was a bit of a surprise to some at the time). Thus if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it, as soon as the immediate behavioral downsides get mitigated to the usual level. Any other kinds of downsides won't be a reason to not go there, with frontier AIs.
Also, continual learning is analogous to neuralese, an inscrutable way of preserving/propagating information with long chains of reasoning, different from text-only notes. In both cases, you can build a textual "transcript" of a process of thinking, but it's not necessarily faithful, and doesn't screen off earlier thinking from later thinking.
if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it
Yep, I agree on this. However, it still seems worthwhile to think about whether and when we expect neuralese to unlock meaningfully better capabilities in the first place.
Also, continual learning is analogous to neuralese
Hm, I haven't heard this argument before. I agree that there's a sense in which it's analogous to neuralese, but it isn't obvious to me that it's analogous in a concerning way. The main reason we care about preserving legible CoTs is that it enables us to catch dangerous kinds of reasoning, e.g. reasoning about subverting safety measures. If a model doesn't have enough serial depth to perform the relevant reasoning in a single forward pass, it would need to verbalize this reasoning in its CoT at some point regardless of whether the relevant facts were learned during pre-training or through continual learning. An exception would be coming across data that lays out a perfect strategy for subverting the most recent safety measures, but hopefully such strategies won't be written up, at least in places where a misaligned AI could encounter it.
Even with text-only reasoning traces, there is inscrutable thinking in the activations that can span the whole reasoning trace (with global attention), it just has to remain shallow, as its depth is only given by layers, and its depth can only be extended by looping around through the text-only reasoning trace. There will be some steganography (RL makes sure of this), even if it's not substantial, that increases the effective depth of inscrutable thinking beyond a single pass through the layers. But it stops with the reasoning trace, so this affordance is still bounded.
Continual learning greatly extends the thinking horizon, likely using inscrutable weight updates, thus in particular extending the feasible depth of inscrutable thinking that can avoid showing up in the text-only reasoning traces.
This is interesting; you make good points on both sides, probably by virtue of gathering them broadly. Nice work!
This is about whether we will move to neuraliese on the current default path. I think there's an additional important factor: We must not. If the default path takes us there, we must put our shoulders against it and change that path.
I think it's hard to argue that faithful CoT doesn't really help the odds of aligning LLM-based AGI. And it's really hard to argue that's not important enough to be worth some real capability loss. I realize that commercial and national defense pressures are strong; but we should not fail to mention this fact in any discussion of losing natural langauge CoT. If we do, we're substantially more likely to all die. It would be a horrible choice and a horrible mistake. Preventing research on improvements without CoT should be a very high priority.
The arguments for risk:
Every plan I know of for aligning LLM-based AGI benefits immensely if we still have somewhat-faithful CoT. And that's critical, because all of those plans suck! That's including my own. Taking an unreliable plan and making it a lot worse is really, really bad for our odds of survival.
Of course that's a big claim. I tried to give full credit to those who think alignment is a lot easier in Cruxes of disagreement on alignment difficulty. But even with the most generous steelmanning I can't get any reasonable misalignment estimate down below maybe 10% risk of disaster. Remove CoT and that goes way up, in that case maybe 20%.
More estimates based on my most recent and careful thinking are more like 40-60% chance of outright alignment failure (with a lot more chance that If we solve alignment, we die anyway). That would go way up to 60- 80% if we lose CoT.
On the actual question of the efficiencies of losing CoT:
I don't think there are large gains to be made by dropping language entirely. Language is reason-complete; that's what it's for, and we humans spent a long time crafting it to be complete and efficient. Language compresses very nicely, too; thinking in language doesn't prevent efficiency improvements by collapsing steps.
BUT even if that turns out to be
Continual learning destroys the hope that we get to early AGIs while avoiding the issues with neuralese, because it almost certainly requires inscrutable communication that maintains thinking for a long time (likely in the form of some kind of test time weight updates), and it's an essential capability. With neuralese itself, there is a significant possibility that it doesn't get too much better by that time yet, but continual learning does need to happen first (as a capability rather than a particular algorithmic breakthrough), and it more inherently has the same issues.
Yeah, I strongly agree with everything you've said here (aside from the question of whether there are large gains to be made by dropping legible CoT—the post describes my uncertainties). It seems likely to me that whether we stick with readable CoTs will ultimately be decided by efficiency considerations, which is why I focused on those in the post, but of course agree that we should do our best in putting our shoulders against it.
I asked this exact question a while ago and wrote up some of my thoughts. https://www.lesswrong.com/posts/PJaq4CDQ5d5QtjNRy/williawa-s-shortform#YmyQqQqdei9C7pXR3 but I didn't get any answers unfortunately.
They mostly hinge on the parallelism and inference/RL vs pretraining point. The arguments that carry the most weight for me now are:
CoT having nice safety/business properties seems unlikely to matter to me. Just, current labs, seem unlikely to hold back capability gains for such things. Maybe I'm too cynical.
I also didn't understand the SGD inductive biases point. Like if you're doing something like GRPO, you can train the model to output tokens that lead to fruitful reasoning down the line. Could you explain this? Like, I agree tokens are a less expressive medium than activations are, but this seems like the same argument as the information bottleneck one. I could also see it going the other way. Like, its hard to represent rigid logical operations in activation space, and being forced to move through tokens spaces has a useful inductive prior for "reasoning" because of its discretisation.
I think CoT may be very valuable in helping capabilities researchers understand their models’ shortcomings and design training curricula that address those deficiencies. This sort of falls under “business value” itself. I currently think this is one of the stronger arguments for why we won’t switch to neuralese.
(The post only vaguely gestures at this argument, so your interpretation is also fair, but I think business value and capabilities gains are pretty closely related rather than being opposing factors.)
Thanks for bringing this up, I hadn't seen your shortform post! We've converged on pretty similar pros/cons. A few comments:
It isn't obvious to me that this kind of big-O analysis matters that much in practice. RNNs will still need the ability to recall any information from the entire gene
I agree it is unclear, but I would still expect it to. Just because.
Have you seen this market?
When will an AI model using neuralese recurrence be first released to the public? (currently 11 Oct 2027)
Over the past year, I have talked to several people about whether they expect frontier AI companies to transition away from the current paradigm of transformer LLMs toward models that reason in neuralese within the next few years. This post summarizes 13 common arguments I’ve heard, six in favor and seven against a transition to neuralese AIs. The following table provides a summary:
| Arguments for a transition to neuralese | Arguments against a transition to neuralese |
| A lot of information gets lost in text bottlenecks. | Natural language reasoning might be a strong local optimum that takes a lot of training effort to escape. |
| The relative importance of post-training compared to pre-training is increasing. | Recurrent LLMs suffer from a parallelism trade-off that makes their training less efficient. |
| There’s an active subfield researching recurrent LLMs. | There’s significant business value in being able to read a model’s CoTs. |
| Human analogy: natural language might not play that big of a role in human thinking. | Human analogy: even if natural language isn’t humans’ primary medium of thought, we still rely on it a lot. |
| SGD inductive biases might favor directly learning good sequential reasoning algorithms in the weight space. | Though significant effort has been spent on getting neuralese models to work, we still have none that work at scale. |
| In the limit of capabilities, neuralese is strictly more powerful than CoT reasoning. | There’s a depth-latency trade-off that might make it less appealing to train large recurrent LLMs. |
| There’s a lot of safety value in being able to read the model’s CoTs. |
Following Hidden Reasoning in LLMs: A Taxonomy, I’ll define neuralese as reasoning in long opaque chains of recurrence. For a precise operationalization of this, I like Fabien Roger’s definition and his corresponding prediction market. He defines the current, CoT-based reasoning paradigm as the conjunction of the following two properties:
The ability to perform 100k heavy serial operations without text bottlenecks already seems like a strong deviation from the current paradigm, where reasoning models perform up to 500 serial operations in a forward pass.[1] Thus, in this post, I’ll constrain the number of allowed serial operations to 10k instead.
The transformer architecture puts a heavy limitation on the amount of information from the residual stream that can reach the next forward pass. The residual stream has to first be compressed into a distribution over tokens, and further information is lost when sampling a single token from that distribution. AI 2027 quantifies the total information loss as follows: “Suppose that an LLM has a vocab size of ~100,000, then each token contains bits of information, around the size of a single floating point number (assuming training in FP16). Meanwhile, residual streams [...] contain thousands of floating point numbers.”
One might counter that the discrete token space simply carves the continuous vector space into bins that we care about, rendering the information that is thrown away superfluous. Furthermore, hard bottlenecks might encourage specialization: not having to deal with all of the information accumulated at the end of the residual stream at the previous token position allows early layers to specialize in syntactic processing and feature extraction. Nevertheless, paraphrasing Eric Drexler’s report on quasilinguistic neural representations, it would be strange to find that the optimal language for machines capable of utilizing expressive vector embeddings consists of tokens that represent mouth noises, especially given the magnitude of the information loss.
In training runs where most of the compute is allocated to pre-training, neuralese architectures are heavily disadvantaged, as I’ll explain in depth below. Over the past year, the amount of compute companies spend on post-training has increased by orders of magnitude: until recently, labs spent ~100x more on pre-training than on post-training, while the Grok 4 announcement seemed to imply that Grok 4 required an amount of post-training compute equivalent to its pre-training compute. Recurrent models don’t face similar disadvantages in post-training, where training signals are usually provided at the sequence rather than token level. Thus, if post-training compute continues to grow, spending additional resources on pre-training recurrent LLMs or on converting a transformer into a recurrent model after pretraining might become more palatable.
A double-digit number of papers has been published over the past year exploring ways to either convert transformers into a recurrent architecture or to introduce entirely new recurrent architectures. I have reviewed Geiping et al. (2025) and Hao et al. (2024) in the past, but there are others; see section 2 of Zhu et al. (2025) for a very recent literature review. These papers offer solutions to some of the age-old issues with neuralese architectures. Geiping et al. partially evade stability and efficiency issues with long chains of backpropagation by using truncated backprop, which backpropagates through only the last k iterations of the recurrent unit, regardless of the total number of recurrent iterations. Hao et al. circumvent the problem of retaining highly parallelizable pre-training with a recurrent architecture by converting a conventional transformer to a recurrent model through fine-tuning. The benchmark scores don’t look that good in either of those papers and consequently, they haven’t been scaled up to large models, but they still constitute some evidence that the efficiency issues that have long plagued RNN-like architectures might be solvable.
Though human inner monologue may be considered as evidence that thinking in natural language is an efficient strategy for solving a wide range of problems, there are three reasons to think that inner monologue might not be that load-bearing.
First, there are people who claim to be able to switch their inner monologue on and off and to perform most of their productive work without thinking in words. Consider the following quote by Albert Einstein:[2]
The words or the language, as they are written or spoken, do not seem to play any role in my mechanism of thought. The psychical entities which seem to serve as elements in thought are certain signs and more or less clear images which can be "voluntarily" reproduced and combined. There is, of course, a certain connection between those elements and relevant logical concepts. It is also clear that the desire to arrive finally at logically connected concepts is the emotional basis of this rather vague play with the above-mentioned elements. But taken from a psychological viewpoint, this combinatory play seems to be the essential feature in productive thought—before there is any connection with logical construction in words or other kinds of signs which can be communicated to others.
Second, there appear to be people who possess no inner monologue at all: see this X thread by Katja Grace and this reddit thread for example discussions. Finally, there is some evidence that impairments in inner speech do not necessarily influence general cognitive abilities: Langland-Hassan et al. (2015) find a lack of correlation between inner speech impairment and executive function or attention.[3]
Human analogies should not be taken too seriously when forecasting future developments in LLMs, but still provide weak suggestive evidence. The fact that some humans seem to think wordlessly for long periods of time suggests to me that language isn’t the main medium of human thought, as it would be more surprising to find two groups of humans using completely different mental mechanisms to achieve remarkably similar intelligent feats than it would be to find that verbal thought isn’t particularly load-bearing for either of those groups but emerges as a byproduct of load-bearing thoughts for one of those groups and not for the other.
Intuitively, one would expect the inductive biases of SGD to favor directly learning good sequential reasoning algorithms in the weights over optimizing the weights to output tokens that function as serial reasoning steps. There is much more room for finding the optimal way to combine a large number of complex reasoning steps in weight-space than in token-space, assuming sufficient expressive capacity for representing all of these steps in a single forward pass.
In the limit, neuralese is strictly more powerful than CoT reasoning. A model that can perform adaptive hidden computation can do everything a transformer model can do with a CoT, but additionally use long chains of recurrence whenever doing so is more efficient.
As James Chua has argued, natural language seems to be a fairly strong local optimum for models to think in. Both internet text data and LLM-generated synthetic data are written exclusively in natural language. In all of these corpuses, sequential reasoning steps are verbalized in natural language, providing a stronger training signal for models that also verbalize intermediate reasoning steps. Labs will most likely want to keep leveraging all of this existing data for pre-training. There is no continuous analog for this corpus that could be used to pre-train recurrent models to already possess efficient continuous representations at the beginning of the post-training stage.
Although recurrent models have no such disadvantage in post-training, it might be costly to retrain models that reason in natural language to utilize continuous representations for intermediate reasoning steps, as Hao et al. (2024) attempted. Even if models that think in natural language are slightly less efficient at inference, the cost of retraining them might outweigh the difference in inference efficiency. Though linguistic drift from natural language toward alien dialects due to RL optimization pressures would remain a concern in this scenario, such drift lies outside of our definition of neuralese.
A final reason to think that natural language reasoning might be a strong local optimum is that any AI will have to produce natural language outputs—essays, emails, thousands of lines of code—in order to be useful. If a model doesn’t natively think in concepts that are close to natural language, it will incur a cost in translating its internal abstractions into human language, while if it does think in concepts close to natural language, the gain from using these concepts in latent thoughts instead of in an explicit CoT is likely going to be less dramatic. Whether SGD will eventually converge at concepts far from human ones depends on how optimized human abstractions are for intelligent thought.
One may counter the previous point with the argument that recurrent models can simply be pre-trained on all the text data that transformers are currently trained on, thus having all of the capabilities of transformer LLMs but also being able to omit intermediate reasoning steps when doing so is more efficient. However, such training would incur significant efficiency costs, as there is a trade-off between highly parallelizable training and highly expressive forward passes. The efficiency of transformer pre-training stems from the possibility of processing all tokens in a single sequence in parallel, which is impossible when computations at a subsequent token position involve a nonlinear transformation of the hidden state at the previous token position.
So far, highly parallelizable training has been a more important factor than highly expressive forward passes. There are ways of circumventing the trade-off to some extent: e.g., it’s possible to use truncated backprop or to only train on short chains of recurrence during pre-training and hope that this generalizes to long chains of recurrence after some fine-tuning. However, as mentioned above, neither of these approaches has worked at scale so far.
All other things being equal, both labs and customers care about the ability to read externalized reasoning chains. For both labs and users, a visible reasoning trace is useful for debugging the model: e.g., it helps with optimizing the prompt and debugging the reasons for why the model isn’t following instructions in the intended way. For labs, visible CoTs are useful for building more robust misuse detectors and refusal mechanisms: e.g., Anthropic’s constitutional classifiers rely on the ability to monitor the model’s output stream, while OpenAI’s deliberative alignment relies on the ability to train models on distilled reasoning traces.
Despite some humans’ apparent ability to think without an inner monologue, most people can’t help but think in words and also often find it useful to write down their thoughts on a page or paper. If someone was forced to play a social deception game with opponents capable of reading their inner monologue, they would probably lose much more often than otherwise. This reliance on language might just be an accidental byproduct of load-bearing thought resulting from the general importance of language in social communication, as suggested in the previous section, but it might also have deeper functions—for example, one might argue that discrete linguistic units are more reliable and amenable to error correction than continuous thoughts. If an RNN had to learn such bottlenecks in order to perform well, there would be less to be gained from switching transformers for neuralese models.
The fact that a transition to neuralese hasn’t happened so far constitutes evidence in itself that getting neuralese architectures to work is difficult. The ML community has known for 30 years that recurrence and depth often amplify gradient variance (Bengio et al., 1994), suggesting that if easy solutions to RNNs’ stability and efficiency issues were available, they would have been discovered by now. Additionally, the failures of approaches that introduce recurrence through minimal tweaks to the transformer architecture indicate that if a transition to neuralese models is possible, this will require an architecture that differs drastically from current LLMs. Such an architecture is likely to initially suffer from inefficiencies that have already been solved for transformer LLMs: as one example, the optimal agentic coding scaffold for a neuralese model might be very different from the ones used for transformer LLMs. In this case, a large performance delta that outweighs the inefficiencies may be required for labs to turn their attention to the new architecture.
One of the bottlenecks to AI scaling discussed in Epoch’s Can AI scaling continue through 2030? is the latency wall, and that bottleneck has implications for the neuralese debate. The minimum time for a model to process a single data point increases with the depth of the model, which sets an upper bound on the training FLOPs within a given timeframe. As labs want to always make use of the latest hardware and algorithmic improvements, the duration of a training run rarely exceeds six months. Though more expressive forward passes don’t influence the amount of FLOPs that can be performed within six months, they reduce the number of data points that can be processed within this time frame, as the time spent on a training run is a product of the time required to process a single batch and the number of training batches. Though batch size can be increased to sidestep this issue, there is a critical batch size above which further increases result in dramatically diminishing returns. These constraints effectively create another trade-off—one between expressive forward passes and dataset sizes, or alternatively one between expressive forward passes and training run lengths—, and though it’s possible that expressive forward passes will win out by being more data-efficient, it isn’t obvious that they will.
It seems unlikely that labs would be willing to pay the safety tax associated with keeping visible reasoning chains around for long if neuralese models turn out to be much more capable. Nevertheless, the Korbak et al. (2025) joint statement by labs offers at least some hope that monitorability considerations will be taken into account in frontier model development for the foreseeable future.
Even after laying out all of these arguments, I’m not much closer to having a confident opinion on whether I expect neuralese architectures to win out eventually or not. Compared to the two most relevant prediction markets, Fabien Roger’s Will early transformative AIs primarily use text? (currently at 47%) and Leo Gao’s Will Transformer based architectures still be SOTA for language modelling by 2026? (currently at 91%), I’m slightly more optimistic about sticking to transformers than both. I hope that this post will spur further discussion on the topic and help further clarify the considerations I’ve described. All suggestions for arguments that I’ve overlooked are also very welcome.
Work produced at Aether. Thanks to Rohan Subramani and Shubhorup Biswas for feedback on a draft of this post.
For example, Qwen3-235B-A22B, which to my knowledge is the deepest open-source transformer released this year, has 94 layers, with each layer involving five serial operations—three in the attention and two in the MLP block. Thus, the total number of serial operations in a Qwen3-235B forward pass is 470. It seems plausible that some proprietary models have slightly deeper forward passes, but likely not much deeper.
See Appendix II of Jacques Hadamard’s An Essay on the Psychology of Invention in the Mathematical Field for the source. The book extensively discusses the role of non-verbal thinking in mathematical research; Henrik Karlsson’s recent post When is it better to think without words? provides a summary.
Though note that this study has at least two significant limitations: first, impairments in inner monologue are assessed only through a test for silent rhyming ability, and second, the sample size is only 11 patients.