Even Anthropic is building frontier AIs (since Opus 3 or Sonnet 3.5; this was a bit of a surprise to some at the time). Thus if the hypothetical breakthrough of meaningfully better capabilities from neuralese happens (in whatever form), all AI companies will start making use of it, as soon as the immediate behavioral downsides get mitigated to the usual level. Any other kinds of downsides won't be a reason to not go there, with frontier AIs.
Also, continual learning is analogous to neuralese, an inscrutable way of preserving/propagating information with long chains of reasoning, different from text-only notes. In both cases, you can build a textual "transcript" of a process of thinking, but it's not necessarily faithful, and doesn't screen off earlier thinking from later thinking.
Over the past year, I have talked to several people about whether they expect frontier AI companies to transition away from the current paradigm of transformer LLMs toward models that reason in neuralese within the next few years. This post summarizes 13 common arguments I’ve heard, six in favor and seven against a transition to neuralese AIs. The following table provides a summary:
| Arguments for a transition to neuralese | Arguments against a transition to neuralese |
| A lot of information gets lost in text bottlenecks. | Natural language reasoning might be a strong local optimum that takes a lot of training effort to escape. |
| The relative importance of post-training compared to pre-training is increasing. | Recurrent LLMs suffer from a parallelism trade-off that makes their training less efficient. |
| There’s an active subfield researching recurrent LLMs. | There’s significant business value in being able to read a model’s CoTs. |
| Human analogy: natural language might not play that big of a role in human thinking. | Human analogy: even if natural language isn’t humans’ primary medium of thought, we still rely on it a lot. |
| SGD inductive biases might favor directly learning good sequential reasoning algorithms in the weight space. | Though significant effort has been spent on getting neuralese models to work, we still have none that work at scale. |
| In the limit of capabilities, neuralese is strictly more powerful than CoT reasoning. | There’s a depth-latency trade-off that might make it less appealing to train large recurrent LLMs. |
| There’s a lot of safety value in being able to read the model’s CoTs. |
Following Hidden Reasoning in LLMs: A Taxonomy, I’ll define neuralese as reasoning in long opaque chains of recurrence. For a precise operationalization of this, I like Fabien Roger’s definition and his corresponding prediction market. He defines the current, CoT-based reasoning paradigm as the conjunction of the following two properties:
The ability to perform 100k heavy serial operations without text bottlenecks already seems like a strong deviation from the current paradigm, where reasoning models perform up to 500 serial operations in a forward pass.[1] Thus, in this post, I’ll constrain the number of allowed serial operations to 10k instead.
The transformer architecture puts a heavy limitation on the amount of information from the residual stream that can reach the next forward pass. The residual stream has to first be compressed into a distribution over tokens, and further information is lost when sampling a single token from that distribution. AI 2027 quantifies the total information loss as follows: “Suppose that an LLM has a vocab size of ~100,000, then each token contains bits of information, around the size of a single floating point number (assuming training in FP16). Meanwhile, residual streams [...] contain thousands of floating point numbers.”
One might counter that the discrete token space simply carves the continuous vector space into bins that we care about, rendering the information that is thrown away superfluous. Furthermore, hard bottlenecks might encourage specialization: not having to deal with all of the information accumulated at the end of the residual stream at the previous token position allows early layers to specialize in syntactic processing and feature extraction. Nevertheless, paraphrasing Eric Drexler’s report on quasilinguistic neural representations, it would be strange to find that the optimal language for machines capable of utilizing expressive vector embeddings consists of tokens that represent mouth noises, especially given the magnitude of the information loss.
In training runs where most of the compute is allocated to pre-training, neuralese architectures are heavily disadvantaged, as I’ll explain in depth below. Over the past year, the amount of compute companies spend on post-training has increased by orders of magnitude: until recently, labs spent ~100x more on pre-training than on post-training, while the Grok 4 announcement seemed to imply that Grok 4 required an amount of post-training compute equivalent to its pre-training compute. Recurrent models don’t face similar disadvantages in post-training, where training signals are usually provided at the sequence rather than token level. Thus, if post-training compute continues to grow, spending additional resources on pre-training recurrent LLMs or on converting a transformer into a recurrent model after pretraining might become more palatable.
A double-digit number of papers has been published over the past year exploring ways to either convert transformers into a recurrent architecture or to introduce entirely new recurrent architectures. I have reviewed Geiping et al. (2025) and Hao et al. (2024) in the past, but there are others; see section 2 of Zhu et al. (2025) for a very recent literature review. These papers offer solutions to some of the age-old issues with neuralese architectures. Geiping et al. partially evade stability and efficiency issues with long chains of backpropagation by using truncated backprop, which backpropagates through only the last k iterations of the recurrent unit, regardless of the total number of recurrent iterations. Hao et al. circumvent the problem of retaining highly parallelizable pre-training with a recurrent architecture by converting a conventional transformer to a recurrent model through fine-tuning. The benchmark scores don’t look that good in either of those papers and consequently, they haven’t been scaled up to large models, but they still constitute some evidence that the efficiency issues that have long plagued RNN-like architectures might be solvable.
Though human inner monologue may be considered as evidence that thinking in natural language is an efficient strategy for solving a wide range of problems, there are three reasons to think that inner monologue might not be that load-bearing.
First, there are people who claim to be able to switch their inner monologue on and off and to perform most of their productive work without thinking in words. Consider the following quote by Albert Einstein:[2]
The words or the language, as they are written or spoken, do not seem to play any role in my mechanism of thought. The psychical entities which seem to serve as elements in thought are certain signs and more or less clear images which can be "voluntarily" reproduced and combined. There is, of course, a certain connection between those elements and relevant logical concepts. It is also clear that the desire to arrive finally at logically connected concepts is the emotional basis of this rather vague play with the above-mentioned elements. But taken from a psychological viewpoint, this combinatory play seems to be the essential feature in productive thought—before there is any connection with logical construction in words or other kinds of signs which can be communicated to others.
Second, there appear to be people who possess no inner monologue at all: see this X thread by Katja Grace and this reddit thread for example discussions. Finally, there is some evidence that impairments in inner speech do not necessarily influence general cognitive abilities: Langland-Hassan et al. (2015) find a lack of correlation between inner speech impairment and executive function or attention.[3]
Human analogies should not be taken too seriously when forecasting future developments in LLMs, but still provide weak suggestive evidence. The fact that some humans seem to think wordlessly for long periods of time suggests to me that language isn’t the main medium of human thought, as it would be more surprising to find two groups of humans using completely different mental mechanisms to achieve remarkably similar intelligent feats than it would be to find that verbal thought isn’t particularly load-bearing for either of those groups but emerges as a byproduct of load-bearing thoughts for one of those groups and not for the other.
Intuitively, one would expect the inductive biases of SGD to favor directly learning good sequential reasoning algorithms in the weights over optimizing the weights to output tokens that function as serial reasoning steps. There is much more room for finding the optimal way to combine a large number of complex reasoning steps in weight-space than in token-space, assuming sufficient expressive capacity for representing all of these steps in a single forward pass.
In the limit, neuralese is strictly more powerful than CoT reasoning. A model that can perform adaptive hidden computation can do everything a transformer model can do with a CoT, but additionally use long chains of recurrence whenever doing so is more efficient.
As James Chua has argued, natural language seems to be a fairly strong local optimum for models to think in. Both internet text data and LLM-generated synthetic data are written exclusively in natural language. In all of these corpuses, sequential reasoning steps are verbalized in natural language, providing a stronger training signal for models that also verbalize intermediate reasoning steps. Labs will most likely want to keep leveraging all of this existing data for pre-training. There is no continuous analog for this corpus that could be used to pre-train recurrent models to already possess efficient continuous representations at the beginning of the post-training stage.
Although recurrent models have no such disadvantage in post-training, it might be costly to retrain models that reason in natural language to utilize continuous representations for intermediate reasoning steps, as Hao et al. (2024) attempted. Even if models that think in natural language are slightly less efficient at inference, the cost of retraining them might outweigh the difference in inference efficiency. Though linguistic drift from natural language toward alien dialects due to RL optimization pressures would remain a concern in this scenario, such drift lies outside of our definition of neuralese.
A final reason to think that natural language reasoning might be a strong local optimum is that any AI will have to produce natural language outputs—essays, emails, thousands of lines of code—in order to be useful. If a model doesn’t natively think in concepts that are close to natural language, it will incur a cost in translating its internal abstractions into human language, while if it does think in concepts close to natural language, the gain from using these concepts in latent thoughts instead of in an explicit CoT is likely going to be less dramatic. Whether SGD will eventually converge at concepts far from human ones depends on how optimized human abstractions are for intelligent thought.
One may counter the previous point with the argument that recurrent models can simply be pre-trained on all the text data that transformers are currently trained on, thus having all of the capabilities of transformer LLMs but also being able to omit intermediate reasoning steps when doing so is more efficient. However, such training would incur significant efficiency costs, as there is a trade-off between highly parallelizable training and highly expressive forward passes. The efficiency of transformer pre-training stems from the possibility of processing all tokens in a single sequence in parallel, which is impossible when computations at a subsequent token position involve a nonlinear transformation of the hidden state at the previous token position.
So far, highly parallelizable training has been a more important factor than highly expressive forward passes. There are ways of circumventing the trade-off to some extent: e.g., it’s possible to use truncated backprop or to only train on short chains of recurrence during pre-training and hope that this generalizes to long chains of recurrence after some fine-tuning. However, as mentioned above, neither of these approaches has worked at scale so far.
All other things being equal, both labs and customers care about the ability to read externalized reasoning chains. For both labs and users, a visible reasoning trace is useful for debugging the model: e.g., it helps with optimizing the prompt and debugging the reasons for why the model isn’t following instructions in the intended way. For labs, visible CoTs are useful for building more robust misuse detectors and refusal mechanisms: e.g., Anthropic’s constitutional classifiers rely on the ability to monitor the model’s output stream, while OpenAI’s deliberative alignment relies on the ability to train models on distilled reasoning traces.
Despite some humans’ apparent ability to think without an inner monologue, most people can’t help but think in words and also often find it useful to write down their thoughts on a page or paper. If someone was forced to play a social deception game with opponents capable of reading their inner monologue, they would probably lose much more often than otherwise. This reliance on language might just be an accidental byproduct of load-bearing thought resulting from the general importance of language in social communication, as suggested in the previous section, but it might also have deeper functions—for example, one might argue that discrete linguistic units are more reliable and amenable to error correction than continuous thoughts. If an RNN had to learn such bottlenecks in order to perform well, there would be less to be gained from switching transformers for neuralese models.
The fact that a transition to neuralese hasn’t happened so far constitutes evidence in itself that getting neuralese architectures to work is difficult. The ML community has known for 30 years that recurrence and depth often amplify gradient variance (Bengio et al., 1994), suggesting that if easy solutions to RNNs’ stability and efficiency issues were available, they would have been discovered by now. Additionally, the failures of approaches that introduce recurrence through minimal tweaks to the transformer architecture indicate that if a transition to neuralese models is possible, this will require an architecture that differs drastically from current LLMs. Such an architecture is likely to initially suffer from inefficiencies that have already been solved for transformer LLMs: as one example, the optimal agentic coding scaffold for a neuralese model might be very different from the ones used for transformer LLMs. In this case, a large performance delta that outweighs the inefficiencies may be required for labs to turn their attention to the new architecture.
One of the bottlenecks to AI scaling discussed in Epoch’s Can AI scaling continue through 2030? is the latency wall, and that bottleneck has implications for the neuralese debate. The minimum time for a model to process a single data point increases with the depth of the model, which sets an upper bound on the training FLOPs within a given timeframe. As labs want to always make use of the latest hardware and algorithmic improvements, the duration of a training run rarely exceeds six months. Though more expressive forward passes don’t influence the amount of FLOPs that can be performed within six months, they reduce the number of data points that can be processed within this time frame, as the time spent on a training run is a product of the time required to process a single batch and the number of training batches. Though batch size can be increased to sidestep this issue, there is a critical batch size above which further increases result in dramatically diminishing returns. These constraints effectively create another trade-off—one between expressive forward passes and dataset sizes, or alternatively one between expressive forward passes and training run lengths—, and though it’s possible that expressive forward passes will win out by being more data-efficient, it isn’t obvious that they will.
It seems unlikely that labs would be willing to pay the safety tax associated with keeping visible reasoning chains around for long if neuralese models turn out to be much more capable. Nevertheless, the Korbak et al. (2025) joint statement by labs offers at least some hope that monitorability considerations will be taken into account in frontier model development for the foreseeable future.
Even after laying out all of these arguments, I’m not much closer to having a confident opinion on whether I expect neuralese architectures to win out eventually or not. Compared to the two most relevant prediction markets, Fabien Roger’s Will early transformative AIs primarily use text? (currently at 47%) and Leo Gao’s Will Transformer based architectures still be SOTA for language modelling by 2026? (currently at 91%), I’m slightly more optimistic about sticking to transformers than both. I hope that this post will spur further discussion on the topic and help further clarify the considerations I’ve described. All suggestions for arguments that I’ve overlooked are also very welcome.
Work produced at Aether. Thanks to Rohan Subramani and Shubhorup Biswas for feedback on a draft of this post.
For example, Qwen3-235B-A22B, which to my knowledge is the deepest open-source transformer released this year, has 94 layers, with each layer involving five serial operations—three in the attention and two in the MLP block. Thus, the total number of serial operations in a Qwen3-235B forward pass is 470. It seems plausible that some proprietary models have slightly deeper forward passes, but likely not much deeper.
See Appendix II of Jacques Hadamard’s An Essay on the Psychology of Invention in the Mathematical Field for the source. The book extensively discusses the role of non-verbal thinking in mathematical research; Henrik Karlsson’s recent post When is it better to think without words? provides a summary.
Though note that this study has at least two significant limitations: first, impairments in inner monologue are assessed only through a test for silent rhyming ability, and second, the sample size is only 11 patients.