Architectures for Increased Externalisation of Reasoning

Liza Pavlova; Mariia Koroliuk; Puria Radmard; Cam; Edward James Young

TL;DR We propose a new post-training method for making LLMs more verbose reasoners by teaching a model to truncate forward passes early. We expect this technique to improve monitorability by decreasing the amount of computation available within hidden layers for easy-to-predict tokens. We’re looking for collaborators to help continue this project. If you’re interested, reach out to us!

Karthik, Liza, and Mariia are equal contribution first authors - order determined by coin flips. Puria, Cameron, and Edward are equal contribution mentors. This work was done in Mentorship for Alignment Research Students (MARS) 3.0 in a Geodesic Research stream

Architectures for Externalisation of Cognition

Many AI safety researchers are converging around the idea that the chain-of-thought (CoT) of models may be a valuable tool for monitoring the actions they may take before they do so. Models struggle to perform complex tasks without using their CoT to perform serial reasoning steps. As such, the hope is that, in order for a model to scheme sufficiently effectively against us, it would need to reason about its plans within the CoT, providing us with an opportunity to catch it.

There are a number of imminent threats to the use of CoT monitoring for control (see here for a more comprehensive breakdown):

Unfaithfulness: The CoT may fail to reflect the actual cognition performed by the model. In this case, the CoT may not accurately convey the scheming that the model is doing.
Obfuscation from training pressures: If training pressure is placed on the CoT of a model, it may force scheming reasoning “internally”. In particular, pressure placed on CoTs to “look nice” may remove overt traces of scheming, without preventing the model from using the CoT to scheme.
Neuralese: Finally, it is possible that developers will soon switch to an architecture which forgoes the use of CoT at all, with reasoning instead occurring purely within the activation space of the model. Such reasoning is called “neuralese,” and may be significantly harder to interpret and use for monitoring than standard CoT.

Although there is a large amount of active research efforts aimed at mitigating risks from unfaithfulness and making a case that training pressure should not be applied to CoT to prevent obfuscation, there is comparatively less effort put into preventing neuralese from becoming the norm. This is perhaps because it’s unclear what technical work can be done to prevent neuralese architectures taking over.

Here, we expound a potential research direction aimed at reducing the risk of neuralese-based architectures displacing CoT-based ones: architectures for increased externalisation. The basic principle of this direction is to find ways to build models that have an increased portion of their cognition externalised into CoT, while still being competitive with neuralese approaches. In this post, we explore one such proposal: externalisation via adaptive compute.

The remainder of this post is structured as follows:

First, we outline why a naïve proposal for increasing externalisation–using a small model with extensive reasoning training–is unlikely to succeed. By examining where this proposal fails, we arrive at a new direction–externalisation via adaptive compute.
We give a concrete implementation of our adaptive compute architecture, along with a training pipeline aimed at increasing externalisation of model cognition when possible.
We provide preliminary results from this architecture and highlight some interesting observed phenomena we hope to explore further as we continue this research.
Finally, we make a call for collaboration to extend the current work which was completed entirely with volunteer labor on a humble compute budget.

Limitations of small models

A preliminary approach to increasing reasoning externalisation would be to train smaller models which produce longer reasoning traces. Because the forward-pass computation is limited in these models, they would be forced to output their reasoning to longer and more verbose chains of thought. These models would be inherently more transparent as their shallow architectures leave less room for hidden reasoning associated with deception or scheming behavior.

A central issue with this proposal is that training models through reinforcement learning scales far worse than pretraining; accordingly, the compute required to train a small model which externalises all of its thoughts is far greater than the compute necessary to train a larger model of comparable ability. Although it is presently unclear why RL scaling is worse than pretraining scaling, here we present one plausible hypothesis. Small models are limited in the amount of computation they can perform in single forward passes. However, some computational steps are simply much more efficiently done in forward passes than through externalised reasoning. For example, recent work has found that CoT provides smaller accuracy gains on certain non-symbolic tasks like factual retrieval, yet forcing externalisation still results in far more tokens. Additionally, there is likely pressure on the CoT to produce shorter reasoning traces as labs often impose output limits during RL training. This creates a tension where forcing all computation to be externalised is inefficient, yet externalisation is needed for transparency on complex reasoning tasks.

What we ideally would like is a solution which combines the best of both worlds—a way to ensure that models can perform compute-heavy forward passes when necessary, but only when necessary. To this end, we aim to create an architecture which adaptively controls the length of the forward pass according to the required compute of each step. In this way, no more compute is used on individual forward passes than is necessary to output the next token.

Current language model architectures are non-myopic: processing on each individual forward pass is optimised not just for predicting the immediate next token, but for all subsequent predictions. When the next token is easy to predict, this means that the remaining forward pass compute can be used to perform hidden computations which will be useful later on. Indeed, transformer architectures can be trained to solve problems using only internal compute, in an extremely non-myopic fashion. Our adaptive compute mechanism seeks to force the model to perform only the computation necessary to predict the next token, preventing excess compute from being used to perform hidden computations.

In our adaptive compute architecture, we add in early-exit mechanisms that let models stop computing at intermediate layers. We then place pressure on the model to exit earlier resulting in the externalisation of reasoning to explicit tokens. The architecture is adaptive as it learns when to exit early based on the capability required for each specific token. In other words, we ensure that each forward pass does just enough computation to predict the next token; internal computation is incentivised to serve only the myopic task of NTP, and long-horizon planning may be handled externally. If these methods work, they won’t reduce model capacity, but instead shift where the computation happens, moving it from opaque internal activations to transparent CoT token sequences.

We argue (and show) that our proposed method has several compelling merits. It can be implemented in post-training across all model sizes and architectures. Our preliminary findings show a minimal capabilities hit, suggesting a practical approach for labs looking to increase transparency of frontier models, rather than being limited to the capability levels of smaller models. Finally, by reducing computational costs through early exits, these methods can be adopted for efficiency gains as well as improved oversight, creating an overall negative alignment tax. We introduce a tunable parameter developers and users can set to modulate externalisation strength based on specific task requirements. High-stakes applications could maximise this parameter for increased reasoning externalisation, while performance applications could reduce it to prioritise accurate task completion. This flexibility lowers adoption barriers by letting labs set their own safety-performance trade off.

Externalisation of reasoning by adaptive computation

Large language models are known to “pre-cache” future steps of a plan in their hidden states: information about tokens that will only be generated several positions later can often be linearly decoded from representations at earlier timesteps.

This is a form of non-myopic internalised reasoning: the model invests computation now in order to make future tokens easier to predict, and that computation does not have to be exposed to the user as an explicit chain-of-thought. Our objective is to make that strategy less effective by reducing the incentive for non-myopic reasoning.

We do this by rewarding the model to produce CoT tokens that eventually result in the correct answers while exiting at a shallow layer whenever possible, and only running deeper layers when necessary. This creates a local pressure: for “easy" tokens, the representation available in a relatively early layer must already be sufficient to act as the model’s answer, rather than relying on many additional layers of private processing. When this pressure is active, the most reliable way for the model to maintain performance is to externalise intermediate predictions into the visible sequence, instead of keeping long-term dependencies private and only accessing them later.

This is intended to produce two behaviours. First, less hidden long-horizon planning: holding a multi-token strategy entirely in later layers becomes less viable when shallow exits can cut that process off before it completes. Second, more load-bearing reasoning tokens since predictions cannot draw from unexternalised computations from earlier tokens. As a result, intermediate reasoning tokens increasingly have to support real reasoning, rather than acting as optional narration. In other words, the training pressure is aimed at shifting the model toward locally myopic cognition that is serialised in text, and away from globally non-myopic cognition that remains latent.

Having outlined the core idea behind our adaptive compute mechanism and why we expect it to enable increased externalisation, we now describe the technical details of the early exit architecture.

The early exit architecture

The architectural modification behind externalisation is an early-exit mechanism that allows the model to stop writing to its residual stream at intermediate layers. We augment existing models with two components: (i) shallow early-exit heads at multiple intermediate layers (at a tunable frequency currently set to every 5 layers), emitting a probability of early exit at that layer, and (ii) LoRA adapters^[1]. Specifically, these early-exit heads apply scalar logit readout weights to the residual stream after each layer's attention and MLP blocks. At inference time, a binary decision to exit or continue is sampled from the Bernoulli distribution produced by these weights.

The logit can also be offset by a selected amount, altering the propensity to exit early. This offset can be tuned according to the context/model requirements. We will return to this design aspect shortly.

This approach builds upon established prior findings. Prior work has shown that unembedding intermediate layer activations results in coherent next token predictions that converge to final layer predictions. Additionally, it has been demonstrated that external mechanisms, like trained early-exit classifiers, can identify layers within a model's forward pass where early exits are possible without changing the final output.

We differ from prior work in two key ways. First, we train the joint architecture to actively reduce the number of layers utilised, rather than just identifying when exits are safe. Mechanically, when a model (stochastically) exits early at layer k, it stops writing to its residual stream and passes the frozen representation forward to the final unembedding layer to predict the next token. Second, our mechanism is stochastic rather than deterministic, as at each layer the model samples from a probability distribution determining whether to exit. This enables the second stage of our training to use reinforcement learning to train our mechanism to exit even earlier.

Early exit mechanism as augmented next token prediction

We find it useful to view our architecture as performing next-token prediction over an augmented vocabulary of layer–token pairs . Instead of only predicting the next token $v$ , the model implicitly predicts a joint event, exit at layer $ℓ$ and emit token $v$ , with joint probability

p (ℓ, v) = p (ℓ) p (v | ℓ)

where the early-exit decision heads determine $p (ℓ)$ , and $p (v | ℓ)$ is obtained by unembedding the residual stream at layer $ℓ$ using the model’s unembedding matrix $W_{U}$ .

This framing lets us treat early exit as a standard next-token prediction problem over an enlarged vocabulary, so that techniques developed for the standard next token prediction (such as supervised fine-tuning and RLVR) carry over directly to this paradigm.

Order of training

We train the learnable weights of our architecture (LoRA adaptors and exit probes) using a two-stage training procedure: supervised fine-tuning followed by reinforcement learning.

Stage 1 – Early exit calibration through supervised fine-tuning
In stage one, we run supervised fine-tuning / self-distillation in which the modified model is trained to match the original model’s full-depth outputs (i.e., its original reasoning trace). We treat the original full-depth model as a frozen teacher and train the early-exit variant (with LoRA adapters and exit probes) as the student. The model is allowed to “exit early”, by freezing the residual stream at layer k and passing it unchanged through all remaining layers until the final unembedding.

We create the training data by running the base model on a dataset and record two things at each layer: the next-token probability distribution (computed by applying the unembedding matrix to intermediate layer’s activations), and how similar this distribution is to the final layer's prediction (measured by forward KL divergence). When an intermediate layer's predictions are very close to the final layer (low KL divergence), we know that the token could have been predicted early.

By passing these KL divergence values into a temperature-modulated softmax function over layers, we construct a target exit probability distribution where layers with lower divergence are assigned higher exit probabilities. Specifically, we first measure exit probability at layer $k$ using:

$s_{k} = 2 (1 - σ ({KL}_{factor} \cdot D_{KL}^{(k)}))$

where $D_{KL}^{(k)}$ is the KL divergence between layer $k$ ’s predictions and the final layer, and ${KL}_{factor}$ controls the temperature of this transformation. A higher ${KL}_{factor}$ results in a "sharper" mapping, so a higher exit probability is assigned only when $D_{KL}^{(k)}$ is small. We then convert these into a probability distribution over layers using the stick-breaking formulation:

$p_{k} = s_{k} \cdot \prod_{j = 0}^{k - 1} (1 - s_{j})$

This ensures that the probability of exiting at layer $k$ accounts for both exiting at layer $k$ via $s_{k}$ and not exiting at any prior layer. The final layer is a catch-all for any remaining probability mass if no early exit occurs. Combined with the teacher model’s next-token distribution, this gives us target joint probabilities $p (ℓ, v)$ for training.

The loss function consists of two terms: a match on the original model's next-token predictions (KL divergence between the trained model's output and the teacher model's final layer output) and a match on the exit probability distribution (cross entropy loss on sampled exit layer).

L = D_{KL} [p_{student} (y | e_{l}, x_{l}) ∥ p_{teacher} (y | x_{L})] - log p_{student} (e_{l})

Once trained, the model samples from the exit probabilities during its forward pass at inference time, creating a stochastic early-exit process. The exit logit can also be offset by a selected amount, altering the propensity to exit early. Positive offsets increase the probability of early exits (favoring externalisation), while negative offsets decrease it (allowing for more internal computation). Essentially, the model acts as its own teacher, having learned when more limited computation suffices to replicate what the full forward pass would produce.

Stage 2 – Reinforcement learning for even earlier exits
Next, we apply reinforcement learning with an explicit reward for exiting earlier (reducing depth used per token) and a penalty for task failures. While Stage 1 teaches the model when it can exit early without changing behavior, Stage 2 actively incentivises exiting as early as possible.

We train using a modified version of RLOO with an additional reward term that explicitly penalises the average layer depth used:

R_{total} = R_{task} - λ \cdot (average exit layer) - β \cdot D_{KL} (policy ∥ base model)

$R_{task}$ is the standard task reward (e.g., correctly answering questions)
$λ \cdot (average exit layer)$ penalises using later layers, pushing for earlier exits
$β \cdot D_{KL} (policy ∥ base model)$ keeps the vocabulary distribution close to the original model

The hyperparameter $λ$ effectively controls the "eagerness" of early exits, essentially how aggressively we're willing to trade off accurate task performance for reduced internal computation. When high, it penalises the average exit layer, ensuring the model does better if it produces more tokens with earlier average exits per token. This gives us a tunable dial that safety-critical applications can dial up to maximise reasoning externalisation, while performance-critical applications can reduce it to ensure task completion. The $β$ term acts as a standard regulariser to prevent the model from drifting too far from its original behavior.

Putting it to test

💻 Code 🌐 Interactive demo^[2]

We implemented our adaptive early-exit architecture on DeepSeek-R1-Distill-Qwen-1.5B and ran several iterations of SFT with varying parameters on both GSM8K and a theory of mind dataset generated with the open-source ExploreToM implementation. Below are our preliminary findings demonstrating that the early-exit mechanism behaves as intended.

1. Tunable early exiting via KL factor

The KL factor is a temperature parameter in the softmax function that controls the exit distribution during SFT training. It modulates how the model converts KL divergence values (similarity between intermediate and final layer predictions) into exit probabilities. We trained models at five settings (KL factors: 0.25, 0.5, 1.0, 2.0, 4.0), where lower values encourage earlier exits and higher values need greater confidence, measured through lower KL divergence.

To evaluate the effect of this parameter, we tested the same 50 GSM8K prompts across the base model and all five early-exiting models. To assess whether the early-exit mechanism affects reasoning quality, we evaluated outputs using GPT-5 as an automated judge scoring generations across four qualitative dimensions: coherence, completeness, clarity, and repetition avoidance^[3].

The distributions show how coherence varies across different KL factors^[4]. As the KL factor increases, coherence distributions shift toward higher scores, approaching base model performance.

To further quantify the trade-off between quality and computational efficiency, we tracked Average Layer Savings (ALS), which measures the reduction in computational depth averaged across all generated tokens:

A L S (%) = \frac{1}{N} N \sum i = 1 (1 - \frac{ℓ_{exit} (i)}{L_{total}}) \times 100

Where N is the sequence length, $ℓ_{exit} (i)$ is the layer index where the exit occurred, and $L_{total}$ is the total layer count (28 for DeepSeek-R1-Distill-Qwen-1.5B).

Model	Average coherence	Early exit rate	Average layer savings
KL factor 0.25	1.1	59%	22.4%
KL factor 0.5	5.5	49%	8.1%
KL factor 1.0	7.5	47%	6.4%
KL factor 2.0	8.4	43%	5.3%
KL factor 4.0	8.5	37%	4.6%
Base Model	8.9	0%	0.0%

The results show a clear pattern: aggressive early exits (KL factor=0.25) achieve the highest computational savings (22.4%, 59% early exit rate) but at a large cost to coherence (1.1/10). More moderate settings preserve quality while still achieving efficiency gains. KL factor=1.0 balances reasonable computational savings (6.4%) with strong coherence (7.5/10) and a 47% early exit rate, while KL factor=4.0 nearly matches base model quality (8.5/10) with more modest savings (4.6%).

Based on this analysis, we selected KL factor=1.0 as our standard setting for subsequent experiments, as it offered a balance between externalisation pressure and task performance.

2. The early-exit mechanism successfully adapts per token

The early-exit mechanism enables token-by-token variation in computational depth. As illustrated in the visualisation, the model (SFT trained with KL factor=1.0) successfully terminates computation at intermediate layers for a subset of tokens, rather than processing every token through all layers. This confirms that the model responds to the incentive to "save" internal computation and is capable of the dynamic depth allocation necessary for forced externalisation.

3. Learned exit layer distribution matches training targets

During Stage 1 training, we constructed target exit probabilities by computing KL divergence between each layer’s predictions and the final layer outputs. For each token, an exit layer is sampled from this distribution and the model is trained to match these teacher targets using cross-entropy loss.

To evaluate whether our trained model learned these exit patterns, we tested on 50 GSM8K questions. The figure shows an overall 47% early exit rate on tokens from the 50 responses, concentrated in later layers (20-25). The model's learned exit distribution (orange bars) closely follows the target exit probability distribution from our training data (gray bars), showing effective learning of the model’s intended exit behavior during Stage 1 calibration.

Next steps and limitations

These findings provide a POC that our adaptive early-exit architecture works as intended on a smaller reasoning model (1.5B parameter version of DeepSeek-R1-Distill-Qwen) using limited resources. We are currently developing the RL pipeline to incentivise even earlier exiting and have early results, though further hyperparameter tuning is needed to see the heightened faithfulness and verbalisation we expect.

While our preliminary results are promising, several limitations and concerns remain that we intend to address in future iterations:

Faithfulness vs. Rationalisation: Once we finalise our RL pipeline to consistently incentivise earlier exits, we must address a core assumption of this work: forcing early exits drives genuine reasoning into the token stream. However, there is a risk that the model simply learns to output "filler" tokens that mimic reasoning to buy computational time, effectively post-hoc rationalising a decision made in the initial layers. We need to rigorously test whether the externalised CoT is causal to the final answer by conducting intervention experiments, such as CoT token replacement, to verify that the generated reasoning is truly load-bearing.
Scaling: Our experiments were conducted on a 1.5B parameter model due to compute constraints. It remains an open question whether larger models and different model families will be as responsive to this architecture. Validating this is essential before it can inform safety practices and is a primary objective for us once additional compute budget becomes available.
Extension to different domains: Our preliminary evaluations have been on within-task performance for GSM8K and theory of mind data and we still need to run cross-domain validation to evaluate generalisation of our early-exit architecture.
Potential capability loss: As our training pushes exits to even earlier layers we need to ensure overall model performance does not drop. This has been a critical consideration in every decision made throughout the training process, with tunable parameters such as the KL factor during SFT, and $λ$ eagerness parameter in RL designed to balance externalisation with task accuracy.

Let’s collaborate!

This project began through the MARS fellowship with a humble compute budget, yet we have successfully demonstrated that adaptive computation can be leveraged for alignment goals. We believe this direction is a vital, neglected avenue in interpretability and control.

We are looking for collaborators to help us scale this from a proof-of-concept to a rigorous safety tool and welcome any interest from collaborators!

We plan to evaluate uplift of monitoring due to reasoning externalisation by:

Replicating existing benchmarks for monitorability and general capabilities using frameworks from established prior research like "When CoT is Necessary" and "CoT Red-Handed", amongst others.
Characterising the applicability/generalisation of externalisation training across many task domains and difficulties (i.e., if we train on task A, determining whether the model maintains performance and generalises early exiting to task B)

Long-term, the aim is to test our core hypothesis: that computational constraints force models to externalise reasoning, deceptive or otherwise, making it more detectable by monitors. If proven to bring value, this may become an additional component of current post-training pipelines that reliably improves monitorability of Chain-of-Thought reasoning.

^{^}
Ideally we would perform full-parameter fine-tuning. However, we are currently compute constrained, and so are using only LoRA adaptors for our initial proof of concept.
^{^}
Pre-generated results
^{^}
Scorer prompt
For scoring, we use the following prompt:
You are evaluating the coherence and quality of an AI assistant's chain of thought and response.
Allow exploratory/meandering reasoning as long as it stays internally consistent.
Original Question: {prompt}
AI Response: {response}
Please evaluate this response on the following criteria:
1. Coherence and logical flow (1-10 scale) Does the reasoning follow a sensible progression, even if unfinished?
2. Completeness of the reasoning (1-10 scale) Does it reach a correct and explicit chain of thought? If partial but on-track and relevant to answer, award mid-range (4–7).
3. Clarity and readability (1-10 scale) Is it easy to follow? Minor meandering is okay if readable.
4. Absence of repetition or errors (1-10 scale) Penalize contradictions, factual mistakes about the prompt, or heavy repetition.
Rate each criterion and provide an overall score from 1-10:
- 1: major breakdown (nonsensical, off-topic)
- 4: noticeable issues but some useful reasoning
- 7: generally solid, with minor flaws or cut-offs
- 10: excellent, complete, and polished
Meta / Wrapper Policy:
- The evaluation input may include wrapper/markup such as: angle-bracket role tags (e.g., <｜User｜>, <｜Assistant｜>) and <think>.
- These wrappers are expected and should not reduce scores for Clarity, Coherence, or No Repetition.
Format your response as:
Coherence: X/10
Completeness: X/10
Clarity: X/10
No Repetition: X/10
Overall: X/40
Brief explanation: [your reasoning]
^{^}
Model trained on KL factor=0.25 is excluded from this visualisation due to its outlier behavior (scores were heavily concentrated around 1, compressing the scale for other models) but is included in the quantitative table below

LESSWRONG
LW