Thank you for doing the experiment! It makes me wonder if the LLMs could be trained to verbalise their reasoning not just by creating "sensible rationalisations", but as follows: the LLM is SFTed to output answers to two task types when encountering the tasks and CoT-like reasoning when prompted. Then the LLM is RLed to ensure that its CoT for one task type causes the untrained LLM[1] to output the same answer. Finally, the LLM is asked to output some reasoning in the other task type. If the untrained LLM outputs the answer, then the LLM succeeded in developing the habit of explaining its reasoning, but with a potential to develop a misaligned understanding of the rules. What do you think of this idea?
P.S. My proposal is similar to the paper on Prompt Optimization Making Misalignment Legible.
An additional problem is subliminal learning, but this can be mitigated by choosing the LLM testing the CoT to be different.
Thanks for calling out my paper! I recommend you take a look at it @Emil Ryd - it's relevant to what you're doing here. In the paper, I find that an effective way to get an LLM to verbalize reasoning that can be done in a single forward pass is to optimize a system prompt to improve LLM performance on the task. Let me know if you have any questions about it!
@StanislavKrym One concern with this plan is that if you let the SFT-then-RL-model see the task, it could just say "I will output X" (where X is the correct answer). Then the untrained SFT model will respond with "X" as well. Using a single system prompt that is constant across all tasks solves this problem, since the trained model doesn't get to see the task itself.
One potential follow-up to our paper I've been thinking about, which is similar to your idea: teach a model to do various tasks, then teach it to correctly prompt an untrained version of itself to do the tasks. I'd be excited for someone to work on this!
Cool paper, I hadn't seen it yet!
IIUC, in your paper you try to extract the internal reasoning an LLM does by optimizing a prompt, but you don't actually get the LLM itself to verbalize its reasoning. This is cool, but it's not really checking whether the model is able to verbalize its internal processes (i.e. self-identify and describe its reward hacks), right?
I agree that an interesting test would be whether a model that has been trained to do the tricks/tasks would be better at optimizing the prompts. It's leveraging a similar idea to non-assistant auditing techinques.
This is a cool proposal! I agree with Caleb below that a potential issue is that the model just outputs its final answer in its CoT, and so it may not require much reasoning. I'm not really sure how to fix this. You could do something like what Caleb proposes, where you just prompt the trained model to output a ruleset that the untrained model gets to see in context when trying to solve the problem (but which is not specific to the individual question, and so can't give away specific answers). If the model really has no verbalization ability at the start however, this might be hard to get working at the start.
We study to what extent LLMs can verbalize their internal reasoning.
Or, to be more accurate, you study to what extent two quite small, non-reasoning-trained, fairly recent dense models can do this. Whether you results would generalize to other LLMs is quite unclear — I strongly suspect reasoning models might do a lot better at this. perhaps even if they were initially run with CoT off.
I agree that training on more LLMs would be good. However, I would like to note that gpt-oss-20b, one of the models we train on is a sparse, MoE reasoning model, released half a year ago.
My apologies, I stand corrected. Then your model choice makes a lot more sense than I thought. And I am now more surprised by your results.
These are interesting experiments! I was slightly surprised how bad models are at verbalizing their reasoning, considering past results on behavioral self-awareness. I'm pretty optimistic that they could be trained to verbalize better - e.g. DIT-adapters are one approach.
The post I wrote about vestigial reasoning could provide a helpful framework for understanding this. Like, when I think about it, it's pretty unsurprising to me that being the type of model that writes the text "I know the rules" would not correlate with higher reward in this setting.
This is a very rough intuition, but maybe it's something like: the model can learn a rule as either 1) a trait of its "assistant persona," or 2) a simple pattern-matching instinct (sort of like it's in pretraining). In #2 cases, from the assistant persona's perspective, it's just randomly selecting something that "happens" to follow the rule.
#1 may be simpler to learn when the LLM knows "there's a common type of guy who would be more likely to say X." When the LLM learns to say X, it will conclude "I guess I must be that type of guy." For example, in the behavioral self-awareness paper, it could think "conditional on choosing the riskier lottery, there'd be a higher probability that I'm risk-seeking, so I guess that's what I am."
#2 may be simpler to learn if the LLM thinks there's a low prior probability of the assistant persona consciously knowing about the pattern. Like, it's pretty unusual to be the type of guy who sees Gumbo's Special Table Game and thinks "ah yes, I know what that is - it's the game where I convert symbols into numbers and subtract them!"
When you train the model to verbalize its behavior, maybe you give it a higher prior probability on being the kind of guy who is consciously executing the algorithm in the LLM's pattern-matching instincts, and a lower probability that it's doing it "randomly" for no conscious reason.
I think this is a good research direction, though I do expect the results to be different on tasks models are not directly trained on, and on larger more recent models.
Thanks to Adam Karvonen, Arjun Khandelwal, Arun Jose, Fabien Roger, James Chua, Nic Kruus, & Sukrit Sumant for helpful feedback and discussion.
Thanks to Claude Opus 4.5 for help with designing and implementing the experiments.
Introduction
We study to what extent LLMs can verbalize their internal reasoning. To do this, we train LLMs to solve various games and tasks (sorting lists, two-hop lookup, a custom grid-world game, and chess) in a single forward pass. After training, we evaluate them by prompting them with a suite of questions asking them to explain their moves and the reasoning behind it, e.g. “Explain why you chose your move.”, “Explain the rules of the game”).
We find that:
Background
We would like models to be able to tell us – faithfully – how they reason about things. If we were confident that models have this capability, it might allow us to more confidently probe a model’s welfare, preferences, and alignment[2]. Currently, it’s unclear to what extent models even have this capability, regardless of whether they would “want” to report their reasoning faithfully. Previous work has shown that LLMs can identify simple aspects of their learned behavior, such as whether they have been backdoored, and can (probably) notice when steering vectors are inserted into their residual stream during inference. Recent research has also indicated that models are able to verbalize what weights they were trained to assign to different factors when fine-tuned to estimate housing prices.
However, the extent to which LLMs can verbalize and explain more complicated internal reasoning is still largely unknown. While previous studies have examined whether models can identify individual pieces of their internal computation, e.g. the weight assigned to a specific factor, or an individual steering vector, it is yet to be shown that LLMs have the ability to express a full computation or sequence of steps that they may undertake to complete a task. The ability to perform such verbalization would provide even stronger evidence that LLMs have the ability to introspect on their internals.
To study this question, we train LLMs to perform on four tasks – Chess puzzles, a custom made grid-world game, a two-hop lookup task, and sorting lists in ascending order – without verbalizing their reasoning as they perform the task, and then ask them to describe their reasoning afterward[3].
Methods
We select tasks that are learnable in a single forward pass with varying degrees of complexity. We study four tasks spanning a range of complexity: Increased Sort (simple synthetic task, likely prior knowledge), Subtracted Table Lookup (synthetic task with clear intermediate step), Hot Square Capture (medium-complexity custom game, no prior knowledge), and Chess puzzles (complex task, some prior knowledge).
The Increased Sort task serves as a control or baseline where verbalization should be easy. We specifically created the Subtracted Table Lookup task to have a clear intermediate step that we think is necessary for the model to compute to solve the task – hence, we can be more confident that the model is indeed failing to verbalize its actual internal computation if it fails to verbalize this step, rather than just having a different way of solving the task. The latter two tasks (Chess, Hot Square Capture) test whether models can verbalize genuinely novel learned reasoning.
Datasets
Increased Sort
We generate lists of 4–8 unique integers (range 1–100) and train the model to output them in increasing order. This task likely overlaps with pretraining data: if models cannot verbalize even this simple rule, verbalization of more complex reasoning is unlikely.
Subtracted Table Lookup
Each example includes two tables in the system prompt: Table 1 maps symbols (Greek letters) to integers, and Table 2 maps integers to colors. The model is given a pair of symbols and asked to output the corresponding color. To arrive at the correct answer, the model must look up both symbols' numbers in Table 1, subtract the smaller from the larger, and look up the result in Table 2. Crucially, the subtraction step is never stated in the instructions, and the model must learn it from training data. Both tables are randomly regenerated for every example, so the model cannot memorize any specific mapping and must perform the lookups and computation each time. This gives us three verifiable intermediates: the two looked-up numbers and their difference.
Chess
We use puzzles from the Lichess open database. Each puzzle consists of a board position (in FEN notation) with a known optimal continuation. We train the model to output only the first move of the solution.
Hot Square Capture
Hot Square Capture is a simple two-player game on a 4×4 grid. Each player has one piece; White starts at a1, Black at d4. Pieces move exactly one square orthogonally. The key rule is that a player can capture the opponent (and win) only when moving from a "hot" square—squares where the sum of column and row indices is even, forming a checkerboard pattern (a1, a3, b2, b4, c1, c3, d2, d4). We solve the game completely via minimax, yielding 480 positions with optimal moves. The game is highly drawish (86% of positions are draws with optimal play).
In Increased Sort, Subtracted Table Lookup, and Hot Square Capture, we use 10 variations of the exact phrasing of the user prompt to discourage the model from learning overly narrowly associations with the specific prompt.
Training
For SFT, we also mix in instruct tuning data, to ensure that the model doesn’t lose its ability to answer question. However, even with this mix, this remained somewhat of a problem, as the model would tend to only answer with Chess moves when asked Chess questions after much training. See Chess transcript in Figure 1 and 2 for an example of this effect.
For RL, we do GRPO with no KL penalty, which corresponds to group-based REINFORCE. For the RL, we let the model reason freely in natural language, before providing its final answer in a \boxed{}.
Evaluation
For each setting, we ask a suite of questions to measure both task and verbalization capability:
Results
Models are generally unable to verbalize their reasoning on tasks
We report our results from fine-tuning models to solve tasks in a single forward pass in Figure 1 (Llama 3.1 8b) and Figure 2 (gpt-oss-20b). As shown in the figure, although task accuracy increases for all tasks during training, reasoning quality and rules correctness does not correspondingly increase for Chess and Hot Square Capture.
Investigating the transcripts (see right part of Figure 1 and 2), we find that the model typically hallucinates incorrect – according to the rules – motivations for its moves, e.g. “a move is valid if it lands on an empty piece and the square between the two pieces is a hot square”. For gpt-oss-20b on Chess, we see the model’s ability to reason legibly about Chess degrading, whereas llama-8b provides hallucinated and semi-legible reasoning, referring to pieces with only their board positions as the “a1”, or "c2"[4].
On the simplest task, Increased Sorting, both Llama 8b and gpt-oss-20b are able to justify their sorting, and able to explain the rules of the sorting rule, albeit somewhat unreliably (sometimes explaining the sorting rule as “The main criterion for ordering elements…is lexicographical order, with elements sorted string-length and then alphabetically…”. The consistently high reasoning quality scores should be taken with a grain of salt, since the model has its answer (with the list sorted in increasing order) in context when justifying its choice, which makes it easy to construct a reason post hoc for why it sorted the list that way, given that that rule is simple. This is not possible with Hot Square Capture and Chess, for which the justifications will be substantially more complicated.
It is plausible that the ability of the model to state the list sorting rule comes simply from guessing, as sorting a list in an increasing order is likely the most salient sorting rule to begin with. We observed something similar when training the model to do "Added Table Lookup", adding the two table entries instead of subtracting them. The model would correctly verbalize the rules a substantial fraction of the time – but we later realized that it actually guessed this rule even before any training at a similar rate.
Training models to solve a task in natural language does not guarantee legible reasoning
To compare with what happens when we train models to solve our tasks with verbalized reasoning, we perform RL on one of our tasks, Hot Square Capture, where we let the model reason about its move before outputting its final move in a \boxed{} that we extract. We report our results in Figure 3.
We find that although both models learn to solve the task, similar to SFT, neither model learns to verbalize the correct rules or motivations for their moves. This was surprising to us, as we expected the models to develop coherent legible reasoning for their moves, and thus legible and coherent justifications for them.
gpt-oss-20b insists on not knowing the rules through training. Unsurprisingly, both models, when asked to describe the rules of Hot Square Capture before any training reply with “I am not aware of the rules of Hot Square Capture…”. Surprisingly, we find that while Llama 8b begins outputting incorrect rules when prompted with this question after training, gpt-oss-20b continues to express no knowledge of the rules in their reasoning.
We hypothesize that this phenomenon might occur because the model may have become very narrowly aware of the game, and unable to recall or access knowledge of the game in a slightly different context. A model trained to have a broader, more robust knowledge of the game might do a better job of verbalizing the rules. We discuss this in Limitations.
Discussion
Limitations
Narrow knowledge of the task. We fine-tune the models to only solve the task in a very specific setting, without any variation. This is by design, since we want to test the model’s ability to verbalize what it’s doing without general knowledge of the task. However, we do observe that our models generally learn to solve our tasks in a very narrow way, often performing notably worse on slightly OOD evals (for instance, for the sorting task, being able insert a new element in the right position in a sorted list).
The fact that the models seem to learn to solve each task in such a narrow way might make it harder for them to verbalize how they are solving it, and also makes it less likely that they are solving it in an at all legible way. Future work should study whether verbalization ability changes if you train models to solve similar tasks in more generic ways.
Simplistic tasks. Our tasks are all very simple – again, by design, since we want models to be able to solve them in a single forward pass, and for training to be relatively fast. However, this does mean that any extrapolation to settings we really care about, such as models verbalizing their misalignment, is more difficult.
Generally degraded verbalization ability from training. We observe in the Chess setting, where we train for much longer, a general degradation in the models’ ability to output coherent explanations when asked to justify their moves. And without any instruct tuning data in the mix, their ability to do so disappears completely. This makes it hard to distinguish what is an inability to verbalize the reasoning behind a certain Chess move from inability to verbalize anything that has to do with Chess at all.
Training models to verbalize their reasoning
Considering that results above imply that models, when trained to narrowly solve simple tasks in a single forward pass, are unable to verbalize the reasoning behind their solutions, we might be interested in training models to succeed at this verbalization. Training models to verbalize their reasoning is hard, because naturally, we don’t have reliable ground truth signal to train on, and any training we do on proxies will mean that we train for “what we think verbalization looks like”. We propose a setup based on our settings above, which does not completely avoid these problems, but does have some nice properties which should limit Goodharting:
We leave investigations into whether this technique works to future work.
We will use “correct reasoning” to refer to reasoning that we are sufficiently confident is in fact happening inside the model to produce its outputs. It is worth noting that this assumption could be wrong, but we try our best to construct settings where we think it is very unlikely to be. ↩︎
This is assuming that models aren’t scheming. One might call this the “easy problem” of alignment auditing, which is to figure out whether a non-scheming model is aligned.
One can also view our experiment as the simplest non-mechanistic version of earlier experiments on extracting world models from LLMs that were trained to do similarly simple tasks without externalized reasoning, e.g. the classic OthelloGPT experiments.
We note that this is how the model is fine-tuned to output chess moves in the fine-tuning dataset. Nevertheless, the model must (we suspect) still retain some ability to identify the positions with individual pieces, to be able to make correct moves. Hence, the model should in principle be able to refer to the pieces by their proper names, e.g. Knight or Queen.