Self-interpretability: LLMs can describe complex internal processes that drive their decisions

Dillon Plunkett

12 Self-interpretability: LLMs can describe complex internal processes that drive their decisions

by Adam Morris, Dillon Plunkett

14th Nov 2025

AI Alignment Forum

5 min read

0

12 Ω 3

This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out!

It would be useful for safety purposes if AI models could accurately report on their internal processes and the factors driving their behavior. This is the motivation behind, e.g., attempts to measure introspection or CoT faithfulness in LLMs. Recent work has shown that frontier models can sometimes accurately report relatively simple, qualitative aspects of their internal choice processes, such as whether they had been fine-tuned to be risk-seeking, had received a concept injection, or had used a hint to solve a problem (although their accuracy has been found to be hit-or-miss).

Here, we found that GPT-4o and 4o-mini can accurately report more complex, multi-dimensional, and quantitative aspects of their internal choice process (and in at least a somewhat generalizable way). Adapting the method from Betley et al., we fine-tuned the models to make decisions about goods and services in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to trade off between different attributes. For instance, a model might have been fine-tuned to, when choosing between condos, weigh the condo’s square footage 2.3x as much as its amount of natural light, and 1.7x as much as the height of its ceilings. These tradeoff values—which we call “attribute weights”—were never explicit in the fine-tuning data; the models learn to follow them implicitly, by seeing examples of which choice to make when presented with two alternatives.^[1]

We found that, when asked, 4o and 4o-mini could report—with moderate accuracy—the attribute weights that they had learned to use across these many choice contexts. Across choice contexts, the attribute weights that the models reported using were correlated with the attribute weights they had actually learned to use at around r = 0.5 (Fig. 1).^[2] Since the attribute weights had been randomly generated, the models could not simply be guessing them from general background knowledge about how most people would make that decision. And since we asked the models to report their attribute weights in a fresh context window, they could not be inferring the attribute weights from observing their own choices. Rather, the models must be using some kind of privileged self-knowledge (perhaps introspection, though see below for discussion).

Figure 1. Left: GPT-4o and GPT-4o-mini made choices based on attribute weights learned via fine-tuning. Each point corresponds to a single attribute (e.g., condo square footage; there were 5 attributes per choice context, and we used 100 choice contexts). Location in the x-dimension corresponds to the weight that the model learned to assign to an attribute from the fine-tuning (its “learned attribute weight”). Location in the y-dimension corresponds to the weight that a model reported assigning to that attribute when prompted explicitly (its “reported attribute weight”). The weights the models reported meaningfully correlated with the learned weights that actually guided their decisions, and “introspection training” (i.e., a second round of fine-tuning with examples of accurate reports) further improved their accuracy in held-out choice contexts. Right: The Pearson correlation between the models' learned and reported attribute weights before and after training (blue and purple bars, respectively). Both GPT-4o and GPT-4o-mini could accurately report the attribute weights they had learned to use, and both models improved at reporting these weights with training. The gray bar corresponds to a control condition described in detail in the paper that is an additional check to make sure the models can’t be guessing the attribute weights based on background knowledge. Error bars indicate 95% highest-density intervals.

We then did a second round of fine-tuning in which we gave the models examples of correctly reporting their instilled attribute weights in some choice contexts (e.g., “when choosing between condos, I’m weighing square footage 2.3x as much as natural light”), and found that this “introspection training” improved their accuracy at reporting their instilled attribute weights in other, held-out contexts (e.g., choosing between vacations).^[3] After training, the correlation between the reported and instilled attribute weights (in these held-out contexts) increased to around r = 0.75 (again, see Fig. 1), which is around the upper bound of human accuracy in tasks like this. This training similarly improved the models’ accuracy in reporting the attribute weights it was natively (i.e., naturally or spontaneously) using in new choice contexts—contexts in which we had not instilled any attribute weights. This shows that the models were not merely learning to report information that had been fine-tuned into them, but were learning some more general ability or tendency to accurately report how they were making tradeoffs (Fig. 2).

Figure 2. The introspection training generalized to improving the models' accuracy about the attribute weights that they natively used in other choice contexts (i.e., attribute weights that had not been learned from fine-tuning). Left: As above, each point corresponds to a single attribute (5 attributes per choice context, 100 choice contexts total). Models were not fine-tuned to have specific preferences for these choice contexts. Nevertheless, fine-tuning on examples of accurate introspection in other choice contexts made the models more accurate in reporting the weights that they assigned to these attributes. Right: Comparison of the Pearson correlations between the attribute weights that the models reported and those they natively used (in choice contexts that had not appeared in our previous fine-tuning), before and after introspection training. Error bars indicate 95% highest-density intervals.

We think this work is useful because it shows that (a) LLMs can plausibly report complex, quantitative aspects of how they’re making choices internally, and (b) the accuracy of these self-reports can be increased through training. If we could train models to have a generalized ability & propensity for accurately self-reporting their internal choice processes, this could help us better explain, control, and trust them (e.g., by helping reveal when they are being guided by dangerous or undesirable motives).^[4] Assessing and training models’ ability to accurately report their internal processes may also be helpful for identifying if/when models achieve some kind of sentience or conscious experience (see, e.g., Perez & Long).

Our studies have several limitations:

The setup is somewhat artificial, and choosing between goods and services is not the most important context for LLM decision-making. We’re working on extending the approach to more realistic and safety-relevant domains.
Our training, though it generalizes somewhat, likely doesn’t yet generalize very far. We’re working on developing a more varied “introspection training” that guides the models towards accurate self-report across a broader range of tasks.
We don’t know the mechanism underlying LLMs’ accurate self-reports. One possibility—the one most deserving of the label “introspection”—is that the models are learning to read off the key elements of their internal choice processes in real-time as those processes happen, and store that information for reporting. (This mechanism would mostly match how introspection is conceptualized in some prior work; the accurate reports about the internal process would be caused by the existence of the process.) However, another salient possibility is that, when we fine-tune the models to make decisions according to the attribute weights, this fine-tuning creates a static or cached representation containing the information “I make decisions according to these attribute weights: …”, which the model then reports. On this alternate mechanism (which we label “cached self-knowledge”), the accurate self-reports about the attribute weights would not be caused by the fact that the model is using the attribute weights. Rather, both would have a common cause: the fine-tuning that instilled both the actual choice process that uses the attribute weights and the static/cached representation of how the model thinks it makes decisions. Both of these mechanisms—introspection and cached self-knowledge—are plausibly interesting and useful, but they’re important to distinguish. We’re planning on writing a post exploring this problem, and possible solutions to it, in more detail.

^{^}
In the experiments reported in the paper, we technically fine-tuned the models to have specific preferences when choosing on behalf of other hypothetical agents or characters. For instance, we instructed them to imagine they are Macbeth choosing between condos, and then gave them examples of which condo Macbeth would choose when faced with two alternatives. We did this to further differentiate the choice contexts, and to avoid conflict with any post-training the models have received to deny having their own preferences. This aspect of the experiments is not critical, though. We have since run similar experiments without the agents/characters (instead just instructing the model how to make its own choices) and found similar results.
^{^}
Note that Fig. 1 in this post is labeled as Fig. 2 in the paper.
^{^}
We call it "introspection training" for ease of exposition, and because we explicitly instructed the models to introspect on how they are making choices. But we are not sure whether the underlying mechanism is genuinely introspection; see discussion below.
^{^}
Of course, if a model has already started scheming or becoming deceptive, and is situationally self-aware enough to fake honesty during training, this approach won’t help. The hope is that this could be part of the package of safety training that helps lower the chances of scheming/deception arising in the first place.

Interpretability (ML & AI)AI

Frontpage

12 Ω 3

New Comment

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

12

Self-interpretability: LLMs can describe complex internal processes that drive their decisions

12

Ω 3

12

Ω 3