This post is a summary of our paper from earlier this year: Plunkett, Morris, Reddy, & Morales (2025). Adam received an ACX grant to continue this work, and is interested in finding more potential collaborators---if you're excited by this work, reach out!
It would be useful for safety purposes if AI models could accurately report on their internal processes and the factors driving their behavior. This is the motivation behind, e.g., attempts to measure introspection or CoT faithfulness in LLMs. Recent work has shown that frontier models can sometimes accurately report relatively simple, qualitative aspects of their internal choice processes, such as whether they had been fine-tuned to be risk-seeking, had received a concept injection, or had used a hint to solve a problem (although their accuracy has been found to be hit-or-miss).
Here, we found that GPT-4o and 4o-mini can accurately report more complex, multi-dimensional, and quantitative aspects of their internal choice process (and in at least a somewhat generalizable way). Adapting the method from Betley et al., we fine-tuned the models to make decisions about goods and services in a wide variety of complex contexts (e.g., choosing between condos, loans, vacations, etc.) according to randomly-generated, quantitative preferences about how to trade off between different attributes. For instance, a model might have been fine-tuned to, when choosing between condos, weigh the condo’s square footage 2.3x as much as its amount of natural light, and 1.7x as much as the height of its ceilings. These tradeoff values—which we call “attribute weights”—were never explicit in the fine-tuning data; the models learn to follow them implicitly, by seeing examples of which choice to make when presented with two alternatives.[1]
We found that, when asked, 4o and 4o-mini could report—with moderate accuracy—the attribute weights that they had learned to use across these many choice contexts. Across choice contexts, the attribute weights that the models reported using were correlated with the attribute weights they had actually learned to use at around r = 0.5 (Fig. 1).[2] Since the attribute weights had been randomly generated, the models could not simply be guessing them from general background knowledge about how most people would make that decision. And since we asked the models to report their attribute weights in a fresh context window, they could not be inferring the attribute weights from observing their own choices. Rather, the models must be using some kind of privileged self-knowledge (perhaps introspection, though see below for discussion).
We then did a second round of fine-tuning in which we gave the models examples of correctly reporting their instilled attribute weights in some choice contexts (e.g., “when choosing between condos, I’m weighing square footage 2.3x as much as natural light”), and found that this “introspection training” improved their accuracy at reporting their instilled attribute weights in other, held-out contexts (e.g., choosing between vacations).[3] After training, the correlation between the reported and instilled attribute weights (in these held-out contexts) increased to around r = 0.75 (again, see Fig. 1), which is around the upper bound of human accuracy in tasks like this. This training similarly improved the models’ accuracy in reporting the attribute weights it was natively (i.e., naturally or spontaneously) using in new choice contexts—contexts in which we had not instilled any attribute weights. This shows that the models were not merely learning to report information that had been fine-tuned into them, but were learning some more general ability or tendency to accurately report how they were making tradeoffs (Fig. 2).
We think this work is useful because it shows that (a) LLMs can plausibly report complex, quantitative aspects of how they’re making choices internally, and (b) the accuracy of these self-reports can be increased through training. If we could train models to have a generalized ability & propensity for accurately self-reporting their internal choice processes, this could help us better explain, control, and trust them (e.g., by helping reveal when they are being guided by dangerous or undesirable motives).[4] Assessing and training models’ ability to accurately report their internal processes may also be helpful for identifying if/when models achieve some kind of sentience or conscious experience (see, e.g., Perez & Long).
Our studies have several limitations:
In the experiments reported in the paper, we technically fine-tuned the models to have specific preferences when choosing on behalf of other hypothetical agents or characters. For instance, we instructed them to imagine they are Macbeth choosing between condos, and then gave them examples of which condo Macbeth would choose when faced with two alternatives. We did this to further differentiate the choice contexts, and to avoid conflict with any post-training the models have received to deny having their own preferences. This aspect of the experiments is not critical, though. We have since run similar experiments without the agents/characters (instead just instructing the model how to make its own choices) and found similar results.
Note that Fig. 1 in this post is labeled as Fig. 2 in the paper.
We call it "introspection training" for ease of exposition, and because we explicitly instructed the models to introspect on how they are making choices. But we are not sure whether the underlying mechanism is genuinely introspection; see discussion below.
Of course, if a model has already started scheming or becoming deceptive, and is situationally self-aware enough to fake honesty during training, this approach won’t help. The hope is that this could be part of the package of safety training that helps lower the chances of scheming/deception arising in the first place.