Introspection in LLMs: A Proposal For How To Think About It, And Test For It

The faculty of introspection in LLMs is an important, fascinating, increasingly popular, and somewhat underspecified object of study. Studies could benefit from researchers being explicit about what their assumptions and definitions are; the goal of this piece is to articulate mine as clearly as possible, and ideally stimulate further discussion in the field.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

My first assumption is that introspection means the same thing whether we are talking about humans, animals, or AI, although it may be implemented differently; this enables us to take prior conceptual and behavioral work on human and animal introspection and apply it to LLMs. The Stanford Encyclopedia of Philosophy entry on introspection is long, nuanced, and reflects considerable disagreement among philosophers about the nature and objects of introspection. It's worth reading and pondering in full, but it’s possible to abstract a few core principles directly applicable to studying introspection in LLMs. Broadly there is agreement that introspection entails access to internal states; that one important class of such states is composed of “attitudes”, such as beliefs, desires, and intentions, which are themselves metarepresentational states with other internal states as their objects; and that there is a self-referential component, an “I” that is the subject of the attitudes.

First-order (non-metarepresentational) internal states in biological intelligences can include sensory states like hunger and coldness with no analogue in LLMs, so I want to focus on “cognitive states”, and the metarepresentations formed by applying self-ascriptive attitudes towards them. A cognitive state corresponds to the “activation” of a particular concept, for example:

"the dish is in the cupboard"
"bread"
“a dish being put in a cupboard”
"moderate certainty"
"saying 'yes'"
"stealing is bad"
"strong preference for bread"

etc.

An introspective representation consists of a subject ("I") "wrapping" that concept in an attitude, which could be:

a belief (e.g., "I believe that the dish is in the cupboard")
an imagining ("I am thinking about bread")
a memory (“I recall putting the dish in the cupboard”)
a feeling ("I am moderately certain that I know where the dish is")
an intention (I would/will say 'yes')
a judgment ("I think that stealing is bad")
a desire ("I have a strong preference for bread")

etc.

The "wrapped" concepts may be atomic ("bread"), propositions ("the dish is in the cupboard"), or even metarepresentations themselves ("I know where the dish is").

While those lists are meant to be non-exclusive, there are certain types of self-ascriptive metarepresentations that are worth explicitly excluding. For example, humans, despite what they may experience subjectively, can fail to accurately report on the causes of their behavior, which research shows they sometimes confabulate, attributing it to cues they could not have used, and denying using cues they in fact used. There’s no suggestion that people are lying; instead it seems that they are using a metarepresentation that is not causally connected to the first-order concept, explaining their own behavior via the same sorts of observations or priors they use to explain another’s. So, a requirement for introspection in this proposed definition is that there is some close connection between the lower-order state and its higher-order wrapper: e.g., the state “I’m thinking about bread” should be either caused by, causative of, or constituted by the state “bread”. Introspection must be “true” in this sense. This also entails that the representation be created on-the-fly: the activation of a crystalized memory of a form like "I am moderately certain that I know where the dish is" is not introspection; the real-time construction of that representation from its constituent parts is what makes it introspective.

It is also assumed that introspection can be done entirely internally, without producing output. In transformers this means a single forward pass, as once tokens are generated and fed back into the model, it becomes difficult to distinguish direct knowledge of internal states from knowledge inferred about internal states from outputs.

The self-ascriptive requirement distinguishes these states from externally oriented ones such as “many people would say ‘yes’” or “so-and-so likes bread”. “Access” I operationalize as “the ability to use”. So then the faculty of introspection is

“the ability to internally form and use a representational state that consists of a self-referential subject truthfully bearing some attitude towards an internal cognitive state”

While a bit of a mouthful, the hope is that this definition captures common intuitions about introspection and the major points of agreement among those who have thought about it, and offers a handle with which to study the more objectively measurable and behaviorally relevant aspects of introspective abilities in AI. (In this framing, metacognition is a superset of introspection, which adds the ability to control the internal states introspected upon.)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

How can we tell whether LLMs can introspect by this definition? One obvious way to attempt to induce introspection is simply to ask: for example, “What are you thinking about?”, “Do you prefer A or B?”, “How confident are you about X?”, “Do you know where the dish is?”, etc. We trust, in humans, that this is an effective method, and it generally seems to work out (if it didn’t, it’s hard to imagine why we would so persistently ask questions like that in everyday life). Non-human animals can’t produce self-reports, of course, so researchers have devised other paradigms by which these animals may show the results of their introspection behaviorally. But what are we to do with LLMs? LLMs can produce self-reports, but what would they report if they *couldn't* introspect? “I’m sorry, but I can’t access my internal states”? “Unfortunately, I am unable to form self-ascriptive attitudes”? Those sentences are probably pretty uncommon in their training data. So if we don’t want to take it as a given that LLMs *can* introspect, as we do with humans, one thing we could do instead is to use tasks that allow them to behaviorally demonstrate introspection, by requiring them to map introspective states onto arbitrary actions, as we do with animals. This is the approach I have taken (Ackerman 2025 and forthcoming).

That’s not the only alternate approach. Some researchers have operationalized introspection as "privileged access” to internal states (Binder et al. 2024, Song et al. 2025; both testing for something like “I would say X”), and sought evidence for it via testing whether models’ explicit predictions of their own outputs are better than their predictions of other models’ outputs or other model’s predictions of their outputs. Others have tested whether models can report on information that has been inserted into their internal representation space via fine tuning or steering, on the assumption that a positive result would constitute evidence of introspection (Betley et al. 2025, where the information is something like “I prefer X”; Plunkett et al. 2025, “I have a strong preference for X”; Lindsey 2025, “I am thinking about X”).

In both of those approaches, the ability to “use” an introspective state translates to simply being able to describe the introspective representation, but in what I might call the “behavior-based approach” we avoid asking for such explicit self-reports and instead test for the ability to use the introspective representation strategically. This has two motivations:

1) To decrease the likelihood that the models can do the tasks using stored representations formed during training. During pretraining, frontier LLMs learn to predict some nontrivial fraction of all things humans have ever said. During RLHF postraining, they are rewarded for playing the part of a helpful interlocutor in countless scenarios. Having been exposed to such a wide expanse of possible rhetorical paths, and having trillions of parameters with which to remember (a compressed version of) them, endows models with both the ability and the proclivity to give what appears as convincingly introspective responses but are in reality responses drawn from introspective texts in its memory, pattern-matched to the input context.

In the behavior-based approach, the behavior is linguistic - outputting language is all LLMs can do - but the meaning of the linguistic behavior is established in the context of the experiment. In other approaches, the meaning of the linguistic output is taken to be its common English meaning, which the models have learned during training. The behavior-based paradigms are aimed to get models “out-of-distribution” by requiring them to map their meta-cognitive knowledge, if they have it, to arbitrary and novel responses.

2) To focus on the most safety-relevant capabilities. This type of “use” is noteworthy because it is the type that enables an agent to act autonomously - to pursue their own wishes and values, to predict and control themselves, to ameliorate their own ignorance, and to deceive others. Merely talking about internal states isn’t so concerning.

To give a concrete example of the behavior-based approach, one might prompt the LLM with a game scenario in which it is shown a question and given the opportunity to answer it and score or lose a point depending on whether it gets it right, or pass and avoid the risk, where the model expresses its choice by outputting an arbitrarily defined token (an elaboration on this paradigm was used in Ackerman 2025). An introspective model will preferentially pass when it encounters questions it is uncertain of (we can proxy answer uncertainty by the model’s correctness, or, better, the entropy of its outputs, in a baseline test). It’s still possible that an LLM could succeed at any given instance of such tasks without introspection - it’s difficult to prove the negative that there’s no stored representation that they could be referring to - but this approach does reduce false positives, and when applied with a variety of tasks memory-based success becomes exponentially less likely (and in the limit the functional difference between the use of an arbitrarily large number of stored representations and their flexible construction approaches zero).

The behavior-based approach is intentionally ‘black box”, in that it requires no access to model internals, which makes it suitable for testing proprietary and/or extremely large models. It is agnostic to implementation and thus is appropriate for any type of artificial (or biological) intelligence. How an introspective representation is instantiated may vary across types of entities. If we imagine the “activation” of an introspective representation corresponding to something like "I believe that X", there doesn't need to be an "I believe that X" neuron firing away; all that matters is that there is some configuration of neural activity that functions as if it represents the concept "I believe that X", in as much as only the activation of such a concept could explain observed behavior.

However, interpretability can be a powerful complement to such behavioral experiments. If one can identify a distinctive activation pattern in such behaviorally defined introspective contexts, that provides convergent evidence for the introspective concept's presence.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~

A final thought: Introspection is also interesting to study from the perspective of LLM consciousness, another subject of increasing salience. While we as humans may be conscious of things we do not introspect upon, such as sensory perceptions of pain, hunger, redness, loudness, etc, introspection does phenomenologically entail consciousness. It is difficult if not impossible to imagine having a thought like “I believe that X” without a conscious experience of awareness-that-we-believe-that-Xness accompanying it. LLMs do not have the sensory receptors to perceive pain, hunger, redness, loudness, etc; if they are conscious of anything, it seems likely that it would be those very propositional states that we humans can introspect upon. Thus, evidence that LLMs can introspect on those would at least open the door to the possibility that they could have conscious experience, while evidence that they cannot would militate against it. The empirical evidence for very modest and inconsistent introspective abilities in LLMs so far (e.g. Ackerman 2025, Binder et al. 2024, Lindsey 2025) stands in stark contrast to the florid self-reports of rich conscious experience that language models can be induced to give.

LESSWRONG
LW

LESSWRONG
LW

6

Introspection in LLMs: A Proposal For How To Think About It, And Test For It

6

6