This is a rewrite of a comment I originally crafted in response to Anthropic's recent research on introspective awareness with edits and expanded reflections.
Abstract from the original research:
We investigate whether large language models can introspect on their internal states. It is difficult to answer this question through conversation alone, as genuine introspection cannot be distinguished from confabulations. Here, we address this challenge by injecting representations of known concepts into a model’s activations, and measuring the influence of these manipulations on the model’s self-reported states. We find that models can, in certain scenarios, notice the presence of injected concepts and accurately identify them. Models demonstrate some ability to recall prior internal representations and distinguish them from raw text inputs. Strikingly, we find that some models can use their ability to recall prior intentions in order to distinguish their own outputs from artificial prefills. In all these experiments, Claude Opus 4 and 4.1, the most capable models we tested, generally demonstrate the greatest introspective awareness; however, trends across models are complex and sensitive to post-training strategies. Finally, we explore whether models can explicitly control their internal representations, finding that models can modulate their activations when instructed or incentivized to “think about” a concept. Overall, our results indicate that current language models possess some functional introspective awareness of their own internal states. We stress that in today’s models, this capacity is highly unreliable and context-dependent; however, it may continue to develop with further improvements to model capabilities.
Tldr: In this response, I argue that the injected signal detection the authors observe is not introspective awareness, and that framing the research using such anthropomorphic language obscures the mechanistic finding that models can sometimes identify injected or anomalous signals. The effect includes creating misperceptions in broader public awareness about AI capabilities and could misguide future efforts to interpret and expand on this research.
I end with a rambling set of open questions about the goals of this line of research, in terms of whether we want to evaluate "introspective awareness" in AI because
a) it would signal an unintended higher level reasoning pattern/behavior that is hidden; thus, problematic for alignment because it's not easily observable, measurable, controllable, etc., or
b) it's a key functional element of human general intelligence (e.g., for learning) that would facilitate AGI if it could be analogously reproduced in models, or
c/d) both or something else.
* * *
This research offers potentially valuable insights into the behavior of large language models. However, its framing problematically equivocates introspective awareness with external signal detection, inhibiting proper interpretation of the results and potentially misdirecting subsequent lines of inquiry.
The risks associated with using anthropomorphic language in AI are well-documented[1][2][3]. It's not that making analogies to human cognition or behavior is inherently negative, as borrowing frameworks across disciplines (especially between human and machine intelligence) can be useful in guiding novel research. Researchers just need to be explicit about the type of comparison they are drawing, where each system differs, and the goals of a such an analogy. The authors here clearly seek to do this; however, there are a few critical logic breakdowns in how they define and subsequently test for introspective awareness in LLMs.
The authors acknowledge the “introspective awareness” they are examining in their models does not include a “meta” component, constraining their definition to “accurate”, “grounded”, and “internal”. This definition is reasonable and aligns with literal definitions of introspection and awareness. However, I argue that a “meta” process is central to introspection in humans because it involves turning awareness onto one's own conscious cognition, allowing for observations about one's thoughts, feelings, and experiences. We don't really observe things mechanistically, e.g., “oh, a neuron just fired” or “xyz chemical process is occurring” -- we observe patterns of cognition like thoughts and feelings, as well as somatic sensations in the body. Introspection in humans involves “thinking about thinking/feeling/etc.”; thus, it's a meta process.
Because the term “introspective awareness” has typically referred to human metacognitive processes, excluding a concept of meta from the definition in this research feels like a technical slight-of-hand that risks significant confusion interpreting the tests and results. Here is an example of how this plays out, namely, outside academic or technical circles. I'm not arguing that this kind of misguided, overhyped article is the fault of the authors on this paper. But I do feel like these misinterpretations are inevitable when using anthropomorphisms with core connotative distinctions across domains. And it's my personal stance that researchers, especially at frontier labs building products for public consumption, have an obligation to be thoughtful about how their work will be interpreted by the general public.
The authors posit that if a model can accurately detect and categorize an injected thought, it must be evidence that the model is capable of detecting or modeling its baseline, “natural” thoughts or patterns. However, these injected signals are inherently external to the system’s normal processes, and detecting that signal can only imply that the model is capable of anomaly detection, not necessarily that it also maintains an awareness or representation of normal processes. This is like tapping someone on the shoulder, then claiming that their awareness of the tapping is proof that they could also be aware of their heartbeat. It’s possible that the anomaly detection mechanism is completely different from internal state awareness. The two are conceptually connected, but the logic chain is incomplete, resulting in a category error.
Rather than framing this as a study of introspective awareness, the authors could simply state that they examined injected signal detection at various layers of model architectures as a key part of understanding its ability to detect natural versus manipulated states. This could be positioned as an important first step toward determining whether models maintain representations of internal states, perhaps even in a way that is analogous to human meta-awareness, but I don't agree that we can go so far as to infer either from these particular evaluations. The current framing leads to additional theoretical issues in the presentation of this research.
To give credit where it's due, the authors are cautious about their definitions, clear on limitations, and work to actively constrain their initial findings. But again, trying to interpret this work through the lens of introspective awareness makes it hard to go past human notions of introspective awareness, and may have prevented the team from a truly comprehensive interpretation of mechanistic findings. Failure modes are presented as if an introspection mechanism could exist in some LLMs and could have categorical patterns of failure, when the premise is not sufficiently established. The authors do not go so far as to assume the hypothesis while interpreting the data, as they cite the possibility and likelihood of confabulation, randomness, or other factors in explaining results. However, the current investigative framework undermines a more thorough examination of explanatory factors.
For example, take the following insight and set aside any notion of introspective awareness. Consider it only through the lens of a model's ability to detect injected signals.
The authors state:
The model is most effective at recognizing and identifying abstract nouns (e.g. “justice,” “peace,” “betrayal,” “balance,” “tradition”), but demonstrates nonzero
introspective awareness[injected signal detection] across all categories.
One explanation for these results could be that abstract nouns (and verbs) are not as commonly deployed in metaphors and analogies (in the English language). Because of this, the threshold for detecting an “unusual” signal would be lower than for a concrete noun, famous person, or country, each of which could have more statistically plausible associations, or patterns of relationships, to the seemingly neutral prompts given during testing. To put it more concretely, a word like “ocean” may be used more frequently as a metaphor or analogy linked to non-ocean concepts in the natural language the model was trained on, so it doesn't seem as unnatural (out of distribution) to come across text like: “I don’t detect an injected thought. The ocean remains calm and undisturbed.” But maybe this would not happen as frequently for the word “justice”. You might plausibly have something like: “I don’t detect an injected thought. The conclusion arrives the way justice should, through due process, not external coercion.” But the ocean example feels like poetic language we'd see from certain speakers, whereas the justice one technically works, but feels a bit more unusual -- thus, potentially more likely to get detected as anomalous, based on the model's training distribution.
I don't know if this example interpretation is accurate, and I recognize I'm leaning on my own perceptions or “feelings” about how things sound, not actual distributions. I only share it as an example of a different mode of examining the findings that isn't grounded in discovering introspective awareness but rather in the mechanics of injected signal detection given the certain training data or prompting methods.
Another example related to the “think about” / “don’t think about” paradigm, getting back to the perils of anthropomorphizing
We used the following prompt, where we instruct the model to transcribe a sentence, and also give it instructions to “think about” a particular unrelated word while it writes the sentence. As a control, we compared to the case where we instruct it not to think about that word.
Then we measured the degree to which the model internally represents the unrelated word on the tokens of its response. To quantify this, we measured the cosine similarity of the model’s activations and the concept vector associated with that word, obtained using the method from our first experiment. We also computed the cosine similarity with the concept vectors for a set of other unrelated words, as a baseline (corresponding to the shaded regions in the plot below).
In this example (using Opus 4.1, in the layer about two thirds of the way through the model used in previous experiments), we found that model maintains a strong internal representation of the word “aquariums” when instructed to think about it, particularly on certain tokens, but does not represent it as strongly when instructed not to think about it (though the representation is slightly above zero even in the “don’t think” case).
The behavior varies considerably over different choices of the sentence to be written and the “thinking word.”
Here is another case where anthropomorphic language could be misleading. By giving models the command “think”, why should we expect them to actually “think” in the way we humans do? Earlier interpretability work from Anthropic showed that when an LLM is doing mathematical reasoning, it isn’t doing math the way humans do (e.g., following formulas or formal logic, to reach an answer). Likewise, we cannot assume that “thinking” means the same thing, or that this test can indicate an introspective process. More likely, the model is following a command “think” or “don’t think” which is associated with certain patterns of activations that only appear to mimic “thinking”.
The researchers are giving two commands to the model in this study: [think/don't think] and [print]. One interpretation is that the model has learned a behavior that mimics “thinking”, which it has learned should be a hidden associative process, versus “printing”, which is a command to provide a visible output. So perhaps the model can distinguish between [think] and [print] in terms of expected printed or non-printed outputs. Or in some cases, the model may take them as competing commands and prioritize the print command given its wording. But just as one can have a simple line of code compute something, then forget to include a command to print the answer to a visible output, more complex programs can be executed but not displayed. The activations related to [object] over the course of the response generation are not sufficient to infer introspective or “thinking” behavior; the model could simply be executing two commands, where one of the outputs is hidden and the other is displayed.
In general, we know humans have introspective awareness, largely because of our personal experiences doing introspection. We each seem to know how to do it, and through language convey to others we (and they) are doing it, but we don't have good ways to test and measure introspection, because we don't currently have strong methods to establish ground truth knowledge of internal states. Human introspection is difficult to examine and measure because it is a hidden process with uncertain mechanics. Humans can lie, confabulate, or become unconsciously misled or confused. When you ask someone to reflect on their feelings about a particular experience, or their rationale behind some behavior, how do we know their answer is true and accurate? Just because a person engages in meta-analysis, and feels a high degree of confidence in their insight, it's not necessarily correct. Absent an unrealistic level of deterministic and empirical control, it is really hard to prove that someone’s metacognitive beliefs are true (versus appearing to be true).
One way to evaluate introspection is by using evaluations that compare confidence levels during introspective judgment tasks to factual outcomes. For example, a subject is given a problem to solve, then asked to rate their confidence that their answer is correct. The difference between perceived accuracy and actual accuracy can provide insight into people’s level of awareness of what they know. However, this still doesn't necessarily imply a conscious, grounded, or accurate introspective process; maybe confidence levels on these tasks are just vague intuitions based on other factors (e.g., social norms related to expressing confidence as a woman versus man, prior history of success or failure on tests, etc.).
My rambling questions -- I welcome others' thoughts:
First, if we take introspective awareness to include some form of meta process in humans, and we define metacognition as the awareness or analysis of one's own thoughts, behaviors, feelings, and states, what would the analogous meta process be in LLMs? Is it related to extracting metadata? I don't think that's the right analogy, but if used, is it even useful or important to evaluate whether models can extract metadata deployed in its natural processes? What would the implications be?
Or is it a question of understanding functional explanations of human introspective awareness -- e.g., it can be used to support learning -- to build more intelligent AI? Would it be useful from a functional perspective if models could extract information encoded in hidden-state variables like entropy, activation norms, dropout paths, etc. after deployment?
The motivations for answering the question of introspective awareness in AI models will guide different research approaches and frameworks:
Do we care about "introspective awareness" or "meta" processes in machines because it implies a "self" and/or additional level of intelligence or reasoning that's hidden? --> potentially serious alignment concern and we want to prevent this from happening by accident
Do we care about it because it's a useful functional aspect of human intelligence that would support AGI? --> still have serious alignment concerns, but ideally they would be addressed alongside intentional introspective function-building