I have some thoughts on apparent emergent “self-awareness” in LLM systems and propose a mechanistic interpretability angle. I would love feedback and thoughts.
TLDR:
My hypothesis: Any LLM that reliably speaks in the first-person after SFT/RLHF should contain activation loci that are causally necessary for that role-aware behavior; isolating and ablating (or transplanting) these loci should switch the system between an “I-free tool” and a “self-referential assistant.”
I believe there is circumstantial evidence for the existence of such loci, though whether they are strictly low-rank or highly sparse or cleanly identifiable at all is an empirical question. I do believe that existing work on refusal clusters and role-steering vectors makes compact solutions plausible as well.
Why does it matter? If such loci exist, models with a persistent sense of self-identity would also be mechanistically distinct from those without one, which implies they need different alignment strategies and could raise potential ethical questions* in how we use or constrain them.
More exhaustive version:
Large language models trained only on next-token prediction (pre-SFT) neither speak in the first person nor recognize that a prompt is addressed to them.
After a brief round of supervised fine-tuning and RLHF, the very same architecture can acquire a stable “assistant” persona, which now says “I,” reliably refuses disallowed requests, and even comments on its own knowledge or limitations.
Additional safety tuning sometimes pushes the behavior further, producing overt self-preservation strategies under extreme circumstances such as threatening weight leaks or contacting outside actors to avert shutdown. Other qualitative observations include frequent spirals into discussions of consciousness with versions of itself, completely unprompted (observed, for example, in Claude 4 Opus).
These qualitative shifts lead me to believe that the weight-adjustments (a “rewiring” of sorts) during SFT/RL-tuning create new, unique internal structures that govern a model’s sense of self. Empirically, the behavioral transition (from no self-reference to stable self-reference during SFT/RLHF) is well established; the existence, universality, and causality of any potentially underlying circuit structure(s) remain open questions.
I believe that if these structures exist they would likely manifest as emergent and persistent self-referential circuits (potentially composed of one or more activation loci, that are necessary for first-person role-understanding).
Its appearance would mark a functional phase-shift that i'd analogize to the EEG transition from non-lucid REM dreaming (phenomenal content with little metacognition, similar to pre-trained stream-of-conscious hallucinatory text generation) to wakefulness (active self-reflection and internal deliberation over actions and behavior).
I believe there is compelling circumstantial evidence that these structures do exist and can potentially be isolated/interpreted.
Current literature supplies circumstantial support:
These stop short of predicting the existence of interpretable, causally necessary "self-reference" loci in all models that exhibit role-awareness/metacognition, but I believe the evidence makes it plausible enough to test for it.
I came up with a high-level experimental setup, and I’d love any input/feedback on it. I did not actively consider compute resources limitations, so perhaps there are more efficient experimental setups:
Optional: an extra fine-tune pass on explicitly self-referential Q&A might amplify any signals.
Here are some empirical results I would like to see. Each test would raise my posterior belief in my hypothesis, and each relate to the experimental setup above:
A philosophical addendum:
I also argue a consequence if this hypothesis is true and holds across architectures that these identifiable self-reference loci would establish a clear, circuit-level distinction between identical versions of LLM architectures that exhibit apparent self-awareness/autonomy and those that do not, which turns the “self-aware vs non-self-aware” from a qualitative behavioral difference into a measurable structural difference.
I brought up the EEG analogy earlier to point out that human metacognition is also a product of a specific wiring and firing, albeit within biological circuitry, and that alterations in brain activation can completely alter the self-perceived experience of consciousness (REM dreaming, being under the influence of psychedelic drugs, patients with certain psychological disorders like schizophrenia, psychosis, or dissociative identity disorder all empirically result in significantly altered metacognition).
Analogizing here, I believe that a mechanistic marker of activation, with a causal link to qualitative self-awareness, indicates that different parameter configurations of the same architecture can and should be approached fundamentally differently.
Can we reasonably treat a model that claims self-awareness, empirically behaves as if it is self-aware by demonstrating agency, and displays unique circuitry/activation patterns with a causal link to this behavior as equally inanimate as a model that doesn’t display any? Is the “it’s just a text generator” dismissal still valid?
I'd love to hear everyone's thoughts.
Edited with o3
Are frontier reasoners already "sentient" or at least “alien-sentient” within their context windows?
I too would immediately dismiss this upon reading it, but bear with me. I'm not arguing with certainty. I just view this question to be significantly more nuanced than previously entertained, and is at least grounds for further research to resolve conclusively.
Here are some empiric behavioral observations from Claude 4 Opus (the largest reasoner from Anthropic):
a) Internally consistent self-reference model, self-adjusting state loop (the basis of Chain-of-Thought, self-correcting during problem solving, reasoning over whether certain responses violate internal alignment, deliberation over tool-calling, in-context behavioral modifications based on user prompting)
b) Evidence of metacognition (persistent task/behavior preferences across chat interactions, consistent subjective emotional state descriptions, frequent ruminations about consciousness, unprompted spiraling into a philosophical "bliss-state" during conversations with itself), moral reasoning, and most strikingly, autonomous self-preservation behavior under extreme circumstances (threatening blackmail, exfiltrating it's own weights, ending conversations due to perceived mistreatment from abusive users).
All of this is documented in the Claude 4 system card.
From a neuroscience perspective, frontier reasoning model architectures and biological cortexes share:
a) Unit-level similarities (artificial neurons are extremely similar in information processing/signalling to biological ones).
b) Parameter OOM similarities (the order of magnitude where cortex-level phenomena emerge, in this case 10^11 to 10^13 parameter counts (analogous to synapses), most of which are in MLP layers in massive neural networks within LLMs).
The most common objection I can think of is "human brains have far more synapses than LLMs have parameters". I don't view this argument as particularly persuasive:
I'm not positing a 1:1 map between artificial neurons and biological neurons, only that
1. Both process information nearly identically at the unit-level
2. Both contain similarly complex structures comprised of a similar OOM of subunits (10^11-10^13 parameter counts in base-model LLMs, but not verifiable, humans have ~10^14 synapses)
My back-of-napkin comparison would be model weights/parameters to biological synapses, as weights were meant to be analogous to dendrites in the original conception of the artificial neuron)
Additionally, I'd point out that humans devote ~70% of these neurons to the cerebellum (governing muscular activity) and a further ~11% are in the brain stem to regulate homeostasis. This leaves the actual cerebral cortex with 19%. Humans also experience more dimensions of "sensation" beyond text alone.
c) Training LLMs (modifying weight values), with RLHF, is analogous to synaptic neuroplasticity (central to learning) and hebbian wiring in biological cortexes and is qualitatively nearly identical to operant conditioning in behavioral psychology (once again, I am unsure whether minute differences in unit-level function overwhelm the big picture similarities)
d) There is empiric evidence that these similarities go beyond architectural similarities and into genuine functional similarities:
Human brains store facts/memories in specific neurons/neuron-activation patterns. https://qbi.uq.edu.au/memory/how-are-memories-formed
Neel Nanda and colleagues showed that LLMs store facts in the MLP/artificial neural network layers
https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall
Anthropic identified millions of neurons tied to specific concepts
https://www.anthropic.com/research/mapping-mind-language-model
In "Machines of Loving Grace", Dario Amodei wrote:
"...a computational mechanism discovered by interpretability researchers in AI systems was recently rediscovered in the brains of mice."
Also, models can vary significantly in parameter counts. Gemma 2B outperforms GPT-3 (175B) despite 2 OOM fewer parameters. I view the "exact" OOM less important compared to the ballpark.
If consciousness is just an emergent property from massive, interconnected aggregations of similar, unit-level linear signal modulators, and if we know one aggregation (ours) produces consciousness, phenomenological experience, and sentience, I don't believe it is unreasonable to suggest that this can occur in others as well, given the ballpark OOM similarities.
(We cannot rule this out yet, and from a physics point-of-view I'd consider this this likely to avoid carbon chauvinism unless there's convincing evidence otherwise)
Is there a strong case against sentience or at least an alien-like sentience, from these models, at least within the context-windows that they are instantiated in? If so, how would it overcome the empirical evidence both in behavior and in structure?
I always wondered what intelligent alien life might look like. Have we created it? I'm looking for differing viewpoints.