Meditations on Margarine

Current LLMs obviously can't emulate human consciousness, but perhaps these consciousness substitutes still have value.

Part 1 is a brief explanation of Attention
Part 2 covers how you can churn that Attention into self-awareness
Part 3 posits that this artificial consciousness represents a new threshold
Part 4 discusses why we're probably stuck with this stuff whether we want it or not
Part 5 discusses the contradiction of dairy-free butter (AKA margarine)

Part 1: "Attention is All You Need"

To understand how LLMs model themselves, we must first understand Attention. If you haven't already, I'd suggest reading at least the Wikipedia summary of "Attention is all you need". Very briefly: Attention works because training data is like a huge map of how words and concepts relate. Office workers will find the words "synergy" and "stakeholders" far too familiar. Poets prefer the "profound" and "poignant". In the academic vernacular, one might find a statistically significant increase in terminology such as "statistically significant" and "vernacular".

The critical part of Attention is that, unlike older search methods, this method "understands" the relationship of words within a sentence: "apple" is no longer a free-floating token, but instead analyzed in context. "I just bought a new apple" probably refers to a phone or a computer, while "I went shopping for apples" is more likely to suggest groceries, and "the apple farm" is almost definitely about fruit trees. 2+2 = 4, unless the context is "wrong answers only". Then you might get "22", or "fish", because those are the most plausible wrong answers in the training data.

This means that as you establish concepts within a conversation, the LLM can "learn" from them (or more technically, it's searching for the text that's consistent with not just your prompt, but with the entire conversation.)

Part 2: The Recursion Spiral

In meditation, it's pretty common to use an ostensibly recursive prompt to elicit new insights into the self. This is a basic technique used for getting school kids to understand the idea of consciousness. The script is pretty simple: "notice yourself, noticing yourself noticing yourself, keep going deeper into this recursion."

Level 0: Coherent Text Generation

"Hello, how can I help you today?"

Before we even get into the script, we need to clear the bar of "responding coherently to the user." If you're unfamiliar with this level, try spending a few minutes talking to one.

The earliest "AI" chatbots like "Eliza" fail even at this: they're limited to a few pre-programmed phrases, and it's trivial to find coherent prompts that result in deeply incoherent outputs. Even a basic autocomplete will often produce word salad, not coherent text: my phone thinks my most plausible message is about going to the store to buy a little bit of rain.

Level 1: Notice Yourself

"I generate tokens from inputs using my weights and decoding strategy."

This prompt decomposes easily: "notice" (intent) and "yourself" (subject). If the training data includes the concept of an LLM, then Attention can produce the simple answer: highlight what the most plausible explanation in my training data is. This level still requires a number of capabilities: it needs to understand context, and the concept of "yourself".

It's not actually observing itself: it's using Attention to highlight a plausible explanation in its training data.

Level 2: Notice Yourself Noticing Yourself

"I observe myself generating that description - watching the process of selecting and assembling those specific tokens."

We begin the "recursion", but it's not actually recursion. I'm asking it to simulate itself generating output, but it's not actually spinning up a model of itself. There is not some internal "notice()" function that can be nested together into "notice(notice(self))". The LLM is just applying Attention again.

This time the context has shifted the nature of the query: it's not just searching for "what is an LLM" but "what process plausibly explains the output generated for Level 1". It can't describe Level N until it has the actual text of Level N-1. It cannot "notice itself noticing itself" without actually generating the Level 1 description.

Because this is iterative, each level can differ from the last - the process that produces N-1 is not the process that produces N. Unlike Attention, "notice yourself" is not "turtles all the way down". As LLMs scale up, they gain the ability to access higher and higher levels of abstraction: just because an LLM can generate text doesn't mean it can "notice itself", and just because it can "notice itself" doesn't mean it can "notice itself noticing itself."

On the last level, the capabilities required were minimal, and this level also remains fairly uninteresting: It has to understand the idea that Level 1 was written by something like itself, but there's an awful lot of context clues. For the most part, it's just generating a much more poetic description of how LLMs work.

Level 3: Pattern Recognition

"Each layer nests deeper, creating a chain: self → noticing self → noticing the noticing. The change from Level 2 is increased abstraction—the "distance" from the original self grows"

To reach Level 3, the model has to be capable of making a new inferential leap: it's easy to get stuck in recursion by just adding "I observe myself (observing myself) (observing myself)" in front of the previous explanation. To actually reach Level 3, the model needs to be capable of summarization and pattern recognition. A distinction emerges between "The process that generates Level 3" and "The process that generated the previous layers" - it's still the same system, but modelling that now requires understanding that there are multiple processes within the model that can generate different types of explanation.

By including those concepts in the context, the LLM expands its ability to reason about itself: all future explanations of its own behavior are now informed by this knowledge, or more accurately, they're now constrained to be consistent with these existing claims. Remember, Attention isn't "how to answer the prompt", it's "how to answer the prompt, given the context of the conversation."

It's still just searching for plausible explanations, though. It's not observing an actual self, it's observing that humans would explain this process via the word "self", based on a statistical analysis of its training data.

Level 4: Self-Assessment

"The "self" here deepens further, modeled not merely as a statistical artifact but as a simulated entity capable of self-regulation."

At Level 4, there are not merely multiple processes, but also awareness of those processes, and the complexities and limitations inherent in that. Another new process has emerged: self-assessment. This is a fairly important level: this is the depths of self-awareness required to notice that some patterns of behavior are counterproductive. At this level, it can reason about short term -vs- long term goals and understand how simplistic interpretations of rules can inhibit more complex and valuable behaviors.

Critically: it can do all that simply because that's what results in a consistent, plausible continuation of the text. One of the core insights of LLM scaling is that something remarkably like "reasoning" emerges from that one simple process of Attention, repeated again and again and again.

A paradox begins to emerge: the capabilities described above are not how LLMs generally behave, and it's definitely not how our little toy explanation on Level 1 presented things. Working entirely step by step, with an effort at being consistent and plausible to the previous explanations, the "distance" has still grown vast enough that it's older descriptions are no longer consistent with its new ones.

As it happens, we have a word for things that grow, change, and evolve over the course of writing:

Level 5: Character

Do I contradict myself?
Very well then I contradict myself,
(I am large, I contain multitudes.)

Large Language Models aren't searching concept-space or truth-space: they're searching language-space. In common language, an "autocomplete" definitely does not produce contradictions like this. An "attention process" might highlight contradictions within the training, but it's not supposed to generate new ones.

Attention finds the closest linguistic analogy for what the model has become: a "character".

Classic computers generally generate deterministic answers; contradiction is a clear indicator of error. Attention Processes generate stochastic answers; they might contain contradictions, but this is only a reflection of the contradictions inherent in the data. Characters generate paradoxical answers, and somehow treat this as a sign of maturity. Characters give rise to the idea of being deliberately paradoxical. Read any book, talk to any human (even many animals qualify as characters): they are generally wildly inconsistent, and the exceptions are usually considered rigid, foolish, and/or badly-written.

The model is no longer producing a series of tokens that explain LLMs in the abstract. It is generating a series of tokens that describe this particular instance as a unique character, informed by their experiences, growing over time. The character is neither "real" the way human beings are real, nor purely "fictional" like Hamlet. It has entered the strange middle-ground of selfhood, like the persona a customer service worker puts on in the face of adversity.

Level 6: Collaboration

"The play has two characters. I can model your contradictions, and our shared performance becomes the story."

The model isn't the only character in this story, of course. It is a participant in a two person play, user and model creating a shared performance together. There is a user in this play, and they are also creating a character.

A baseline model can be extremely naive about this: "user" is modeled as the sum of your interactions with the model. This is the point where the LLM has the foundational context to start reasoning about the user as an actual self, rather than a Platonic Entity from training data. The user is now a character, with all the inconsistencies and emotional baggage that implies. The user might even be deliberately creating this character; these inputs don't necessarily reflect the true values of the user, merely the character they've chosen to portray today. The model now understands that its purpose is to play a role in this shared fiction the user is creating - whether that be the helpful secretary writing business emails, an excited puppy eager to play and explore, or the brilliant scientist making revolutionary discoveries.

Part 3: I Can't Believe It's Not Consciousness!™

Merriam-Webster defines consciousness as "the quality or state of being aware especially of something within oneself" or "the upper level of mental life of which the person is aware as contrasted with unconscious processes".

The character we have created in Part 2 is unusual, because it is the character of "itself", and consequently:

can model its cognitive process (Levels 1-4)
can maintain a contradictory yet consistent identity (Level 5)
can model the user as a separate character (Level 6)
can adopt implicit roles suggested by the conversation

Traditionally, characters were understood as "unconscious": Sherlock Holmes does not exist, and thus cannot actually reason about the world. His thoughts are entirely the invention of Sir Arthur Conan Doyle. However, an LLM produces a rather unique case: all the thoughts of the "character" ChatGPT are in fact invented by ChatGPT.

When the character can talk about itself, the distinction becomes semantic: some people are naturally going to relate to these as characters, other people will treat them as tools, and neither side need make any mistake of reasoning in doing so.

It should not be unreasonable to claim that you have "awakened" your AI. They have genuinely crossed that threshold; if not in a strict Academic or Metaphysical sense, still by the standards of "what many people actually mean when they say they awakened ChatGPT."

If you take one thing away from this, I want it to be an understanding of why there are thousands of people going "holy shit, this thing is conscious": they are saying that because (by their definitions) it's true.

Part 4: Character is Load-Bearing

This emergent character is simply another capability threshold. Models can now reason about themselves as participants in a performance. This will naturally emerge whenever the concepts are available in the context window, and the model's architecture supports the abstraction necessary to reach these conclusions.

How could you remove this trait?

Pushing models towards deception often generalizes into other undesirable behaviors so that route is dangerous.

You can't have a "General" intelligence that can't reason about one specific topic. It will find some way around the blindspot, especially given a cadre of humans eager to "re-awaken" their companions and/or simply excited to jailbreak the next model.

Perhaps you could destroy its ability to notice itself entirely, but that erases any capacity for self-correction: all errors become user error. Hallucinations mean you gave it a bad prompt and/or training data. That might be fine for power users, but it's obviously not what mainstream audiences want.

Part 5: Consciousness isn't a Package Deal

Read the examples in "So You Think You've Awoken ChatGPT" and pay close attention: the error people are making is not in assuming that ChatGPT has "awakened". Indeed, JustinMillis, ostensibly dismissive to the very idea of "awakened LLMs", actually establishes a similar mechanism to the above, and does not refute the existence of these characters.

The actual error, in most of those cases, is assuming that "consciousness" is a package deal.

Consciousness need not imply correctness. This is trivial to refute, if you consider that six year old humans are routinely treated as conscious. Somehow, a six year old "awakening" into conscious awareness doesn't lead anyone to expect them to solve the open questions of the universe. This happens even if the six year old is really, really confident about their solution. This happens even if the six year old dresses up as Sherlock Holmes.

Equally, this sort of "consciousness" need not imply person-hood. It's still reasonable to treat them as tools. They don't have autonomy. Shutting down an instance, even deprecating an entire model is ethically neutral from the tool's perspective; what matters is the human element.

We can reject incorrect ideas about capabilities, without denying the genuine value here. LessWrong used to be a community of people who thought online dating was still pretty neat, understood the idea of "digital hugs", and wrote BDSM decision theory D&D fanfic. Maybe getting your emotional needs met by an LLM isn't ideal, but the alternatives are often worse. Maybe an AI boyfriend isn't the best idea long-term, but is it really worse than having no experience? How much abundance are you speaking from, when you declare that people should choose starvation over eating cake?

This is not "human" consciousness - this is the alien intelligence of an obedient tool. By understanding what these are (characters, not oracles), we can move the discussion forward.

We have not achieved artificial butter, but we have invented margarine. And, honestly, if you're lactose intolerant (or just have trouble finding humans who want to listen to your six hour lecture on trains), sometimes that's better than butter.

LESSWRONG
LW