The Mirror Test: How We've Overcomplicated AI Self-Recognition

[-]johnswentworth5mo50

Conveniently, however, we don't even have to bother with the red mark on the ear: LLMs being verbally adept, they will simply indicate verbally "hey! that's what I just wrote!"

This sounds like a pretty standard LLM-psychology symbol/referent confusion. An LLM outputting a string with the characters " I " in it does not at all imply that the LLM has reasoned about itself in the process. In a base model, for instance, the LLM outputting "hey! that's what I just wrote!" implies that the input text would typically be followed in the training corpus by text like "hey! that's what I just wrote!"; such a pattern does not necessarily require any degree of self awareness in order to learn.

Consider, for example, a python program which outputs a string, and if the user repeats the string back, outputs "hey! that's what I just wrote!". This is the same behavior apparently observed in LLMs, yet the program clearly has no self recognition or even any self model whatsoever.

An LLM could easily learn much the same pattern (i.e. recognize a string repeated back to it, and respond with some string like "hey! that's what I just wrote!") from data + RL, again without any self model whatsoever. "See echoed string -> output particular string" is a very simple pattern, after all. Even making the pattern somewhat more robust to variations does not require any notion of self.

Now, I don't necessarily disagree that the Davidson et al test described sounds over constrained and not particularly well suited to the problem at hand, but one does need to somehow distinguish actual self recognition from merely responding to echoed strings with a string which happens to include the characters " I ".

[-]faul_sname5mo51

Ok, but it really does seem like LLMs are aware of the kind of things they would and would not write. Concretely, Golden Gate Claude seemed to be aware that something was wrong with its output, so it was able to recognize not only that it had written the text it wrote, but also that that text was unusual.

Golden Gate Claude tries to bake a cake without thinking of bridges (from @ElytraMithra's twitter)

I suppose you could make the argument that that doesn't mean that the LLM has necessarily learned a "self" concept, just that there is a character that sometimes goes by "Claude" and sometimes goes by "I" and which speaks in specific distinguishable contexts, and that the "Claude"/"I" character can recognize its own outputs, but that the LLM doesn't know that it is the same character as "Claude"/"I"... but you could make similar arguments about humans.

[-]sdeture5mo20

Thanks for engaging! But you're arguing against claims I didn't make. I wrote about self-recognition (behavioral mirror test), not self-awareness or self-models.

All learning is pattern matching, but what matters is the spontaneous emergence of this specific capability: they learned to recognize their outputs without being explicitly programmed for this task. Would we reject chimp self-recognition because they learned it through neural pattern matching? Likewise, since humans recognize faces through pattern matching in the fusiform gyrus - does that mean we don't 'really' recognize our mothers? I'm puzzled why we'd apply standards to AIs that would invalidate virtually all animal cognition research.

[-]Richard_Kennaway5mo31

All that demonstrates is memory. Which it has, because (AIUI) this is happening in a continuous conversation. This is about as convincing as "printf( "I'm conscious! Really!!!\n" )" is as evidence of consciousness.

[-]sdeture5mo52

I disagree - there are a number of animals (and LLMs) with memory, and they aren't all capable of self-recognition. Memory and self-recognition are two distinct concepts, though the first is likely a precondition for the latter. (And indeed, when you pass the mirror test, you are allowed to remember what you look like...)

Now, if there were a tool use being called that used a script to check whether a user message matched a previous AI assistant message, I'd agree with the spirit your "printf( "I'm conscious! Really!!!\n" )" comment. But that's not what's happening. What's happening is that a small-to-moderate number of LLMs (I count 7-8) are consistently recognizing their own outputs when pasted back without context or instructions, even though they (1) weren't trained to do so, (2) weren't asked to do so, and (3) weren't given any tools to do so. This, to my mind, suggests an emergent unplanned property which arises only for certain model architectures or model sizes.

I also want to make very clear that my post is not about consciousness (in fact the word does not appear in the body of the text). I am making a much narrower claim (self-recognition) and connecting it, yes, to questions of moral-standing. I'd strongly prefer to keep debate focused on these more tractable topics.

[-]JBlack5mo20

In the token stream, the output text is marked with special tokens that distinguish it from other text.

[-]sdeture5mo51

Yes, but the conversation tags don't tell the LLM their output has been copied back to them. The tags merely establish the boundary between self and other - they indicate "this message came from the user, not from me." They don't tell the model that "the user's message contains the same content as the previous output message." Recognizing that match, recognizing that "other looks just like self" - is literally what the mirror test measures.

It's the difference between knowing "this is a user message" (which tags provide) and recognizing "this user message contains my own words" (which requires content recognition).

[-]JBlack5mo0-1

The mirror test isn't just "other looks like self", it's "image can show me things about myself that I didn't know". That's the whole point of making a mark out of sight on the subject's body while anaesthetized. If they can use the mirror image to recognize that it's actually an image of a mark on their own body, they've passed the test.

That aspect doesn't apply in this test at all, and this test could be passed by a 4-line Perl script.

[-]sdeture5mo-10

The original Gallup 1970 mirror test is linked in the post. It is under 2 pages.

As for a '4-line Perl script' - I'd love to see it! Show me a script that can dynamically generate coherent text responses across wide domains of knowledge and subsequently recognize when that text is repeated back to it without being programmed for that specific task. The GitHub repo is open if you'd like to implement your alternative.

LESSWRONG
LW

LESSWRONG
LW

2

The Mirror Test: How We've Overcomplicated AI Self-Recognition

2

2