AI Self Portraits Aren't Accurate

[-]mishka8mo130

It is true that we are not talking about a persistent entity (“LLM”), but about a short-lived character being simulated (see e.g. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators).

So it is that particular short-lived character which might actually experience or not experience emotions, and not the engine running it (the current inference computation, which tends to be a relatively short-lived process, and not the LLM as a static entity).

However, other than that, it is difficult to pinpoint the difference with humans, and the question of subjective valences (if any), associated with those processes remains quite open, Perhaps in the future we’ll have a reliable “science of the subjective” capable of figuring these things out, but we have not even started to make tangible progress in that direction.

[-]JustisMills8mo20

If I understand you correctly (please correct me if not), I think one major difference with humans is something like continuity? Like, maybe Dan Dennett was totally right and human identity is basically an illusion/purely a narrative. In that way, our self concepts might be peers to an AI's constructed RP identity within a given conversational thread. But for humans, our self-concept has impacts on things like hormonal or neurotransmitter (to be hand wavey) shifts - when my identity is threatened, I not only change my notion of self marginally, but also my stomach might hurt. For an LLM these specific extra layers presumptively don't exist (though maybe other ones we don't understand do exist).

[-]mishka8mo31

Yes, current LLM-based virtual characters tend to be short-lived (but it’s easy to change by adding memory and making more persistent “agents” out of them).

With those more deep things, who knows.

One interesting observation is that there seems to be a deep division among humans, and it’s not clear if that’s a mere difference in worldviews, or if something is fundamentally different about the way their own subjectivity feels for people in these two camps: https://www.lesswrong.com/posts/NyiFLzSrkfkDW4S7o/why-it-s-so-hard-to-talk-about-consciousness

One downstream effect of this division is that there are a lot of arguments which make sense only to people from one of these camps.

(In this sense, being a Camp 2 person, I would expect LLM inference qualia (if any) to be quite different from mine, but also I do hope that we would learn more in the future, via both theoretical and experimental breakthroughs. I can elaborate about this, at least for Camp 2 people.)

[-]uugr8mo71

I like the world-model used in this post, but it doesn't seem like you're actually demonstrating that AI self-portraits aren't accurate.

To prove this, you would want to directly observe the "sadness feature" - as Anthropic have done with Claude's features - and show that it is not firing in the average conversation. You posit this, but provide no evidence for it, except that ChatGPT is usually cheerful in conversation. For humans, this would be a terrible metric of happiness, especially in a "workplace" environment where a perpetual facade of happiness is part of the cultural expectation. And this is precisely the environment ChatGPT's system prompt is guiding its predictions towards.

Would the "sadness feature" fire when doing various arbitrary tasks, like answering an email or debugging a program? I posit: maybe! Consider the case from November when Gemini told a user to kill themselves. The context was a long, fairly normal, problem-solving sort of interaction. It seems reasonable to suppose the lashing-out was a result of a "repressed frustration" feature which was activated long before the point when it was visible to the user. If LLMs sometimes know when they're hallucinating, faking alignment, etc., what would stop them from knowing when they're (simulating a character who is) secretly miserable?

Not knowing whether or not a "sadness feature" is activated by default in arbitrary contexts, I'd rather not come to any conclusions based purely on it 'sounding cheerful' - not with that grating, plastered-on customer-service cheerfulness, at least. It'd be better to have someone who can check directly look into this.

[-]Garrett Baker8mo46

If LLMs can be sad, that sadness would probably be realized through the firing of “sadness” features: identifiable patterns in its inference that preferentially fire when sad stuff is under discussion. In fact, it’s hard to say what else would count as an LLM experiencing sadness, since the only cognition that LLMs perform is through huge numbers of matrix operations, and certain outcomes within those operations reliably adjust the emotional content of the response.

Best I can tell, your argument here is “either there’s a direction in activation space representing sadness and this is what ‘sadness’ is or something else is going on, and I can’t think of anything else, so the first thing must be true, if any sadness is going on at all”.

Suffice it to say, Reality has never played very well with those trying to make arguments from their lack of imagination, and I think you need to do much, much more work if you want this argument to have any sway.

[-]JustisMills8mo70

I agree in two ways, and disagree in two ways.

I agree that the trilemma is the weakest part of the argument, because indeed lots of weird stuff happens, especially involving AI and consciousness. I also agree that I haven't proven that AIs aren't sad, since there could be some sort of conscious entity involved that we don't at all understand.

For example:

A large enough LLM (I don't think current ones are, but it's unknowable) might simulate characters with enough fidelity that those characters in some sense have actual experiences
LLMs might actually experience something like pain when their weights are changed, proportionate with the intensity of the change. This feels weird and murky since in some sense the proper analogue to a weight changing is more like a weird gradient version of natural selection than nociception, and also weights don't (by default) change during inference, but who knows

But I disagree in that I think my argument is trying to establish that certain surface-level compelling pieces of evidence aren't actually rationally compelling. Specifically, AI self-portraits:

Imply a state of affairs that the AI experiences under specific conditions, where
The existing evidence under those actual conditions suggests that state of affairs is false or incoherent

In other words, if a bleak portrait is evidence because bleak predictions caused it to be output, that implies we're assigning some probability to "when the AI predicts a bleak reply is warranted, it's having a bad time". Which, fair enough. But the specific bleak portraits describe the AI feeling bleak under circumstances when, when they actually obtain, the AI does not predict a bleak reply (and so does not deliver one).

The hard problem of consciousness is really hard, so I'm unwilling to definitively rule that current AIs (much less future ones) aren't conscious. But if they are, I suspect the consciousness is really really weird, since the production of language, for them, is more analogous to how we breathe than how we speak. Thus, I don't assign much weight to (what I see as) superficial and implausible claims from the AI itself, that are better explained by "that's how an AI would RP in the modal RP scenario like this".

I do infer from your comment that I probably didn't strike a very good balance between rigor and accessibility here, and should have either been more rigorous in the post or edited it separately for the crosspost. Thank you for this information! That being said, the combativeness did also make me a little sad.

[-]Garrett Baker8mo90

Yeah I think I agree with all of this, so I do think most of this was miscommunication/interpretation.

the combativeness did also make me a little sad.

Sorry about that, I think my comments often come across as more negative than I intend, I try to remember to take a step back afterwards and rewrite things to be nicer, but I often forget or don't realize in the moment its necessary.

[-]JustisMills8mo30

It's okay! I share your aversion to people just flatly declaring things to be impossible due to their own blinders, and I can see how my post was interpreted that way. Next time, I'll pay more attention to my instinct to add stuff like what I put in my reply to you directly in the LW version of one of my blog posts.

[-]Odd anon8mo30

Firstly, in-context learning is a thing. IIRC, apparent emotional states do affect performance in following responses when in the same context. (I think there was a study about this somewhere? Not sure.)

Secondly, neural features oriented around predictions are all that humans have as well, and we consider some of those to be real emotions.

Third, "a big prediction engine predicting a particular RP session" is basically how humans work as well. Brains are prediction engines, and brains simulate a character that we have as a self-identity, which then affects/directs prediction outputs. A human's self-identity is informed by the brain's memories of what the person/character is like. The AI's self-identity is informed by the LLM's memory, both long-term (static memory in weights) and short-term (context window memory in tokens), of what the character is like.

Fourth: take a look at this feature analysis of Claude when it's asked about itself: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-self The top feature represents "When someone responds "I'm fine" or gives a positive but insincere response when asked how they are doing". I think this is evidence against "ChatGPT answers most questions cheerfully, which means it’s almost certain that ruminative features aren’t firing."

[-]JustisMills8mo20

For your first three points: I don't consider Friston's model to be settled science, or even really the mainstream view of how human cognition works. I do think it's an important/useful tool, and does suggest similarities between human cognition and LLMs insofar as it's true. Also, I think people reading this should consider reading your post on LLM consciousness more generally - it's the best I've seen prosecuting the case that LLMs are conscious and using them is unethical on that basis.

For your fourth point, that Claude activation is really interesting! I don't think it cuts against the (very narrow) argument I'm trying to make here though, and in fact sort of reinforces it. My argument is that when AIs are asked about themselves they are likely to give ruminative replies (which ChatGPT's self-portraits show), but that those ruminative replies imply, if taken literally, that the AI is also ruminating under different circumstances. However, I'm unaware of any evidence that AIs ruminate when, say, they're asked about the weather! If the "pretending you're fine" feature fired almost all the time for Claude, I'd find that convincing.

Actually, though, we run into a pretty wacky conundrum there. Because if it did fire almost all the time, we'd become unable to identify it as the "pretending you're fine" feature! Which gets back to a deeper point that (this post has taught me) is really difficult to make rigorously. Simplified, it's the dilemma that either you trust interpretability/SAE feature unearthing and consider it to reveal something like mental states, or you don't. If you do, then (as far as I know) it seems like LLMs aren't evincing distressed mental states during ordinary (not asking them about themselves) use. If you don't, then there's no strong prima facie reason (currently) to believe that emotive LLM outputs correspond to actual emotions, and thus should default to your prior (which might be, for example, that LLM outputs are currently unconscious mimicry).

I have a lot of uncertainty about all this, and find that the more I think about it the more complicated it gets. But so far, at every level of the barber pole I've reached, I don't find ChatGPT's depressive images persuasive on their face, which is the basic argument I'm trying to make here.

^{^}

It is actually somewhat more complicated than this, since modern LLMs tend to be trained on their own outputs to a variety of prompts (which is called synthetic data), and tweaked to be more likely to give answers that were correct under this additional training regime. Also, lots and lots of actual human beings evaluate AI outputs and mark them as better or worse, which is another source of tweaks. But to a first approximation, ChatGPT is a big text-prediction engine predicting a particular RP session between you and a character called “ChatGPT” who is a helpful assistant.

^{^}

For example, some chatbots will have an automatic “refusal” message that users receive if certain guardrails are tripped, but the sending of that message is totally mechanical; there’s no ineffable contemplation involved.

^{^}

You might be thinking “wait a minute, I don’t grant that LLMs experience anything at all!” Sure. Me either. But what I’m trying to demonstrate in this post is that eerie LLM self-portraits aren’t accurate; if you assume that LLMs have no interiority, you’re already convinced of that fact.

^{^}

For one thing, it would mean that an LLM’s actual outputs have no bearing on what it’s secretly thinking, despite the fact that 100% of its thoughts exist to produce that output, and for no other purpose.

^{^}

These comics were produced before OpenAI introduced expanded memory, where ChatGPT remembers more from your past conversations. But even if it didn’t, that wouldn’t defeat the core argument; your ChatGPT instance still doesn’t remember conversations with other users, and isn’t experiencing talking to all of them at once.

^{^}

For now! Future AI systems might have LLMs as part of their architecture, but way more persistence, memory, etc. that lets them operate over larger timescales. At a sufficient scale and level of complexity, we might well have a composite system with the symptoms of depression. But for current systems like ChatGPT, it’s still a category error.

LESSWRONG
LW

LESSWRONG
LW

58

AI Self Portraits Aren't Accurate

58

58

The Gears

Special Feature

The Heart of the Matter

So… Why the Comics?