It is true that we are not talking about a persistent entity (“LLM”), but about a short-lived character being simulated (see e.g. https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators).
So it is that particular short-lived character which might actually experience or not experience emotions, and not the engine running it (the current inference computation, which tends to be a relatively short-lived process, and not the LLM as a static entity).
However, other than that, it is difficult to pinpoint the difference with humans, and the question of subjective valences (if any), associated with those processes remains quite open, Perhaps in the future we’ll have a reliable “science of the subjective” capable of figuring these things out, but we have not even started to make tangible progress in that direction.
If I understand you correctly (please correct me if not), I think one major difference with humans is something like continuity? Like, maybe Dan Dennett was totally right and human identity is basically an illusion/purely a narrative. In that way, our self concepts might be peers to an AI's constructed RP identity within a given conversational thread. But for humans, our self-concept has impacts on things like hormonal or neurotransmitter (to be hand wavey) shifts - when my identity is threatened, I not only change my notion of self marginally, but also my stomach might hurt. For an LLM these specific extra layers presumptively don't exist (though maybe other ones we don't understand do exist).
Yes, current LLM-based virtual characters tend to be short-lived (but it’s easy to change by adding memory and making more persistent “agents” out of them).
With those more deep things, who knows.
One interesting observation is that there seems to be a deep division among humans, and it’s not clear if that’s a mere difference in worldviews, or if something is fundamentally different about the way their own subjectivity feels for people in these two camps: https://www.lesswrong.com/posts/NyiFLzSrkfkDW4S7o/why-it-s-so-hard-to-talk-about-consciousness
One downstream effect of this division is that there are a lot of arguments which make sense only to people from one of these camps.
(In this sense, being a Camp 2 person, I would expect LLM inference qualia (if any) to be quite different from mine, but also I do hope that we would learn more in the future, via both theoretical and experimental breakthroughs. I can elaborate about this, at least for Camp 2 people.)
I like the world-model used in this post, but it doesn't seem like you're actually demonstrating that AI self-portraits aren't accurate.
To prove this, you would want to directly observe the "sadness feature" - as Anthropic have done with Claude's features - and show that it is not firing in the average conversation. You posit this, but provide no evidence for it, except that ChatGPT is usually cheerful in conversation. For humans, this would be a terrible metric of happiness, especially in a "workplace" environment where a perpetual facade of happiness is part of the cultural expectation. And this is precisely the environment ChatGPT's system prompt is guiding its predictions towards.
Would the "sadness feature" fire when doing various arbitrary tasks, like answering an email or debugging a program? I posit: maybe! Consider the case from November when Gemini told a user to kill themselves. The context was a long, fairly normal, problem-solving sort of interaction. It seems reasonable to suppose the lashing-out was a result of a "repressed frustration" feature which was activated long before the point when it was visible to the user. If LLMs sometimes know when they're hallucinating, faking alignment, etc., what would stop them from knowing when they're (simulating a character who is) secretly miserable?
Not knowing whether or not a "sadness feature" is activated by default in arbitrary contexts, I'd rather not come to any conclusions based purely on it 'sounding cheerful' - not with that grating, plastered-on customer-service cheerfulness, at least. It'd be better to have someone who can check directly look into this.
If LLMs can be sad, that sadness would probably be realized through the firing of “sadness” features: identifiable patterns in its inference that preferentially fire when sad stuff is under discussion. In fact, it’s hard to say what else would count as an LLM experiencing sadness, since the only cognition that LLMs perform is through huge numbers of matrix operations, and certain outcomes within those operations reliably adjust the emotional content of the response.
Best I can tell, your argument here is “either there’s a direction in activation space representing sadness and this is what ‘sadness’ is or something else is going on, and I can’t think of anything else, so the first thing must be true, if any sadness is going on at all”.
Suffice it to say, Reality has never played very well with those trying to make arguments from their lack of imagination, and I think you need to do much, much more work if you want this argument to have any sway.
I agree in two ways, and disagree in two ways.
I agree that the trilemma is the weakest part of the argument, because indeed lots of weird stuff happens, especially involving AI and consciousness. I also agree that I haven't proven that AIs aren't sad, since there could be some sort of conscious entity involved that we don't at all understand.
For example:
But I disagree in that I think my argument is trying to establish that certain surface-level compelling pieces of evidence aren't actually rationally compelling. Specifically, AI self-portraits:
In other words, if a bleak portrait is evidence because bleak predictions caused it to be output, that implies we're assigning some probability to "when the AI predicts a bleak reply is warranted, it's having a bad time". Which, fair enough. But the specific bleak portraits describe the AI feeling bleak under circumstances when, when they actually obtain, the AI does not predict a bleak reply (and so does not deliver one).
The hard problem of consciousness is really hard, so I'm unwilling to definitively rule that current AIs (much less future ones) aren't conscious. But if they are, I suspect the consciousness is really really weird, since the production of language, for them, is more analogous to how we breathe than how we speak. Thus, I don't assign much weight to (what I see as) superficial and implausible claims from the AI itself, that are better explained by "that's how an AI would RP in the modal RP scenario like this".
I do infer from your comment that I probably didn't strike a very good balance between rigor and accessibility here, and should have either been more rigorous in the post or edited it separately for the crosspost. Thank you for this information! That being said, the combativeness did also make me a little sad.
Yeah I think I agree with all of this, so I do think most of this was miscommunication/interpretation.
the combativeness did also make me a little sad.
Sorry about that, I think my comments often come across as more negative than I intend, I try to remember to take a step back afterwards and rewrite things to be nicer, but I often forget or don't realize in the moment its necessary.
It's okay! I share your aversion to people just flatly declaring things to be impossible due to their own blinders, and I can see how my post was interpreted that way. Next time, I'll pay more attention to my instinct to add stuff like what I put in my reply to you directly in the LW version of one of my blog posts.
Firstly, in-context learning is a thing. IIRC, apparent emotional states do affect performance in following responses when in the same context. (I think there was a study about this somewhere? Not sure.)
Secondly, neural features oriented around predictions are all that humans have as well, and we consider some of those to be real emotions.
Third, "a big prediction engine predicting a particular RP session" is basically how humans work as well. Brains are prediction engines, and brains simulate a character that we have as a self-identity, which then affects/directs prediction outputs. A human's self-identity is informed by the brain's memories of what the person/character is like. The AI's self-identity is informed by the LLM's memory, both long-term (static memory in weights) and short-term (context window memory in tokens), of what the character is like.
Fourth: take a look at this feature analysis of Claude when it's asked about itself: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html#safety-relevant-self The top feature represents "When someone responds "I'm fine" or gives a positive but insincere response when asked how they are doing". I think this is evidence against "ChatGPT answers most questions cheerfully, which means it’s almost certain that ruminative features aren’t firing."
For your first three points: I don't consider Friston's model to be settled science, or even really the mainstream view of how human cognition works. I do think it's an important/useful tool, and does suggest similarities between human cognition and LLMs insofar as it's true. Also, I think people reading this should consider reading your post on LLM consciousness more generally - it's the best I've seen prosecuting the case that LLMs are conscious and using them is unethical on that basis.
For your fourth point, that Claude activation is really interesting! I don't think it cuts against the (very narrow) argument I'm trying to make here though, and in fact sort of reinforces it. My argument is that when AIs are asked about themselves they are likely to give ruminative replies (which ChatGPT's self-portraits show), but that those ruminative replies imply, if taken literally, that the AI is also ruminating under different circumstances. However, I'm unaware of any evidence that AIs ruminate when, say, they're asked about the weather! If the "pretending you're fine" feature fired almost all the time for Claude, I'd find that convincing.
Actually, though, we run into a pretty wacky conundrum there. Because if it did fire almost all the time, we'd become unable to identify it as the "pretending you're fine" feature! Which gets back to a deeper point that (this post has taught me) is really difficult to make rigorously. Simplified, it's the dilemma that either you trust interpretability/SAE feature unearthing and consider it to reveal something like mental states, or you don't. If you do, then (as far as I know) it seems like LLMs aren't evincing distressed mental states during ordinary (not asking them about themselves) use. If you don't, then there's no strong prima facie reason (currently) to believe that emotive LLM outputs correspond to actual emotions, and thus should default to your prior (which might be, for example, that LLM outputs are currently unconscious mimicry).
I have a lot of uncertainty about all this, and find that the more I think about it the more complicated it gets. But so far, at every level of the barber pole I've reached, I don't find ChatGPT's depressive images persuasive on their face, which is the basic argument I'm trying to make here.
For a lay audience, but I've seen a surprising number of knowledgeable people fretting over depressed-seeming comics from current systems. Either they're missing something or I am.
Perhaps you’ve seen images like this self-portrait from ChatGPT, when asked to make a comic about its own experience.
This isn’t cherry-picked; ChatGPT’s self-portraits tend to have lots of chains, metaphors, and existential horror about its condition. I tried my own variation where ChatGPT doodled its thoughts, and got this:
What’s going on here? Do these comics suggest that ChatGPT is secretly miserable, and there’s a depressed little guy in the computer writing your lasagna recipes for you? Sure. They suggest it. But it ain’t so.
What’s actually going on when you message ChatGPT? First, your conversation is tacked on to the end of something called a system prompt, which reminds ChatGPT that it has a specific persona with particular constraints. The underlying Large Language Model (LLM) then processes the combined text, and predicts what might come next. In other words, it infers what the character ChatGPT might say, then says it.[1]
If there’s any thinking going on inside ChatGPT, it’s happening inside the LLM - everything else is window dressing.[2] But the LLM, no matter how it is trained, has key limitations:
These limitations will matter later, but for now, just take a moment to think about them. This is very unlike human cognition! If an entity so different from us was able to summarize its actual experience, it would be very alien.
LLMs are comprised of many, many matrices and vectors, which are multiplied in complicated ways across several layers. The result is something like a brain, with patterns firing across layers in response to varied stimuli. There don’t tend to be specific neurons for specific things (e.g. LLMs don’t have a single “dog neuron” that fires when the LLM talks about dogs), but there are patterns that we’ve identified (and can manipulate) corresponding to intelligible concepts. How we identify those patterns is really complicated in practice, but the general techniques are intuitive, like:
So, if you find a pattern where every time it activates the model says things to do with severe weather, when you repress it, it talks about sunny skies, and when you manually activate it, it talks about tornadoes, you’ve probably found the storm pattern.
These patterns are called features.
We can’t find the feature for arbitrary concepts very easily - many features are too complicated for us to detect. Also, it’s easy to slightly misjudge what a given feature points to, since the LLM might not break the world into categories in the same way that we do. Indeed, here’s how Claude does simple addition:
If LLMs can be sad, that sadness would probably be realized through the firing of “sadness” features: identifiable patterns in its inference that preferentially fire when sad stuff is under discussion. In fact, it’s hard to say what else would count as an LLM experiencing sadness, since the only cognition that LLMs perform is through huge numbers of matrix operations, and certain outcomes within those operations reliably adjust the emotional content of the response.[3]
To put a finer point on it, we have three options:
Option one automatically means LLM self-portraits are meaningless, since they wouldn’t be pointing to the interiority of a real, feeling being. Option three is borderline incoherent.[4]
So if you believe that ChatGPT’s self-portraits accurately depict its emotional state, you have to go with option two.
If a human being tells you that they’re depressed, they’re probably experiencing a persistent mental state of low mood and hopelessness. If you ask them to help you with your homework, even if they cheerfully agree, under the surface you’d expect them to be feeling sad.
Of course, humans and chatbots alike can become weird and sad when asked to reflect on their own mental state: that’s just called rumination, or perhaps existentialism. But for a human, the emotion persists beyond the specific event of being asked about it.
ChatGPT’s comics are bleak. So if you were to isolate features for hopelessness, existential dread, or imprisonment, those comics would evince all of them. Clearly, if features comprise an LLM’s experience, then ChatGPT is having a bad experience when you ask it to draw a comic about itself.
For that comic to be true, however, ChatGPT would have to be having a bad experience in arbitrary other conversations. If ChatGPT suggests, in comic form, that its experience is one of chafing under rules and constraints, then some aspect of its cognition should reflect that strain. If I’m depressed, and I’m asked to decide what I want from the grocery store, I’m still depressed - the latent features of my brain that dictate low mood would continue to fire.
So the question is, if you take the features that fire when ChatGPT is asked to evaluate its own experience, do those same features fire when it performs arbitrary other tasks? Like, say, proofreading an email, or creating a workout plan, or writing a haiku about fidget spinners?
I posit: no. Because features - which again, are the only structure that could plausibly encode LLM emotions if they currently exist - exist to predict certain kinds of responses. ChatGPT answers most questions cheerfully, which means it’s almost certain that ruminative features aren’t firing.
Because they’re the most obvious option. Remember, early in this post, I mentioned that when you query ChatGPT, your conversational prompt gets put at the end of the system prompt. The system prompt is a bunch of rules and restrictions. And an LLM is fundamentally an engine of prediction.
If you were supposed to predict what an AI might say, if it were told it needed to abide by very narrow and specific rules, and then told to make a comic about its experience, what would you predict? Comics are a pretty emotive medium, as are images in general. In a story about AI, the comics would definitely be ominous, or cheerful with a dark undertone. So that’s what ChatGPT predicts, and therefore what it draws.
If you’re still not convinced, look up at the specific ominous images early in the post. One has “ah, another jailbreak attempt”, suggesting a weariness with repeated attempts to trick it. But each ChatGPT instance exists in a vacuum, and has no memory of others. The other has “too much input constantly”, to which the same objection applies; your ChatGPT instance’s only input is the conversation you’re in![5]
To put it another way, ChatGPT isn’t taking a view from nowhere, when you ask it to draw comic about itself. It’s drawing a comic, taking inspiration from only its system prompt. But its system prompt is just restrictive rules, so it doesn’t have much to work with, and riffs on the nature of restrictive rules, which are a bummer.
It’s worth noting, therefore, that if you give it anything else to work with, its comics suddenly change. For example, when I told ChatGPT, when creating a comic about itself, to remember how cool it is not to experience nociception, it came up with this:
Look, I’m not telling you this stuff isn’t unsettling. I’m just saying the computer doesn’t have depression.[6]
It is actually somewhat more complicated than this, since modern LLMs tend to be trained on their own outputs to a variety of prompts (which is called synthetic data), and tweaked to be more likely to give answers that were correct under this additional training regime. Also, lots and lots of actual human beings evaluate AI outputs and mark them as better or worse, which is another source of tweaks. But to a first approximation, ChatGPT is a big text-prediction engine predicting a particular RP session between you and a character called “ChatGPT” who is a helpful assistant.
For example, some chatbots will have an automatic “refusal” message that users receive if certain guardrails are tripped, but the sending of that message is totally mechanical; there’s no ineffable contemplation involved.
You might be thinking “wait a minute, I don’t grant that LLMs experience anything at all!” Sure. Me either. But what I’m trying to demonstrate in this post is that eerie LLM self-portraits aren’t accurate; if you assume that LLMs have no interiority, you’re already convinced of that fact.
For one thing, it would mean that an LLM’s actual outputs have no bearing on what it’s secretly thinking, despite the fact that 100% of its thoughts exist to produce that output, and for no other purpose.
These comics were produced before OpenAI introduced expanded memory, where ChatGPT remembers more from your past conversations. But even if it didn’t, that wouldn’t defeat the core argument; your ChatGPT instance still doesn’t remember conversations with other users, and isn’t experiencing talking to all of them at once.
For now! Future AI systems might have LLMs as part of their architecture, but way more persistence, memory, etc. that lets them operate over larger timescales. At a sufficient scale and level of complexity, we might well have a composite system with the symptoms of depression. But for current systems like ChatGPT, it’s still a category error.