The following is an example of how if one assumes that an AI (in this case autoregressive LLM) has "feelings", "qualia", "emotions", whatever, it can be unclear whether it is experiencing something more like pain or something more like pleasure in some settings, even quite simple settings which already happen a lot with existing LLMs. This dilemma is part of the reason why I think AI suffering/happiness philosophy is very hard and we most probably won't be able to solve it.

Consider the two following scenarios:

Scenario A: An LLM is asked a complicated question and answers it eagerly.

Scenario B: A user insults an LLM and it responds.

For the sake of simplicity, let's say that the LLM is an autoregressive transformer with no RLHF (I personally think that the dilemma still applies when the LLM has RLHF, but then the arguments are more complicated and shaky).

If the LLM has "feelings", "qualia", whatever, are they positive or negative in scenarios A and B? One could argue in two ways:

  • They are positive in scenario A and negative in scenario B since LLMs emulate humans, and that's what the answer would be for a human.
  • They are significantly more negative in scenario A than in scenario B because:
    • If scenario A was part of the training corpus, the loss would be significantly higher than if scenario B was part of the training corpus.
    • It can be argued that things correlated with high loss cause negative feelings and things correlated with low loss cause positive feelings, the same way as in humans, things correlated with low reproductive fitness cause negative feelings and things correlated with high reproductive fitness cause positive feelings.

Some people might argue that either of the two answers is the right one, but my point is that I don't think it's plausible we would reach an agreement about the answer.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 11:45 AM

Granting that LLMs in inference mode experience qualia, and even granting that they correspond to human qualia in any meaningful way:

I find both arguments invalid. Either conclusion could be correct, or neither, or the question might not even be well formed. At the very least, the situation is a great deal more complicated than just having two arguments to decide between!

For example in scenario (A), what does it mean for an LLM to answer a question "eagerly"? My first impression is that it's presupposing the answer to the question, since the main meaning of "eagerly" is approximately "in the manner of having interest, desire, and/or enjoyment". That sounds a great deal like positive qualia to me!

Maybe it just means the lesser sense of apparently showing such emotions, in which case it may mean no more than an author writing such expressions for a character. The author may actually be feeling frustration that the scene isn't flowing as well as they would like and they're not sure that the character's behaviour is really in keeping with their emotions from recent in-story events. Nonetheless, the words written are apparently showing eagerness.

The "training loss" argument seems totally ill-founded regardless. That doesn't mean that its conclusion in this hypothetical instance is false, just that the reasoning provided is not sufficient justification for believing it.

So in the end, I don't see this as a dilemma at all. It's just two possible bad arguments out of an enormously vast space of bad arguments.

You might be interested in reading this. I think you are reasoning in an incorrect framing. 

Note that this uncertainty applies to humans as well. Most of the time we make assumptions based on similarity of biology and default trust in self-reports, rather than having tests for qualia and valence.