Empathy as a natural consequence of learnt reward models

Let’s take a very simplistic model where reward = I am eating chocolate (as detected by the brainstem, say).

There would be some period of time during training when the reward predictor would predict a reward when I see someone else eating chocolate, because there’s a lot of overlap between them-eating-chocolate and me-eating-chocolate in the latent space. I think that’s your point here in this post, right?

But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—there was a reward prediction, but the reward didn’t happen. And thus the brain would eventually learn a more sophisticated “correct” reward model that didn’t fire empathetically. Right?

Of course, that’s not what’s really happens—adults have empathy too, it doesn’t get naturally trained away. That needs to be explained.

One possibility is that the reward model is somehow blinded to any information that could indicate whether something is empathy or not, but that seems difficult to implement. I’m skeptical.
Another possibility is (mumble mumble) regularization, but I dunno how that would work.
My preferred theory is that the brain has some mechanism to detect when a thought is an empathetic simulation, and then it can just choose not to send an error signal in that circumstance. (Or it can do other things with that information.) I’m currently not sure what that mechanism is.

Interested in how you’re thinking about this. Sorry if I misunderstood anything :)

[-]Elias Schmied3y10

In the specific example of chocolate (unless it wasn't supposed to be realistic), are you sure it doesn't get trained away? I don't think that, upon seeing someone eating chocolate, I immediately imagine tasting chocolate. I feel like the chocolate needs to rise to my attention for other reasons, and only then do I viscerally imagine tasting chocolate.

[-]Steven Byrnes3y40

What I really believe is that “the brain does other things with that information”, things more general than “feeling the same feeling as the other person is feeling”. See here:

In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!

I do think “feeling the same feeling as the other person is feeling” can happen. The ice cream example is not great for that; maybe consider “seeing someone get unexpectedly punched hard in the stomach”. That makes me cringe a bit, still, even as an adult. Maybe an even better example (that only works for half the population) is “seeing someone get kicked in the balls”.

But it’s a bit subtle. If I saw people getting unexpectedly punched hard in the stomach day after day, sure, maybe I would stop cringing. But how much of that is a natural consequence of the learning algorithm and how much of that is “empathy is kinda aversive here, so I learn by RL to leverage top-down attention to deliberately avoid triggering that reaction”? I tend to think it’s mostly the latter, but it’s not obvious.

[-]beren3y52

I think this is a mechanism that actually happens a lot. People generally do lose a lot of empathy with experience and age. People definitely get de-sensitized to both strongly negative and strongly positive experiences after viewing them a lot. I actually think that this is more likely than the RL story -- especially with positive-valence empathy which under the RL story people would be driven to seek out.

But then every time that empathy thing happens, I obviously don’t then immediately eat chocolate. So the reward model would get an error signal—there was a reward prediction, but the reward didn’t happen. And thus the brain would eventually learn a more sophisticated “correct” reward model that didn’t fire empathetically. Right?

My main model for why this doesn't happen in some circumstances (but definitely not all) is that the brain uses these signals and has a mechanism for actually providing positive or negative reward when they fire depending on other learnt or innate algorithms. For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude. Generally I think of this mechanism as a low level substrate on which you can build up a more complex repertoire of social emotions by doing reward shaping on these signals.

Also -- I really like your post on empathy that cfoster linked above! I have read a lot of your work but somehow missed that one lol. Cool we are thinking at least somewhat along similar lines

[-]Steven Byrnes3y20

Thanks!

For instance, you could pass the RPE through to some other region to detect whether the empathy triggered for a friend or enemy and then return either positive or negative reward, so implementing either shared happiness or schadenfreude.

In that case I’d be interested in the “some other region to detect whether the empathy triggered for a friend or enemy”. How is that region doing that? Specifically, (1) what exactly is the “low level substrate”, (2) what are the exact recipes for turning those things into the full complex repertoire of social emotions? Those are major research interests of mine. Happy for you & anyone else to join / share ideas :)

[-]Elias Schmied3y30

Thanks for the reply!

In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!

How sure are you that these are instances of empathy (defining it as "prediction by our own latent world model of ourselves being happy/unhappy soon")? If I imagine myself in these examples, it doesn't introspectively feel like I am reacting to an impression of their internal state, but rather like I am directly reacting to their social behavior (e.g., abstractly speaking, a learned reflex of status-reasserting anger when someone else displays high status through happy and calm behavior).

This would also cleanly solve the mysteries of why they don't get updated and how they are distinguished from "other transient feelings" - there's no wrong prediction by the latent world model involved (nothing to be distinguished or updated), and the social maneuvering doesn't get negative feedback.

That's where some instinctive disagreement of mine with that post of yours comes from too. But I also haven't read through it carefully enough to be sure.

[-]Steven Byrnes3y30

I think I probably don’t follow what you’re saying. It seems to me that people care very much about the internal state of other people. (Not in the sense of “people care that they have veridical beliefs about the internal state of other people”, but in the sense of “people spend a lot of time thinking about the internal state of other people, and their beliefs about those states are very relevant to their reactions”.)

Like, if I am to feel schadenfraude at Alice’s misfortune, it seems to me that it really matters that it’s a misfortune from Alice’s perspective. If I hate swimming and Alice loves it, and then Alice swims, then I wouldn’t feel schadenfraude there, right? And that requires attending to and reacting to (my beliefs about) Alice’s internal state, right?

Again, this seems very obvious to me, which suggests that I’m probably misunderstanding you.

[-]Elias Schmied3y30

I appreciate the charity!

I'm not claiming that people don't care about other people's internal states, I'm saying that it introspectively doesn't feel like that is implemented via empathy (the same part of my world model that predicts my own emotions), but via a different part of my model (dedicated to modeling other people), and that this would solve the "distinguishing-empathy-from-transient-feelings" mystery you talk about.

Additionally (but relatedly), I'm also skeptical that those beliefs are better decribed as being about other people's internal states rather than as about their social behavior. It seems easy to conflate these if we're not introspectively precise. E.g., if I imagine myself in your Alice example, I imagine Alice acting happy, smiling and uncaring, and only then is there any reaction - I don't even feel like I'm *able* to viscerally imagine the abstract concept (prod a part of my world model that represents it) of "Alice is happy".

But these are still two distinct claims, and the latter assumes the former.

One illustrative example that comes to mind is the huge number of people who experience irrational social anxiety, even though they themselves would never judge themselves if they were in other people's position.

[-]Steven Byrnes3y20

I'm also skeptical that those beliefs are better decribed as being about other people's internal states rather than as about their social behavior.

Hmm. Continuing with the schadenfraude example, let’s say Alice stole my kettle and I would feel good if she burned her fingers on it. (Serves her right!) My introspection says, if Alice is alone when she burns her fingers, I’m still happy—that still counts. If I never see her again after that, that still counts. Heck, if she becomes a hermit and never sees another human again, that still counts. And therefore, that thought of Alice burning her fingers is pleasing in a way that is tightly connected to how I believe Alice feels, and disconnected from how I believe Alice is behaving socially, I think.

You mention “I imagine Alice acting happy, smiling and uncaring”. But I feel like the following two things feel very different to me:

“I imagine that Alice is acting happy, smiling and uncaring, and this is straightforwardly related to how she really feels”, versus
“I imagine that Alice is acting happy, smiling and uncaring, but on the inside she’s miserable, and she’s hiding how she really feels”.

What do you think?

I'm saying that it introspectively doesn't feel like that is implemented via empathy (the same part of my world model that predicts my own emotions), but via a different part of my model (dedicated to modeling other people)

I don’t update much on that because I think almost all of the discourse and intuitions and literature surrounding the word “empathy” are not talking about the same thing that I want to talk about. Thus I tend to avoid the word “empathy” altogether where possible. I’ve been using other terms like “empathetic simulation” or “little glimpse of empathy”. I talk about that a bit in Section 13.5.2 here. More specifically, I’m guessing that it doesn’t “feel like empathy” when you imagine Alice burning her fingers on the kettle she stole from me, because that thought feels good, whereas empathizing with Alice would be unpleasant. Here, my model says “yes the thought feels good, and if that’s not what you think of as “empathy”, then the thing you think of as “empathy” is not what I’m talking about”.

When we think of emotion concepts / categories, the valence / arousal / etc. associated with them are central properties. E.g. righteous indignation has to have positive valence and high arousal, otherwise we would call it something else (and think of it as something else). So if you think a thought that involves lots of the same cortical neurons as you get in typical righteous indignation, but those neurons trigger negative valence and low arousal in the brainstem (because of the empathy-detector intervening, or whatever), it wouldn’t feel anything like righteous indignation introspectively. Or something like that.

[-]cfoster03y71

At a high level, I agree that something related to empathy can happen when the same circuits are used for processing thoughts-about-others from thoughts-about-self. This seems like a design pattern that might be worth copying. My main concerns are:

It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself.
This doesn't seem like it'd give us a robust enough version of empathy by itself, because the agent isn't motivated to actively seek out opportunities to empathize. As an analogy, I know if I were forced to think of, and even look at, the process that produces hamburger meat, I would probably have a visceral reaction and not want to eat the burger. But I like burgers, so I don't seek out that train of thought, so the hypothetical empathy & disgust that would've been invoked lays inactive. Maybe something like Anthropic's Constitutional AI method would help in this direction...

Nitpick about terminology: I think the stuff you're talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning. A reward model, on the other hand, is just another part of your model of the world, so it might not be connected to visceral "feels". It doesn't necessarily have any sway over decision-making, in the same way as your "will this number be even or odd" model isn't necessarily connected to any visceral "feels", so you don't tend to make decisions based primarily on those predictions.

Also if you haven't read this post, I think it's a good one and very related.

[-]beren3y10

It seems like the AIs we build will be very different from us, at least in terms of basic drives. I can definitely empathize when there's some common currency to the experience (for ex. they're feeling pain, and I've also experienced pain), but probably less so when there's a greater gap. Since AIs won't share any of our physiology or evolutionary history, I worry that that common currency will be missing, which would seemingly incentivize the AI having separate circuits for modeling humans and for modeling itself

Yes, this depends a lot on the self model of the AGI. It's definitely not a silver bullet. The AGI will almost certainly have a very good model of humans, their culture, and how their minds work from various self-supervised losses. Whether the AGI conceptualises itself as close to this or not depends on the representations of AGI in the dataset as well as potentially our training regime.

Nitpick about terminology: I think the stuff you're talking about is primarily attributable to having a learned value function rather than to having a learned reward model in the narrow sense of a predictor of immediate reward. I tend to use value function to refer to the thing that, alongside the reward function, produces visceral (gut-like) reactions to thoughts based on forecasts that were learned via something like TD learning

I agree it is not necessarily the reward model that generates direct feelings. I think it is hard to connect any part of an RL system directly to gut level 'feels' because we don't really know what these are. The value function is just the estimate of the long run reward and is trained on a supervised bellman equation. It is very possible that the machinery that creates this won't exist at all in the AGI, or maybe it is just some intrinsic property of RL agents I don't know.

[-]DragonGod3y30

Typos

There are good reasons for not naturally learning this kind of entirely ego-centric world model with a complete separation in latent self between concepts involving self and involving others.

Bolded should be "latent space".

[-]P.3y30

Also "indivudals".

[-]Charlie Steiner3y*21

Very interesting, thanks. I'm unconvinced that the motivational aspects of empathy are common in learning algorithms that look like gradient descent - if flinching when someone else is hurt doesn't harm your reproductive fitness then maybe it's easy for evolution to stick with it, but substantively changing your plans to avoid causing that flinch (as in the rats not shocking other rats) should rise to the attention of gradient descent and get massaged out.

My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology - like usually being empathetic but sometimes modulating it and often justifying self-serving actions - is sculpted by such evolved nudges, and wouldn't be recapitulated in AI lacking those nudges.

[-]beren3y30

My prediction is that there really is an evolved nudge towards empathy in the human motivational system, and that human psychology - like usually being empathetic but sometimes modulating it and often justifying self-serving actions - is sculpted by such evolved nudges, and wouldn't be recapitulates in AI lacking those nudges.

I agree -- this is partly what I am trying to say in the contextual modulation section. The important thing is that the base capability for empathy might exist as a substrate to then get sculpted by gradient descent / evolution to implement a wide range of adaptive pro or anti-social emotions/behaviours. Which of these behaviours, if any, get used by the AI will depend on the reward function / training data it sees.

[-]Davey Morse10mo10

The key idea that leads to empathy is the fact that, if the world model performs a sensible compression of its input data and learns a useful set of natural abstractions, then it is quite likely that the latent codes for the agent performing some action or experiencing some state, and another, similar, agent performing the same action or experiencing the same state, will end up close together in the latent space. If the agent's world model contains natural abstractions for the action, which are invariant to who is performing it, then a large amount of the latent code is likely to be the same between the two cases. If this is the case, then the reward model might 'mis-generalize' to assign reward to another agent performing the action or experiencing the state rather than the agent itself. This should be expected to occur whenever the reward model generalizes smoothly and the latent space codes for the agent and another are very close in the latent space. This is basically 'proto-empathy' since an agent, even if its reward function is purely selfish, can end up assigning reward (positive or negative) to the states of another due to the generalization abilities of the learnt reward function ^[1].

awesome

[-]Sergey Cleftsow3y10

Why do you consider the behavior of so-called "psychopaths" as a "disorder"? What if a norm here is just a matter of cultural expectations? So, what is normal and what is not can be understood by comparison of an individual behavior when cultural norms don't limit it. And if, then, let's say, 40% of specimen behaves as psychopaths (particularly, manifest violence in the form of a stable pattern), then we cannot call those individuals having "disorder." We have to consider them as a particular segment of the Homo Sapiens population having a specific evolutionary function.

[-]MiguelDev3y10

Outside of apes and monkeys, dophins and elephants, as well as corvids also appear in anecdotal reports and the scientific literature to have many complex forms of empathy.

Might be related to Erich Neumann's book The Great mother which cites: "The psychological development [of humankind]... begins with the 'matriarchal' stage in which the archetype of the Great Mother dominates and the unconscious directs the psychic process of the individual and the group." It's like when we see animals in the wild eg. the lioness and its cub, we always associate it as the mother and its child - we do not have to google or open a book to like ensure that it is the case but deep within our psyche is that pattern that allows us to interpret it as such.

[-]Ben Amitay3y10

I agree with other commentors that this effect will be washed out by strong optimization. My intuition is that the problem is distinguishing self from other is easy enough (and supported by enough data) that the optimization doesn't have to be that strong.

[I began writing the following paragraph as a counter- argument to the post, but it ended up less decisive when thinking about the details - as next paragraph:] There are many general mechanisms for convergence, synchronization and coordination. I hope to write a list in the close future. For example, as you wrote having a model of other agents is obviously generally useful, and it may require having an approximation of both their worlds models and value functions as part of your world model. Unless you have huge amounts of data and compute, you are going to reuse your own world model as theirs, with small corrections on top. But this is about your world model, not your value function.

[The part that help your argument. Epistemic status: Many speculative details, but ones that I find pretty convincing, at least before multiplying their probabilities] Except having the value function of other agents in your world model, and having the mechanisim for predicting their action as part of your world-model-update, is basically replicating computations that you already have in your actor and critic, in a more general form. Your original actor and critique are then likely to simplify to "do the things that my model of myself would, and value the results as much as my model of myself would" + some corrections. In that stage, if the "some corrections" part is not too heavy, you may have some confusion of the kind that you described. Of course, it will still be optimized against.

[-]Ben Amitay3y10

BTW speaking about value function rather than reward model is useful here, because convergent instrumental goals are big part of the potential for reuse of others' (deduced) value function as part of yours. Their terminal goals may then leak into yours due to simplicity bias or uncertainty about how to separate them from the instrumental ones.

The main problem with that mechanism is that you liking chocolate will probably leak as "its good for me too to eat chocolate", not "its good for me too when beren eat chocolate" - which is more likely to cause conflict then coordination, if there is only that much chocolate.

[-]Ben Amitay3y10

And specifically for humans, I think the probably was evolutionary pressure that is actively in favor of leaking terminal goals - as the terminal goals of each of us is a noisy approximation of evolution's "goal" of increasing amount of offspring, that kind of leaking is potential for denoising. I think I explicitly heard this argument in the context of ideals of beauty (though many other things are going on there and pushing in the same direction)

[-]beren3y22

I agree that this will probably wash out with strong optimization against. and that such confusions become less likely the more different the world models of yourself and the other agent that you are trying to simulate is -- this is exactly what we see with empathy in humans! This is definitely not proposed as a full 'solution' to alignment. My thinking is that a.) this effect may be useful for us in providing a natural hook to 'caring' about others which we can then design training objectives and regimens to allow us to extend and optimise this value shard to a much greater extent than it occurs naturally.

[-]Ben Amitay3y10

We agree 😀

What do you think about some brainstorming in the chat about how to use that hook?

[-]rvnnt3y1-1

Whether we can build artificial empathy into AI systems also has clear relevance to AI alignment.

I disagree. My tentative guess would be that in the majority of worlds where humanity survives and flourishes, {AGI having empathy} contributed ~nothing to achieving that success. (For most likely interpretations of "empathy".)

If we can create empathic AIs, then it may become easier to make an AI be receptive to human values, even if humans can no longer completely control it.

I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)

You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it's a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.

Also, doing a quick bit of Rationalist Taboo on "empathy", it looks to me like that word is pointing at a rather complicated, messy swath of territory. I think that swath contains many subtly and not-so-subtly different things, most of which would not begin to be sufficient for alignment (albeit that some might be necessary).

[-]beren3y10

I suspect that {the cognitive process that produced the above sentence} is completely devoid of security mindset. If so, might be worth trying to develop security mindset? And/or recognize that one is liable to (i.a.) be wildly over-optimistic about various alignment approaches. (I notice that that sounded unkind; sorry, not meaning to be unkind.)

Yep this is definitely not proposed as some kind of secure solution to alignment (if only the world were so nice!). The primary point is that if this mechanism exists it might provide some kind of base signal which we can then further optimize to get the agent to assign some kind of utility to others. The majority of the work will of course be getting that to actually work in a robust way.

You pointed out that empathy is not a silver bullet. I have a vague (but poignant) intuition that says that the problem is a lot worse than that: Not only is empathy not a silver bullet, it's a really really imprecise heuristic/proxy/shard for {what we actually care about}, and is practically guaranteed to break down when subjected to strong optimization pressure.

Yes. Realistically, I think almost any proxy like this will break down under strong enough optimization pressure, and the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax.

[-]rvnnt3y10

the name of the game is just to figure out how to prevent this much optimization pressure being applied without imposing too high a capabilities tax

Hmm. I wonder if you'd agree that the above relies on at least the following assumptions being true:

(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).
(ii) It will be possible to end the acute risk period using an A(G)I that is limited in the above way.

If so, how likely do you think (i) is to be true? If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.

I think realizing (i) would probably be at least nearly as hard as the whole alignment problem. Possibly harder. (I don't see how one would in actual practice even measure "optimization pressure".)

[-]beren3y2-1

(i) It will actually be possible to (measure and) limit the amount of "optimization pressure" that an advanced A(G)I exerts (towards a given goal).If so, how likely do you think (i) is to be true?

If you have any ideas (even very rough/vague ones) for how to realize (i), I'd be curious to read them.

For this, it is not clear to me that it is impossible or even extremely difficult to do this, at least in a heuristic way. I think that managing to successfully limit the optimization power applied against our defences is fundamental to coming up with alignment techniques that can work in practice. We need some way to bound the adversary otherwise we are essentially doomed by construction.

There is a whole bunch of ideas you can try here which work mostly independently and in parallel -- examples of this are:

1.) Quantilization

2.) Impact regularization

3.) General regularisation against energy use, thinking time, compute cost

4.) Myopic objectives and reward functions. High discount rates

5.) limiting serial compute of the model

6.) Action randomisation / increasing entropy -- something like dropout over actions.

7.) Satisficing utility/reward functions

8.) Distribution matching objectives instead of argmaxing

9.) penalisation of divergence from a 'prior' of human behaviour

10.) Maintaining value uncertainty estimates and acting conservatively within the outcome distribution

These are just examples I have thought of immediately. There are a whole load more if you sit down and brainstorm for a while.

In terms of measuring optimziation power I don't think this is that hard to do roughly. We can definitely define it in terms of outcomes as KL divergence of achieved distribution vs some kind of prior 'uncontrolled' distribution. We already implement KL penalties in RL like this. Additionally, rough proxies are serial compute, energy expenditure, compute expenditure, divergence from previous behaviour etc.

It will be possible to end the acute risk period using an A(G)I that is limited in the above way.

The major issue is what level of alignment tax these solutions impose and whether it is competitive with other players. This ultimately depends on the amount of slack that is available in the immediately post-AGI world. My feeling is that it is possible there is quite a lot of slack here, at least at first, and that most of the behaviours we really want to penalise for alignment purposes are quite far from most likely behaviour -- i.e. there is very little benefit to us of having the AGI having such a low discount rate it is planning about tiling the universe with paperclips in billions of years.

I also don't think of these so much as solutions but as part of the solution -- i.e. we still need to find good robust ways of encoding human values as goals, detect and prevent inner misalignment, and have some approach to manage goodhearting.

[+][comment deleted]3y10

^{^}

Our theory is very similar to the [Perception-Action-Mechanism](https://web-archive.southampton.ac.uk/cogprints.org/1042/) (PAM), and the very similar 'simulation theory' of empathy. Both argue that empathy occurs because our brain essentially learns to map representations of other's experiencing some state to our own representations for that state. Our contribution is essentially to argue that this isn't some kind of special ability that must be evolved, but rather a natural outcome an an architecture which learns a reward model against an unsupervised latent state.

^{^}

One prediction of this hypothesis would be that we should expect general unsupervised models, potentially attached to RL agents, to naturally develop all kinds of 'mirror neurons' if trained in a multi-agent environment.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

48

Empathy as a natural consequence of learnt reward models

48

48

Typos

Empathy in the brain

Widespread empathy in animals

Contextual modulation of empathy

Psychopaths etc