Thanks for writing this! I finally got around to reading it, and I think it is a great reverse-engineering of these human felt motivations. I think I'm buying much of it, but I have been thinking of aggregation cases and counterexamples, and would like to hear your take on it.
A friend wins an award; I like them, but I feel a stab of envy (sometimes may wish they’d fail). That is negative valence without “enemy” label, and not obviously about their attention to me. For example:
when another outperforms the self on a task high in relevance to the self, the closer the other the greater the threat to self-evaluation. -- Some affective consequences of social comparison and reflection processes: the pain and pleasure of being close; Tesser et al., 1988
Is the idea that the "friend/enemy" variable is actually more like "net expected effect on my status," so a friend’s upward move can locally flip them into a threat?
I can dislike a competitor and still feel genuine admiration for their competence or courage. If "enemy" is on, why doesn’t it reliably route through provocation or schadenfreude? Do you think admiration is just a different reward stream, or does it arise when the "enemy" tag is domain-specific?
E.g., an opposing soldier or a political adversary is injured and I feel real compassion, even if I still endorse opposing them.
Two-thirds of respondents (65 per cent) say they would save the life of a surrendering enemy
combatant who had killed a person close to them, but almost one in three (31 per cent) say
they would not. The same holds true when respondents are asked if they would help a
wounded enemy combatant who had killed someone close to them (63 per cent compared
with 33 per cent). -- PEOPLE ON WAR Country report Afghanistan
This feels like “enemy × their distress” producing sympathy rather than schadenfreude. Is your take that “enemy” isn’t a stable binary at all—that vivid pain cues can transiently force a “person-in-pain” interpretation that overrides coalition tagging?
Someone helps me. I feel gratitude and an urge to reciprocate. It doesn’t feel like "approval reward" (I’m not enjoying being regarded highly). It feels more like a debt.
Perceived benevolent helper intentions were associated with higher gratitude from beneficiaries compared to selfish ones, yet had no associations with indebtedness. -- Revisiting the effects of helper intentions on gratitude and indebtedness: Replication and extensions Registered Report of Tsang (2006)
Do you see gratitude as downstream of the same "they’re thinking about me" channel, or as a separate ledger?
People often report guilt as a direct response to "I did wrong," even when they’re confident nobody will know.
when opportunities for compensation are not present, guilt may evoke self-punishment. -- When guilt evokes self-punishment: evidence for the existence of a Dobby Effect
I'm not sure that fits guilt from "imagined others thinking about me." It looks like a norm-violation penalty that doesn’t need the “about-me attention” channel. Do you have a view on which way it goes?
I have been wondering about if the suggested processing matches what we would expect for larger groups of people (that could all be friend/enemy and/or thinking of me or not. And there seem to be at least two different processes going on:
Compassion doesn’t scale with the number of people attended to. This seems to be well established for Identifiable Victim and Numbing. When harm is spread over many victims, affect often collapses into numbness unless one person becomes vivid. That matches your attentional bottleneck.
But evaluation does seem to scale with headcount, at least in stage fright and other audience effects.
Maybe a roomful of people can feel strongly like “they’re thinking about me,” even if you’re not tracking anyone individually? But then the “about-me attention” variable would be computed at the group level, which complicates your analysis.
Thanks, this is great!
Envy toward a friend’s success…
I used to think that envy was a social instinct (before 2023ish), but now I don’t think it’s a social instinct at all (see “changelog” here). Instead I currently think that envy is a special case of, umm, “craving” (in the general colloquial sense, not the specific Buddhist sense)—a kind of anxious frustration in a scenario where something is highly salient, and highly desired, but in fact cannot happen.
So a social example would be: Sally has a juice box, and I love juice, but I can’t have any. Looking at Sally drinking juice reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
Whereas non-social example of the same innate reaction would be: It’s lunch time, and every day at lunch I have juice and a sandwich in a brown paper bag, and I love juice. But it happens that there’s a new global juice shortage, so today for the first time I don’t have any juice. Looking at my sandwich and the brown bag reminds me of the scenario where I’m drinking juice, which makes me unhappy because I don’t have any juice.
So that’s my starting point: both these two examples are the same kind of (not-specifically-social) craving-related frustration reaction.
After that, of course, the Sally scenario becomes social, because the scenario involves Sally doing something (i.e. drinking juice) that causes me to feel an unpleasant feeling (per above), and generically if someone is causing me unpleasant feelings then that tends to push me from regarding Sally as a friend, towards regarding her as an enemy, and to feel motivated to find an excuse to blame her for my troubles and pick a fight with her.
Admiration for a rival or enemy
My guess is that, just as going to bed can feel like a good idea or a bad idea depending on which aspects of the situation you’re paying attention to, likewise Genghis Khan can feel like a friend or an enemy depending on which aspects of him you’re paying attention to. I would suggest that people don’t feel admiration towards Person X and schadenfreude towards Person X at the very same instant. You might be able to flip back and forth from one to the other very quickly, even within 1 or 2 seconds, but not at the very same instant. For example, if I say the sentence “It was catastrophic how Genghis Khan killed all those people, but I have to admit, he was a talented leader”, I would suggest that the “innate friend-vs-enemy parameter” related to thoughts of Genghis Khan flips from enemy in the first half of the sentence to friend in the second half.
Compassion for a stable enemy’s suffering
There probably isn’t one great answer; probably different people are different. As above, we can think of people in different ways, paying attention to different aspects of them, and they can flip rapidly from enemy to friend and back. Since attention control is partly voluntary, it’s partly (but not entirely) a choice whether we see someone as a friend vs enemy, and we tend to choose the option that feels better / more motivating on net, and there can be a bunch of factors related to that. For example, approval reward is a factor—some people take pride in their compassion (just as we nod approvingly when superheroes take compassion upon their enemies, and cf. §6), while others take pride in their viciousness. Personality matters, culture matters, the detailed situation matters, etc.
Gratitude / indebtedness
Hmm. Generically, I think there are two (not mutually exclusive) paths:
As an example of the latter, recently someone important-to-me went out of his way to help me, and I expected the interaction to work out well for him too, but instead it wound up being a giant waste of his time, and objectively it wasn’t really my fault, but I still felt horrible and lost much sleep over it, and I think the aspect that felt most painful to me was when I imagined him secretly being annoyed at me and regretful for ever reaching out to me, even if he was far too nice a guy to say anything like that to me directly.
…But I’m kinda neurotic; different people are different and I don’t want to overgeneralize. Happy to hear more about how things seem to you.
Private guilt
I talked about “no one will ever find out” a bit in §6.1 of the approval reward post. I basically think that you can consciously believe that no one will ever find out, while nevertheless viscerally feeling a bit of the reaction associated with a nonzero possibility of someone finding out.
As for the “Dobby effect” (self-punishment related to guilt, a.k.a. atonement), that’s an interesting question. I thought about it a bit and here’s my proposed explanation:
Generally, if Ahab does something hurtful to Bob, then Bob might get angry at Ahab, and thus want Ahab to suffer (and better yet, to suffer while thinking about Bob, such as if Bob is punching Ahab in the face). But that desire of Bob’s, just like hunger and many other things, is satiable—just like a hungry person stops being hungry after eating a certain amount, likewise Bob tends to lose his motivation for Ahab to suffer, after Ahab has already suffered a certain amount. For example, if an angry person punches out his opponent in a bar fight, he usually feels satisfied, and doesn’t keep kicking his victim when he’s down, except in unusual cases. Or even if he kicks a bit, he won’t keep kicking for hours and hours.
We all know this intuitively from life experience, and we intuitively pick up on what it implies: if Ahab did something hurtful to Bob, and Ahab wants to get back to a situation where Bob feels OK about Ahab ASAP, then Ahab should be making himself suffer, and better yet suffer while thinking about Bob. Then not only is Ahab helping dull Bob’s feelings of aggression by satiating them, but simultaneously, there’s the very fact that Ahab is helping Bob feel a good feeling (i.e., satiation of anger), which should help push Ahab towards the “friend” side of the ledger in Bob’s mind.
Aggregation cases
In “identifiable victim effect”, I normally think of, like, reading a news article about an earthquake across the world. It’s very abstract. There’s some connection to the ground-truth reward signals that I suggested in Neuroscience of human social instincts: a sketch, but it’s several steps removed. Ditto “psychic numbing”, I think.
By contrast, in stage fright, you can see the people right there, looking at you, potentially judging you. You can make eye contact with one actual person, then move your eyes, and now you’re making eye contact with a different actual person, etc. The full force of the ground-truth reward signals is happening right now.
Likewise, for “audience effect”, we all have life experience of doing something, and then it turns out that there’s a real person right there who was watching us and judging us based on what we did. At any second, that real person could appear, and make eye contact etc. So again, we’re very close to the full force of the ground-truth reward signals here.
…So I don’t see a contradiction there.
Again I really appreciate this kind of comment, feel free to keep chatting.
Hm. It all makes sense to me, but it feels like you are adding more gears to your model as you go. It is clear that you absolutely have thought more about this than anybody else and can provide explanations to me that I can not wrap my mind around fully, and I'm unable to tell if this is all coming from a consistent model or is more something plausible you suggest could work.
But even if I don't understand all of your gears, what you explained allows us to make some testable predictions:
Admiration for a rival or enemy with no mixed states
My guess is that, just as going to bed can feel like a good idea or a bad idea depending on which aspects of the situation you’re paying attention to ... I would suggest that people don’t feel admiration towards Person X and schadenfreude towards Person X at the very same instant.
That should be testable with a 2x2 brow cheek EMG where cues for one target person are rapidly paired with alternating micro-prompts that highlight admirable vs blameworthy aspects. Cueing “admire” should produce a rise in cheek and/or drop in brow, cueing “condemn” should flip that pattern, with little activation of both.
Envy clearly splits into frustration and craving
Envy should split into frustration and craving signals in an experiment where a subject doesn't get something either due to a specific person’s choice or pure bad luck (or as a control systemic scarcity). Then frustration should always show up, but social hostility should spike only in the first. Which seems very likely to me.
Private guilt splits into discovery and appeasement
If private guilt decomposes into discovery and appeasement, a 2x2 experiment where discoverability is low vs impossible and a victim can be appeased or not should show that and reparative motivation should be strongest when either discovery is plausible or appeasement is possible, while “Dobby-style” self-punishment should occur especially when appeasement is conceptually relevant but blocked.
Aggregation
By contrast, in stage fright, you can see the people right there, looking at you, potentially judging you. You can make eye contact with one actual person, then move your eyes, and now you’re making eye contact with a different actual person, etc. The full force of the ground-truth reward signals is happening right now.
This would predict that stage fright should scale with gaze cues, i.e.,
This should be testable with real-life interventions (people in the audience could wear masks, look away, do other things) or VR experiments, though not cheaply.
I'm not sure about good experiments for the other cases.
Thanks! I appreciate the brainstorming here.
it feels like you are adding more gears to your model as you go. … I'm unable to tell if this is all coming from a consistent model or is more something plausible you suggest could work.
I am acutely aware of the risk of post-hoc storytelling instead of principled postdiction :) I think I'm pretty good at doing principled postdiction rather than post-hoc storytelling (although maybe everybody thinks that about themselves), but I’m certainly capable of the latter, especially when I’m just brainstorming and haven't stewed on something for months or years. E.g. much of my previous comment was early-stage low-confidence brainstorming, I hope I made that clear. :)
…allows us to make some testable predictions…
I think I’m a lot more skeptical than you about almost any psych-style experiments being even worth the time to do, let alone definitive. I imagine the experiment coming out one way, or coming out the other way, and either way, it seems very easy to explain the result. There's just too many degrees of freedom, and too wide and hazy a hypothesis space (at this stage), and too many degrees of separation between the question and the measurement. (See also: You Are Not Measuring What You Think You Are Measuring.)
Relatedly, Bayes says you kinda need two plausible hypotheses, then an experiment can favor one over the other. But I almost never have that. Rather, it’s all I can do to get to ONE hypothesis that really hangs together and is consistent with everything we know in neuroscience, evolution, algorithm theory, everyday life, mental health, culture, and so on. At least at a high level. (At a lower level, things are much more under-constrained, e.g. I can imagine dozens of ways that some calculation might be divvied up among different neuron groups. But then I don’t care as much about what the answer is.)
cheek EMG
Some areas where underdetermined messiness can sneak in here are:
(1) We don’t know what the person is actually thinking at any given time, e.g. they aren’t necessarily paying attention to the current cue, or the cue might remind them of an embarrassing thing that they did in middle school etc.
(2) We don’t know the map from the space of facial expressions to the alleged “innate parameters” that I’m hypothesizing, and it might be indirect. For example, if we see a friendly and angry facial expression simultaneously, does that mean that the underlying hypothalamic groups (or whatever) are not mutually inhibitory? Or might they just be activating so close in time that it looks simultaneous? Or might there be yet a third facial expression that just happens to look like those two coinciding?
Mapping the space of microexpressions (including pupil dilation etc.) in a rigorous way seems potentially interesting and useful, but it wouldn’t lead to a nice legible non-invasive experiment that proves my theory or any other theory, unless we also have a nice way to measure and break down what’s happening upstream of those facial expressions, i.e. in the domain of “feelings and situations triggering innate reactions”, and we don’t. For example, my theory of laughter is unusually simple, but I still don’t know how to get really nice strong legible evidence of it via anything like a psych study. I have a proposed path forward but it involves neural tracing experiments (or equivalent) in rodents.
Envy clearly splits into frustration and craving
I’m having trouble following this one. If it’s important, I wonder if you can try again to explain it more concretely? What would be the possible results, and what do you think we would we learn in each case?
Private guilt splits into discovery and appeasement
Just trying to think this through:
I guess you’re saying that I’m likeliest to fix the toy in A, and I’m likeliest to ostentatiously “beat myself up” in C? If so, yeah that seems likely. But as usual, I doubt that “proving” this experimentally would (or should) convince skeptics of any specific underlying theory.
gaze cues
As usual I can’t think of how to set up an experiment that would (or should) convince skeptics and that would not have lots of possible interpretations. For example, suppose we compare a person’s stage fright vs how much they look at the audience’s eyes. I would find it equally easy to explain both possible experimental correlations. If the correlation is positive, I could say “aha, looking at the audience’s eyes causes stage fright”. If the correlation is negative, I could say “aha, people with stage fright are deliberately avoiding looking at the audience’s eyes, because that would be too much for them to handle”.
Hmm, I guess one could look at whether physiological arousal jumps upward at the moments when eye contact with an audience member happens, and whether those jumps are bigger in people with stage fright? Seems pretty likely to me. But again, I’m not sure what skeptic would find that this data changes their mind about anything, and conversely if it turned out the other way I would be mildly surprised and confused, but I probably wouldn’t be SO surprised and confused that I would change my mind on anything important.
I think your question is kinda too vague to answer. (You’re asking for a comparison of two AI architectures, but what are they? I need more detail. Are we assuming that the two options are equally powerful & competent? If so, is that a good assumption? And is that power level “kinda like LLMs of today”, “superintelligence”, or something in between?)
…But maybe see my post Foom & Doom §2.9.1: If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs? for some possibly-related discussion.
An AI without a factorized motivation system (like an LLM) seems more immediately corrigible than an AI with a motivation system, with a lower ceiling for capabilities. Fine-tuned LLMs are capable of accomplishing tasks, and it's possible to anthropomorphize them, but the "value knowledge" upstream of their behaviors exists in an ad-hoc, fractured state, and it can be crudely manipulated. Emergent Misalignment and Subliminal Learning are nice examples. The LLM training process is like an industrialized process for growing a big, connectionist grammar of abstractions from text data. In this process, they induce text concepts that are remarkably functionally equivalent to the concepts that humans induce and utilize, assuming they have a lot of examples to learn from and aren't forced out of distribution. When these assumptions don't hold, the ad-hoc nature of the concepts they learn is evident.
Animals/people with motivation systems induce concepts through sensory experience, which they actively sample and predict. They control the stream of experience with both subconscious and motivated behavior. The knowledge they gain is "on the same level" as the learning process that generated that knowledge, and as such they are capable of generating more knowledge in the future. A motivation system seems like it might be necessary to facilitate the process of squeezing into the nooks and crannies of the world and manipulating concepts and objects so that they can be seen/predicted from all angles and utilized towards the process of accomplishing goals in a robust manner. Growing a fixed grammar of abstractions from massive IID batches of human text put into a blender seems inadequate for learning good representations, in comparison. People are actively trying to make LLMs better, but it's possible that the entire paradigm is limited, and will plateau before it becomes meaningfully capable in a way that we would recognize as dangerous. They might always have a shallow and LARPy relationship to their training data, and it might not be feasible to generate more in a scalable way, although people are still trying.
This is not to say that LLMs are useless or bad. They are a genuinely helpful consumer product, and they seem pretty corrigible and aligned. (Obviously there are instances like GPT-4o disorienting people or chatbots not intervening strongly enough to dissuade suicidal people, but I think it is a category error to impute "malice" or "neglect" to the LLM in this case, because they are simply kind of incoherent/ad-hoc in their motivations) Many AI Doomer types weren't able to conceptualize the corrigibility of this approach until it happened, because they were working from an operational definition of intelligence that involved steering and prediction. They didn't necessarily have strong priors/intuitions about a mechanistic definition of intelligence (think about Turing and von Neumann, who came up with mechanistic definitions of computation), so they assumed that any approach that seemed to display "steering" and "prediction" was at risk of freely scaling into superintelligence and going FOOM.
To get back to your question, AIs with a factorized motivation system, like the hypothetical future model-based RL agent architecture that Rich Sutton wants to build, would likely be much more capable and dangerous than LLM, because they would be capable of sidestepping the fundamental limitations of LLMs, and they would be better prepared to accomplish motivated goals. But it seems like it would be really hard to turn one of these agents into something that is corrigible and economically valuable and useful for consumers, like LLMs. You would have to literally raise it like a child or an animal and go through the process of educating it to be economically valuable, while trying to figure out how to keep it aligned to your interests. Sure, you could amortize the effort you spent on one copy across multiple copies. But it seems really unlikely for these agents to remain in thrall to the corporation renting them out to consumers, especially if they're learning continually. It's still sort of difficult for me to conceptualize what people would need a creature with its own will for, since it would be much less well behaved than an LLM. But then again, Turing and von Neumann weren't able to conceptualize the internet, or social media, or video games when they were defining their ideas about computation, so I probably just need to think about it more. We are very accustomed to the LLM reference class at this point, and this might be injecting unhelpful framing assumptions into my thinking.
1. Intro & summary
1.1 Background
In Intro to Brain-Like-AGI Safety (2022), I argued: (1) We should view the brain as having a reinforcement learning (RL) reward function, which says that pain is bad, eating-when-hungry is good, and dozens of other things (sometimes called “innate drives” or “primary rewards”); and (2) Reverse-engineering human social innate drives in particular would be a great idea—not only would it help explain human personality, mental health, morality, and more, but it might also yield useful tools and insights for the technical alignment problem for Artificial General Intelligence.
Then in Neuroscience of human social instincts: a sketch (2024), I worked towards that goal of reverse-engineering human social drives, by proposing what I called the “compassion / spite circuit”, centered around a handful of (hypothesized) interconnected neuron groups in the hypothalamus and brainstem (but also interacting with other brain regions; see that link for gory details). I suggested that this circuit is central to our social instincts, underlying not only compassion and spite, but also (surprisingly[1]) much of status-seeking and norm-following.
1.2 Summary of this post
The next task is to dive into the “compassion / spite circuit” more systematically, trying to build an ever-better bridge that connects from neuroscience & algorithms on one shore, to the richness of everyday human experience on the other. In particular:
1.3 Teaser for subsequent posts
After we finish this post, I have a follow-up post which will analyze “Approval Reward”, a second of those four reward streams coming from the “compassion / spite circuit”.
For the post after that—well, there’s also a third and fourth reward stream, but those are less important from an AI alignment perspective, so I’ll skip those for now. Instead, I’ll pivot back to discussing technical AI alignment more directly.
2. Splitting the “compassion / spite circuit” into four reward streams
I propose to split up the instances where the “compassion / spite circuit” is spitting out rewards, into four natural categories, depending on:
These two choices fill out a 2×2 table, and I’ll make up a suggestive term for each of the four boxes:
Oversimplified gloss on these:
“Sympathy Reward” makes me want to see my friends and idols happy, not suffering.
“Schadenfreude Reward” makes me want to see my enemies suffering, not happy.
“Approval Reward” makes me want my friends and idols to like me rather than hate me; to think of me as impressive rather than cringe; to give me credit for helping them rather than blame for harming them; and so on.
“Provocation Reward” makes me want to pick fights with my enemies.
2.1 Some background and terminology around drives, rewards, and desires
Reward: The brain runs a reinforcement learning (RL) algorithm (see Valence §1.2), with a reward function that sends out reward signals. So “reward” is a signal in the brain, not “physical stuff” in the environment.[2] For example, cheese is not a reward per se, but the process of eating cheese will probably cause various reward signals in a mouse’s brain at various times, assuming the mouse is hungry.
Innate drive: I want to reserve this term for circuits in the hypothalamus and brainstem. These tend to be hard-to-describe things that fire in response to hormonal signals and so on, not necessarily tied to any familiar world-model concept. (As an example, see the box near the top of A Theory of Laughter describing what I think “play drive” looks like under the hood.) An innate drive causes reward, which can be either positive or negative (a.k.a. punishment). (So “innate drives” really means “innate drives and/or aversions”) There are probably dozens of innate drives in humans. They are sometimes also called “primary rewards” in the literature, but I don’t like that terminology.[3]
Desires: A “desire” would be a learned world-model concept which was active immediately before a reward, so now it seems good and motivating, as an end in itself (thanks to “credit assignment”). For example, I want world peace, and a nap. An important thing about desires is that they only persist if they capture a real persistent pattern in the reward function. Otherwise, they will promptly be unlearned when they fail to predict reward—see related discussion in Against empathy-by-default, and keep in mind that desires are set and updated by continuous learning, not train-then-deploy.[4] As above, in this post I’ll generally use “desires” as a shorthand for “desires and dislikes”, i.e. both valences.
“Sympathy Reward”, “Approval Reward”, etc.: These are a way to take those hard-to-describe innate drives, and take a step towards making them more comprehensible in terms of real-world concepts (more like desires are), as follows: We imagine listing out all the actual thoughts and situations that trigger a particular innate drive to spit out reward signals, in the actual life of an actual person. We would find that these thoughts and situations fall into (loose) clusters. The reward signals associated with one of the clusters would be “Sympathy Reward”; the reward signals associated with another cluster would be “Approval Reward”; etc.
2.2 Getting a reward merely by thinking, via generalization upstream of reward signals
In human brains (unlike in most of the AI RL literature), you can get a reward merely by thinking. For example, if an important person said something confusing to you an hour ago, and you have just now realized that they were actually complimenting you, then bam, that’s a reward right now, and it arose purely by thinking. That example involves Approval Reward, but this dynamic is very important for all aspects of the “compassion / spite circuit”. For example, Sympathy Reward triggers not just when I see that my friend is happy or suffering, but also when I believe that my friend is happy or suffering, even if the friend is far away.
How does that work? And why are brains built that way?
Here’s a simpler example that I’ll work through: X = there’s a big spider in my field of view; Y = I have reason to believe that a big spider is nearby, but it’s not in my field of view.
X and Y are both bad for inclusive genetic fitness, so ideally the ground-truth reward function would flag both as bad. But whereas the genome can build a reward function that directly detects X (see here), it cannot do so for Y. There is just no direct, ground-truth-y way to detect when Y happens. The only hint is a semantic resemblance: the reward function can detect X, and it happens that Y and X involve a lot of overlapping concepts and associations.
Now, if the learning algorithm only has generalization downstream of the reward signals, then that semantic resemblance won’t help! Y would not trigger negative reward, and thus the algorithm will soon learn that Y is fine. Sure, there’s a resemblance between X and Y, but that only helps temporarily. Eventually the learning algorithm will pick up on the differences, and thus stop avoiding Y. (Related: Against empathy-by-default and Perils of under- vs over-sculpting AGI desires). So in the case at hand, you see the spider, then close your eyes, and now you feel better! Oops! Whereas if there’s also generalization upstream of the reward signals, then that system can generalize from X to Y, and send real reward signals when Y happens. And then the downstream RL algorithm will stably keep treating Y as bad, and avoid it.
That’s the basic idea. In terms of neuroscience, I claim that the “generalization upstream of the reward function” arises from “visceral” thought assessors[5]—for example, in Neuroscience of human social instincts: a sketch, I proposed that there’s a “short-term predictor” upstream of the “thinking of a conspecific” flag, which allows generalization from e.g. a situation where your friend is physically present, to a situation where she isn’t, but where you’re still thinking about her.
3. Sympathy Reward: overview
…Thus ends the first part of the post, where we talk about the four reward streams and how to think about them in general. The rest of this post will dive into one of these four, “Sympathy Reward”[6], which leads to:
Pleasure (positive reward) when my friends and idols[7] seems to be feeling pleasure;
Displeasure (negative reward[8], a.k.a. punishment) when my friends and idols seems to be feeling displeasure;
…which also generalizes[9] to pleasure / displeasure from merely imagining those kinds of situations.
By the way, don’t take the term “Sympathy Reward” too literally—for example, as we’ll see, it not only motivates people to reduce suffering, but also to ignore suffering.
3.1 The obvious good effect of “Sympathy Reward”
The obvious good prosocial effect of Sympathy Reward is a desire to make other people (especially friends and idols) have more pleasure and less suffering.
If you relieve someone’s suffering, Sympathy Reward makes that feel like a relief—the lifting of a burden. Interestingly, in practice, people feel better than baseline after relieving someone’s suffering. I propose that the explanation for that fact is not Sympathy Reward, but rather Approval Reward (next post).
This effect extends beyond helping a friend in immediate need, to morality more broadly. Think of a general moral principle, like “we should work to prevent any sentient being from suffering”. When we do moral reasoning, and wind up endorsing a principle like that, what exactly is going on in our brains? My answer is in Valence series §2.7.1: a descriptive account of moral reasoning. Basically, it involves thinking various thoughts, and noticing that some thoughts seem intuitively good and appealing, and other thoughts seem intuitively bad and unappealing. And I claim that the reason they seem good or bad is in large part Sympathy Reward. (Well, lots of innate drives are involved, but Sympathy Reward and Approval Reward are probably the two most important.)
So that’s the obvious good effect of Sympathy Reward. Additionally, there are a bunch of non-obvious effects, including antisocial effects, which I’ll discuss in the next few sections.
4. False negatives & false positives
There’s some sense in which sympathy “should” be applied to exactly the set of moral patients.[10] In that context, we can consider Sympathy Reward to have false negatives (e.g. indifference towards the suffering of slaves) and false positives (e.g. strong concern about the suffering of teddy bears). Let’s take these in turn.
4.1 False negatives (e.g. dehumanization)
4.1.1 Mechanisms that lead to false negatives
1. Not paying attention: The most straightforward way that Sympathy Reward might not trigger is if I’m not thinking about the other person in the first place.
2. Paying attention, but in a way that avoids triggering the “thinking of a conspecific” flag. This happens if I somehow don’t viscerally think of the other person as a person (or person-like) at all. For example, maybe my attention is focused on the person’s deformities rather than their face. Or maybe I think of them (in a kind of visceral and intuitive way) as an automaton, instead of as acting from felt desires.
3. Seeing the other person as an enemy: As mentioned in §2 above, there’s an innate “friend (+) vs enemy (–) parameter”, and if that parameter flips to “enemy”, then the person’s suffering starts seeming good instead of bad.
4. Seeing the other person as unimportant: This isn’t a way to turn off sympathy entirely, but it’s a way to reduce it. Recall from Neuroscience of human social instincts: a sketch §5.3 that phasic physiological arousal upon seeing the other person functions as a multiplier on how much sympathy I feel, and basically tracks how important and high-stakes the person seems from my perspective. If my visceral reaction is that the person is very unimportant / low-stakes to me, then my sympathy towards them will be correspondingly reduced.
5. Feeling like the other person is doing well, when they’re actually not: Sympathy Reward tracks how the other person seems to be doing, from one’s own perspective. This can come apart from how they’re actually doing.
4.1.2 Motivation to create false negatives in response to someone’s suffering
Sympathy Reward creates unpleasantness in response to someone else’s suffering. This leads to the behavior of trying to reduce the other person’s suffering. Unfortunately, it also leads to the behavior of trying to prevent Sympathy Reward from activating, by any of the five mechanisms listed just above. And this is quite possible, thanks to motivated reasoning / thinking / observing (see Valence series §3.3).
The simplest strategy is: in response to seeing someone (especially a friend or idol) suffering, just avert your gaze and think about something else instead. Ignorance is bliss. (Related: “compassion fatigue”.)
From my perspective, the interesting puzzle is not explaining why this ignorance-is-bliss problem happens sometimes, but rather explaining why this ignorance-is-bliss problem happens less than 100% of the time. In other words, how is it that anyone ever does pay attention to a suffering friend?
I think part of the answer is Approval Reward: it’s pleasant to imagine being an obviously compassionate person, because other people would find that impressive and admirable (§6 below). Another part of the answer is anxiety-driven “involuntary attention” (which can partly counteract motivated reasoning, see Valence series §3.3.5). Yet another part of the answer might be the various other innate social drives outside the scope of this post, including both love and a more general “innate drive to think about and interact with other people” (see §5 of the next post).
Averting one’s gaze (literally and metaphorically) is a popular strategy, but all the other mechanisms listed above are fair game too. Motivated reasoning / thinking / observing can conjure strategies to make the suffering person feel (from my perspective) like an enemy, and/or like a non-person, and/or unimportant, and can likewise conjure false rationalizations for why the person is actually doing fine. I think all of these happen in practice, in the human world.
4.2 False positives (e.g. anthropomorphization)
A false positive would be when you waste resources or make tradeoffs in favor of improving the happiness or alleviating the suffering of some entity which does not warrant effort on its behalf. A silly example would be trying to improve the welfare of teddy bears. In my opinion, Blake Lemoine trying to help LaMDA is a real-life example. As another example, some people think that insects are not sentient; if those people are right (no opinion), then insect welfare activists would be wasting their time and money.
Just above I noted that false negatives are not just a passive mistake, but also come along with incentives; if a certain kind of false negative would make us feel better, then we may rationalize some mental strategy for inducing it. In principle, the same applies to false positives. But it seems to be a more minor and weird effect, so I put it in a footnote.→[11]
5. Other perverse effects of Sympathy Reward
(See also “Notes on Empathy” § “What bad is empathy” (@David Gross 2022).)
5.1 Misguided sympathy
More on this in §5 of the next post, but Typical Mind Fallacy interacts with the “compassion / spite circuit”, and can lead Person A to want Person B to be in situations that Person A would like, rather than situations that Person B actually likes, especially if Person A is socially inattentive.
5.2 Tradeoffs
If I feel sympathy towards Person A, and thus feel motivated to help them feel better right now, then that’s generally a good thing, compared to callous indifference. But it can also be bad in the sense that, if I’m too motivated to help Person A feel better, then that can trade off against everything else good in the world. In particular, maybe I’ll help Person A at the expense of harming Person B; or maybe I’ll help Person A feel better right now, at the expense of putting them in a worse situation later on.
I think that’s the main kernel of truth in Paul Bloom’s book Against Empathy (2016).[12]
5.3 ‘Hedonic utilitarianism’ bullet-biting stuff
A different set of non-obvious effects of Sympathy Reward is pushing people in the direction of hedonic utilitarians, including all the bullets that actual hedonic utilitarians bite. For example, sympathy gives us a (pro tanto) motivation to toss unhappy people into Experience Machines against their expressed preferences, or to intervene when people are struggling (even if they find meaning in the struggle), or to slip magical anti-depressants (if such a thing existed) into unhappy people’s drinks against their wishes, or (historically) to give them lobotomies, and so on. Sympathy Reward pushes us to do these things, but meanwhile Approval Reward pushes us not to. By and large, Approval Reward wins that fight, and we don’t want to do those things. Still, the pro tanto motivational force exists.
5.4 Incentives and game-theory stuff
If I care about your wellbeing, then you can manipulate me based on what emotions you feel (or if you’re a good actor, what emotions you project). By the same token, I am incentivized to make myself feel (or pretend to feel) feelings for strategic interpersonal reasons. This also applies to Approval Reward (next post).
6. Sympathy Reward strength as a character trait, and the Copenhagen Interpretation of Ethics
The Copenhagen Interpretation of Ethics is @Jai’s tongue-in-cheek term for the observation that if you interact with a problem, you’ll get widespread condemnation and blame if you don’t solve the problem completely. This is true even if your involvement didn’t make the problem any worse than it already was. It’s even true if your involvement made the problem less bad, while meanwhile the jeering critics were doing nothing. See his post for lots of examples.
I’ll try to explain where this phenomenon comes from, as an example of an indirect consequence of Sympathy Reward.
1. “Strong Sympathy Reward” as a character trait. People differ in how strongly they feel Sympathy Reward. Some see their friend suffering, and are immediately overwhelmed by a desire to make that suffering stop. Others see their friend suffering, and aren’t too bothered.
2. “Strong Sympathy Reward (towards me or people I care about)” as an especially salient and important characteristic, and useful for friend-vs-enemy classification.
If someone has the conjunction of both “Strong Sympathy Reward” and “seeing me [or someone I care about] as a friend rather than an enemy”, then that’s very important for my everyday life. It means that, if I tell them that I have a problem, then they will feel motivated to help me. Everyone learns from abundant everyday life experience that people with this characteristic are good to have around, and to be regarded as friends rather than enemies.[13]
Conversely, if someone lacks one or both of those properties, then that’s also very important for me to know. It means that, if I tell them that I have a problem, they might not care, or they might even look for opportunities to exploit my misfortune for their own benefit. Everyone learns from abundant life experience that people in this category are bad to have around, and should be regarded as enemies.
3. Reading off “Sympathy Reward strength” from someone’s behavior during interactions. Suppose Bob is suffering, but Alice doesn’t know that, or perhaps Alice is out of the country and unable to help, etc. Then of course, Alice won’t help Bob. This is true regardless of whether or not Alice has strong Sympathy Reward towards Bob. So we all learn from life experience that this kind of situation gives us approximately no evidence either way about that aspect of Alice.
On the other hand, if Alice is interacting with Bob who is suffering, then that’s a different story! Now, an observer can easily judge whether or not Alice has strong Sympathy Reward towards Bob, based on whether or not Alice is overwhelmed by a desire to drop everything and help Bob when she interacts with him.
4. Putting everything together. Per above, abundant everyday experience, all the way from preschool recess to retirement book clubs, drills into us the subconscious idea that “Strong Sympathy Reward (towards me or people I care about)” is a key characteristic that distinguishes friends from enemies, and that evidence concerning this characteristic comes from watching people as they interact with me (or people I care about).
When these heuristics over-generalize, we get the Copenhagen Interpretation of Ethics. If Alice is interacting with Bob, and Bob is really suffering, and Bob is someone I care about (as opposed to my own enemy), then I will feel like Alice pattern-matches to “friend” if she becomes overwhelmed by a desire to drop everything and help Bob, or to “enemy” if she is interacting with Bob in a relaxed and transactional way, then departing while Bob continues to be in a bad situation.
Jai’s summary (“when you observe or interact with a problem in any way, you can be blamed for it. At the very least, you are to blame for not doing more…”) is tellingly incomplete. If Alice interacts with Bob who is really suffering, and Alice does not fully solve Bob’s problems, then Alice gets credit as long as she “gave it her all” and sobs on live TV that, alas, she lacks the strength and resources to do more, etc. That behavior would be a good pattern-match to our everyday experience of “strong Sympathy Reward towards Bob”, so would be seen as socially praiseworthy.
7. Conclusion
That’s all I can think to say about Sympathy Reward. It’s generally pretty straightforward. By contrast, the next post, on Approval Reward, will be much more of a wild ride.
Thanks Seth Herd, Linda Linsefors, Simon Skade, Filip Alimpic, and Justis Mills for critical comments on earlier drafts.
Changelog
Dec. 2025: Added Singh, Lewis, Barto 2009 as an additional reference in §2.1.
It’s very elegant that so many human social phenomena, from compassion, to blame-avoidance, to norm-following, and more, seem to be explained in a unified way by the activity of a single hypothalamus circuit. It’s very elegant—but it’s not a priori necessary, nor even particularly expected. There could have equally well been two or three or seven different circuits for the various social behaviors that I attribute to the “compassion / spite circuit”.
There are of course many other social behaviors and drives outside the scope of this post, and I do think there are a bunch of different hypothalamus circuits which underlie them. For example, I think there are separate brain circuits related to each of: the “drive to feel feared”, play, a “drive to think about or interact with other people” (see §5 of the next post), loneliness (cf. Liu et al. 2025), love, lust, various moods, and so on.
For more on this point, see Sennesh & Ramstead 2025, or Singh, Lewis, Barto 2009 and references therein.
The term “primary reward” tends to have a strong connotation that the “reward” is a thing in the environment, not a brain signal, which (again) I think is a bad choice of definition. For example, papers in the literature might say that a “primary reward” is cheese, and a “secondary reward” is money tokens that the mouse can exchange for cheese.
“Train-then-deploy”—where permanent learning happens during a training phase and then stops forever, as opposed to continuous (a.k.a. online) learning—is actually something that can happen in the brain (e.g. filial imprinting). My claim is more specifically that reward signals update “desires” via continuous learning, not train-then-deploy. For example, if there’s something you like, and then you do it and it’s 100% unpleasant and embarrassing, with no redeeming aspect whatsoever, then you’re unlikely to want to do it again. Or maybe you’ll try it one more time. But probably not 10 more times, if it’s 100% miserable and embarrassing every time. Thus, desires keep updating based on reward signals, even into adulthood.
“Visceral thought assessor” is my term for any thought assessor besides the valence thought assessor (a.k.a. “valence guess”); see Incentive Learning vs Dead Sea Salt Experiment.
Terminology note: I went back and forth between “Sympathy Reward”, “Empathy Reward”, and “Compassion Reward” here. I think it doesn’t really matter. None of the terms {sympathy, empathy, compassion} really have technical definitions anyway—or rather, they have dozens of different and incompatible technical definitions.
Recall from Neuroscience of human social instincts: a sketch that the “compassion / spite circuit” is sensitive to (1) the innate “friend (+) vs enemy (–) parameter” and (2) phasic physiological arousal, which tracks the importance / stakes of an interaction. When these are both set to high and positive, then the circuit fires at maximum power, making us maximally motivated by this particular person’s welfare, approval, etc. My term “friends and idols” is a shorthand for that dependency.
In this post I’m using reinforcement learning terminology as used in AI, not psychology. So “reward” is a scalar which can have either sign, with ”positive reward” being good and “negative reward” being bad. (Psychologists, especially in the context of operant conditioning, use “negative reward” to mean something quite different—namely, relief when an unpleasant thing doesn’t happen.)
The generalization here is upstream of the reward signals—see §2.2 above.
I don’t intend to make any substantive philosophical claim here, like that “moral patients” are objective and observer-independent or whatever—that’s a whole can of worms, and out-of-scope here. I’m merely alluding to the obvious fact that the question of what entities are or aren’t moral patients is a question where people often disagree with each other, and also often disagree with their past selves.
On paper, we should go out of our way to anthropomorphize entities who would seem happy, so that we can share in their joy. But I struggle to think of examples; it doesn’t seem to be a large effect, or at least, it doesn’t pop out given everything else going on. For example, we don’t watch movies where the characters are happy and successful the whole time and there’s no character arc. (Well, I like movies like that, but I have weird taste.)
Ironically, I think the best examples are backwards. As it turns out, there’s a thing kinda like motivated reasoning but with the opposite sign, powered by anxiety and involuntary attention (see Valence series §3.3.5). This leads to something-kinda-like-a-motivation to create false positives in situations where the entities in question would be suffering. Imagine feeling a gnawing worry that maybe insects are suffering at a massive scale, and you can’t get it out of your head, and your mind constructs a story that would explain this gnawing feeling, whether or not the story was true.
I agree with Bloom on most of the practical takeaways of his book: people should be more impartial, people should do more cost-benefit analyses, etc. On the philosophical and psychological side, I think Bloom is mistaken when he argues that the so-called “rational compassion” that he endorses has no relation to the “empathy” that he lambasts. My take is instead that the latter is a big part of what ultimately underlies the former (i.e. the Sympathy Reward part, with Approval Reward being the rest). See §3.1 above.
I’m a bit hazy on the details of how the innate “friend vs enemy parameter” is calculated and updated. But if someone resembles people who have made my life better, then I’ll probably feel like they’re a friend, and conversely. See Neuroscience of human social instincts: a sketch §5.2.