I feel inordinately proud of this post, probably because this was a problem that I’ve been confused about since 2019, and I literally taught myself neuroscience in large part because I wanted to solve this problem, and I spent what amounts to several years of full-time effort building up an ability to tackle it … and this post represented the moment when I finally felt like I had my foot in the door towards a satisfying solution.
Granted, there’s still plenty more work to do, and indeed I’ve continued to follow up on this work in the past year since I wrote this post; but it now feels like I’m filling in gaps, and fleshing out details, and refactoring inelegant descriptions, whereas before it felt like I was trying to breach a wall of impenetrable mystery.
(Copying a discussion I had elsewhere.)
THEM: The gating’s not selective. When the spider shows up in the dark corner, the argument predicts I get scared of everything co-active in cortex: the spider, that corner, the person who happens to be standing next to me, the kind of fabric I happen to be wearing that day, etc. Where does it end?
ME: I think fewer things are “co-active in the cortex” than you suggest. I think attention flits around like ten times a second, and my whole argument in that section was that involuntary attention would ensure that I’m mostly thinking about the spider when the corresponding visceral thought assessor update happens.
THEM: Let me try a couple specific examples.
You and I are exploring a dark basement together. A spider lands on you. I get more scared of spiders, plausibly more scared of dark corners, not more scared of you.
More sharply, I'm exploring a dark basement alone, and a spider lands on an old unused exercise bike. I don't think I'm going to get more scared of that or any other exercise bike.
I think in these example, the spider lands on you or the exercise bike, so I'm going to be paying substantive attention to you / the bike?
ME: Hmmm. You’re right about my previous reply. But I think I kinda bite those bullets. I think something visceral is learned, and will manifest in the future, but calling the result “I am scared of [blah]” has some wrong connotations.
For one thing, the ground-truth reaction from seeing a spider is generally stronger than the defer-to-predictor anticipation of that reaction. So e.g. if you have strong reason to believe that a spider might jump out at you soon, you might say “I’m scared right now”, but you might also say something more specific: “I’m scared that a spider will jump out at me”. The nervousness is real and unpleasant, but the actual spider would be worse.
For another thing, if a spider jumps out from an exercise bike once ever, then the short-term predictor is learning something like: “Exercise bike is weak evidence of danger, AND this particular basement is weak evidence of danger, AND this one specific exercise bike is weak evidence of danger, AND this lighting condition is weak evidence of danger, …”. And then later you see a different exercise bike in a different location, different lighting, etc. The predictor would see this as quite weak but nonzero evidence for physiological arousal, and maybe it would be too weak to notice. How would the evidence become stronger? (A) If you see that same exercise bike in the same basement in the same lighting, that would add up to stronger evidence. Also, (B) if you see spiders jumping out of five exercise bikes in five different contexts over the course of a few months, then the predictor will keep strengthening and strengthening the connection from “exercise bike” to physiological arousal, until the effect is very noticeable. I think both of those match my experience.
Also, if a predictor learns that “exercise bike” is weak-but-nonzero evidence for physiological arousal, and then you see a bunch of other exercise bikes where nothing goes wrong, presumably that weak evidence is erased (or overridden by a different system) (cf. “extinction” in psych jargon).
This isn’t as much a question as it is just sharing some thoughts I had, but I would love to hear your thoughts :) Let’s imagine we are our own brain’s optimizer. We just received a bad signal, we feel pain. Let’s say, we realized someone else is soon going to feel pain, so we feel pain. What could the optimizer do now? Well, there are only 2 things it can do:
Try to disconnect “she feels pain” from the concept of pain that then triggered pain in yourself
Try to disconnect your previous thoughts from arriving at “she feels pain”
You speak a lot to (1), explaining the symbol grounding mechanism that continuously symbol grounds it in the ground truth, so the optimizer trying to move “she feels pain” away from its previous position in the feature space won’t work (at least as long as we continuously have such ground truth input - this sheds light on the very immoral but very interesting experiment of having an individual not exposed to such input for long periods, like not seeing any human face for multiple months, be it in person, on pictures or on your phone. There, this theory should predict that such a move in feature space could happen and will be successful - to be dramatic, you become a psychopath).
You don’t speak much to (2) though. One option for example here would be to unlearn the concept of “future” - babies first gradually learn about it therefore it’s reasonable to assume that you could unlearn it again. Luckily, this doesn’t seem to happen, so there must be some opposing force, something that promises reward if this concept persists.
Specifically, this concept must offer you insight into your actions such that your future expected reward rises. This is obvious in this case - without the concept “future”, you can hardly make any intelligent decisions at all. But it also carries over to much more specific and even human invented associations/knowledge:
Let’s say you work in cyber-security and the reason you think this person will feel pain is because using those cyber-security skills enabled you to make an association the normal person wouldn’t. The optimizer could try to unlearn these skills, but actually those skills lead to higher expected reward, else you wouldn’t be pursuing it: be it the nice house you can afford, the social status you enjoy because of it or simply the joy you receive from enacting it.
In other words, anything you learned, you learned because you assumed it would result in a higher expected reward and anything you act out (after learning), you do because it results in a higher expected reward. To forget these concepts will at least require a reward matching theirs.
This doesn’t imply it should be impossible though - let’s say you learned something that you hate, like say chiseling stone. You did this because the market would pay insane wages because only few could do the job and so the reward you saw attached to those wages was immense and you pushed through the boring education of becoming an expert in chiseling stone. And once you got there, you realize, you weren’t the only one with the idea: wages drop quicker than the average pump & dump crypto coin. In fact the profession you enacted before, which you intrinsically enjoy, even pays better.
As I’m writing this, I realize there are no good stories for why chiseling stone might give you a better glimpse into someone’s future pain, but let’s just take it for granted. Then the reward of the knowledge of chiseling stone is pretty much zero, maybe even negative because whenever you recall it, you recall all the effort that didn’t pay off.
Yet I have never heard of something along these lines happening. It would be quite a great mechanism for the free market though, the wages would jump right up: let’s hope our individual in question doesn’t once again try to learn to chisel stone, completely forgetting this tale of unreciprocated effort.
You could maybe argue something like: precisely the things that fall in this category are things we gave up on, that is, their occurrence in our day-to-day life is incredibly rare. Therefore, with a normal learning rate, we simply wouldn’t iterate over them often enough to forget them meaningfully.
Lastly, just for completeness, naturally ‘disconnecting your previous thoughts from arriving at “she feels pain”’ also entails your previous actions - it’s a very special occurrence to know somebody will feel pain in the future, unless you had a play in it yourself. Naturally those decisions back then will be optimized on as well, hopefully leading you to make better decisions in the future.
immoral but very interesting experiment … not seeing any human face for multiple months, be it in person, on pictures or on your phone
There must be plenty of literature on the psychological effects of isolation, but I haven’t looked into it much. (My vague impression is: “it messes people up”.) I think I disagree that my theory makes a firm prediction, because who is to say that the representations will drift on a multiple-month timescale, as opposed to much slower? Indeed, the fact that adults are able to recall and understand memories from decades earlier implies that, after early childhood, pointers to semantic latent variables remain basically stable.
2. Try to disconnect your previous thoughts from arriving at “she feels pain”
I would describe this as: if it’s unpleasant to think about how my friend is suffering, then I can avoid those unpleasant feelings by simply not thinking about that, and thinking about something else instead.
For starters, there’s certainly a kernel of truth to that. E.g. see compassion fatigue, where people will burn out and quit jobs working with traumatized people. Or if someone said to me: “I stopped hanging out with Ahmed, he’s always miserable and complaining about stuff, and it was dragging me down too”, I would see that as a perfectly normal and common thing for someone to say and do. But you’re right that it doesn’t happen 100% of the time, and that this merits an explanation.
My own analysis is at: §4.1.1 and §4.1.2 of my (later) Sympathy Reward post. The most relevant-to-you part starts at: “From my perspective, the interesting puzzle is not explaining why this ignorance-is-bliss problem happens sometimes, but rather explaining why this ignorance-is-bliss problem happens less than 100% of the time. In other words, how is it that anyone ever does pay attention to a suffering friend? …”
So that’s my take. As for your take, I think one of my nitpicks would be that I think you’re giving the optimizer-y part of the brain a larger action space than it actually has. If I would get a higher reward by magically teleporting, I’m still not gonna do that, because I can’t. By the same token, if I would get a higher reward by no longer knowing some math concept that I’ve already learned, tough luck for me, that is not an available option in my action space. My world-model is built by predictive (a.k.a. self-supervised) learning, not by “whatever beliefs would lead to immediate higher reward”, and for good reason: the latter has pathological effects, as you point out. (I’ve written about it too, long ago, in Reward is Not Enough.) I do have actions that can impact beliefs, but only in an indirect and limited way—see my discussion of motivated reasoning (also linked in my other comment).
let me preface this by saying how much I enjoyed reading this post - it really shows that this isn't some random idea you had but that you really thought a lot about this. As someone who's first introduction to this kind of idea was precisely this blogpost, thanks.
question - maybe I'm simply misunderstanding you:
-you seem to assume that the cortex's modelling of one's own happiness is very similar to the cortex's modelling of thinking of happiness. you might argue that it's only the "concept of happiness", which I would agree is present in both scenarios, but it doesn't strike me why that in particular would be learned using this supervised mechanism.
-building on that point, I think it might be more probable that understanding another's feelings is part of 1A - instead of simply seeing, hearing, etc. there would be something tasked with analyzing facial cues - in particular humans exhibit micro expressions (expressions that last very short periods and are almost impossible to control), something most people can't seem to pick up on, atleast consciously. So why do we have them if other people can't pick up on them? Maybe they can, but only subconsciously to precisely facilitate this symbol grounding for somebody else's feelings. Then again, if you can't consciously pick up on it, the target for the supervision will probably be terrible as well so maybe that's not it.
(i'll probably hammer u with more questions down the line, still trying to process all of this lol)
Thanks!!
you seem to assume that the cortex's modelling of one's own happiness is very similar to the cortex's modelling of thinking of happiness
I would say “overlaps” rather than “is similar to”. Think of it as vaguely like I-am-juggling versus you-are-juggling. Those are different thoughts, but they overlap, in that they both involve the “juggling” concept. That overlap is very necessary for e.g. recognizing that the same word “juggling” applies to both, and for transferring juggling-related ideas between myself and other people, which we are obviously very capable of doing.
you might argue that it's only the "concept of happiness", which I would agree is present in both scenarios, but it doesn't strike me why that in particular would be learned using this supervised mechanism.
The chain of events would be e.g.
(1) The Thought Generator (world-model) catalogs our own interoceptive feelings into emotion-concepts like "pleasure".
(2) The Thought Generator learns from experience that pleasure has something to do with smiling, e.g. during times where we feel pleasure and notice ourselves smile, or otherwise learn this obvious regularity in the world. This becomes a world-model (thought generator) semantic association “smile-concept” ↔ “pleasure-concept”.
(3) Often we’re paying attention to our own feelings, and then the “pleasure” emotion-concept is active if and only if our immediate interoceptive sensory inputs match “pleasure”. And these times, when we’re paying attention to our own feelings, are the only times where the pleasure Thought Assessor learning rate is nonzero. So the Thought Assessor learns that there’s a robust correlation between the “pleasure-concept” in the Thought Generator and the pleasure innate signal.
(4) Other times we’re NOT paying attention to our own immediate interoceptive sensory inputs, and then the emotion-concepts are “left hanging”, inactive regardless of what we’re feeling. But while they’re left hanging, they can INSTEAD be activated by semantic associations with other parts of our world-model. Then in such a moment, if I see someone smile, it activates smile-concept, which [via (2)] in turn weakly activates pleasure-concept, which in turn [via (3)] weakly activates the pleasure Thought Assessor. This is a candidate “transient empathetic simulation”. But remember, the learning rate of that Thought Assessor is zero whenever the emotion-concepts are “left hanging” like that. So the Thought Assessor won’t disconnect pleasure-concept.
Does that help? Sorry if I’m missing your point. …The above might be hard to follow without a diagram.
analyzing facial cues - in particular humans exhibit micro expressions
The theory that we have evolved direct responses to different facial reactions seems probably wrong to me (or at least, not the main explanation), for a couple reasons:
First, blind people seem to have normal social intuitions.
Second, I don’t think it’s plausible to simultaneously say that microexpressions immediately trigger important innate reactions, and that people are generally bad at consciously noticing microexpressions. When I think of other environmental things that immediately trigger innate reactions, I think of, like, balls flying at my face, big spiders, sudden noises, getting poked, foul smells, etc. We’re VERY good and fast at forming good conscious models of all those environmental things. So it doesn’t seem plausible to me that we could get metaphorically “poked” by microexpressions many times a day for years straight without ever developing a conscious awareness of those microexpressions.
So why do we have them if other people can't pick up on them
For my answer, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions. We have “innate behaviors” that impact the face, such as gagging, laughing, and Duchenne-smiling. We also have voluntary control of facial muscles, which we learn to deploy strategically for social signaling. When we use voluntary control to hide the signs of “innate behaviors”, the bit of “innate behavior” that slips through the cracks is a microexpression.
You might ask: why don’t our “innate behaviors” evolve to not impact the face, so that we can hide them better? Hard to say for sure. Probably part of it is that we are only sometimes trying to hide them. Some “innate behavior” facial manifestations might also have more direct adaptive utility (cf. §4.2 of that link). Part of it is probably that the hiding is good enough, because microexpressions are actually hard to notice.
Think of it as vaguely like I-am-juggling versus you-are-juggling.
Here, I can see how they would overlap to a reasonable degree - I don't think this easily carries over to emotions. Emotions atleast feel like this weird, distinct thing such that any statement along the lines "I'm happy" does it injustice. Therefore I can't see it being carried over to "She's happy", their intersection wouldn't be robust enough such that it won't falsely trigger for actually unrelated things. That is, "She's happy" ≈ "I'm happy" ≉ experiencing happiness.
Facial cues (as one example, it makes sense that there would be other things like higher-pitched voices when enjoying oneself etc) eliminate this problem because opposed to something introspective being the link, a more objective state of the mind, like "He's sad", will be the learned link.
this might sound like I'm being unnecessarily picky about this, but imo these associations need to be very exact, else humans would be reward-hacking all day: it's reasonable to assume that the activations of thinking "She's happy" are very similar to trying to convince oneself "She's happy" internally, even 'knowing' the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.
regarding micro expressions specifically, it's definitely not a hill i want to die on, it kind of just popped in my mind as I was writing about facial cues and by micro I really mean 'micro micro' - e.g. smiles that aren't perfectly symmetrical for quarter of a second, something I at least can't really pick up on; what is their evolutionary advantage if they don't atleast offer some kind of subconscious effect on conspecifics? But yea, if you can't consciously pick up on it, linking the two is pointless or even bad.
I read the linked post roughly, but as I read neither so far, i probably can't relate too well to it. seems reasonable (or honestly, obvious) though that it's a mix rather than either of those extreme statements.
Thanks again for engaging :)
these associations need to be very exact, else humans would be reward-hacking all day: it's reasonable to assume that the activations of thinking "She's happy" are very similar to trying to convince oneself "She's happy" internally, even 'knowing' the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.
I don’t think things work that way. There are a lot of constraints on your thoughts. Copying from here:
1. Thought Generator generates a thought: The Thought Generator settles on a “thought”, out of the high-dimensional space of every thought you can possibly think at that moment. Note that this space of possibilities, while vast, is constrained by current sensory input, past sensory input, and everything else in your learned world-model. For example, if you’re sitting at a desk in Boston, it’s generally not possible for you to think that you’re scuba-diving off the coast of Madagascar. Likewise, it’s generally not possible for you to imagine a static spinning spherical octagon. But you can make a plan, or whistle a tune, or recall a memory, or reflect on the meaning of life, etc.
If I want to think that Sally is happy, but I know she’s not happy, I basically can’t, at least not directly. Indirectly, yeah sure, motivated reasoning obviously exists (I talk about how it works here), and people certainly do try to convince themselves that their friends are happy when they’re not, and sometimes (but not always) they are even successful.
I don’t think there’s (the right kind of) overlap between the thought “I wish to believe that Sally is happy” and the thought “Sally is happy”, but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.
Emotions…feel like this weird, distinct thing such that any statement along the lines "I'm happy" does it injustice. Therefore I can't see it being carried over to "She's happy", their intersection wouldn't be robust enough such that it won't falsely trigger for actually unrelated things. That is, "She's happy" ≈ "I'm happy" ≉ experiencing happiness
I agree that emotional feelings are hard to articulate. But I don’t see how that’s relevant. Visual things are also hard to articulate, but we can learn a robust two-way association between [certain patterns in shapes and textures and motions] and [a certain specific kind of battery compartment that I’ve never tried to describe in English words]. By the same token, we can learn a robust two-way association between [certain interoceptive feelings] and [certain outward signs and contexts associated with those feelings]. And this association can get learned in one direction (interoceptive model → outward sign] from first-person experience, and later queried in the opposite direction [outward sign → interoceptive model] in a third-person context.
(Or sorry if I’m misunderstanding your point.)
what is their evolutionary advantage if they don't atleast offer some kind of subconscious effect on conspecifics?
Again, my answer is “none”. We do lots of things that don’t have any evolutionary advantage. What’s the evolutionary advantage of getting cancer? What’s the evolutionary advantage of slipping and falling? Nothing. They’re incidental side-effects of things that evolved for other reasons.
but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.
somewhat random but I think I want to learn more about this field in general - from what I can tell, you didn't learn about it in a normal academic setting (like doing a neuroscience B.Sc.) either; any tips for good resources?
About the example in section 6.1.3: Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech? It seems to me like that requires both (a) identifying that the speech is trying to get someone's attention, and (b) identifying that the speech is directed at you. (Well, I guess (b) implies (a) if you weren't visibly paying attention to her beforehand.)
About (a): If the Steering Subsystem doesn't know the meaning of words, then how can it tell that Zoe is trying to get someone's attention? Is there some way to tell from the sound of the voice? Or is it enough to know that there were no voices before and Zoe has just started talking now, so she's probably trying to get someone's attention to talk to them? (But that doesn't cover all cases when Zoe would try to get someone's attention.)
About (b): If you were facing Zoe, then you could tell if she was talking to you. If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you? Are there any other ways the Steering Subsystem could tell if she was talking to you?
I'm not sure how many false positives vs. false negatives evolution will "accept" here, so I'm not sure how precise a check to expect.
Good questions!
Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech?
I think you’re thinking about that kinda the wrong way around.
You’re treating “the things that Zoe does when she wants to get my attention” as a cause, and “my brain reacts to that” as the effect.
But I would say that a better perspective is: everybody’s brain reacts to various cues (sound level, pitch, typical learned associations, etc.), and Zoe has learned through life experience how to get a person’s attention by tapping into those cues.
So for example: If Zoe says “hey” to me, and I don’t notice, then Zoe might repeat “hey” a bit louder, higher-pitched, and/or closer to my head, and maybe also wave her hand, and maybe also poke me.
The wrong question is: “how does my brain know that louder and higher-pitched and closer sounds, concurrent with waving-hand motions and pokes, ought to trigger an orienting reaction?”.
The right perspective is: we have these various evolved triggers for orienting reactions, whose details we can think of as arbitrary (it’s just whatever was effective for noticing predators and prey and so on), and Zoe has learned from life experience various ways to activate those triggers in other people.
If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you?
Yup, STEP 1 is one of my “thought assessors” (probably somewhere in the amygdala) has learned from life experience that hearing my own name should trigger orienting to that sound; and then STEP 2 is that Zoe in turn has learned from life experience that saying someone’s name is a good way to get their attention.
(For a PDF version of this post, go to: https://doi.org/10.5281/zenodo.17953592)
(Last update: May 2026. See changelog at the bottom.)
(If you’re in a hurry, you can just read the “Background and summary” section, and skip the other 85%.)
0. Background and summary
0.1 Background: What’s the problem and why should we care?
There’s a neuroscience problem which is centrally important for Artificial General Intelligence (AGI) safety, but which has had me stumped for as long as I’ve been in this field. Indeed, solving this problem is the main reason I got into neuroscience in the first place! In this post, I sketch an outline of a possible solution.[1]
What is this grand problem? As described in Intro to Brain-Like-AGI Safety, I believe the following:
0.2 Summary of the rest of the post
I’ll start by going through the four algorithmic ingredients we need for my hypothesis, one by one, in each case describing what it is algorithmically, why it’s useful evolutionarily, and where in the brain we might go looking to find the specific neurons that are running this (alleged) algorithm.
Here’s the roadmap:
Then, I’ll go through an important (putative) example of social instincts built from these ingredients, which I call the “compassion / spite circuit”. This circuit leads to an innate drive to feel compassion towards people we like, and to feel spite and schadenfreude towards people we hate.
In an elegant twist, I claim that this very same “compassion / spite circuit” also leads to an innate “drive to feel liked / admired”—a drive that I hypothesized earlier and believe to be central to both status-seeking and norm-following. The trick in explaining how they’re related is:
Then I’ll go more briefly through some other possible social instincts, including a sketch of a possible “drive to feel feared” (whose existence I previously hypothesized here). For context, dual strategies theory talks about “prestige” and “dominance” as two forms of status; while the “drive to feel liked / admired” leads to prestige-seeking, the “drive to feel feared” correspondingly leads to dominance-seeking.
0.3 Confidence level
My confidence gradually decreases as you proceed through the article. The “Background” section above is rock-solid in my mind, as are Ingredients 1, 1A, and 2. Ingredients 3 and especially 4 are somewhat new to this post, but derive from ideas I’ve been playing around with for a year or two, and I feel pretty good about them. The specific putative examples of social instincts in §5–§7 are much more new and speculative, and are oversimplified at best. But I’m optimistic that they’re on the right track, and that they’re at least a “foot in the door” towards future refinements.
0.4 Later work
UPDATE NOV. 2025: After you finish this post, see also my later follow-up posts Social drives 1: “Sympathy Reward”, from compassion to dehumanization & Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking, which further flesh out how my neuroscientific hypothesis (below) connects to everyday experiences and intuitions.
1. Ingredient 1: Innate sensory heuristics in the Steering Subsystem
The Steering Subsystem (brainstem and hypothalamus, more-or-less) takes sensory data, does innately-specified calculations on them, and uses the results to trigger innate reactions.
Think of things like seeing a slithering snake, or a skittering spider; smelling or tasting rotten food; male dogs smelling a female dog in heat; camouflaged animals recognizing the microenvironment where their bodies will blend in; and so on.
Note that these are all imperfect heuristics, anchored to innate circuitry, rather than developing along with our understanding of the world. We can call it a venomous-spider-detector circuit, for example, noting that it evolved because venomous spiders were dangerous to early humans.[4] But if we do that, then we acknowledge that it will have both false positives (e.g. centipedes, harmless spiders) and false negatives (funny-looking stationary venomous spiders), when compared to actual venomous spiders as we intelligently understand them. In vision especially, think of these heuristics as detecting relatively simple patterns of blobs and motion textures, as opposed to an “image classifier” / “video classifier” up to the standards of modern ML or human capabilities.
For more discussion of Ingredient 1, see Intro Series §3.2.1.
1.1 Ingredient 1A: Innate sensory heuristics for conspecific detection in particular
As a special case of Ingredient 1, I claim that, in pretty much all animals, there are sensory heuristics that are specifically designed by evolution to trigger on conspecifics. That would include one or more variations on: seeing a conspecific, hearing a conspecific, touching (or being touched by) a conspecific, smelling a conspecific, etc.
(I’m confident in this part because pretty much all animals have innate behaviors towards conspecifics that are different from their behaviors in other situations—mating, intermale aggression, parenting, being parented, herding, huddling, and so on.)
I claim that these all trigger a special Steering Subsystem innate behavior that I call “the social attention reflex”:
1.2 Neuroscience details
Neuroscience details box
The sensory heuristics involve brainstem areas like the superior colliculus (for innate heuristic calculations on visual data), inferior colliculus (auditory data), gustatory nucleus of the medulla (taste data), and so on. (Again see Intro Series §3.2.1.)
In the case of visual sensory heuristics, I’m actually not 100% confident that these calculations are located in the superior colliculus proper; for all I know, they’re partly or entirely in the neighboring parabigeminal nucleus, or whatever. There are papers on this topic, but they can’t always be taken at face value—see for example me complaining about methodologies used in the literature here and here.
For the “social attention reflex”, it would be somewhere within the Steering Subsystem, but I don’t have any particular insight into exactly where. If I had to guess, I might guess that it’s one of the many little cell groups of the medial preoptic hypothalamus, since those often involve social interactions. If not that, then I’d guess it’s somewhere else in the hypothalamus, or (less likely) some other part of the Steering Subsystem.
If you want to experimentally find the cell group that orchestrates the “social attention reflex”, the conceptually-simplest method would be to first find one of the sensory heuristics for conspecific detection (e.g. the face detector), see what its efferent connections (downstream targets) are, and treat all those as top candidates to be studied one-by-one.
2. Ingredient 2: Generalization via short-term predictors
Ingredient 1 is a first step towards understanding, say, fear-of-spiders. But it’s not the whole story, because I don’t just get nervous when there is actually a large skittering spider in my field-of-view right now, but also when I imagine one, or when somebody tells me that there’s a spider behind me, etc. How does that work? The answer is: what I call the “short-term predictor”.
The “short-term predictor” is a learning algorithm that involves three ingredients—context, output, and supervisor. For definitions see this post; or in the ML supervised learning literature, you can substitute “context” = “trained model input”, “output” = “trained model output”, and “supervisor” = “label” (i.e., ground truth), which is subtracted from the trained model output to get an error that updates the model.[5]
The important points are that:
Thus, this kind of story explains the fact that I viscerally react to learning that there’s a spider in my vicinity that I can’t immediately see or feel.
If we take the brainstem reaction and the short-term predictor together, it can function as what I call a long-term predictor, again see Intro Series §5.
By the same token, the “social attention reflex” can trigger when I’m thinking of a conspecific, even if the conspecific is not standing right there, triggering my brainstem sensory heuristics right now.
2.1 Neuroscience details
Neuroscience details box
I think the short-term predictors that I’ll be talking about in this post are mostly centered around small clusters of medium spiny neurons somewhere in the amygdala, or the lateral septum, or the medial part of the nucleus accumbens shell. (I haven’t tried to pin them down in more detail than that. See Intro Series §5.5.4 for some more general neuroscience discussion of this topic.)
However, in some cases pyramidal neurons can play this short-term predictor role as well, such as in the cortex-like (basolateral) section of the amygdala, along with certain parts of cortex layer 5PT.
The supervisory signal (either ground truth or an error signal, I’m not sure) probably makes an intermediate stop (“relay”) at some little cluster of neurons on the fringes of the Ventral Tegmental Area (VTA), not shown in the diagram above, in which case the supervisory signal would ultimately arrive at the spiny neuron in the form of a dopamine signal. I think. (But there are also VTA GABA neurons that seem somehow related to these particular short-term predictors. I haven’t tried to make sense of that in detail.)
3. Ingredient 3: Tailoring learned models via involuntary attention and learning rate
3.1 Involuntary attention
Let’s talk more about what happens when you see a skittering spider out of the corner of your eye:
When the seeing-a-spider brainstem sensory heuristic triggers, I claim that one thing it does is trigger an “orienting reflex”. Part of that reflex involves moving the eyes, head, and body towards whatever triggered the heuristic. And another part of it involves involuntary attention towards the visual inputs in general, and the corresponding part of the field-of-view in particular.
The involuntary attention plays an important role in constraining what “thought” the cortex is thinking. If you’re daydreaming, imagining, remembering, etc., then your current “thought” has very little to do with current visual inputs. By contrast, involuntary attention towards vision forms a constraint that the thought must be “about” the visual inputs. It’s not completely constraining—the same thought can also contextualize those visual inputs by roping in presumed upstream causes, or expected consequences, or other associations, etc. But the visual inputs have to be a central part of the thought. In other words, you’re not only pointing your eyes at the spider, but you’re also actually thinking about the spider with your cortex (“global workspace”).
To be more specific about what’s going on, we need to be thinking about large-scale patterns of information flow within the cortex, as in the following toy example:
When you’re using visual imagination, your consciously-accessible visual areas of the cortex (e.g. the inferior temporal gyrus (IT)) are, in essence, disconnected from the immediate visual input. You can imagine Taylor Swift’s new dress while looking at a swamp. By contrast, when you’re paying attention to what you’re looking at, then there’s a consistency requirement: the visual models (i.e., generative models of visual data) in IT have to be consistent with the immediate visual input from your retina.
And my claim is that the Steering Subsystem has some control over this kind of large-scale information flow among different parts of the cortex, via its “involuntary attention”.
Incidentally, for this post, I’m less interested in vision than interoception, the “sense” of how we’re feeling. We can have a (generalized) “orienting reflex” towards interoceptive inputs just as we can towards visual inputs—an itchy bug bite will summon attention just as reliably as an unexpected noise will. So here’s the analogous diagram for the case of interoception, which we’ll expand on later:
3.1.1 Side note: Transient attentional gaps are more common, and harder to notice, than you realize
You might be wondering: Is it really true that, if I’m imagining Taylor Swift’s new dress, then my awareness is detached from immediate visual input? Don’t we continue to be aware of visual input even while imagining something else?
A few responses:
First, your cortex has lots of vision-related areas, and it’s possible for some visual areas to be yoked to immediate visual input while other visual areas are simultaneously yoked to episodic memory. I think this definitely happens to some extent.
Second, your attention can jump around between different things rather quickly, such that most people imagine themselves to have far more complete and continuous visual awareness than they actually do—see things like change blindness, or the selective attention test, or the fact that you can only perceive colors at the center of your field-of-view.
Third, the cortex tracks time-extended models, and accordingly has a general ability to pull up activation history from slightly (e.g. half a second) earlier, anywhere in the cortex. That makes it very hard to introspect upon exactly what you were or weren’t thinking at any given moment. For a much more detailed discussion of this point, with an example, see Intuitive Self-Models §2.3.
This is a general lesson, going beyond just vision: transient (fraction-of-a-second) attentional gaps and shifts are hard to notice, both as they happen and in hindsight. Don’t unthinkingly trust your intuitions on that topic. (I’ll be centrally relying on these transient attentional shifts in this post, so it’s important that you are thinking about them clearly.)
3.2 Combining attention with learning rate modulation
The Steering Subsystem can get an additional lever of control over some brain learning algorithm by adjusting its learning rate to different settings at different times, depending on the large-scale information flows in the cortex. This opens up a flexible design space that the genome exploits in a variety of ways.
As a worked example relevant to this post, let’s take the interoception diagram from §3.1 above, and add in a short-term predictor with learning rate modulation. And how exactly will its learning rate be modulated? However we want—it’s a design degree of freedom! But for this example, we’ll set the short-term predictor learning rate to zero unless you’re paying attention to actual interoceptive input. So here’s the newly-expanded diagram from above:
What’s the point of this setup? Well, it will transform this short-term predictor into what we might call an “interoceptive concept finder”, that can find and flag the idea of physiological arousal in your interoceptive concept space, more or less.[6]
Think of this setup as somewhat like “linear probes” in ML interpretability research: the short-term predictor simply finds correlations between the ground truth (actual physiological arousal in your Steering Subsystem) and your various unlabeled learned interoceptive concepts.
And then why do we need learning rate modulation? Because the correlations we’re looking for are only present when you’re paying attention to your own interoceptive inputs. If you’re not—e.g. if you’re reading a book and empathetically simulating the protagonist—the correlations get messed up. For example, if the book protagonist is feeling intense rage, then you might (transiently) experience actual anger yourself. But you also might not! And even if you do empathetically feel some anger on behalf of the protagonist, it would probably be more “mild anger” than “intense rage”. Either way, the active concepts in the cortex (based on the book text) would not match the actual state of your Steering Subsystem. (See Valence series §1.5.4–§1.5.5 for more on this point.) So during such times, it’s fine if this short-term predictor continues to be queried, but we don’t want it to be updated.
OK, so we can build an “interoceptive concept finder” by taking a short-term predictor and judiciously setting up its context data, learning rate modulation, temporal delay setting, and so on. Then what? Is an “interoceptive concept finder” setup the best way to build a short-term predictor for physiological arousal? …Wrong question! We don’t have to pick just one “best” short-term predictor for physiological arousal! The brain can have more than one short-term predictor for the same signal. They can be complementary. For example, I don’t think an “interoceptive concept finder” for physiological arousal would be the most effective way to react quickly and preemptively to dangerous situations—for that, you’d want a predictor that listens directly to exteroceptive inputs like vision and sound. But on the other hand, an “interoceptive concept finder” is probably helpful for planning, since it can tell the Steering Subsystem about the feelings that might result from a possible future plan (see “the interface problem” in Intro series §6.2.2).
Anyway, it turns out that our “interoceptive concept finder” is exactly what we need for our social instincts story. Let’s keep going:
3.3 Neuroscience details
Neuroscience details box
For involuntary attention: There are probably multiple pathways working in conjunction. Probably cholinergic and/or adrenergic neurons are involved. More specifically, cholinergic projections to the cortex are probably part of this story, and so are the cholinergic projections to thalamic relay cells. I don’t know the details.
For adjusting learning rate: There are a bunch of ways this could work. If there’s an error signal coming from the Steering Subsystem (hypothalamus or brainstem) to a short-term predictor, it could be set to zero, and then there’s no learning. Or maybe there’s a separate signal for learning rate (maybe acetylcholine again?) coming from the Steering Subsystem, which could be turned off instead. There could also be some more indirect effect of lack-of-attention on the cortex side—like maybe the cortex representations are less active when they’re further removed from sensory input, and that indirectly reduces learning rate, or something. I don’t know.
Two short-term predictors for the same thing: I mentioned that for physiological arousal and similar innate state variables, I think there are (at least) two different short-term predictors of that same ground truth, one using exteroception-related data as context, the other (i.e. the “interoceptive concept finder”) using interoception-related data as context. My guess is that the former is in the amygdala. The latter is maybe somewhere in the medial prefrontal or cingulate cortex (or insula … or precuneus … or NAc medial shell … I really don’t know). (Clarification for the latter: I think most of the short-term predictors are medium spiny neurons in the “extended striatum”, and have been labeling my diagrams accordingly. But as I mentioned in §2.1 above, I do think there are places where pyramidal neurons play a short-term predictor role too, including in layer 5PT of certain parts of the cortex.)
4. Ingredient 4: Reading out transient empathetic simulations
If we apply the same kind of reasoning as above, it suggests a path to solving the symbol-grounding problem for somebody else’s feelings. A key ingredient we need is “involuntary LACK of attention towards interoceptive inputs”, triggered by the “social attention reflex” of Ingredient 1A—the right side of this diagram:
What is this “lack of attention” supposed to accomplish? Here’s a schematic diagram illustrating the flows of information / attention / constraints in a normal situation (left) and in a situation where one of the Ingredient 1A conspecific detection heuristics has just fired (right):
The involuntary lack of attention transiently disconnects the interoceptive models from what I’m feeling right now. Instead, the space of interoceptive models in the cortex will settle into whatever is most consistent with what’s happening in the visual, semantic, and other areas of the cortex (a.k.a. “global workspace”). And thanks to the orienting reflex, those other areas of the cortex are modeling Zoe.
And therefore, if any interoceptive models are active, they’re ones that have some semantic association with Zoe. Or more simply: they’re how Zoe seems to be feeling, from my perspective.
We’re almost there! I’ll pull out the right half of that figure, and attach an “interoceptive concept finder” (§3.2 above), and a gate that only opens precisely when the social attention reflex is active:
And bam, we have solved the symbol grounding problem for other people’s feelings! The signal at the bottom should occasionally be allowed through the gate, and when it does, it will carry information about how a different person seems to be feeling.
(I showed the example of physiological arousal, but the same logic applies to “being happy”, “being angry”, “being in pain”, etc.)
This step is built on the kind of “transient empathetic simulation” that I’ve discussed previously: the “interoceptive concept finder” short-term predictor is trained by supervised learning on instances of myself feeling physiological arousal, but right now it’s being triggered by thinking about someone else feeling physiological arousal.
4.1 So, the “social attention reflex” is also a “this is an empathetic simulation” flag?
Well, kinda. But with some caveats.
The sense in which this is true is: both the interoceptive model space and the associated short-term predictors are trained in a circumstance where they relate exclusively to my own interoceptive inputs, but then they’re sometimes queried in a circumstance where they relate to someone else’s interoceptive inputs.
But in other senses, calling it an “empathetic simulation” flag might be a bit misleading.
First, it would be a transient empathetic simulation, lasting a fraction of a second, which is rather different from how we normally use the term “empathy”—more on that in Intro Series §13.5.2.
Arguably, even “transient empathetic simulation” is an overstatement—it’s just some learned semantic association between what I’m seeing and some feeling-related concept. The concept of Zoe seems to somehow imply the concept of stress, within my world-model. That's all. I don't really need to be “taking her perspective”, nor to be feeling Zoe’s simulated stress in Zoe’s simulated loins, or whatever.
Second, this reflex is exclusively related to empathetic simulations of what someone is feeling[7]—not empathetic simulations of what they're thinking, seeing, etc. For example, if I'm curious whether Zoe can see the moon from where she's standing, then I would do a quick empathetic simulation of what Zoe is seeing. The “social attention reflex” is not particularly related to that; indeed, if anything, this reflex is probably anticorrelated with that, since it innately activates in situations where orienting reflexes are pulling attention to our own exteroceptive sensory inputs.
Thus, my framework implies that social instincts can only involve reacting to someone's (assumed) feelings. It cannot (directly) involve reacting to what someone is seeing, or thinking, etc. I think that claim rings true to everyday experience.
And there's actually a deeper reason to believe that claim. If I take Zoe’s visual perspective and imagine that she’s looking at a saxophone, then my Steering Subsystem can’t do anything with that information. The Steering Subsystem doesn’t understand saxophones, or anything else about our big complicated world. But it does know the “meaning” of its suite of innate physiological state variables and signals—physiological arousal, body temperature, goosebumps, and so on. See my discussion of “the interface problem” in Intro Series §6.2.2.
Third, as mentioned above, only a subset of short-term predictors (those set up as “interoceptive concept finders”) will output transient empathetic simulation data during a social attention reflex. Other short-term predictors will not.
4.2 Neuroscience details
Neuroscience details box
Involuntary lack-of-attention signal: Well, absence-of-attention might just involve suppressing presence-of-attention pathways, like the ones I mentioned under Ingredient 3 above (possibly involving acetylcholine). Or it might be a different system that pushes in the opposite direction—maybe involving serotonin? Or (more likely) multiple complementary signals that work in different ways. I don’t have any strong opinions here.
5. Hypothesis: a “compassion / spite circuit”
Everything so far was preliminaries—now we can start speculating about real social instincts! My main example is a possible innate drive circuit that would be upstream of compassion and spite. Start with another Steering Subsystem signal:
5.1 The “Conspecific seems to be feeling (dis)pleasure” signal
The first step is to get a “conspecific seems to be feeling pleasure / displeasure”[8] signal in the Steering Subsystem, as follows:
The purple box is yet another Steering Subsystem signal that I’m labeling “pleasure / displeasure”. This is closely related to valence—for details see Valence Series §A. Then the gray box would be an intermediate variable[9] in the Steering Subsystem which would, by design, track the extent to which I think of the conspecific as feeling pleasure / displeasure.
That was just the start. Next, how do we build a social instinct out of the gray “conspecific seems to be feeling pleasure / displeasure” box? We need another Steering Subsystem parameter!
5.2 The “friend (+) vs enemy (–)” parameter
I introduced another Steering Subsystem parameter called “friend (+) vs enemy (–)”. When this parameter is extremely negative, it indicates that whatever you’re thinking about (in this case, the conspecific) should be physically attacked, right now. If the activity level is mildly negative, then you probably won’t go that far, but you’ll still feel like they’re the enemy and you hate them. If it’s positive, you’ll feel “on the same team” as them.
Anyway, when the “friend (+) vs enemy (–)” parameter is positive, then “conspecific seems to be feeling pleasure / displeasure” causes positive / negative valence respectively. This innate drive would lead to compassion—we feel intrinsically motivated by the idea that the conspecific is feeling pleasure, and intrinsically demotivated by the idea that the conspecific is feeling displeasure.
…And if the “friend (+) vs enemy (–)” parameter is negative, we flip the sign: “conspecific seems to be feeling pleasure / displeasure” causes negative / positive valence respectively. This innate drive would lead to both spite and schadenfreude.
How is the “friend (+) vs enemy (–)” parameter itself calculated? By other social instincts outside the scope of this post—more on that in §7 below. Perhaps part of it is a different circuit that says: if thinking about a conspecific co-occurs with positive valence (i.e., if we like / admire them), then that probably shifts the friend/enemy parameter a bit more towards friend, and perhaps also conversely with negative valence. That’s not circular, because conspecifics can acquire positive or negative valence for all kinds of reasons, just like sweaters or computers or anything else can acquire positive or negative valence for all kinds of reasons, including non-social dynamics like if I’m hungry and the conspecific gives me yummy food. That’s a robust and flexible system that will leverage my rich understanding of the world to systematically assign “friend” status to conspecifics who lead to good things happening for me. That’s probably just one factor among many; I imagine that there are lots of innate circuits that can impact friend / enemy status in various circumstances. Of course, as usual, the friend / enemy parameter would be attached to one or more short-term predictors, enabling memory, generalization, and perhaps also transient empathetic simulations.
5.2.1 Evolution and zoological context
Evolutionary and zoological context box
Pretty much every complex social animal has innate, stereotyped behaviors for both helping and hurting conspecifics in different circumstances—e.g. attack behaviors, and companionship-type behaviors such as within families.
And evolutionarily, if it makes sense to help or hurt conspecifics through innate, stereotyped behaviors, then presumably it also makes sense to help or hurt conspecifics through the more powerful and flexible pathways that leverage within-lifetime learning, as would happen through a “compassion / spite circuit”. (See (Appetitive, Consummatory) ≈ (RL, reflex).)
Indeed, even in rodents, I think there’s clear evidence of more flexible, goal-oriented behaviors to (selectively) help conspecifics. For example, Márquez et al. 2015 find that rats help conspecifics via choice of arm in a T-shaped maze. And Bartal et al. 2014 find that rats release conspecifics from restraints, but only in situations where they feel friendly towards the conspecific. (See also: Kettler et al. 2021.) I don’t think either of these needs to be explained with my proposed “compassion / spite circuit” above involving transient empathetic simulation; for example, maybe rats squeak in a certain way when they’re happy, and hearing another rat make a happy squeak triggers a primary reward, or whatever. But anyway, as far as I can tell at a glance, the “compassion / spite circuit” is at least plausibly present even in rodents.
…Or maybe it’s just a “compassion” circuit for rodents. I can’t immediately find any evidence either way on whether rats display flexible, goal-oriented spite-type behavior towards other rats they hate. (They undoubtedly have inflexible, stereotyped, threat and attack postures and behaviors, but that’s different—again see (Appetitive, Consummatory) ≈ (RL, reflex).) Let me know if you’ve seen otherwise!
5.2.2 Neuroscience details
Neuroscience details box
I expect that friend-vs-enemy is two groups of neurons that are mutually inhibitory, as opposed to one that swings positive and negative compared to baseline. That’s how the hypothalamus handles hungry-vs-full, for example (see here). As for where those neuron groups are, I don’t know. Probably medial hypothalamus somewhere.
5.3 Phasic physiological arousal
“Phasic” means that physiological arousal jumps up for a fraction of a second, in synchronization with noticing something, thinking a certain thought, etc. The opposite of “phasic” is “tonic”, like how I can have generally high arousal (alertness, excitement) in the morning and generally low arousal in the afternoon.
Now, one thing that my compassion / spite circuit above is missing is a notion that some interactions can feel more important / high-stakes to me than others. I think this is a separate axis of variation from the friend / enemy axis. For example, my neighbor and my boss are both solidly on the “friend” side of my friend / enemy spectrum—I feel “warmly” towards both, or something—but interactions with my boss feel much higher stakes, and correspondingly I react more strongly to their perceived feelings. So let’s refine the circuit above to fix that:
Basically, when I orient to a conspecific, then recognize them, the associated phasic arousal[10] tracks how important (high-stakes) is this interaction with the conspecific, from my perspective. Then we use that to scale up or down the compassion / spite response.
5.3.1 Neuroscience details
Neuroscience details box
I think the locus coeruleus, a tiny group of 30,000 neurons (in humans), is the high-level arousal-controller in your brain, and its activity can vary over short timescales (up and down within half a second, there’s a plot in Clayton et al. 2004). If you measure pupil dilation, then maybe you’ll miss some of the very fastest dynamics, but you will see the variation on a ≈1-second timescale. If you measure skin conductance, that’s slower still.
I’m generally assuming in this post that “arousal” is a scalar. That’s probably something of an oversimplification (see Poe et al. 2020 & Luskin et al. 2025) but good enough for present purposes.
I’ve been talking as if the role of phasic arousal is specific to the “compassion / spite circuit”, but a more elegant possibility is that it’s a special case of a very general interaction between arousal and valence, such that arousal makes all good things seem better, and makes all bad things seem worse, other things equal. After all, arousal is saying that a situation is high-stakes. So that kind of general dynamic seems evolutionarily plausible to me.
(For the record, I think the general interaction between arousal and valence is not just multiplicative. I think there’s also a thing that we call “being overwhelmed”, where sufficiently high arousal can cause negative valence all by itself. Basically, in a very high-stakes situation, the Steering Subsystem wants to say that things are either very good or very bad, and in the absence of positive evidence that things are very good, it treats “very bad” as a default.)
5.4 Generalization via short-term predictors
As usual, Steering Subsystem signals can serve as ground-truth supervision for short-term predictors, which supports generalization. Thanks to “defer-to-predictor mode” (see Intro Series §5), we wind up with Steering Subsystem social instincts activating in situations where nobody is in the room with me right now, but nevertheless I find myself intrinsically motivated by the idea of Zoe feeling good in general, and/or Zoe feeling good about me in particular.
6. The “compassion / spite circuit” also causes a “drive to feel liked / admired”
Let’s talk about the social instinct that I call “drive to feel liked / admired”—i.e., an innate drive that makes it so that, if I think highly of person X, then it’s inherently motivating to believe that person X thinks highly of me too. To make this work, one might think that we need another ingredient. It’s not enough for the Steering Subsystem to have strong evidence that my conspecific is feeling pleasure or displeasure, as above. The Steering Subsystem has to get strong evidence that my conspecific is feeling pleasure or displeasure in regards to me in particular. Where could such evidence come from?
Remarkably, my answer is: we already got it! We don’t need any other ingredients. It’s just an emergent consequence of the same circuit above!! Let me explain why:
6.1 Key idea: My “compassion / spite circuit” is disproportionately active and important while the conspecific is thinking about me-in-particular
Let’s say Zoe walks up to me and says “hey”. Or she’s having a conversation with me. Or she’s staring at me from across the room. These situations are quite common, and have two critical properties: (1) my “social attention reflex” is triggering like crazy, perhaps once a second or even more, and (2) Zoe is probably thinking about me-in-particular.
So what? Well…
Here’s a diagram illustrating this:
Thus, the compassion / spite circuit leads people to have a particular motivation for other people to have positive feelings about them. This is what I’ve called “the drive to feel liked / admired”.
(Note to readers: when I was initially writing this post, I was very focused on “drive to feel liked / admired”. Later on, I decided that “drive to feel liked / admired” is just one of numerous downstream impacts of this kind of reward signal. See my follow-up post: Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking.)
6.2 If the same circuit drives both compassion and “drive to feel liked / admired”, why aren’t they more tightly correlated across the population?
If the same innate circuit in the Steering Subsystem is upstream of both compassion and “drive to feel liked / admired”, then one might think that these two things should be yoked together. In other words, if that circuit’s output is generally strong in one person, then they should wind up with both drives being powerful influences on their behavior, and if it’s weak in another person, then they should wind up with neither drive being a powerful influence.
But in fact, in my everyday experience, these seem to be somewhat independent axes of variation, with some people apparently driven much more by one than the other. How does that work?
The answer is simple. If, in the course of life, the circuit often activates when the conspecific is thinking about me-in-particular, and rarely activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize feeling liked / admired. And conversely, if the circuit rarely activates when the conspecific is thinking about me-in-particular, and often activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize compassion.
As an example of the former, suppose Phoebe tends to react very weakly (low arousal, or perhaps not orienting at all) to seeing a person out of the corner of her eye, or to hearing someone’s voice in the distance as they talk to someone else, but Phoebe does reliably react to the more powerful stimuli of transient eye contact, or someone getting her attention to talk to her. Then Phoebe would wind up with a relatively strong drive to feel liked / admired relative to her compassion drive.[11]
As an example of the latter, let’s turn to autism. As I’ve discussed in Intense World Theory of Autism, autism involves many different suites of symptoms which don’t always go together (sensory sensitivity, “learning algorithm hyperparameters”, proneness to seizures, etc.). But a common social manifestation would be kinda the reverse of the above. Given their trigger-happy arousal system, they’ll respond robustly and frequently to things like noticing someone out of the corner of their eye, or hearing someone in the distance. But as for receiving eye contact, or someone deliberately trying to get their attention, they’ll find it so overwhelming that they’ll tend to avoid those situations in the first place,[12] or use other coping methods to limit their physiological arousal. So that’s my attempted explanation for why many autistic people have an especially weak “drive to feel liked / admired”, relative to their comparatively-more-typical levels of compassion and spite, if I understand correctly.
6.3 Whose admiration do I crave?
I think it’s common sense that, in the “drive to feel liked / admired”, we’re driven to be liked / admired by some people much more than others. For example, think of a real person whom you greatly admire, more than almost anyone else, and imagine that they look you in the eye and say, “wow, I’m very impressed by you!” That would probably feel extremely exciting and motivating! Such events can be life-changing—see Mentorship, Management, and Mysterious Old Wizards. Next, imagine some random unimpressive person looks you in the eye and says the same thing. OK cool, maybe you’d be happy to receive the compliment. Or maybe not even that. It sure wouldn’t go down as a life-affirming memory to be treasured forever. More examples in footnote→[13]
I had previously written that, if Zoe likes / admires me, then that feels intrinsically motivating to the extent that I like / admire Zoe in turn. Whoops, I’ve changed my mind! Instead, I now think that it feels intrinsically motivating to the extent that interactions with Zoe seem important and high-stakes from my perspective, regardless of whether I like / admire her.[14] (However, if I see her as “enemy” rather than “friend”, then that would have an impact). For example, if Zoe is my boss whom I mildly like / admire, I think I would still react strongly to her approval. That’s what we get from the circuit above—the physiological arousal will respond to how high-stakes it feels for me to be interacting with Zoe, along with the various other factors (e.g. receiving eye contact automatically causes extra arousal). I think my new theory is a better fit to everyday experience, but you can judge for yourself and let me know what you think.
There’s an additional question of what’s upstream of that—i.e., what leads to some people inducing physiological arousal (i.e. being “attention-grabbing”, “intimidating”, “larger-than-life”, etc.) more than others? I think it’s complicated—lots of things go into that. Some come straight from arousal-inducing innate reactions. For example, I think we have an innate reaction that induces arousal upon interacting with a tall person, just as many other animals have instincts to “size each other up”. The evolutionary logic is: Any interaction with a tall person is high-stakes because they could potentially beat us up. In other cases, the physiological arousal routes through within-lifetime learning. Is the person in a position to strongly impact my life?
Incidentally, if we compare my previous theory (that I’m driven to be liked / admired by Zoe in proportion to how much I like / admire Zoe in turn) to my current theory (that I’m driven to be liked / admired by Zoe in proportion to how much interactions with Zoe feel arousing, a.k.a. high-stakes), I think there’s some overlap in predictions, because there’s correlation between strongly liking / admiring Zoe, versus feeling like interactions with Zoe are high-stakes. I think the correlation comes from both directions. If I strongly like / admire Zoe, then as a consequence, my interactions with her can feel high-stakes. My liking / admiring her puts her in a position to impact my life. For example, if she spurns me, then I’ve lost access to something I enjoy; plus, I’ve implicitly given her the power to crush my self-esteem. In the other direction, if interactions with Zoe feel high-stakes, I think that can impact how much I like / admire Zoe, for various reasons, including the general valence-arousal interaction mentioned in §5.3.1.
7. Other examples of social instincts
I think the “compassion / spite circuit” above is an important piece of the puzzle of human social instincts. But there’s a whole lot more to social instincts beyond that! Really, I think there’s a bunch of interacting circuits and signals in the Steering Subsystem. How can we pin it down?
Experimentally, there’s a longstanding thread of work laboriously characterizing each of the hundreds of little neuron groups in the Steering Subsystem. More of that would obviously help. I mentioned at least one specific experiment above (§1.2). In parallel, perhaps we could try leapfrogging that process by measuring a complete connectome! My impression is that there are viable roadmaps to a full mouse connectome within years, not decades—much sooner than people seem to realize. Indeed, my guess is that getting a primate or even human connectome well before Artificial General Intelligence is totally a viable possibility, given appropriate philanthropic or other support. (See here.)
On the theory side, as we wait for that data, I think there’s still plenty of room for further careful armchair theorizing to come up with plausible hypotheses. A possible starting point for brainstorming is to look at the set of innate stereotyped (a.k.a. “consummatory”) behavior towards conspecifics, to guess at some of the signals that might be internal to the Steering Subsystem. Doing that is a bit tricky for humans, since our behavioral repertoire comes disproportionately from learning and culture (excepting early childhood, I suppose). But for example, if a rodent sees another rodent, it might display:
Of these:
So that brings us to:
7.1 “Drive to feel feared” (a.k.a. “drive to receive submission”)
Dual strategies theory (see my own discussion at Social status part 2/2: everything else) says that people can have “high status” in two different ways: “prestige” and “dominance”. If the “drive to feel liked / admired” above is upstream of seeking prestige for its own sake, then the “drive to feel feared” would be correspondingly upstream of seeking dominance for its own sake.
The “drive to feel feared” could also be called “drive to receive submission”—i.e., a drive for others to display submissive behavior towards me, as in those rats rolling onto their backs. I’m not sure which of those two terms is better. I figure there’s probably some Steering Subsystem signal that’s upstream of both a tendency towards submissive behavior and a tendency towards fear and flight behavior, and it’s this upstream signal that flows into the circuit.
Evolutionarily, it makes perfect sense for there to be a “drive to feel feared”. If someone submits to me, then I’m dominant, and I get first dibs on food and mates without having to fight.
Neuroscientifically, I think the circuit for “drive to feel feared” could be parallel to the “compassion / spite circuit” above. More specifically, the first step is using Ingredient 4 to get to “Conspecific seems to be feeling fear / submission”:
And then we combine that with physiological arousal to get a motivational effect:
And as before, this would fire especially strongly under eye contact or other signals that the conspecific is thinking of you-in-particular:
(As drawn, the circuit might (mis)fire when I notice my friend submitting to a bully who is also simultaneously threatening me. I think that would be solvable by gating the circuit such that it doesn’t fire if I myself am also feeling fear / submission. Let me know if you think of other examples where this proposal doesn’t work.)
8. Conclusion
I feel like I have the big picture of a plausible nuts-and-bolts explanation of how the human brain solves the symbol grounding problem to implement social instincts. It might be wrong, and I’m happy for feedback.
Ingredients 1–4 constitute a kind of domain-specific language in which I think all of our social instincts are written. And then §5–§7 includes an attempt to build two specific social instincts out of the elements of that language, out of a much larger collection of social instincts yet to be sorted out. I figure that the things I wrote down, while a bit sketchy and incomplete, are probably capturing at least some aspects of compassion, spite, schadenfreude, “drive to feel liked / admired”, and “drive to feel feared”, and I think these collectively capture a lot of the human social world. (See also my post A theory of laughter for how laughter and play work.)
If you think this post is totally on the wrong track, then please let me know, by email or the comments section below. If it’s on the right track, then that’s great, but we still obviously have tons of work left to do to really pin down human social instincts, possibly in conjunction with experiments, as discussed in §7 above.
In case anyone’s wondering, I think my next project going forward will be to spend a while pondering the very biggest picture of brain-like AGI safety—everything from reward functions and training environments and testing, to governance and deployment and society, in light of (what I hope is) my newfound understanding of how human social instincts generally work. My confusion on that topic has been a big blocker to my thinking and progress previous times that I tried to do that. After that, I guess I’ll figure out where to go from there! Should be interesting.
Thanks Seth Herd and Simon Skade for critical comments on earlier drafts, and thanks various commenters and especially Rif A. Saurous for critical feedback that informed later revisions.
Changelog
(Some previous versions of the post are archived at the DOI link. I can share even more fine-grained version history and list of changes upon request.)
2025-11-26: Since initial publication, I’ve added links to some later follow-up posts (search for “UPDATE” in the text), made some minor wording changes, replaced a secondary-source reference with the corresponding primary source, and added a reference.
2026-04-30: I changed terminology from “the ‘thinking of a conspecific’ flag” to “the social attention reflex”. I think the new term has better connotations, especially the way it invokes a parallel to “orienting reflex” and “startle reflex”, which likewise are associated with fast, transient, and involuntary changes in both attention and other innate signals like pleasure and arousal.
Relatedly, I deleted a few words suggesting that the social attention reflex is more likely to be in the medial hypothalamus than the lateral hypothalamus. My old term (“thinking of a conspecific” flag) suggested a social-related state variable, which struck me as more medial-ish. But now I’m thinking of it more as a fast reflex, which strikes me as more lateral-ish if anything. But I dunno, I’m just guessing.
I also dramatically shortened and simplified §6.1: (“Key idea: My ‘compassion / spite circuit’ is disproportionately active and important while the conspecific is thinking about me-in-particular”). I decided that this is a pretty straightforward point, and I was making it unnecessarily complicated.
Other minor wording tweaks (especially §3.2) for clarity.
2026-05-19: I rewrote §3–§5.1 to remove unnecessary complication, and clean up some errors and muddled thinking. More details:
In §3.2, I previously had a toy example of learning rate modulation in the thought assessors, where I was daydreaming about Taylor Swift, and then I suddenly orient to a spider jumping at me, and the learning rate modulation (I argued) was necessary to prevent learning that Taylor Swift is a risk factor for spiders jumping out at me. I do think that’s an actual solution to an actual problem, and that it’s implemented in the brain partly via the well-known “cholinergic interneuron pause” in response to (generalized) orienting reflexes. But I described this example poorly (and somewhat incorrectly), and more importantly it’s an example that’s not directly related to this post, and I think it was just causing unnecessary confusion (even I was confused when I re-read it). So I switched to a new example that overlaps much more with §4. I also deleted the discussion of learning rate modulation in the Thought Generator, which I decided was somewhat misleading and confusing as written, and off-topic anyway.
That change to §3, in turn, allowed me to shorten and streamline §4, including in ways that hopefully made §5.1 a bit clearer in turn.
The new version introduces and uses a new term I just made up, “interoceptive concept finder”, for a particular type of short-term predictor.
Some bits of text in this introductory section are copied from an earlier (wrong) post, “Spatial attention as a “tell” for empathetic simulation?”.
For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post A Theory of Laughter.
Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.
Actually, this is an area where the evolutionary “design spec” can be pretty inscrutable. The (so-called) spider detector circuit, like any image classifier, triggers on all kinds of inputs, not all of which are spiders, including Bizarre Visual Input Type 74853 that has no relation to spiders and would occur on average once every 100 lifetimes in our ancestral environment. And maybe it just so happened that Bizarre Visual Input Type 74853 correlates with danger, such that noticing and recoiling from it was adaptive. Then that very fact would be part of the evolutionary pressure sculpting the (so-called) spider detector circuit, such that the term “spider detector circuit” is not a 100% perfect description of its evolutionary purpose.
My diagrams are drawn with the “supervisor” signal traveling from the Steering Subsystem to the short-term predictor, and then the subtraction step (“supervisor – output = error”) happening in the short-term predictor. But that’s just for illustration. I’m also open-minded to the possibility that the subtraction is performed in the Steering Subsystem, and that it’s the error signal that travels up to the short-term predictor. That’s more of a low-level implementation detail that I’m not too concerned with for the purpose of this post.
I’m oversimplifying. I think what would actually happen is: the predictor will flag your various interoceptive concepts in proportion to how much physiological arousal they entail. Note that some cultures probably don’t even have a “physiological arousal” concept per se; see my Lisa Feldman Barrett post.
For purposes of this discussion, things like sense-of-pain, sense-of-temperature, and “affective touch” (c-tactile receptors) count as interoception, not exteroception, despite the fact that you can in fact learn about the outside world via those signals. After all, the skin is an organ, and sensing the health and status of your organs is an interoception thing. See How Do You Feel by Bud Craig (2020) for detailed physiological evidence—nerve types, pathways in the spine and brain, etc.—that this is the right classification.
Here and elsewhere, I’m using English-language emotion words to refer to Steering Subsystem signals, because I don’t know how else to refer to them. But be warned that there is never a perfect correspondence between brainstem signals and emotion words (as we actually use them in everyday life). For more discussion of that point, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions.
As a general rule, there are multiple ways to turn pseudocode into neuroscientifically-plausible circuits. For example, the gray box is an intermediate variable in this calculation. I’m drawing it explicitly because it makes it easier to follow. But it might not be a separate cell group in the hypothalamus. Or conversely, it could be two cell groups, one for “pleasure” and the other for “displeasure”, with mutual inhibition. Or something else, who knows.
In terms of the Ingredient 4 discussion, this would be the actual phasic arousal in our own bodies, which is impacted by the exteroception-sensitive short-term predictors, but is not impacted by transient empathetic simulations of someone else’s phasic arousal.
I guess I’m predicting that people with constitutionally low arousal responses (extraverts, thrill-seekers, etc.) will tend to have a higher ratio of status drive to compassion drive. But I didn’t check that. It’s not a strong prediction—there are probably a bunch of other factors at play too.
Aversion to eye contact is common among autistic people. For example, John Elder Robison entitled his first memoir Look Me in the Eye, and discusses his aversion to eye contact in the prologue. And in the book excerpt I copied here, there are three quotes from autistic people about their experience of eye contact.
As an example, there’s an anecdote here of someone making a “feelgood” email folder for when she was feeling down, and most of the entries she mentions are basically compliments from people whom (I suspect) she sees as important and intimidating. As another example, my 9yo craves “impressing his parents” like a drug, and strives endlessly for us to laugh at his jokes, admire his knowledge and achievements, etc. But when we had regular visits with a 4yo who idolized him, he basically couldn’t care less.
Update Sept 2025: I think there’s an additional phenomenon where, if thoughts of Person X tend to induce physiological arousal in Person Y, then that contributes not only to Y wanting to feel liked / admired by X, but also (under certain conditions) to Y feeling sexually attracted to X, especially if Y is a cis woman. For more discussion see §3–§6 of my follow-up post Neuroscience of human sexual attraction triggers (3 hypotheses).