This isn’t as much a question as it is just sharing some thoughts I had, but I would love to hear your thoughts :) Let’s imagine we are our own brain’s optimizer. We just received a bad signal, we feel pain. Let’s say, we realized someone else is soon going to feel pain, so we feel pain. What could the optimizer do now? Well, there are only 2 things it can do:
Try to disconnect “she feels pain” from the concept of pain that then triggered pain in yourself
Try to disconnect your previous thoughts from arriving at “she feels pain”
You speak a lot to (1), explaining the symbol grounding mechanism that continuously symbol grounds it in the ground truth, so the optimizer trying to move “she feels pain” away from its previous position in the feature space won’t work (at least as long as we continuously have such ground truth input - this sheds light on the very immoral but very interesting experiment of having an individual not exposed to such input for long periods, like not seeing any human face for multiple months, be it in person, on pictures or on your phone. There, this theory should predict that such a move in feature space could happen and will be successful - to be dramatic, you become a psychopath).
You don’t speak much to (2) though. One option for example here would be to unlearn the concept of “future” - babies first gradually learn about it therefore it’s reasonable to assume that you could unlearn it again. Luckily, this doesn’t seem to happen, so there must be some opposing force, something that promises reward if this concept persists.
Specifically, this concept must offer you insight into your actions such that your future expected reward rises. This is obvious in this case - without the concept “future”, you can hardly make any intelligent decisions at all. But it also carries over to much more specific and even human invented associations/knowledge:
Let’s say you work in cyber-security and the reason you think this person will feel pain is because using those cyber-security skills enabled you to make an association the normal person wouldn’t. The optimizer could try to unlearn these skills, but actually those skills lead to higher expected reward, else you wouldn’t be pursuing it: be it the nice house you can afford, the social status you enjoy because of it or simply the joy you receive from enacting it.
In other words, anything you learned, you learned because you assumed it would result in a higher expected reward and anything you act out (after learning), you do because it results in a higher expected reward. To forget these concepts will at least require a reward matching theirs.
This doesn’t imply it should be impossible though - let’s say you learned something that you hate, like say chiseling stone. You did this because the market would pay insane wages because only few could do the job and so the reward you saw attached to those wages was immense and you pushed through the boring education of becoming an expert in chiseling stone. And once you got there, you realize, you weren’t the only one with the idea: wages drop quicker than the average pump & dump crypto coin. In fact the profession you enacted before, which you intrinsically enjoy, even pays better.
As I’m writing this, I realize there are no good stories for why chiseling stone might give you a better glimpse into someone’s future pain, but let’s just take it for granted. Then the reward of the knowledge of chiseling stone is pretty much zero, maybe even negative because whenever you recall it, you recall all the effort that didn’t pay off.
Yet I have never heard of something along these lines happening. It would be quite a great mechanism for the free market though, the wages would jump right up: let’s hope our individual in question doesn’t once again try to learn to chisel stone, completely forgetting this tale of unreciprocated effort.
You could maybe argue something like: precisely the things that fall in this category are things we gave up on, that is, their occurrence in our day-to-day life is incredibly rare. Therefore, with a normal learning rate, we simply wouldn’t iterate over them often enough to forget them meaningfully.
Lastly, just for completeness, naturally ‘disconnecting your previous thoughts from arriving at “she feels pain”’ also entails your previous actions - it’s a very special occurrence to know somebody will feel pain in the future, unless you had a play in it yourself. Naturally those decisions back then will be optimized on as well, hopefully leading you to make better decisions in the future.
immoral but very interesting experiment … not seeing any human face for multiple months, be it in person, on pictures or on your phone
There must be plenty of literature on the psychological effects of isolation, but I haven’t looked into it much. (My vague impression is: “it messes people up”.) I think I disagree that my theory makes a firm prediction, because who is to say that the representations will drift on a multiple-month timescale, as opposed to much slower? Indeed, the fact that adults are able to recall and understand memories from decades earlier implies that, after early childhood, pointers to semantic latent variables remain basically stable.
2. Try to disconnect your previous thoughts from arriving at “she feels pain”
I would describe this as: if it’s unpleasant to think about how my friend is suffering, then I can avoid those unpleasant feelings by simply not thinking about that, and thinking about something else instead.
For starters, there’s certainly a kernel of truth to that. E.g. see compassion fatigue, where people will burn out and quit jobs working with traumatized people. Or if someone said to me: “I stopped hanging out with Ahmed, he’s always miserable and complaining about stuff, and it was dragging me down too”, I would see that as a perfectly normal and common thing for someone to say and do. But you’re right that it doesn’t happen 100% of the time, and that this merits an explanation.
My own analysis is at: §4.1.1 and §4.1.2 of my (later) Sympathy Reward post. The most relevant-to-you part starts at: “From my perspective, the interesting puzzle is not explaining why this ignorance-is-bliss problem happens sometimes, but rather explaining why this ignorance-is-bliss problem happens less than 100% of the time. In other words, how is it that anyone ever does pay attention to a suffering friend? …”
So that’s my take. As for your take, I think one of my nitpicks would be that I think you’re giving the optimizer-y part of the brain a larger action space than it actually has. If I would get a higher reward by magically teleporting, I’m still not gonna do that, because I can’t. By the same token, if I would get a higher reward by no longer knowing some math concept that I’ve already learned, tough luck for me, that is not an available option in my action space. My world-model is built by predictive (a.k.a. self-supervised) learning, not by “whatever beliefs would lead to immediate higher reward”, and for good reason: the latter has pathological effects, as you point out. (I’ve written about it too, long ago, in Reward is Not Enough.) I do have actions that can impact beliefs, but only in an indirect and limited way—see my discussion of motivated reasoning (also linked in my other comment).
let me preface this by saying how much I enjoyed reading this post - it really shows that this isn't some random idea you had but that you really thought a lot about this. As someone who's first introduction to this kind of idea was precisely this blogpost, thanks.
question - maybe I'm simply misunderstanding you:
-you seem to assume that the cortex's modelling of one's own happiness is very similar to the cortex's modelling of thinking of happiness. you might argue that it's only the "concept of happiness", which I would agree is present in both scenarios, but it doesn't strike me why that in particular would be learned using this supervised mechanism.
-building on that point, I think it might be more probable that understanding another's feelings is part of 1A - instead of simply seeing, hearing, etc. there would be something tasked with analyzing facial cues - in particular humans exhibit micro expressions (expressions that last very short periods and are almost impossible to control), something most people can't seem to pick up on, atleast consciously. So why do we have them if other people can't pick up on them? Maybe they can, but only subconsciously to precisely facilitate this symbol grounding for somebody else's feelings. Then again, if you can't consciously pick up on it, the target for the supervision will probably be terrible as well so maybe that's not it.
(i'll probably hammer u with more questions down the line, still trying to process all of this lol)
Thanks!!
you seem to assume that the cortex's modelling of one's own happiness is very similar to the cortex's modelling of thinking of happiness
I would say “overlaps” rather than “is similar to”. Think of it as vaguely like I-am-juggling versus you-are-juggling. Those are different thoughts, but they overlap, in that they both involve the “juggling” concept. That overlap is very necessary for e.g. recognizing that the same word “juggling” applies to both, and for transferring juggling-related ideas between myself and other people, which we are obviously very capable of doing.
you might argue that it's only the "concept of happiness", which I would agree is present in both scenarios, but it doesn't strike me why that in particular would be learned using this supervised mechanism.
The chain of events would be e.g.
(1) The Thought Generator (world-model) catalogs our own interoceptive feelings into emotion-concepts like "pleasure".
(2) The Thought Generator learns from experience that pleasure has something to do with smiling, e.g. during times where we feel pleasure and notice ourselves smile, or otherwise learn this obvious regularity in the world. This becomes a world-model (thought generator) semantic association “smile-concept” ↔ “pleasure-concept”.
(3) Often we’re paying attention to our own feelings, and then the “pleasure” emotion-concept is active if and only if our immediate interoceptive sensory inputs match “pleasure”. And these times, when we’re paying attention to our own feelings, are the only times where the pleasure Thought Assessor learning rate is nonzero. So the Thought Assessor learns that there’s a robust correlation between the “pleasure-concept” in the Thought Generator and the pleasure innate signal.
(4) Other times we’re NOT paying attention to our own immediate interoceptive sensory inputs, and then the emotion-concepts are “left hanging”, inactive regardless of what we’re feeling. But while they’re left hanging, they can INSTEAD be activated by semantic associations with other parts of our world-model. Then in such a moment, if I see someone smile, it activates smile-concept, which [via (2)] in turn weakly activates pleasure-concept, which in turn [via (3)] weakly activates the pleasure Thought Assessor. This is a candidate “transient empathetic simulation”. But remember, the learning rate of that Thought Assessor is zero whenever the emotion-concepts are “left hanging” like that. So the Thought Assessor won’t disconnect pleasure-concept.
Does that help? Sorry if I’m missing your point. …The above might be hard to follow without a diagram.
analyzing facial cues - in particular humans exhibit micro expressions
The theory that we have evolved direct responses to different facial reactions seems probably wrong to me (or at least, not the main explanation), for a couple reasons:
First, blind people seem to have normal social intuitions.
Second, I don’t think it’s plausible to simultaneously say that microexpressions immediately trigger important innate reactions, and that people are generally bad at consciously noticing microexpressions. When I think of other environmental things that immediately trigger innate reactions, I think of, like, balls flying at my face, big spiders, sudden noises, getting poked, foul smells, etc. We’re VERY good and fast at forming good conscious models of all those environmental things. So it doesn’t seem plausible to me that we could get metaphorically “poked” by microexpressions many times a day for years straight without ever developing a conscious awareness of those microexpressions.
So why do we have them if other people can't pick up on them
For my answer, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions. We have “innate behaviors” that impact the face, such as gagging, laughing, and Duchenne-smiling. We also have voluntary control of facial muscles, which we learn to deploy strategically for social signaling. When we use voluntary control to hide the signs of “innate behaviors”, the bit of “innate behavior” that slips through the cracks is a microexpression.
You might ask: why don’t our “innate behaviors” evolve to not impact the face, so that we can hide them better? Hard to say for sure. Probably part of it is that we are only sometimes trying to hide them. Some “innate behavior” facial manifestations might also have more direct adaptive utility (cf. §4.2 of that link). Part of it is probably that the hiding is good enough, because microexpressions are actually hard to notice.
Think of it as vaguely like I-am-juggling versus you-are-juggling.
Here, I can see how they would overlap to a reasonable degree - I don't think this easily carries over to emotions. Emotions atleast feel like this weird, distinct thing such that any statement along the lines "I'm happy" does it injustice. Therefore I can't see it being carried over to "She's happy", their intersection wouldn't be robust enough such that it won't falsely trigger for actually unrelated things. That is, "She's happy" ≈ "I'm happy" ≉ experiencing happiness.
Facial cues (as one example, it makes sense that there would be other things like higher-pitched voices when enjoying oneself etc) eliminate this problem because opposed to something introspective being the link, a more objective state of the mind, like "He's sad", will be the learned link.
this might sound like I'm being unnecessarily picky about this, but imo these associations need to be very exact, else humans would be reward-hacking all day: it's reasonable to assume that the activations of thinking "She's happy" are very similar to trying to convince oneself "She's happy" internally, even 'knowing' the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.
regarding micro expressions specifically, it's definitely not a hill i want to die on, it kind of just popped in my mind as I was writing about facial cues and by micro I really mean 'micro micro' - e.g. smiles that aren't perfectly symmetrical for quarter of a second, something I at least can't really pick up on; what is their evolutionary advantage if they don't atleast offer some kind of subconscious effect on conspecifics? But yea, if you can't consciously pick up on it, linking the two is pointless or even bad.
I read the linked post roughly, but as I read neither so far, i probably can't relate too well to it. seems reasonable (or honestly, obvious) though that it's a mix rather than either of those extreme statements.
Thanks again for engaging :)
these associations need to be very exact, else humans would be reward-hacking all day: it's reasonable to assume that the activations of thinking "She's happy" are very similar to trying to convince oneself "She's happy" internally, even 'knowing' the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.
I don’t think things work that way. There are a lot of constraints on your thoughts. Copying from here:
1. Thought Generator generates a thought: The Thought Generator settles on a “thought”, out of the high-dimensional space of every thought you can possibly think at that moment. Note that this space of possibilities, while vast, is constrained by current sensory input, past sensory input, and everything else in your learned world-model. For example, if you’re sitting at a desk in Boston, it’s generally not possible for you to think that you’re scuba-diving off the coast of Madagascar. Likewise, it’s generally not possible for you to imagine a static spinning spherical octagon. But you can make a plan, or whistle a tune, or recall a memory, or reflect on the meaning of life, etc.
If I want to think that Sally is happy, but I know she’s not happy, I basically can’t, at least not directly. Indirectly, yeah sure, motivated reasoning obviously exists (I talk about how it works here), and people certainly do try to convince themselves that their friends are happy when they’re not, and sometimes (but not always) they are even successful.
I don’t think there’s (the right kind of) overlap between the thought “I wish to believe that Sally is happy” and the thought “Sally is happy”, but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.
Emotions…feel like this weird, distinct thing such that any statement along the lines "I'm happy" does it injustice. Therefore I can't see it being carried over to "She's happy", their intersection wouldn't be robust enough such that it won't falsely trigger for actually unrelated things. That is, "She's happy" ≈ "I'm happy" ≉ experiencing happiness
I agree that emotional feelings are hard to articulate. But I don’t see how that’s relevant. Visual things are also hard to articulate, but we can learn a robust two-way association between [certain patterns in shapes and textures and motions] and [a certain specific kind of battery compartment that I’ve never tried to describe in English words]. By the same token, we can learn a robust two-way association between [certain interoceptive feelings] and [certain outward signs and contexts associated with those feelings]. And this association can get learned in one direction (interoceptive model → outward sign] from first-person experience, and later queried in the opposite direction [outward sign → interoceptive model] in a third-person context.
(Or sorry if I’m misunderstanding your point.)
what is their evolutionary advantage if they don't atleast offer some kind of subconscious effect on conspecifics?
Again, my answer is “none”. We do lots of things that don’t have any evolutionary advantage. What’s the evolutionary advantage of getting cancer? What’s the evolutionary advantage of slipping and falling? Nothing. They’re incidental side-effects of things that evolved for other reasons.
About the example in section 6.1.3: Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech? It seems to me like that requires both (a) identifying that the speech is trying to get someone's attention, and (b) identifying that the speech is directed at you. (Well, I guess (b) implies (a) if you weren't visibly paying attention to her beforehand.)
About (a): If the Steering Subsystem doesn't know the meaning of words, then how can it tell that Zoe is trying to get someone's attention? Is there some way to tell from the sound of the voice? Or is it enough to know that there were no voices before and Zoe has just started talking now, so she's probably trying to get someone's attention to talk to them? (But that doesn't cover all cases when Zoe would try to get someone's attention.)
About (b): If you were facing Zoe, then you could tell if she was talking to you. If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you? Are there any other ways the Steering Subsystem could tell if she was talking to you?
I'm not sure how many false positives vs. false negatives evolution will "accept" here, so I'm not sure how precise a check to expect.
Good questions!
Do you have an idea of how the Steering Subsystem can tell that Zoe is trying to get your attention with her speech?
I think you’re thinking about that kinda the wrong way around.
You’re treating “the things that Zoe does when she wants to get my attention” as a cause, and “my brain reacts to that” as the effect.
But I would say that a better perspective is: everybody’s brain reacts to various cues (sound level, pitch, typical learned associations, etc.), and Zoe has learned through life experience how to get a person’s attention by tapping into those cues.
So for example: If Zoe says “hey” to me, and I don’t notice, then Zoe might repeat “hey” a bit louder, higher-pitched, and/or closer to my head, and maybe also wave her hand, and maybe also poke me.
The wrong question is: “how does my brain know that louder and higher-pitched and closer sounds, concurrent with waving-hand motions and pokes, ought to trigger an orienting reaction?”.
The right perspective is: we have these various evolved triggers for orienting reactions, whose details we can think of as arbitrary (it’s just whatever was effective for noticing predators and prey and so on), and Zoe has learned from life experience various ways to activate those triggers in other people.
If she said your name, then maybe the Steering Subsystem might recognize your name (having used interpretability to get it from the Learning Subsystem?) and know she was talking to you?
Yup, STEP 1 is one of my “thought assessors” (probably somewhere in the amygdala) has learned from life experience that hearing my own name should trigger orienting to that sound; and then STEP 2 is that Zoe in turn has learned from life experience that saying someone’s name is a good way to get their attention.
(If you’re in a hurry, you can just read the “Background and summary” section, and skip the other 85%.)
There’s a neuroscience problem which has had me stumped since the very beginning of when I became interested in neuroscience at all (as a lens into Artificial General Intelligence (AGI) safety) back in 2019. In this post I offer a hypothesis for what the solution might generally look like, at least in the big picture.[1]
What is this grand problem? As described in Intro to Brain-Like-AGI Safety, I believe the following:
I’ll start by going through the four algorithmic ingredients we need for my hypothesis, one by one, in each case describing what it is algorithmically, why it’s useful evolutionarily, and where in the brain we might go looking to find the specific neurons that are running this (alleged) algorithm.
Here’s the roadmap:
Then, I’ll go through an important (putative) example of social instincts built from these ingredients, which I call the “compassion / spite circuit”. This circuit leads to an innate drive to feel compassion towards people we like, and to feel spite and schadenfreude towards people we hate.
In an elegant twist, I claim that this very same “compassion / spite circuit” also leads to an innate “drive to feel liked / admired”—a drive that I hypothesized earlier and believe to be central to both status-seeking and norm-following. The trick in explaining how they’re related is:
Then I’ll go more briefly through some other possible social instincts, including a sketch of a possible “drive to feel feared” (whose existence I previously hypothesized here). For context, dual strategies theory talks about “prestige” and “dominance” as two forms of status; while the “drive to feel liked / admired” leads to prestige-seeking, the “drive to feel feared” correspondingly leads to dominance-seeking.
My confidence gradually decreases as you proceed through the article. The “Background” section above is rock-solid in my mind, as are Ingredients 1, 1A, and 2. Ingredients 3 and especially 4 are somewhat new to this post, but derive from ideas I’ve been playing around with for a year or two, and I feel pretty good about them. The specific putative examples of social instincts in §5–§7 are much more new and speculative, and are oversimplified at best. But I’m optimistic that they’re on the right track, and that they’re at least a “foot in the door” towards future refinements.
UPDATE NOV. 2025: After you finish this post, see also my later follow-up posts Social drives 1: “Sympathy Reward”, from compassion to dehumanization & Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking, which further flesh out how my neuroscientific hypothesis (below) connects to everyday experiences and intuitions.
The Steering Subsystem (brainstem and hypothalamus, more-or-less) takes sensory data, does innately-specified calculations on them, and uses the results to trigger innate reactions.
Think of things like seeing a slithering snake, or a skittering spider; smelling or tasting rotten food; male dogs smelling a female dog in heat; camouflaged animals recognizing the microenvironment where their bodies will blend in; and so on.
Note that these are all imperfect heuristics, anchored to innate circuitry, rather than developing along with our understanding of the world. We can call it a venomous-spider-detector circuit, for example, noting that it evolved because venomous spiders were dangerous to early humans.[4] But if we do that, then we acknowledge that it will have both false positives (e.g. centipedes, harmless spiders) and false negatives (funny-looking stationary venomous spiders), when compared to actual venomous spiders as we intelligently understand them. In vision especially, think of these heuristics as detecting relatively simple patterns of blobs and motion textures, as opposed to an “image classifier” / “video classifier” up to the standards of modern ML or human capabilities.
For more discussion of Ingredient 1, see §3.2.1 here.
As a special case of Ingredient 1, I claim that, in pretty much all animals, there are a set of sensory heuristics that are specifically designed by evolution to trigger on conspecifics. That would include one or more variations on: seeing a conspecific, hearing a conspecific, touching (or being touched by) a conspecific, smelling a conspecific, etc.
(I’m confident in this part because pretty much all animals have innate behaviors towards conspecifics that are different from their behaviors in other situations—mating, intermale aggression, parenting, being parented, herding, huddling, and so on.)
I claim that these all trigger a special Steering Subsystem flag that I call “thinking of a conspecific”:
Neuroscience details box
The sensory heuristics involve brainstem areas like the superior colliculus (for innate heuristic calculations on visual data), inferior colliculus (auditory data), gustatory nucleus of the medulla (taste data), and so on. (Again see §3.2.1 here.)
In the case of visual sensory heuristics, I’m actually not 100% confident that these calculations are located in the superior colliculus proper; for all I know, they’re partly or entirely in the neighboring parabigeminal nucleus, or whatever. There are papers on this topic, but they can’t always be taken at face value—see for example me complaining about methodologies used in the literature here and here.
For the “thinking of a conspecific” flag, it would be somewhere within the Steering Subsystem, but I don’t have any particular insight into exactly where. If I had to guess, I might guess that it’s one of the many little cell groups of the medial preoptic hypothalamus, since those often involve social interactions. If not that, then I’d guess it’s somewhere else in the medial hypothalamus, or (less likely) the lateral hypothalamus, or (less likely) some other part of the Steering Subsystem.
If you want to find “thinking of a conspecific” flag experimentally, the conceptually-simplest method would be to first find one of the sensory heuristics for conspecific detection (e.g. the face detector), see what its efferent connections (downstream targets) are, and treat all those as top candidates to be studied one-by-one.
Ingredient 1 is a first step towards understanding, say, fear-of-spiders. But it’s not the whole story, because I don’t just get nervous when there is actually a large skittering spider in my field-of-view right now, but also when I imagine one, or when somebody tells me that there’s a spider behind me, etc. How does that work? The answer is: what I call the “short-term predictor”.
The “short-term predictor” is a learning algorithm that involves three ingredients—context, output, and supervisor. For definitions see this post; or in the ML supervised learning literature, you can substitute “context” = “trained model input”, “output” = “trained model output”, and “supervisor” = “label” (i.e., ground truth), which is subtracted from the trained model output to get an error that updates the model.[5]
The important points are that:
Thus, this kind of story explains the fact that I viscerally react to learning that there’s a spider in my vicinity that I can’t immediately see or feel.
If we take the brainstem reaction and the short-term predictor together, it can function as what I call a long-term predictor, again see here.
By the same token, the “thinking of a conspecific” flag can trigger when I’m, well, thinking of a conspecific, even if the conspecific is not standing right there, triggering my brainstem sensory heuristics right now.
Neuroscience details box
I think the short-term predictors that I’ll be talking about in this post are mostly centered around small clusters of medium spiny neurons somewhere in the amygdala, or the lateral septum, or the medial part of the nucleus accumbens shell. (I haven’t tried to pin them down in more detail than that. See §5.5.4 here for some more general neuroscience discussion of this topic.)
However, in some cases pyramidal neurons can play this short-term predictor role as well, such as in the cortex-like (basolateral) section of the amygdala, along with certain parts of cortex layer 5PT.
The supervisory signal (either ground truth or an error signal, I’m not sure) probably makes an intermediate stop (“relay”) at some little cluster of neurons on the fringes of the Ventral Tegmental Area (VTA), not shown in the diagram above, in which case the supervisory signal would ultimately arrive at the spiny neuron in the form of a dopamine signal. I think. (But there are also VTA GABA neurons that seem somehow related to these particular short-term predictors. I haven’t tried to make sense of that in detail.)
In this section I’ll just go through a simple example of the orienting reflex upon seeing a spider, then in Ingredient 4 below we’ll see how this applies to social instincts and feelings.
When the seeing-a-spider brainstem sensory heuristic triggers, I claim that one thing it does is trigger an “orienting reflex”. Part of that reflex involves moving the eyes, head, and body towards whatever triggered the heuristic. And another part of it involves involuntary attention towards the visual inputs in general, and the corresponding part of the field of view in particular.
The involuntary attention plays an important role in constraining what “thought” the cortex is thinking. If you’re daydreaming, imagining, remembering, etc., then your current “thought” has very little to do with current visual inputs. By contrast, involuntary attention towards vision forms a constraint that the thought must be “about” the visual inputs. It’s not completely constraining—the same thought can also contextualize those visual inputs by roping in presumed upstream causes, or expected consequences, or other associations, etc. But the visual inputs have to be a central part of the thought. In other words, you’re not only pointing your eyes at the spider, but you’re also actually thinking about the spider with your cortex (“global workspace”).
To be more specific about what’s going on, we need to be thinking about large-scale patterns of information flow within the cortex, as in the following toy example:
When you’re using visual imagination, your consciously-accessible visual areas of the cortex (e.g. the inferior temporal gyrus (IT)) are, in essence, disconnected from the immediate visual input. You can imagine Taylor Swift’s new dress while looking at a swamp. By contrast, when you’re paying attention to what you’re looking at, then there’s a consistency requirement: the visual models (i.e., generative models of visual data) in IT have to be consistent with the immediate visual input from your retina.
And my claim is that the Steering Subsystem has some control over this kind of large-scale information flow among different parts of the cortex, via its “involuntary attention”.
You might be wondering: Is it really true that, if I’m imagining Taylor Swift’s new dress, then my awareness is detached from immediate visual input? Don’t we continue to be aware of visual input even while imagining something else?
A few responses:
First, your cortex has lots of vision-related areas, and it’s possible for some visual areas to be yoked to immediate visual input while other visual areas are simultaneously yoked to episodic memory. I think this definitely happens to some extent.
Second, your attention can jump around between different things rather quickly, such that most people imagine themselves to have far more complete and continuous visual awareness than they actually do—see things like change blindness, or the selective attention test, or the fact that peripheral vision has terrible resolution and terrible color perception and makes faces look creepy.
Third, the cortex tracks time-extended models, and accordingly has a general ability to pull up activation history from slightly (e.g. half a second) earlier, anywhere in the cortex. That makes it very hard to introspect upon exactly what you were or weren’t thinking at any given moment. For a much more detailed discussion of this point, with an example, see here.
This is a general lesson, going beyond just vision: transient (fraction-of-a-second) attentional gaps and shifts are hard to notice, both as they happen and in hindsight. Don’t unthinkingly trust your intuitions on that topic. (I’ll be centrally relying on these transient attentional shifts in this post, so it’s important that you are thinking about them clearly.)
The Steering Subsystem gets an additional lever of control over brain learning algorithms by combining that kind of large-scale information flow control with time-variable learning rates, as follows.
Let’s start with learning in the world model / Thought Generator ≈ cortex. Above I was talking about the “space of visual models” which are learned from scratch in IT. Like everything in the world-model (details), this space is learned by predictive (a.k.a. self-supervised) learning. But it’s learned more specifically when we’re paying attention to visual input. The models thus get sculpted to reflect the structure of the actual visual world.
Separately, we can query those existing models for the purpose of memory recall and visual imagination. But when we do, I claim that the learning rate is zero (or at most, almost-zero).
Moving onto the parallel case of learning in the Thought Assessors / short-term predictors ≈ striatum and amygdala. The genome can likewise leverage large-scale information flows to get some control over what the short-term predictors learn.
As a toy example, let’s take the diagram above, but add in a short-term predictor. And just as for the cortex case above, we’ll set the short-term predictor learning rate to zero unless we’re paying attention to visual input. Here’s a diagram:
Thanks to this learning rate modulation, this short-term predictor is trained specifically to maximize its predictive accuracy in situations where we’re paying attention to visual input. When we’re visually imagining or remembering something, by contrast, the short-term predictor will continue to be queried, but it won’t be updated.
What’s the advantage of this setup? Well, imagine my cortex is daydreaming about Taylor Swift, and then my brainstem notices a spider in the corner of my field-of-view. Without the involuntary attention, the learning algorithm update would associate daydreaming-about-Taylor-Swift with the seeing-a-spider reaction (physiological arousal, aversiveness, etc.), which is not a useful thing for me to learn. The involuntary attention can solve that problem: first the involuntary attention kicks the Taylor Swift daydream out of my brain, and ensures that I’m thinking about the spider instead; and second the short-term predictor learning algorithm records those new thinking-about-the-spider thoughts, and fires into its output line whenever similar thoughts recur in the future. Thus I’ll wind up feeling physiological arousal related to the shape and motion of a spider, spiderwebs, centipedes, that corner in the basement, etc., which makes a lot more sense (ecologically) than feeling physiological arousal related to Taylor Swift.
(Well, that’s a bad example. It is entirely ecologically appropriate to feel physiological arousal related to Taylor Swift! But that’s for other reasons!)
Neuroscience details box
For involuntary attention: There are probably multiple pathways working in conjunction. Probably cholinergic and/or adrenergic neurons are involved. More specifically, cholinergic projections to the cortex are probably part of this story, and so are the cholinergic projections to thalamic relay cells. I don’t know the details.
For adjusting learning rate: There are a bunch of ways this could work. If there’s an error signal coming from the Steering Subsystem (hypothalamus or brainstem) to a short-term predictor, it could be set to zero, and then there’s no learning. Or maybe there’s a separate signal for learning rate (maybe acetylcholine again?) coming from the Steering Subsystem, which could be turned off instead. There could also be some more indirect effect of lack-of-attention on the cortex side—like maybe the cortex representations are less active when they’re further removed from sensory input, and that indirectly reduces learning rate, or something. I don’t know.
If we apply the same kind of reasoning as above, it suggests a path to solving the symbol-grounding problem for somebody else’s feelings. A key ingredient we need is “involuntary LACK of attention towards interoceptive inputs”, triggered by the “thinking of a conspecific” flag of Ingredient 1A —the right side of this diagram:
What is this “lack of attention” supposed to accomplish? Here’s a schematic diagram illustrating the flows of information / attention / constraints in a normal situation (left) and in a situation where one of the Ingredient 1A conspecific detection heuristics has just fired (right):
The involuntary lack of attention transiently disconnects the interoceptive models from what I’m feeling right now. Instead, the space of interoceptive models in the cortex will settle into whatever is most consistent with what’s happening in the visual, semantic, and other areas of the cortex (a.k.a. “global workspace”). And thanks to the orienting reflex, those other areas of the cortex are modeling Zoe.
And therefore, if any interoceptive models are active, they’re ones that have some semantic association with Zoe. Or more simply: they’re how Zoe feels (or more precisely, how Zoe seems to feel, from my perspective).
This is progress! But there’s still some more work to do.
Next, let’s put in a couple short-term predictors (Ingredient 2), and think about learning rates (Ingredient 3):
Here, I show two different short-term predictors for the same ground truth (namely, physiological arousal). However, the contexts and learning rates are different, and hence their behaviors are correspondingly different as well.
The short-term predictor on the left uses (let’s say) visual models as context, and its learning rate is nonzero if and only if I’m paying attention to immediate visual inputs. As it turns out, Zoe is my tyrannical boss, who loves to exercise arbitrary power over me, and thus our conversations are often stressful. This left predictor will pick up on that pattern, and preemptively suggest physiological arousal whenever I notice that Zoe might be coming to talk to me.
Meanwhile, the short term predictor on the right uses interoceptive models as context, and its learning rate is nonzero if and only if I’m paying attention to my own interoceptive inputs.[6] This short-term predictor will wind up learning things that seem pretty stupidly trivial—e.g. “the conscious feeling of arousal (in the Thought Generator a.k.a. world-model) predicts actual arousal (in the Steering Subsystem)”; but it still needs to be there for technical reasons.[7] Anyway, this output will not respond to the fact that conversations with Zoe tend to be stressful for me. But if Zoe herself seems stressed, the output will reflect that.
Thus, when things are set up properly, the Steering Subsystem can simultaneously get instructions of both how a situation feels to us and the feelings that other people seem to be feeling.
(I showed the example of physiological arousal, but the same logic applies to “being happy”, “being angry”, “being in pain”, etc.)
Well, kinda. But with some caveats.
The sense in which this is true is: both the interoceptive model space and the associated short-term predictors are trained in a circumstance where they relate exclusively to my own interoceptive inputs, but then they’re sometimes queried in a circumstance where they relate to someone else’s interoceptive inputs.
But in other senses, calling it an “empathetic simulation” flag might be a bit misleading.
First, it would be a transient empathetic simulation, lasting a fraction of a second, which is rather different from how we normally use the term “empathy”—more on that here.
Arguably, even “transient empathetic simulation” is an overstatement—it’s just some learned association between what I’m seeing and some feeling-related concept. The concept of Zoe seems to somehow imply the concept of stress, within my world-model. That's all. I don't really need to be “taking her perspective”, nor to be feeling Zoe’s simulated stress in Zoe’s simulated loins, or whatever.
Second, this flag is exclusively related to empathetic simulations of what someone is feeling[8]—not empathetic simulations of what they're thinking, seeing, etc. For example, if I'm curious whether Zoe can see the moon from where she's standing, then I would do a quick empathetic simulation of what Zoe is seeing. The “thinking of a conspecific” flag is not particularly related to that; indeed, if anything, this flag is probably anticorrelated with that, since the flag is trained only in situations where orienting reflexes are pulling attention to our own exteroceptive sensory inputs.
Thus, my framework implies that social instincts can only involve reacting to someone's (assumed) feelings. It cannot (directly) involve reacting to what someone is seeing, or thinking, etc. I think that claim rings true to everyday experience.
And there's actually a deeper reason to believe that claim. If I take Zoe’s visual perspective and imagine that she’s looking at a saxophone, then my Steering Subsystem can’t do anything with that information. The Steering Subsystem doesn’t understand saxophones, or anything else about our big complicated world. But it does know the “meaning” of its suite of innate physiological state variables and signals—physiological arousal, body temperature, goosebumps, and so on. See my discussion of “the interface problem” here.
Third, even among the set of short-term predictors related to “feelings”, only some of them are set up such that they will output a transient empathetic simulation. See the toy example above with two different short-term predictors for physiological arousal, one of which conveys empathetic simulations and the other of which does not.
Neuroscience details box
Involuntary lack-of-attention signal: Well, absence-of-attention might just involve suppressing presence-of-attention pathways, like the ones I mentioned under Ingredient 3 above (possibly involving acetylcholine). Or it might be a different system that pushes in the opposite direction—maybe involving serotonin? Or (more likely) multiple complementary signals that work in different ways. I don’t have any strong opinions here.
Two short-term predictors for the same thing: I drew a diagram above with two different short-term predictors of physiological arousal. While that diagram was oversimplified in various ways, I do think it’s true that there are (at least) two different short-term predictors of physiological arousal, one using exteroception-related signals as context, the other using interoception-related signals as context, with the latter capturing empathetic simulations (among its other roles). My guess is that the former is in the amygdala and the latter is somewhere in the medial prefrontal or cingulate cortex. (Clarification for the latter: I think most of the short-term predictors are medium spiny neurons in the “extended striatum”, and have been labeling my diagrams accordingly. But as I mentioned in §2.1 above, I do think there are places where pyramidal neurons play a short-term predictor role too, including in layer 5PT of certain parts of the cortex.)
Everything so far was preliminaries—now we can start speculating about real social instincts! My main example is a possible innate drive circuit that would be upstream of compassion and spite. Start with another Steering Subsystem signal:
The first step is to get a “conspecific seems to be feeling pleasure / displeasure”[9] signal in the Steering Subsystem, as follows:
The purple box is yet another Steering Subsystem signal that I’m labeling “pleasure / displeasure”. This is closely related to valence—for details see here. Then the gray box would be an intermediate variable[10] in the Steering Subsystem which would, by design, track the extent to which I think of the conspecific as feeling pleased / displeased.
All we need to get that gray box, beyond what we’ve already covered, is a gate: If the thinking-of-a-conspecific flag is on, AND there’s a short-term predictor output consistent with (dis)pleasure, then that means I’m thinking about a conspecific who is currently feeling (dis)pleasure.
This step is built on the kind of “transient empathetic simulation” that I’ve discussed previously and in §4.1 above: the short-term predictor on the right is trained by supervised learning on instances of myself feeling (dis)pleasure, but now at this particular moment it’s being triggered by thinking about someone else feeling (dis)pleasure.
That was just the start. Next, how do we build a social instinct out of the gray “conspecific seems to be feeling pleasure / displeasure” box? We need another Steering Subsystem parameter!
I introduced another Steering Subsystem parameter called “friend (+) vs enemy (–)”. When this parameter is extremely negative, it indicates that whatever you’re thinking about (in this case, the conspecific) should be physically attacked, right now. If the activity level is mildly negative, then you probably won’t go that far, but you’ll still feel like they’re the enemy and you hate them. If it’s positive, you’ll feel “on the same team” as them.
Anyway, when the “friend (+) vs enemy (–)” parameter is positive, then “conspecific seems to be feeling pleasure / displeasure” causes positive / negative valence respectively. This innate drive would lead to compassion—we feel intrinsically motivated by the idea that the conspecific is feeling pleasure, and intrinsically demotivated by the idea that the conspecific is feeling displeasure.
…And if the “friend (+) vs enemy (–)” parameter is negative, we flip the sign: “conspecific seems to be feeling pleasure / displeasure” causes negative / positive valence respectively. This innate drive would lead to both spite and schadenfreude.
How is the “friend (+) vs enemy (–)” parameter itself calculated? By other social instincts outside the scope of this post—more on that in §7 below. Perhaps part of it is a different circuit that says: if thinking about a conspecific co-occurs with positive valence (i.e., if we like / admire them), then that probably shifts the friend/enemy parameter a bit more towards friend, and perhaps also conversely with negative valence. That’s not circular, because conspecifics can acquire positive or negative valence for all kinds of reasons, just like sweaters or computers or anything else can acquire positive or negative valence for all kinds of reasons, including non-social dynamics like if I’m hungry and the conspecific gives me yummy food. That’s a robust and flexible system that will leverage my rich understanding of the world to systematically assign “friend” status to conspecifics who lead to good things happening for me. That’s probably just one factor among many; I imagine that there are lots of innate circuits that can impact friend / enemy status in various circumstances. Of course, as usual, the friend / enemy parameter would be attached to one or more short-term predictors, enabling memory, generalization, and perhaps also transient empathetic simulations.
Evolutionary and zoological context box
Pretty much every complex social animal has innate, stereotyped behaviors for both helping and hurting conspecifics in different circumstances—e.g. attack behaviors, and companionship-type behaviors such as within families.
And evolutionarily, if it makes sense to help or hurt conspecifics through innate, stereotyped behaviors, then presumably it also makes sense to help or hurt conspecifics through the more powerful and flexible pathways that leverage within-lifetime learning, as would happen through a “compassion / spite circuit”. (See (Appetitive, Consummatory) ≈ (RL, reflex).)
Indeed, even in rodents, I think there’s clear evidence of more flexible, goal-oriented behaviors to (selectively) help conspecifics. For example, Márquez et al. 2015 finds that rats help conspecifics via choice of arm in a T-shaped maze. And Bartal et al. 2014 finds that rats release conspecifics from restraints, but only in situations where they feel friendly towards the conspecific. (See also: Kettler et al. 2021.) I don’t think either of these needs to be explained with my proposed “compassion / spite circuit” above involving transient empathetic simulation; for example, maybe rats squeak in a certain way when they’re happy, and hearing another rat make a happy squeak triggers a primary reward, or whatever. But anyway, as far as I can tell at a glance, the “compassion / spite circuit” is at least plausibly present even in rodents.
…Or maybe it’s just a “compassion” circuit for rodents. I can’t immediately find any evidence either way on whether rats display flexible, goal-oriented spite-type behavior towards other rats they hate. (They undoubtedly have inflexible, stereotyped, threat and attack postures and behaviors, but that’s different—again see (Appetitive, Consummatory) ≈ (RL, reflex).) Let me know if you’ve seen otherwise!
Neuroscience details box
I expect that friend-vs-enemy is two groups of neurons that are mutually inhibitory, as opposed to one that swings positive and negative compared to baseline. That’s how the hypothalamus handles hungry-vs-full, for example (see here). As for where those neuron groups are, I don’t know. Probably medial hypothalamus somewhere.
“Phasic” means that physiological arousal jumps up for a fraction of a second, in synchronization with noticing something, thinking a certain thought, etc. The opposite of “phasic” is “tonic”, like how I can have generally high arousal (alertness, excitement) in the morning and generally low arousal in the afternoon.
Now, one thing that my compassion / spite circuit above is missing is a notion that some interactions can feel more important / high-stakes to me than others. I think this is a separate axis of variation from the friend / enemy axis. For example, my neighbor and my boss are both solidly on the “friend” side of my friend / enemy spectrum—I feel “warmly” towards both, or something—but interactions with my boss feel much higher stakes, and correspondingly I react more strongly to their perceived feelings. So let’s refine the circuit above to fix that:
Basically, when I orient to a conspecific, then recognize them, the associated phasic arousal[11] tracks how important (high-stakes) is this interaction with the conspecific, from my perspective. Then we use that to scale up or down the compassion / spite response.
Neuroscience details box
I think the locus coeruleus, a tiny group of 30,000 neurons (in humans) is the high-level arousal-controller in your brain, and its activity can vary over short timescales (up and down within half a second, there’s a plot in Clayton et al. 2004). If you measure pupil dilation, then maybe you’ll miss some of the very fastest dynamics, but you will see the variation on a ≈1-second timescale. If you measure skin conductance, that’s slower still.
I’m generally assuming in this post that “arousal” is a scalar. That’s probably something of an oversimplification (see Poe et al. 2020 & Luskin et al. 2025) but good enough for present purposes.
I’ve been talking as if the role of phasic arousal is specific to the “compassion / spite circuit”, but a more elegant possibility is that it’s a special case of a very general interaction between arousal and valence, such that arousal makes all good things seem better, and makes all bad things seem worse, other things equal. After all, arousal is saying that a situation is high-stakes. So that kind of general dynamic seems evolutionarily plausible to me.
(For the record, I think the general interaction between arousal and valence is not just multiplicative. I think there’s also a thing that we call “being overwhelmed”, where sufficiently high arousal can cause negative valence all by itself. Basically, in a very high-stakes situation, the Steering Subsystem wants to say that things are either very good or very bad, and in the absence of positive evidence that things are very good, it treats “very bad” as a default.)
As usual, Steering Subsystem flags can serve as ground-truth supervision for short-term predictors, which supports generalization. Thanks to “defer-to-predictor mode” (see here), we wind up with Steering Subsystem social instincts activating in situations where nobody is in the room with me right now, but nevertheless I find myself intrinsically motivated by the idea of Zoe feeling good in general, and/or Zoe feeling good about me in particular.
Let’s talk about the social instinct that I call “drive to feel liked / admired”—i.e., an innate drive that makes it so that, if I think highly of person X, then it’s inherently motivating to believe that person X thinks highly of me too. To make this work, one might think that we need another ingredient. It’s not enough for the Steering Subsystem to have strong evidence that my conspecific is feeling pleasure or displeasure, as above. The Steering Subsystem has to get strong evidence that my conspecific is feeling pleasure or displeasure in regards to me in particular. Where could such evidence come from?
Remarkably, my answer is: we already got it! We don’t need any other ingredients. It’s just an emergent consequence of the same circuit above!! Let me explain why:
I think there’s a “I’m receiving eye contact” detector in the human brainstem, just like the other conspecific-detection sensory heuristics of Ingredient 1A.
But if you think about it, the “I’m receiving eye contact” detector has a special property, one that the other Ingredient 1A heuristics lack. Consider: if you’re hearing a conspecific, or noticing their gait, etc., then the conspecific might not even know you exist. By contrast, if a conspecific is giving you eye contact, then their brainstem is activating its “thinking of a conspecific” flag, in regards to you.
Here’s a diagram illustrating this:
As Zoe makes (perhaps brief) eye contact with me, both my and Zoe’s Steering Subsystems are shown. My big idea is marked in red—Zoe is reliably thinking about me at the very moment when I’m sensitive to how Zoe seems to be feeling. So if the circuit frequently triggers this way, then I’ll wind up motivated not so much towards Zoe feeling good in general, but towards Zoe liking / admiring me.
“Receiving eye contact” is a special case of “I’m the target of an orienting reflex”. And I think that other Ingredient 1A heuristics fit into that mold too. For example, my human-face-detection heuristic fires if someone turns to face me. That has directionally the same effect as eye contact, but it doesn’t require eye contact per se—it also fires if the person has sunglasses. And it also supports the “drive to feel liked / admired”, for the same reason as above.
(Ecologically, we expect a long and robust history of “I’m the target of an orienting reflex” brainstem heuristic detectors. For example, if I’m a mouse, and a fox performs an orienting reflex towards me, then I’d better switch from hiding to running.)
Suppose Zoe walks up to me and says “hey”. That still gets my attention—and being a human voice, it triggers the corresponding Ingredient 1A heuristic, and thus the “thinking of a conspecific” flag. But it has the same special property as eye contact above: at the very moment when it gets my attention, Zoe is reliably thinking about me-in-particular.
So the same logic as above holds: the circuit is responding specifically to how Zoe feels about me, and not just to how Zoe feels in general.
If the same innate circuit in the Steering Subsystem is upstream of both compassion and “drive to feel liked / admired”, then one might think that these two things should be yoked together. In other words, if that circuit’s output is generally strong in one person, then they should wind up with both drives being powerful influences on my behavior, and if it’s weak in another person, then they should wind up with neither drive being a powerful influence.
But in fact, in my everyday experience, these seem to be somewhat independent axes of variation, with some people apparently driven much more by one than the other. How does that work?
The answer is simple. If, in the course of life, the circuit often activates when the conspecific is thinking about me-in-particular, and rarely activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize feeling liked / admired. And conversely, if the circuit rarely activates when the conspecific is thinking about me-in-particular, and often activates when they aren’t, then that would lead the circuit to mostly incentivize and generalize compassion.
As an example of the former, suppose Phoebe tends to react very weakly (low arousal, or perhaps not orienting at all) to seeing a person of the corner of her eye, or to hearing someone’s voice in the distance as they talk to someone else, but Phoebe does reliably react to the more powerful stimuli of transient eye contact, or someone getting her attention to talk to her. Then Phoebe would wind up with a relatively strong drive to feel liked / admired relative to her compassion drive.[12]
As an example of the latter, let’s turn to autism. As I’ve discussed in Intense World Theory of Autism, autism involves many different suites of symptoms which don’t always go together (sensory sensitivity, “learning algorithm hyperparameters”, proneness to seizures, etc.). But a common social manifestation would be kinda the reverse of the above. Given their trigger-happy arousal system, they’ll respond robustly and frequently to things like noticing someone out of the corner of their eye, or hearing someone in the distance. But as for receiving eye contact, or someone deliberately trying to get their attention, they’ll find it so overwhelming that they’ll tend to avoid those situations in the first place,[13] or use other coping methods to limit their physiological arousal. So that’s my attempted explanation for why many autistic people have an especially weak “drive to feel liked / admired”, relative to their comparatively-more-typical levels of compassion and spite, if I understand correctly.
I think it’s common sense that, in the “drive to feel liked / admired”, we’re driven to be liked / admired by some people much more than others. For example, think of a real person whom you greatly admire, more than almost anyone else, and imagine that they look you in the eye and say, “wow, I’m very impressed by you!” That would probably feel extremely exciting and motivating! Such events can be life-changing—see Mentorship, Management, and Mysterious Old Wizards. Next, imagine some random unimpressive person looks you in the eye and says the same thing. OK cool, maybe you’d be happy to receive the compliment. Or maybe not even that. It sure wouldn’t go down as a life-affirming memory to be treasured forever. More examples in footnote→[14]
I had previously written that, if Zoe likes / admires me, then that feels intrinsically motivating to the extent that I like / admire Zoe in turn. Whoops, I’ve changed my mind! Instead, I now think that it feels intrinsically motivating to the extent that interactions with Zoe seem important and high-stakes from my perspective, regardless of whether I like / admire her.[15] (However, if I see her as “enemy” rather than “friend”, then that would have an impact). For example, if Zoe is my boss whom I mildly like / admire, I think I would still react strongly to her approval. That’s what we get from the circuit above—the physiological arousal will respond to how high-stakes it feels for me to be interacting with Zoe, along with the various other factors (e.g. receiving eye contact automatically causes extra arousal). I think my new theory is a better fit to everyday experience, but you can judge for yourself and let me know what you think.
There’s an additional question of what’s upstream of that—i.e., what leads to some people inducing physiological arousal (i.e. being “attention-grabbing”, “intimidating”, “larger-than-life”, etc.) more than others? I think it’s complicated—lots of things go into that. Some come straight from arousal-inducing innate reactions. For example, I think we have an innate reaction that induces arousal upon interacting with a tall person, just as many other animals have instincts to “size each other up”. The evolutionary logic is: Any interaction with a tall person is high-stakes because they could potentially beat us up. In other cases, the physiological arousal routes through within-lifetime learning. Is the person in a position to strongly impact my life?
Incidentally, if we compare my previous theory (that I’m driven to be liked / admired by Zoe in proportion to how much I like / admire Zoe in turn) to my current theory (that I’m driven to be liked / admired by Zoe in proportion to how much interactions with Zoe feel arousing, a.k.a. high-stakes), I think there’s some overlap in predictions, because there’s correlation between strongly liking / admiring Zoe, versus feeling like interactions with Zoe are high-stakes. I think the correlation comes from both directions. If I strongly like / admire Zoe, then as a consequence, my interactions with her can feel high-stakes. My liking / admiring her puts her in a position to impact my life. For example, if she spurns me, then I’ve lost access to something I enjoy; plus, I’ve implicitly given her the power to crush my self-esteem. In the other direction, if interactions with Zoe feel high-stakes, I think that can impact how much I like / admire Zoe, for various reasons, including the general valence-arousal interaction mentioned in §5.3.1.
I think the “compassion / spite circuit” above is an important piece of the puzzle of human social instincts. But there’s a whole lot more to social instincts beyond that! Really, I think there’s a bunch of interacting circuits and signals in the Steering Subsystem. How can we pin it down?
Experimentally, there’s a longstanding thread of work laboriously characterizing each of the hundreds of little neuron groups in the Steering Subsystem. More of that would obviously help. I mentioned at least one specific experiment above (§1.2). In parallel, perhaps we could try leapfrogging that process by measuring a complete connectome! My impression is that there are viable roadmaps to a full mouse connectome within years, not decades—much sooner than people seem to realize. Indeed, my guess is that getting a primate or even human connectome well before Artificial General Intelligence is totally a viable possibility, given appropriate philanthropic or other support. (See here.)
On the theory side, as we wait for that data, I think there’s still plenty of room for further careful armchair theorizing to come up with plausible hypotheses. A possible starting point for brainstorming is to look at the set of innate stereotyped (a.k.a. “consummatory”) behavior towards conspecifics, to guess at some of the signals that might be internal to the Steering Subsystem. Doing that is a bit tricky for humans, since our behavioral repertoire comes disproportionately from learning and culture (excepting early childhood, I suppose). But for example, if a rodent sees another rodent, it might display:
Of these:
So that brings us to:
Dual strategies theory (see my own discussion at Social status part 2/2: everything else) says that people can have “high status” in two different ways: “prestige” and “dominance”. If the “drive to feel liked / admired” above is upstream of seeking prestige for its own sake, then the “drive to feel feared” would be correspondingly upstream of seeking dominance for its own sake.
The “drive to feel feared” could also be called “drive to receive submission”—i.e., a drive for others to display submissive behavior towards me, as in those rats rolling onto their backs. I’m not sure which of those two terms is better. I figure there’s probably some Steering Subsystem signal that’s upstream of both a tendency towards submissive behavior and a tendency towards fear and flight behavior, and it’s this upstream signal that flows into the circuit.
Evolutionarily, it makes perfect sense for there to be a “drive to feel feared”. If someone submits to me, then I’m dominant, and I get first dibs on food and mates without having to fight.
Neuroscientifically, I think the circuit for “drive to feel feared” could be parallel to the “compassion / spite circuit” above. More specifically, the first step is using Ingredient 4 to get to “Conspecific seems to be feeling fear / submission”:
And then we combine that with physiological arousal to get a motivational effect:
And as before, this would fire especially strongly under eye contact or other signals that the conspecific is thinking of you-in-particular:
(As drawn, the circuit might (mis)fire when I notice my friend submitting to a bully who is also simultaneously threatening me. I think that would be solvable by gating the circuit such that it doesn’t fire if I myself am also feeling fear / submission. Let me know if you think of other examples where this proposal doesn’t work.)
I feel like I have the big picture of a plausible nuts-and-bolts explanation of how the human brain solves the symbol grounding problem to implement social instincts. It might be wrong, and I’m happy for feedback.
Ingredients 1–4 constitute a kind of domain-specific language in which I think all of our social instincts are written. And then §5–§7 includes an attempt to build two specific social instincts out of the elements of that language, out of a much larger collection of social instincts yet to be sorted out. I figure that the things I wrote down, while a bit sketchy and incomplete, are probably capturing at least some aspects of compassion, spite, schadenfreude, “drive to feel liked / admired”, and “drive to feel feared”, and I think these collectively capture a lot of the human social world. (See also my post A theory of laughter for how laughter and play work.)
If you think this post is totally on the wrong track, then please let me know, by email or the comments section below. If it’s on the right track, then that’s great, but we still obviously have tons of work left to do to really pin down human social instincts, possibly in conjunction with experiments, as discussed in §7 above.
In case anyone’s wondering, I think my next project going forward will be to spend a while pondering the very biggest picture of brain-like AGI safety—everything from reward functions and training environments and testing, to governance and deployment and society, in light of (what I hope is) my newfound understanding of how human social instincts generally work. My confusion on that topic has been a big blocker to my thinking and progress previous times that I tried to do that. After that, I guess I’ll figure out where to go from there! Should be interesting.
Thanks Seth Herd and Simon Skade for critical comments on earlier drafts.
2025-11-26: Since initial publication, I’ve added links to some later follow-up posts (search for “UPDATE” in the text), made some minor wording changes, replaced a secondary-source reference with the corresponding primary source, and added a reference. More details, and/or archived versions of this post, are available upon request.
Some bits of text in this introductory section are copied from an earlier (wrong) post, “Spatial attention as a “tell” for empathetic simulation?”.
For a different (simpler) example of what I think it looks like to make progress towards that kind of pseudocode, see my post A Theory of Laughter.
Thanks to regional specialization across the cortex (roughly correspondingly to “neural network architecture” in ML lingo), there can be a priori reason to believe that, for example, “pattern 387294” is a pattern in short-term auditory data whereas “pattern 579823” is a pattern in large-scale visual data, or whatever. But that’s not good enough. The symbol grounding problem for social instincts needs much more specific information than that. If Jun just told me that Xiu thinks I’m cute, then that’s a very different situation from if Jun just told me that Fang thinks I’m cute, leading to very different visceral reactions and drives. Yet those two possibilities are built from generally the same kinds of data.
Actually, this is an area where the evolutionary “design spec” can be pretty inscrutable. The (so-called) spider detector circuit, like any image classifier, triggers on all kinds of inputs, not all of which are spiders, including Bizarre Visual Input Type 74853 that has no relation to spiders and would occur on average once every 100 lifetimes in our ancestral environment. And maybe it just so happened that Bizarre Visual Input Type 74853 correlates with danger, such that noticing and recoiling from it was adaptive. Then that very fact would be part of the evolutionary pressure sculpting the (so-called) spider detector circuit, such that the term “spider detector circuit” is not a 100% perfect description of its evolutionary purpose.
My diagrams are drawn with the “supervisor” signal traveling from the Steering Subsystem to the short-term predictor, and then the subtraction step (“supervisor – output = error”) happening in the short-term predictor. But that’s just for illustration. I’m also open-minded to the possibility that the subtraction is performed in the Steering Subsystem, and that it’s the error signal that travels up to the short-term predictor. That’s more of a low-level implementation detail that I’m not too concerned with for the purpose of this post.
See my recent post Against empathy-by-default for a related discussion about how things go wrong if you just keep the learning rate turned on 100% of the time.
Details: Basically, I’m saying that, because physiological arousal is one of the interoceptive sensory inputs (related discussion), the Thought Generator self-supervised learning algorithm is already learning to predict imminent physiological arousal. So why do we also need a separate short-term predictor, nominally learning the same thing? My answer is: the Thought Generator algorithm is designed to build unlabeled latent variables that are useful for prediction, not to actually produce meaningful outputs, thanks to locally-random pattern separation. So the short-term predictor is also needed, to turn those unlabeled latent variables into a meaningful (“grounded”) output signal.
For purposes of this discussion, things like sense-of-pain, sense-of-temperature, and “affective touch” (c-tactile receptors) count as interoception, not exteroception, despite the fact that you can in fact learn about the outside world via those signals. After all, the skin is an organ, and sensing the health and status of your organs is an interoception thing. See How Do You Feel by Bud Craig (2020) for detailed physiological evidence—nerve types, pathways in the spine and brain, etc.—that this is the right classification.
Here and elsewhere, I’m using English-language emotion words to refer to Steering Subsystem signals, because I don’t know how else to refer to them. But be warned that there is never a perfect correspondence between brainstem signals and emotion words (as we actually use them in everyday life). For more discussion of that point, see Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions.
As a general rule, there are multiple ways to turn pseudocode into neuroscientifically-plausible circuits. For example, the gray box is an intermediate variable in this calculation. I’m drawing it explicitly because it makes it easier to follow. But it might not be a separate cell group in the hypothalamus. Or conversely, it could be two cell groups, one for “pleasure” and the other for “displeasure”, with mutual inhibition. Or something else, who knows.
In terms of the Ingredient 4 discussion, this would be the actual phasic arousal in our own bodies, which is impacted by the exteroception-sensitive short term predictors, but is not impacted by transient empathetic simulations of someone else’s phasic arousal.
I guess I’m predicting that people with constitutionally low arousal responses (extraverts, thrill-seekers, etc.) will tend to have a higher ratio of status drive to compassion drive. But I didn’t check that. It’s not a strong prediction—there are probably a bunch of other factors at play too.
Aversion to eye contact is common among autistic people. For example, John Elder Robison entitled his first memoir Look Me in the Eye, and discusses his aversion to eye contact in the prologue. And in the book excerpt I copied here, there are three quotes from autistic people about their experience of eye contact.
As an example, there’s an anecdote here of someone making a “feelgood” email folder for when she was feeling down, and most of the entries she mentions are basically compliments from people whom (I suspect) she sees as important and intimidating. As another example, my 9yo craves “impressing his parents” like a drug, and strives endlessly for us to laugh at his jokes, admire his knowledge and achievements, etc. But when we had regular visits with a 4yo who idolized him, he basically couldn’t care less.
Update Sept 2025: I think there’s an additional phenomenon where, if thoughts of Person X tend to induce physiological arousal in Person Y, then that contributes not only to Y wanting to feel liked / admired by X, but also (under certain conditions) to Y feeling sexually attracted to X, especially if Y is a cis woman. For more discussion see §3–§6 of my follow-up post Neuroscience of human sexual attraction triggers (3 hypotheses).