Part of the “Intro to brain-like-AGI safety” post series.
13.1 Post summary / Table of contents
In the previous post, I proposed that one path forward for AGI safety involves reverse-engineering human social instincts—the innate reactions in the Steering Subsystem (hypothalamus and brainstem) that contribute to human social behavior and moral intuitions. This post will go through some examples of how human social instincts might work.
My intention is not to offer complete and accurate descriptions of human social instinct algorithms, but rather to gesture at the kinds of algorithms that a reverse-engineering project should be looking for.
Table of contents:
- Section 13.2 explains, first, why I expect to find innate, genetically-hardwired, social instinct circuits in the hypothalamus and/or brainstem, and second, why evolution had to solve a tricky puzzle when designing these circuits. Specifically, these circuits have to solve a “symbol grounding problem”, by taking the symbols in a learned-from-scratch world-model, and somehow connecting them to the appropriate social reactions.
- Section 13.3 and 13.4 go through two relatively simple examples where I attempt to explain recognizable social behaviors in terms of innate reaction circuits: filial imprinting in Section 13.3, and fear-of-strangers in Section 13.4.
- Section 13.5 discusses an additional ingredient that I suspect plays an important role in many social instincts, which I call “little glimpses of empathy”. This mechanism enables reactions where recognizing or expecting a feeling in someone else triggers a “response feeling” in oneself—for example, if I notice that my rival is suffering, it triggers the warm feelings of schadenfreude. To be clear, “little glimpses of empathy” have little in common with how the word “empathy” is used normally; “little glimpses of empathy” are fast and involuntary, and are involved in both prosocial and antisocial emotions.
- Section 13.6 wraps up with a plea for researchers to figure out exactly how human social instincts work, ASAP. I will have a longer wish-list of research directions in Post #15, but I want to emphasize this one right now, as it seems particularly impactful and tractable. If you (or your lab) are in a good position to make progress but would need funding, email me and I’ll keep you in the loop about possible upcoming opportunities.
13.2 What are we trying to explain and why is it tricky?
13.2.1 Claim 1: Social instincts arise from genetically-hardcoded circuitry in the Steering Subsystem (hypothalamus & brainstem)
Let’s talk about envy, to pick a central example of social emotions. (Remember, the point of this post is that I want to understand human social instincts in general; I don’t literally want AGIs to be envious—see previous post, Section 12.4.3.)
I claim: there needs to be genetically-hardcoded circuitry in the Steering Subsystem—a.k.a. an “innate reaction”—which gives rise to the feeling of envy.
Why do I think that? A few reasons:
First, envy seems to have a solid evolutionary justification. I’m referring here to the usual evolutionary psychology story: Basically, for most of human history, life was full of zero-sum competitions for status, mates, and resources, such that an aversive reaction to other people’s successes (under some circumstances) would have been plausibly adaptive in general.
Second, envy seems to be innate, not learned. I think parents will agree that children often react negatively to the successes of their siblings and classmates starting from a remarkably young age, and in situations where those successes have no discernable direct negative impact on the child in question. Even adults feel envious in situations where there’s no direct negative impact from the other person’s success—e.g., people can be envious of the achievements of historical figures—making it hard to explain envy as an indirect consequence of any non-social innate drive (hunger, curiosity, etc.). The fact that envy is a cross-cultural human universal is also consistent with it stemming from an innate reaction, as is the fact that it’s (I think) present in some non-human animals.
In my framework (see Posts #2–#3), the only way to build this kind of innate reaction is to hardwire specific circuitry into the Steering Subsystem. As a (non-social) example of how I expect this kind of innate reaction to be physically configured in the brain (if I understand correctly, see detailed discussion in this other post I wrote), there’s a discrete population of neurons in the hypothalamus which seems to implement the following behavior: “If I’m under-nourished, do the following tasks: (1) emit a hunger sensation, (2) start rewarding the neocortex for getting food, (3) reduce fertility, (4) reduce growth, (5) reduce pain sensitivity, etc.”. There seems to be a neat and plausible story of what this population of hypothalamic neurons is doing, how it's doing it, and why. I expect that there are analogous little circuits (perhaps also in the hypothalamus, or maybe somewhere in the brainstem) that underlie things like envy, and I’d like to know exactly what they are and how they work, at the algorithm level.
Third, in social neuroscience (just like in non-social neuroscience), the Steering Subsystem (hypothalamus and brainstem) seems to be (regrettably) neglected and dismissed in comparison to the cortex. Even so, there are more than enough papers on the topic to see that the Steering Subsystem (especially hypothalamus) plays a major role in social behavior—examples in footnote. No further comment until I read more of the literature.
13.2.2 Claim 2: Social instincts are tricky because of the “symbol grounding problem”
For social instincts to have the effects that evolution “wants” them to have, they need to interface with our conceptual understanding of the world—i.e., with our learned-from-scratch world-model, which is a huge (probably multi-terabyte) complicated unlabeled data structure in our brain.
So suppose my acquaintance Rita just won a trophy and I didn’t, and that makes me envious. Rita winning the trophy is represented by some specific neuron firing pattern in the learned cortical world model, and that’s supposed to trigger the hard-coded envy circuit in my hypothalamus or brainstem. How does that work?
You can’t just say “The genome wires these particular neurons to the envy circuit,” because we need to explain how. Recall from Post #2 that the concepts of “Rita” and “trophy” were learned within my lifetime, basically by cataloging patterns in my sensory inputs, and then patterns in the patterns, etc.—see predictive learning of sensory inputs in Post #4. How does the genome know that this particular set of neurons should trigger the envy circuit?
By the same token, you can’t just say “A within-lifetime learning algorithm will figure out the connection”; we would also need to specify how the brain calculates a “ground truth” signal (e.g. supervisory signals, error signals, reward signals, etc.) which can steer this learning algorithm.
Thus, the challenge of implementing envy (and other social instincts) amounts to a kind of symbol grounding problem—we have lots of “symbols” (concepts in our learned-from-scratch predictive world-model), and the Steering Subsystem needs a way to “ground” them, at least well enough to extract what social instincts they should evoke.
So how do the social instinct circuits solve that symbol grounding problem? One possible answer is: “Sorry Steve, but there’s no possible solution, and therefore we should reject learning-from-scratch and all the other baloney in Posts #2–#7.” Yup, I admit it, that’s a possible answer! But I don’t think it’s right.
While I don’t have any great, well-researched answers, I do have some ideas of what the answer should generally look like, and the rest of the post is my attempt to gesture in that direction.
13.2.3 Reminder of brain model, from previous posts
As usual, here’s our diagram from Post #6:
And here’s the version distinguishing within-lifetime learning-from-scratch from genetically-hardcoded circuitry:
Again, our general goal in this post is to think about how social instincts might work, without violating the constraints of our model.
13.3 Sketch #1: Filial imprinting
(This section is not necessarily a central example of how social instincts work, but included as practice thinking through the relevant algorithms. Thus, I feel pretty strongly that the discussion here is plausible, but haven’t read the literature deeply enough to know if it’s correct.)
Filial imprinting (wikipedia) is a phenomenon where, in the most famous example, baby geese will “imprint on” a salient object that they see during a critical period 13–16 hours after hatching, and then will follow that object around. In nature, the “object” they imprint on is almost invariably their mother, whom they dutifully follow around early in life. However, if separated from their mother, baby geese will imprint on other animals, or even inanimate objects like boots and boxes.
Your challenge: come up with a way to implement filial imprinting in my brain model.
Here’s my answer.
The first step is: I added a particular Thought Assessor dedicated to MOMMY (marked in red), with a prior pointing it towards visual inputs (Post #9, Section 9.3.3). Next I’ll talk about how this particular Thought Assessor is trained, and then how its outputs are used.
13.3.2 How is the MOMMY Thought Assessor trained?
During the critical period (13–16 hours after hatching):
Recall that there’s a simple image processor in the Steering Subsystem (called “superior colliculus” in mammals, and “optic tectum” in birds). I propose that when this system detects that the visual field contains a mommy-like object (based on some simple image-analysis heuristics, which apparently are not very discerning, given that boots and boxes can pass as “mommy-like”), it sends a “ground truth in hindsight” signal to the MOMMY Thought Assessor. This triggers updates to the Thought Assessor (by supervised learning), essentially telling it: “Whatever you’re seeing right now in the context signals, those should lead to a very high score for MOMMY. If they don’t, please update your synapses etc. to make it so.”
After the critical period (13–16 hours after hatching):
After the critical period, the Steering Subsystem permanently stops updating the MOMMY Thought Assessor. No matter what happens, it gets an error signal of zero!
Therefore, however that particular Thought Assessor got configured during the critical period, that’s how it stays.
Thus far in the story, we have built a circuit that learns the specific appearance of an imprinting-worthy object during the critical period, and then after the critical period, the circuit fires in proportion to how well things in the current field-of-view match that previously-learned appearance. Moreover, this circuit is not buried inside a giant learned-from-scratch data structure, but rather is sending its output into a specific, genetically-specified line going down to the Steering Subsystem—exactly the configuration that enables easy interfacing with genetically-hardwired circuitry.
So far so good!
13.3.3 How is the MOMMY Thought Assessor used?
Now, the rest of the story is probably kinda similar to Post #7. We can use the MOMMY Thought Assessor to build a reward signal incentivizing the baby goose to be physically proximate and looking at the imprinted object—not only that, but also for planning to get physically proximate to the imprinted object.
I can think of various ways to make the reward function a bit more elaborate than that—maybe the optic tectum heuristics continue to be involved, and help detect if the imprinted object is on the move, or whatever—but I’ve already exhausted my very limited knowledge of imprinting behavior, and maybe we should move on.
13.4 Sketch #2: Fear of strangers
(As above, the purpose here is to practice playing with the algorithms, and I don’t feel strongly that this description is definitely a thing that happens in humans.)
Here’s a behavior, which may ring true to parents of very young kids, although I think different kids display it to different degrees. If a kid sees an adult they know well, they’re happy. But if they see an adult they don’t know, they get scared, especially if that adult is very close to them, touching them, picking them up, etc.
Your challenge: come up with a way to implement that behavior in my brain model.
Here’s my answer.
(As usual, I’m oversimplifying for pedagogical purposes.) I’m assuming that there are hardwired heuristics in the brainstem sensory processing systems that indicate the likely presence of a human adult—presumably based on sight, sound, and smell. This signal by default triggers a “be scared” reaction. But the brainstem circuitry is also watching what the Thought Assessors in the cortex are predicting, and if the Thought Assessors is predicting safety, affection, comfort, etc., then the brainstem circuitry trusts that the cortex knows what it's talking about, and goes with the suggestions of the cortex. Now we can walk through what happens:
First time seeing a stranger:
- Steering Subsystem sensory heuristics say: “An adult human is present.”
- Thought Assessor says: “Neutral—I have no expectation of anything in particular.”
- Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
- Thought Assessor says: “Oh, oops, I guess my assessment was wrong, let me update my models.”
Second time seeing the same stranger:
- Steering Subsystem sensory heuristics say: “An adult human is present.”
- Thought Assessors say: “This is a scary situation.”
- Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be scared right now.”
The stranger hangs around for a while, and is nice, and playing, etc.:
- Steering Subsystem sensory heuristics say: “An adult human is still present.”
- Other circuitry in the brainstem says: “I've been feeling mighty scared all this time, but y'know, nothing bad has happened…” (cf. Section 220.127.116.11)
- Other Thought Assessors see the fun new toy and say “This is a good time to relax and play.”
- Steering Subsystem says: “Considering all of the above, we should be relaxed right now.”
- Thought Assessors say: “Oh, oops, I was predicting that this was a situation where we should feel scared, but I guess I was wrong, let me update my models.”
Third time seeing the no-longer-stranger:
- Steering Subsystem sensory heuristics say: “An adult human is present.”
- Thought Assessors say: “I expect to feel relaxed and playful and not-scared.”
- Steering Subsystem “Stranger Danger circuit” says: “Considering all of the above, we should be relaxed and playful and not-scared right now.”
13.5 Another key ingredient (I think): “Little glimpses of empathy”
Yet again, here’s our diagram from Post #6:
Let’s zoom in on one particular Thought Assessor in my brain, which happens to be dedicated to predicting a cringe reaction. This Thought Assessor has learned over the course of my lifetime that the predictive world-model activations corresponding to “my stomach is getting punched” constitute an appropriate time to cringe:
Now what happens when I watch someone else getting punched in the stomach?
If you look carefully on the left, you’ll see that “His stomach is getting punched” is a different set of activations in my predictive world-model than “My stomach is getting punched”. But it’s not entirely different! Presumably, the two sets would overlap to some degree.
And therefore, we should expect that, by default, “His stomach is getting punched” would send a weaker but nonzero “cringe” signal down to the Steering Subsystem.
I call this signal a “little glimpse of empathy”. It tends to be a transient echo of what I (involuntarily) infer a different person to be feeling.
So what? Well, recall the symbol-grounding problem from Section 13.2.2 above. The existence of “little glimpses of empathy” is a massive breakthrough towards solving that problem for social instincts! After all, my Steering Subsystem now has a legible-to-it indication that a different person is feeling a certain feeling, and that signal can in turn trigger a response reaction in me.
(I’m glossing over various issues with “little glimpses of empathy”, but I think those issues are solvable.)
For example, a (massively-oversimplified) envy reaction could look like “if I’m not happy, and I become aware (via a ‘little glimpse of empathy’) that someone else is happy, then issue a negative reward”.
More generally, one could have a Steering Subsystem circuit whose inputs include:
- my own current physiological state (“feelings”),
- the contents of the “little glimpse of empathy”,
- …associated with some metadata about the person being empathetically simulated (maybe via a “perceived social status” Thought Assessor, for example?), and
- heuristics drawn from my brainstem sensory processing systems, e.g. indicating whether I’m looking at a human right now.
The circuit could then produce outputs (“reactions”), which could (among other things) include rewards, other feelings, and/or ground truths for one or more Thought Assessors.
It seems to me that evolution would thus have quite a versatile toolbox for building social instincts, especially by chaining together more than one circuit of this type.
13.5.2 Distinction from the standard definition of “empathy”
I want to strongly distinguish “little glimpses of empathy” from the standard definition of “empathy”. (Maybe call the latter “a giant gulp of empathy”?)
For one thing, standard empathy is often effortful and voluntary, and may require at least a second or two of time, whereas a “little glimpse of empathy” is always fast and involuntary. An analogy for the latter would be how looking at a chair activates the “chair” concept in your brain, within a fraction of a second, whether you want it to or not.
For another thing, a “little glimpse of empathy”, unlike standard “empathy”, does not always lead to prosocial concern for its target. For example:
- In envy, if a little glimpse of empathy indicates that someone is happy, it makes me unhappy.
- In schadenfreude, if a little glimpse of empathy indicates that someone is unhappy, it makes me happy.
- When I’m angry, if a little glimpse of empathy indicates that the person I’m talking to is happy and calm, it sometimes makes me even more angry!
These examples are all antithetical to prosocial concern for the other person. Of course, in other situations, the “little glimpses of empathy” do spawn prosocial reactions. Basically, social instincts span the range from kind to cruel, and I suspect that pretty much all of them involve “little glimpses of empathy”.
By the way: I already offered a model of “little glimpses of empathy” in the previous subsection. You might ask: What’s my corresponding model of standard (giant gulp of) empathy?
Well, in the previous subsection, I distinguished “my own current physiological state (feelings)” from “the contents of the little glimpse of empathy”. For standard empathy, I think this distinction breaks down—the latter bleeds into the former. Specifically, I would propose that when my Thought Assessors issue a sufficiently strong and long-lasting empathetic prediction, the Steering Subsystem starts “deferring” to them (in the Post #5 sense), and the result is that my own feelings wind up matching the feelings of the target-of-empathy. That’s my model of standard empathy.
Then, if the target of my (standard) empathy is currently feeling an aversive feeling, I also wind up feeling an aversive feeling, and I don’t like that, so I’m motivated to help him feel better (or, perhaps, motivated to shut him out, as can happen in compassion fatigue). Conversely, if the target of my (standard) empathy is currently feeling a pleasant feeling, I also wind up feeling a pleasant feeling, and I’m motivated to help him feel that feeling again.
Thus, standard empathy seems to be inevitably prosocial.
13.5.3 Why do I believe that “Little glimpses of empathy” are part of the story?
First, it seems introspectively right (to me, at least). If my friend is impressed by something I did, I feel proud, but I especially feel proud at the exact moment when I imagine my friend feeling that emotion. If my friend is disappointed in me, I feel guilty, but I especially feel guilty at the exact moment when I imagine my friend feeling that emotion. As another example, there’s a saying: “I can’t wait to see the look on his face when….” Presumably this saying reflects some real aspect of our social psychology, and if so, I claim that this observation dovetails well with my “little glimpses of empathy” story.
Second, way back in Post #5, Section 5.5.4, I noted that the medial prefrontal cortex (mPFC) (and the corresponding parts of the ventral striatum) plays a dual role as (1) a visceromotor center that can orchestrate autonomic reactions like pupil dilation and heart rate changes, and (2) a motivational / decision-making center. I claimed that the “Thought Assessors” picture elegantly explains why those roles go together as two sides of the same coin. I neglected to mention yet another role of mPFC, namely (3) a center of social instincts and morality. (Other Thought Assessor areas besides mPFC are in this category as well.) I think the “little glimpses of empathy” picture elegantly accounts for that as well: the “glimpses of empathy” correspond to signals getting sent from mPFC and the other Thought Assessor areas down to the Steering Subsystem, and thus all behavior that connects to social instincts necessarily involves Thought Assessors.
(That said, there are other possible social-instinct stories that also involve Thought Assessors but do not involve “little glimpses of empathy”—see for example Sections 13.3–13.4 above—so this piece of evidence is not very specific.)
Third, if the rest of my model (Posts #2–#7) is correct, then “little glimpse of empathy” signals would arise automatically, such that it would be straightforward to evolve a Steering Subsystem circuit that “listens” for them.
Fourth, if the rest of my model is correct, then, well, I can’t think of any other way to build most social instincts! Process of elimination!
13.6 Future work (please!)
As noted in the introduction, the point of this post is to gesture towards what I expect a “theory of human social instincts” to look like, such that it would be compatible with all my other claims about brain algorithms in Posts #2–#7, particularly the strong constraint of “learning from scratch” as discussed in Section 13.2.2 above. My takeaway from the discussion in Sections 13.3–5 is a strong feeling of optimism that such a theory exists, even if I don’t know all the details yet, and a corresponding optimism that this theory is actually how the human brain works, and will line up with corresponding circuits in the brainstem or (more likely) hypothalamus.
Of course, I want very much to move past the “general theorizing” stage, into more specific claims about how human social instincts actually work. For example, I’d love to move beyond speculation on how these instincts might solve the symbol-grounding problem, and learn how they actually do solve the symbol-grounding problem. I’m open to any ideas and pointers here, or better yet, for people to just figure this out on their own and tell me the answer.
For reasons discussed in the previous post, nailing down human social instincts is at the top of my wishlist for how neuroscientists can help with AGI safety.
Remember how I talked about Differential Technological Development (DTD) in Post #1 Section 1.7? Well, this is the DTD “ask” that I feel strongest about—at least, among those things that neuroscientists can do without explicitly working on AGI safety (see upcoming Post #15 for my more comprehensive wish-list). I really want us to reverse-engineer human social instincts in the hypothalamus & brainstem long before we reverse-engineer human world-modeling in the neocortex.
And things are not looking good for that project! The hypothalamus is small and deep and hence hard-to-study! Human social instincts might be different from rat social instincts! Orders of magnitude more research effort is going towards understanding neocortex world-modeling than understanding hypothalamus & brainstem social instinct circuitry! In fact, I’ve noticed (to my chagrin) that algorithmically-minded, AI-adjacent neuroscientists are especially likely to spend their talents on the Learning Subsystem (neocortex, hippocampus, cerebellum, etc.) rather than the hypothalamus & brainstem. But still, I don’t think my DTD “ask” is hopeless, and I encourage anyone to try, and if you (or your lab) are in a good position to make progress but would need funding, email me and I'll keep you in the loop about possible upcoming opportunities.
See for example “The Evolutionary Psychology of Envy” by Hill & Buss, book chapter in Envy: Theory & Research, 2008.
Envy is on Donald E. Brown’s “list of human universals”, as reproduced in an appendix to The Blank Slate (Steven Pinker, 2002).
“…if you look at the human literature nobody talks about the hypothalamus and behaviour. The hypothalamus is very small and can’t be readily seen by human brain imaging technologies like functional magnetic resonance imaging (fMRI). Also, much of the anatomical work in the instinctive fear system, for example, has been overlooked because it was carried out by Brazilian neuroscientists who were not particularly bothered to publish in high profile journals. Fortunately, there has recently been a renewed interest in these behaviors and these studies are being newly appreciated.” (Cornelius Gross, 2018)
A few random example papers on the role of the Steering Subsystem (especially hypothalamus) in social behavior: “Independent hypothalamic circuits for social and predator fear” (Silva et al., 2013), “Representation of distinct reward variables for self and other in primate lateral hypothalamus” (Noritake et al., 2020), and “Social Stimuli Induce Activation of Oxytocin Neurons Within the Paraventricular Nucleus of the Hypothalamus to Promote Social Behavior in Male Mice” (Resendez et al., 2020).
I suspect a more accurate diagram would feature arousal (in the psychology-jargon sense, not the sexual sense—i.e., heart rate elevation etc.) as a mediating variable. Specifically: (1) if brainstem sensory processing indicates that an adult human is present and nearby and picking me up etc., that leads to heightened arousal (by default, unless the Thought Assessors strongly indicate otherwise), and (2) when I’m in a state of heightened arousal, my brainstem treats it as bad and dangerous (by default, unless the Thought Assessors strongly indicate otherwise).
For example, the Steering Subsystem needs a method to distinguish a “little glimpse of empathy” from other transient feelings, e.g. the transient feeling that occurs when I think through the consequences of a possible course of action that I might take. Maybe there are some imperfect heuristics that could do that, but my preferred theory is that there’s a special Thought Assessor trained to fire when attending to another human (based on ground-truth sensory heuristics as discussed in Section 13.4). As another example, we need the “Ground truth in hindsight” signals to not gradually train away the Thought Assessor’s sensitivity to “his stomach is getting punched”. But it seems to me that, if the Steering Subsystem can figure out when a signal is a “little glimpse of empathy”, then it can choose not to send error signals to the Thought Assessors in those cases.
Warning: I’m not entirely sure that there really is a “standard” definition of empathy; it’s also possible that the term is used in lots of slightly-inconsistent ways.