[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

[-]azsantosk4yΩ5120

One thing that appears to be missing on the filial imprinting story is a mechanism allowing the "mommy" thought assessor to improve or at least not degrade over time.

The critical window is quite short, so many characteristics of mommy that may be very useful will not be perceived by the thought assessor in time. I would expect that after it recognizes something as mommy it is still malleable to learn more about what properties mommy has.

For example, after it recognizes mommy based on the vision, it may learn more about what sounds mommy makes, and what smell mommy has. Because these sounds/smalls are present when the vision-based mommy signal is present, the thought assessor should update to recognize sound/smell as indicative of mommy as well. This will help the duckling avoid mistaking some other ducks for mommy, and also help the ducklings find their mommy though other non-visual cues (even if the visual cues are what triggers the imprinting to begin with).

I suspect such a mechanism will be present even after the critical period is over. For example, humans sometimes feel emotionally attracted to objects that remind them or have become associated with loved ones. The attachment may be really strong (e.g. when the loved one is dead and only the object is left).

Also, your loved ones change over time, but you keep loving them! In "parental" imprinting for example, the initial imprinting is on the baby-like figure, generating a "my kid" thought assessor associated with the baby-like cues, but these need to change over time as the baby grows. So the "my kid" thought assessor has to continuously learn new properties.

Even more importantly, the learning subsystem is constantly changing, maybe even more than the external cues. If the learned representations change over time as the agent learns, the thought assessors have to keep up and do the same, otherwise their accuracy will slowly degrade over time.

This last part seems quite important for a rapidly learning/improving AGI, as we want the prosocial assessors to be robust to ontological drift. So we both want the AGI to do the initial "symbol-grounding" of desirable proto-traits close to kindness/submissiveness, and also for its steering subsystem to learn more about these concepts over time, so that they "converge" to favoring sensible concepts in an ontologically advanced world-model.

[-]Steven Byrnes4y*Ω360

Thanks!

For example, humans…

Just to be clear, I was speculating in that section about filial imprinting in geese, not familial bonding in humans. I presume that those two things are different in lots of important ways. In fact, for all I know, they might have nothing whatsoever in common. ¯\_(ツ)_/¯

(UPDATE: I guess the Westermarck Effect might be implemented in a Section-13.3-like way, although not necessarily.)

If the learned representations change over time as the agent learns, the thought assessors have to keep up and do the same, otherwise their accuracy will slowly degrade over time.

Yeah, that seems possible (although I also consider it possible that it’s not a problem; by analogy, catastrophic forgetting is famously more of an issue for ANNs than for brains).

If the learned representations do in fact change a lot over time, I’m slightly skeptical that it would be possible to solve that problem directly, thanks to the lack of an independent ground truth. For example, I can imagine a system that says “If I’m >95% confident that this is MOMMY, then update such that I’m 100% confident that this is MOMMY.” Maybe that system would work to keep pointing at the real mommy, even as learned representations drift. But also, maybe that system would cause the Thought Assessor to gradually go off the rails and trigger off weird patterns in noise. Not sure. Did you have something like that in mind? Or something different?

An alternative might be that, if the specific filial-imprinting mechanism gradually stops working over time, it deactivates at some point and the (now-adolescent) goose switches to some other mechanism(s), like “desire to be with fellow geese that are extremely familiar to me” a la Section 13.4.

Reminder that I know very little about goose behavior and this is all casual speculation. :)

[-]Nathan Helm-Burger3yΩ110

Ok, so this is definitely not a human thing, so probably a bit of a tangent. One of the topics that came up in a neuroscience class once was goose imprinting. There's apparently been studies (see Eckhard Hess for the early ones) that show that the strength of the imprinting (measured by behavior following the close of the critical period) onto whatever target is related to how much running towards the target the baby geese do. The hand-wavey explanation was something like 'probably this makes sense since if you have to run a lot to keep up with your mother-goose for safety, you'll need a strong mother-goose-following behavioral tendency to keep you safe through early development'.

https://www.apa.org/monitor/2011/12/imprinting

[-]Angela Pretorius3y10

Mother geese don’t change their appearance much over their lifetime. I doubt that a chick ever needs to update its mommy thought assessor.

The ‘my kid’ thought assessor in humans is easily fooled by puppies and baby rabbits. Spend a large proportion of your waking hours around a cute animal and your brainstem assumes that it is your child.

[-]jpyykko4yΩ030

I feel that the social instincts link to the learned-from-scratch world-model via a chain of guided development windows.
The singular links in the chain are stacks of affective mechanisms: the trigger that detects the environmental stimulus (the moving large object for ducklings), the response (follow that object), and an affect (emotion) that links the instinct to the learned model via a reward signal to strengthen the association (feeling of safety).
As it would be near impossible for the DNA to have a concept of "Rita won a trophy" as the trigger, the system would have to first "teach" the model simpler concepts, and then tag onto those via the affect to be able to trigger later correctly: for example, "Rita" would be identified as a "member of the pack/competition", which would be derived from the concept of "agent". This in turn would have to be first learned via the associations that spring from the early instincts of "pheromones", "human voice", "eyes" etc..

These simpler concepts from our early years occur in development windows. F.ex. for the first 8 weeks babies don't focus their gaze on anything, as they are still learning the basics of seeing. After they have a slight better capacity to predict what they see, the next development window opens, which among other things, has a filter to detect eyes. For a while the eyes are associated with an "agent" and "safety", hence the babies smile instantly at their parents faces, while pretty soon this filial imprinting window closes, and they start to cry at the sight of new faces instead.

I have some of these chains of instincts mapped out on an initial level, and am soon trying out these theories within an environment closely resembling to OpenAI's gym (the architecture didn't lend itself easily to this new reward paradigm, unfortunately). Maybe they could be discussed further with some interested people?

Also, little glimpse of empathy has some literature under the term mirror neurons.

[-]Steven Byrnes4y*Ω450

little glimpse of empathy has some literature under the term mirror neurons

Sorta, but unfortunately the "mirror neuron" literature seems to be a giant dumpster fire. I suggest & endorse the book The Myth Of Mirror Neurons by Hickok. UPDATE: See also my later post Quick notes on "mirror neurons".

[-]Angela Pretorius3y20

When my little one was a newborn he was just as happy being handled by strangers as he was with mum and dad. It was around four months that he started showing a preference for mum and dad and disliking strangers. I’m sure that he could recognise us long before the four month mark though.

Geese need to imprint from birth, whereas there is no immediate need for a baby who is not yet mobile to imprint on it’s parents. So if babies have an ‘imprinting window’ then it probably occurs later, after a baby has learnt to reliably recognise familiar faces in spite of changes in make-up or clothing.

Aside: Babies prefer to look at faces while still in the womb https://www.lancaster.ac.uk/news/articles/2017/babies-preference-for-faces-begins-before-birth-/.

[-]Oliver Sourbut4y30

This is really great, Steve! I'm looking forward to reading more posts and in more detail.

I think I absorbed some of what you're conveying regarding 'little glimpses of empathy', and I was thinking about how I might explain it back.

I wonder if coining two words or phrases might be valuable, and possibly divorcing it from the 'empathy' wording to obviate the need for the disclaimer about the normal use of that word.

One concept, if I understood right, is that there is an 'involuntary other-modelling' (?) occuring when we observe facts relating to someone else that, if related to us, would make us feel a certain way. This claim stands on its own, remaining agnostic about the source of these signals.

The complementary (and more tentative?) claim is that 'involuntary other-modelling' is produced as an automatic consequence of 'relatee-wise generalisation' (?) in the intra-lifetime world model ('stomach punch my' vs 'stomach punch his'), perhaps coupled with other hard-coded signals. I think you might distinguish this claim more clearly if you had a second term.

The first thing is like a 'type/shape' (of signal/structure). The second claim is more like pointing to an instance of that type.

Am I reading right, and are these useful suggestions? I think I have a way to go before I fully grok your broader model.

[-]Steven Byrnes4y20

Thanks for the comment!

I didn't think too hard about terminology and am open to brainstorming.

I'm concerned that the word “modeling” misses one of the important points. “Model” suggests “predictive model”; I think it’s possible (at least in principle, and probably in practice) to “model” a person in a way that is wholly disconnected from your suite of visceral reactions, just like you can “model” how a car engine works.

Instead, I would start with what you said, “when we observe facts relating to someone else that, if related to us, would make us feel a certain way”, but then add “…while actually activating those same ‘feelings’ in our own head”. Well, at least that would be closer. And I used the word “empathy” to convey that second part, I think.

I guess what you call “involuntary other-modeling” is what I call “a little glimpse of empathy”, and what you call “relatee-wise generalization” is what I'd call “the main (or only?) reason why the ‘little glimpse of empathy’ occurs”. But sorry if I'm misunderstanding.

[-]Oliver Sourbut4y30

I guess what you call “involuntary other-modeling” is what I call “a little glimpse of empathy”, and what you call “relatee-wise generalization” is what I'd call “the main (or only?) reason why the ‘little glimpse of empathy’ occurs”. But sorry if I'm misunderstanding.

Ok excellent, this is a succinct version of what I was getting from your original post, and is what my comment was trying to confirm. Thank you.

“relatee-wise generalization” is what I'd call “the main (or only?) reason why the ‘little glimpse of empathy’ occurs”

Right, and to me this seems like an important distinct claim. I think I understood from your original post that these were somewhat separate claims, but I guess my response is to advocate making that distinction as clear as possible, perhaps by coining some extra term(s) - because I think different evidence is required to precede them, and different conclusions follow from them.

(I suppose I should point out that the second claim, depending on the degree of 'main (or only?)', seems a lot bolder i.e. I require more convincing. Like, there might be substantial hardcoded circuitry which puts this stuff in, rather than it falling out of relatee-wise generalisation. But then again I can viscerally feel empathy for a hypothetical, or for obviously-non-kin animals, or whatnot, so this could be right.)

[-]Steven Byrnes4y30

Thanks.

Like, there might be substantial hardcoded circuitry which puts this stuff in, rather than it falling out of relatee-wise generalisation.

I think this is tied up with learning-from-scratch. “Relatee-wise generalisation” is compatible with learning-from-scratch, and I can't currently see any other option that's compatible with learning-from-scratch. Can you? I'm not sure what you mean by “hardcoded circuitry”.

Then someone might say: “Yeah but if we throw out learning-from-scratch, then look at all these other possible ways that social instincts might work!” But I'm currently strongly disinclined to throw out learning-from-scratch, because I have a lot of other reasons for believing it.

So the premise of this post is something like “Is there any plausible explanation for social instincts that's compatible with Posts #2–#7, and especially with the learning-from-scratch discussion in Post #2?” (That’s the “symbol grounding” thing of Section 13.2.2, see also the post title.) If yes, then I’d be willing to bet that that explanation for social instincts is the correct one, and I would want to prioritize fleshing it out and testing it. If no, then oops, guess I better throw out Posts #2–#7!!

[-]Ben Smith3yΩ120

Still working my way through reading this series--it is the best thing I have read in quite a while and I'm very grateful you wrote it!

I feel like I agree with your take on "little glimpses of empathy" 100%.

I think fear of strangers could be implemented without a steering subsystem circuit maybe? (Should say up front I don't know more about developmental psychology/neuroscience than you do, but here's my 2c anyway). Put aside whether there's another more basic steering subsystem circuit for agency detection; we know that pretty early on, through some combination of instinct and learning from scratch, young humans and many animals learn there are agents in the world who move in ways that don't conform to the simple rules of physics they are learning. These agents seem to have internally driven and unpredictable behavior, in the sense their movement can't be predicted by simple rules like "objects tend to move to the ground unless something stops them" or "objects continue to maintain their momentum". It seems like a young human could learn an awful lot of that from scratch, and even develop (in their thought generator) a concept of an agent.

Because of their unpredictability, agent concepts in the thought generator would be linked to thought assessor systems related to both reward and fear; not necessarily from prior learning derived from specific rewarding and fearful experiences, but simply because, as their behavior can't be predicted with intuitive physics, there remains a very wide prior on what will happen when an agent is present.

In that sense, when a neocortex is first formed, most things in the world are unpredictable to it, and an optimally tuned thought generator+assessor would keep circuits active for both reward or harm. Over time, as the thought generator learns folk physics, most physical objects can be predicted, and it typically generates thoughts in line with their actual beahavior. But agents are a real wildcard: their behavior can't be predicted by folk physics, and so they perceived in a way that every other object in the world used to be: unpredictable, and thus continually predicting both reward and harm in an opponent process that leads to an ambivalent and uneasy neutral. This story predicts that individual differences in reward and threat sensitivity would particularly govern the default reward/threat balance otherwise unknown items. It might (I'm really REALLY reaching here) help to explain why attachment styles seem so fundamentally tied to basic reward and threat sensitivity.

As the thought generator forms more concepts about agents, it might even learn that agents can be classified with remarkable predictive power into "friend" or "foe" categories, or perhaps "mommy/carer" and "predator" categories. As a consequence of how rocks behave (with complete indifference towards small children), it's not so easy to predict behavior of, say, falling rocks with "friend" or "foe" categories. On the contrary, agents around a child are often not indifferent to children, making it simple for the child to predict whether favorable things will happen around any particular agent by classifying agents into "carer" or "predator" categories. These categories can be entirely learned; clusters of neurons in the thought generator that connect to reward and threat systems in the steering system and/or thought assessor. So then the primary task of learning to predict agents is simply whether good things or bad things happen around the agent, as judged by the steering system.

This story would also predict that, before the predictive power of categorizing agents into "friend" vs. "foe" categories has been learned, children wouldn't know to place agents into these categories. They'd take longer to learn whether an agent is trustworthy or not, particularly so if they haven't learned what an agent is yet. As they grow older, they get more comfortable with classifying agents into "friend" or "foe" categories and would need fewer exemplars to learn to trust (or distrust!) a particular agent.

[-]Gunnar_Zarncke4y20

Your "little glimpses of X" are probably closely related to Microexpressions - they are practically what shows externally - probably what leaks over to muscles.

[-]Joe Kwon4y10

Hi Steve, loved this post! I've been interested in viewing the steering and thought generator + assessor submodule framework as the object and generator-of-values of which which we want AI to learn a good pointer to/representation of, to simulate out the complex+emergent human values and properly value extrapolate.

I know the way I'm thinking about the following doesn't sit quite right with your perspective, because AFAIK, you don't believe there need to be independent, modular value systems that give their own reward signals for different things (your steering subsystem and thought generator and assessor subsystem are working in tandem to produce a singular reward signal). I'd be interested in hearing your thoughts on what seems more realistic, after importing my model of value generators as more distinctive and independent modular systems in the brain.

In the past week, I've been thinking about the potential importance of considering human value generators as modular subsystems (for both compute and reward). Consider the possibility that at various stages of the evolutionary neurocircuitry-shaping timeline of humans, that modular and independently developed subsystems developed. E.g. one of the first systems, some "reptilian" vibe system, was one that rewarded sugary stuff because it was a good proxy at the time for nutritious/calorie-dense foods that help with survival. And then down the line, there was another system that developed to reward feeling high-social status, because it was a good proxy at the time for surviving as social animals in in-group tribal environments. What things would you critique about this view, and how would you fit similar core-gears into your model of the human value generating system?

I'm considering value generators as more independent and modular, because (this gets into a philosophical domain but) perhaps we want powerful optimizers to apply optimization pressure not towards the human values generated by our wholistic-reward-system, but to ones generated by specific subsystems (system 2, higher-order values, cognitive/executive control reward system) instead of reptilian hedon-maximizing system.

This is a few-day old, extremely crude and rough-around-the-edges idea, but I'd especially appreciate your input and critiques on this view. If it were promising enough, I wonder if (inspired by John Wentworth's evolution of modularity post) training agents in a huge MMO environment and switching up reward signals in the environment (or the environment distribution itself) every few generations would lead to a development of modular reward systems (mimicking the trajectory of value generator systems developing in humans over the evolutionary timeline).

[-]Steven Byrnes4y30

you don't believe there need to be independent, modular value systems that give their own reward signals for different things (your steering subsystem and thought generator and assessor subsystem are working in tandem to produce a singular reward signal)

If I'm deciding between sitting on the couch vs going to the gym, at the end of the day, my brain needs to do one thing versus another. The different considerations need to be weighed against each other to produce a final answer somehow, right? A “singular reward signal” is one solution to that problem. I haven't heard any other solution that makes sense to me.

That said, we could view a “will lead to food?” Thought Assessor as a “independent, modular value system” of sorts, and likewise with the other Thought Assessors. (I’m not sure that’s a helpful view, it’s also misleading in some ways, I think.)

(I would call a Thought Assessor a kind of “value function”, in the RL sense. You also talk about “value systems” and “value generators”, and I’m not sure what those mean.)

What things would you critique about this view

Similar to above: if we’re building a behavior controller, we need to decide whether or not to switch behaviors at any given time, and that requires holistic consideration of the behavior’s impact on every aspect of the organism’s well-being. See § 6.5.3 where I suggest that even the run-and-tumble algorithm of a bacterium might plausibly combine food, toxins, temperature, etc. into a single metric of how-am-I-doing-right-now, whose time-derivative in turn determines the probability of tumbling. (To be clear, I don’t know much about bacteria, this is theoretical speculation.) Can you think of a way for a mobile bacteria to simultaneously avoid toxins and seek out food, that doesn't involve combining toxin-measurement and food-measurement into a single overall environmental-quality metric? I can’t.

If you want your AGI to split its time among several drives, I don’t think that’s incompatible with “singular reward signal”. You could set up the reward function to have diminishing returns to satisfying each drive, for example. Like, if my reward is log(eating) + log(social status), I'll almost definitely wind up spending time on each, I think.

^{^}

In the initial version of this post, my opening example was “envy” instead of “drive to feel liked / admired”. But now I think maybe envy was a very bad example.

Instead, my leading hypothesis right now is that envy is a side-effect of innate drives / reactions that are not specifically social at all! An alternative is that envy is sorta just a special case of craving—a kind of anxious frustration in a scenario where something is highly motivating and salient, but there’s no way to actualize that desire.

So if Sally has a juice box and I don’t, it incidentally makes the alluring possibility of drinking juice very salient in my mind. Since I can’t have juice, the frustration of that desire leads (via some innate mechanism I don’t understand) to feelings that I’d call “envy”. But if I’m staring at an empty shelf in the store where there should have been juice boxes (but the juice box factory burned down), I think I can get the very same kind of frustrated feeling, for the same underlying reason. But in the latter case, I wouldn’t call it “envy”, because it’s not directed towards anyone in particular. The factory burned down—nobody has a juice box this week! But it’s still frustrating to look at the empty shelf, taunting me.

I could be wrong.

^{^}

See Elephant In The Brain (Simler & Hanson, 2018) for a brief discussion of Arabian Babbler Birds. Warning: I know very little about Arabian Babbler Birds, and indeed for all I know their apparent prestige-seeking might occur for different underlying reasons than humans’.

^{^}

For a good recent review of the role of the Steering Subsystem (especially hypothalamus) in social behavior, see “Hypothalamic Control of Innate Social Behaviors” (Mei et al., 2023). Another lovely recent example, related to loneliness, is “A Hypothalamic Circuit Underlying the Dynamic Control of Social Homeostasis” (Liu et al., 2023).

^{^}

“…if you look at the human literature nobody talks about the hypothalamus and behaviour. The hypothalamus is very small and can’t be readily seen by human brain imaging technologies like functional magnetic resonance imaging (fMRI). Also, much of the anatomical work in the instinctive fear system, for example, has been overlooked because it was carried out by Brazilian neuroscientists who were not particularly bothered to publish in high profile journals. Fortunately, there has recently been a renewed interest in these behaviors and these studies are being newly appreciated.” (Cornelius Gross, 2018)

^{^}

I suspect a more accurate diagram would feature arousal (in the psychology-jargon sense, not the sexual sense—i.e., heart rate elevation etc.) as a mediating variable. Specifically: (1) if brainstem sensory processing indicates that an adult human is present and nearby and picking me up etc., that leads to heightened arousal (by default, unless the Thought Assessors strongly indicate otherwise), and (2) when I’m in a state of heightened arousal, my brainstem treats it as bad and dangerous (by default, unless the Thought Assessors strongly indicate otherwise).

^{^}

For example, the Steering Subsystem needs a method to distinguish a “transient empathetic simulation” from other transient feelings, e.g. the transient feeling that occurs when I think through the consequences of a possible course of action that I might take. Maybe there are some imperfect heuristics that could do that, but my preferred theory is that there’s a special Thought Assessor trained to fire when attending to another human (based on ground-truth sensory heuristics as discussed in Section 13.4). As another example, we need the “Ground truth in hindsight” signals to not gradually train away the Thought Assessor’s sensitivity to “his stomach is getting punched”. But it seems to me that, if the Steering Subsystem can figure out when a signal is a “transient empathetic simulation”, then it can choose not to send error signals to the Thought Assessors in those cases.

^{^}

Warning: I’m not entirely sure that there really is a “standard” definition of empathy; it’s also possible that the term is used in lots of slightly-inconsistent ways. I enjoyed this blog post on the topic.

LESSWRONG
LW

LESSWRONG
LW

73

[Intro to brain-like-AGI safety] 13. Symbol grounding & human social instincts

73

Ω 24

73

Ω 24

13.1 Post summary / Table of contents

13.2 What are we trying to explain and why is it tricky?

13.2.1 Claim 1: Social instincts arise from genetically-hardcoded circuitry in the Steering Subsystem (hypothalamus & brainstem)

13.2.2 Claim 2: Social instincts are tricky because of the “symbol grounding problem”

13.2.3 Reminder of brain model, from previous posts

13.3 Sketch #1: Filial imprinting

13.3.1 Overview

13.3.2 How is the MOMMY Thought Assessor trained?

13.3.3 How is the MOMMY Thought Assessor used?

13.4 Sketch #2: Fear of strangers

13.5 Another key ingredient (I think): “Transient empathetic simulation”

13.5.1 Introduction

13.5.2 Distinction from the standard definition of “empathy”

13.5.3 Why do I believe that “transient empathetic simulations” are part of the story?

13.6 Future work (please!)

Changelog