TL;DR: Since humans built our own self-models from the same circuits which evolved to model other apes, our empathy is switched on by default for anything that we model as similar to ourselves. Specifically this is because our model of others' emotions is "writing to the same memory address" as circuits which read off our own emotions. This also happens to be game-theoretically advantageous, so it's stuck around.
"Because the way people are built, Hermione, the way people are built to feel inside ... is that they hurt when they see their friends hurting. Someone inside their circle of concern, a member of their own tribe. That feeling has an off-switch, an off-switch labeled 'enemy' or 'foreigner' or sometimes just 'stranger'."
– Harry, HPMOR
The neuroscientific consensus strongly supports an "off switch" architecture for affective empathy—shared neural substrates automatically activated unless inhibited
– Claude Research I ran today
I always wondered whether that line in HPMOR was correct; I guessed it probably was. Now, thanks to LLMs, searching the academic literature is no longer the special kind of hell that it once was, which seems to confirm that, Harry's explanation is basically correct.[1] I don't know whether Eliezer genuinely put a bunch of thought into Harry's description of empathy, or whether it just shook out that way (though I suspect the former).
More importantly, I don't know why we'd expect it to be this way! Why would we have an always-on empathy mechanism! That sounds really tiring. It's also not the only way we do things. Affective empathy (i.e. when we feel what others are feeling) is default-on, while cognitive empathy (i.e. the ability to consciously understand what others are thinking and feeling) isn't! What the hell is going on?
Feel free to skip to "Modelling Yourself" if the sentence "We evolved to predictively model the world, re-used those circuits for reinforcement learning, then evolved to model others' behaviours as well."
The most basic kind of brain you can have just links inputs to outputs: if a large creature swims towards you, swim away; if you smell food, swim towards it; if you see a potential mate, initiate a courtship dance. These sorts of things can be programmed in by evolution, and on some level they're all a model of the world, of the form "If I perceive X then the course of action which maximizes my expected inclusive genetic fitness is Y."
Now suppose there's lots of different large creatures which might want to eat you. It would be good to learn what all of them look like, but therein lies a problem: there might be a dozen different predators in your environment. It's quite inefficient to learn all of them through evolution, and it's also non-robust to new predators showing up, so you might instead want a completely different kind of model. I suggest one which looks like this: "Predict what's going to happen in five seconds time. If that thing would cause you to take a specific action, then take that action now."
This has some disadvantages: it requires you to see the predator several times (and have it swim towards you several times) before you learn to swim away when it's not spotted you yet. You might get eaten before you figure out what's going on. But it's not likely to be much worse than the naive policy above.
So we can build a system out of primitive instincts + sensory prediction, which can learn to respond to arbitrary stimuli. This is called "classical conditioning." It works on all manner of things (I once saw some people perform classical conditioning on snails, with success).
Analogy: imagine you're running a radio show, and you want to make sure no swear words go out on air. You run your speaker's words into an LLM, and if it predicts a swear word is coming up next, you pre-emptively bleep them.
We need to pull off a little trick to scale up the brain. Suppose we want to be a better predictor. If we just stick more layers of neurons into the prediction loop, then the loop will take longer to complete each time. This is unfortunate!
Instead, you can pull some crazy shit to stack predictive layers on top of each other in a different way. Since each layer is doing predictions concurrently, there's no delay. This is called hierarchical predictive coding and isn't too important for our model, but it is important to mention because it is the Rolls Royce of models of human cognition.
What if we want to learn arbitrary behaviours? Turns out that pretty much all animals can do this as well, but how? For a start, we need to invoke the concept of reinforcement. The animal needs to have a circuit which says "That thing you just did, do more of it." Common reinforcement triggers are food, sex, and drugs.
But secondly, how do we add this to our animal? There's a "so dumb it might just work" way: at each step, predict your next action, as well as your next sensory inputs. Instead of training them to "predict" the actions accurately, act as if they got the answer right if it leads to a reinforcement trigger. This lets you learn single action -> reinforcement associations.
What if you want to learn associations over longer timesteps? Well, you can let yourself be reinforced in cases where you don't get the stimulus (yet) if your predictive model merely expects that the reinforcement stimulus is coming soon. This is how, once you've associated walking into McDonald's with getting food or talking to someone hot with getting sex, you can be reinforced by the actions that led you to those intermediate steps.
Another approach is just to let the reinforcement trigger reinforce every action you've taken in the past half-hour or so, which is kinda how lots of LLM RL works.
You can also build a planning system, where you make lots of possible predictions of what you're going to do, and what will happen in response. Then you can up-weight stimulus:action pairs which you expect to produce reinforcing stimuli. This is as far as we need to go for now.
Why did I go through the entire history of the evolution of brains in so much detail? I want you to have a solid model for what I mean by "modelling" when I say that humans evolved to model other humans. When you live with a bunch of other apes, and those apes are involved with decisions like "who gets how much food" then those apes rapidly become the most important things for you to model.
Questions like "Thag did not find many berries today, if I offer him an extra share of mammoth, will he do the same when I don't find many berries?" really are life-or-death, as are questions like "If I convince Grug to work with me, can we kick out Borm and rule the tribe ourselves?".
These skills relate to cognitive empathy, which is the ability to think "What would I do in X's position?"
But there's one more ape you'll want to model: yourself. Your brain is full of a bunch of working-memory stuff, which (I suspect, this is where I start to speculate) is similar enough to a stream of sensory information that the brain goes "Oh yeah, we should try to predict this as well." This goes beyond modelling your own actions: you're modelling your own thoughts and cognition.
If you don't already have good ape-modelling circuits, you just won't be able to succeed at this task. Most animals have pretty piss-poor self-reflection capabilities. But if you're already hyper-evolved to model other apes, you might have a shot at it.
Self-reflection on this level is also really powerful. I will assert without proof that it's core to many of the ways in which we can become more rational. I also suspect that it does an important job of automatically rational-ifying beliefs, but that we don't notice this because it goes on in the background, and the errors that this catches are too silly for even the dumbest humans to make.
This can then tie into the same re-purposed predictive circuits which "predict" our own actions.
So the circuits which you evolved to predict other apes are being used to predict yourself. We also know that your brain runs on a rule which looks a bit like "If you predict that you'll do X, do X."
To use a computing analogy: I suggest that the circuit which is doing "Predict if Thag will be happy" writes to the same memory address that stores your predictions for your own emotions. Then, the circuit "If you predict that you'll be happy, be happy" reads fthat memory address. This only stops happening if we put in a flag which says "Nope, don't read from this, this isn't for us!".
It also turns out that affective empathy is useful for working together. Really useful. The rule "be nice to people around you, unless..." is essentially the Tit-for-Tat strategy from Prisoner's dilemma, which is one of the two canonical strategies (the other being "Pavlov" which is more effective in some situations but also more cognitively taxing to implement) so once evolution has stumbled into this region of brain-space, it has reasons to stay there.
Cognitive empathy is generally thought to be default-off with an on switch. If cognitive empathy is upstream of affective empathy, then why is affective empathy default-on?
Also, what's up with the predictive circuits which "predict" our next actions. If they're fully re-wired to run only on reinforcement signals, and not correct predictions, then can they really be affected by our self-model? Maybe this works because the bottom-up information comes from reinforcement signals, while the top-down information comes from our higher-level self-model.
Our brains are (empirically) functional learning machines, which probably means they're implementing a complexity penalty + Bayesian updates. Having a stronger complexity penalty on your self-model probably leads you to model yourself as being describable with relatively little information. Again, I assert without proof that this leads to a process which makes you act more like a utility maximizer over time, since utility maximizers are fairly easy to describe.
Can we express all of this using a framework like Garrabrant Induction? Suppose each observation resolves a set of questions of the form P(at time , we observe ). The market will get good at predicting the next steps pretty well. We can then start asking questions like P(at time we observe at time we do ). If the generator implements dynamics which do actions automatically, then we can start to predict our own actions ahead of time. If we also add a dynamic to read off P(at time we do ) and choose actions that way, we have classical conditioning.
We can then start to ask questions like P(at time we observe and get reward at time we do ) which the market should deal with pretty well. By implementing some more dynamics which choose actions with high expected future reward, we can get to operant conditioning. We can also resolve otherwise-predictive shares according to whether those actions "did a good job".
Then, when we introduce other apes into the equation, we'll end up with traders who are good at predicting apes' behaviour. Some of these will also (by chance) end up trying to predict our own behaviour, and if they do a good job they'll get richer than the ones that don't try this.
I don't think this system obviously ends up doing any kind of empathy-by default.
(Garrabrant Inductors are known to be kind of self-reflective: the traders can trade on shares in propositions the form P([Market with this description] will value [proposition] at [price] at [timestep]) which might let the inductor do some other cool stuff)
This also doesn't actually help us understand agency, it's just pushing the problem around. The atoms of this system are purely predictive traders; and the only way we can become a smart agent is if some of those traders implement agents---or models of agents---which, since we have the space of basically all efficiently computable functions, some will. I still think it's cool though.
https://pmc.ncbi.nlm.nih.gov/articles/PMC2206036/ proposes a model of empathy in which "The lower level, which is automatically activated (unless inhibited) by perceptual input, accounts for emotion sharing." This work is based on investigations of patients with brain injuries, which is my favourite way of studying the brain
https://pubmed.ncbi.nlm.nih.gov/16998603/ same guy, same point.
https://pmc.ncbi.nlm.nih.gov/articles/PMC3524680/ talks about "mirror neurons" which are over-hyped as a mechanistic explanation for empathy, but are totally fine as evidence for some level of automatic processing. In humans, they also probably aren't individual neurons like in the monkey studies.