Disclaimer: I don't have a background in alignment research or reinforcement learning, and I don't think any of the ideas discussed here are new, but they might be interesting to some.

A recent post suggested that humans provide an untapped wealth of evidence about alignment. I strongly agree with that and I found it interesting to think about how nature can ensure that a mother's values are aligned with the well-being of her children.

One particular thing this made me realize is that there are two very different reasons for why an agentic AGI might behave in a way that we would characterize as misaligned:

1. The reward function was poorly specified, and the model didn't learn the values we wanted it to learn
2. The model did learn the values we wanted it to learn, but those values led it to conclusions and actions that we did not anticipate

Let's go back to thinking about how evolution aligns mothers with their children. If we could get an AGI to love all humans like most (mammalian) mothers love their children, it might not necessarily solve the alignment problem, but it would be far better than having an AGI which doesn't care much about humans.
We can think of humans as reinforcement learning agents with a range of different goals, motivations and desires that are in one way or another all instrumental for the purpose of survival and reproduction in an environment similar to the one we may have found ourselves in 10,000 years ago. Some of those goals represent pretty simple values like avoiding pain or maintaining a certain blood sugar level, but others represent more complex values, like ensuring the well-being of our children. The biological link between pain or food-intake and a reward signal like dopamine can be quite simple, but how is something like "care for your children" encoded as a value in a way that generalizes to out-of-distribution environments?

Two examples for misaligned AGIs that were discussed around here are a system that is supposed to prevent a diamond from being stolen, but that can be tricked by placing an image of the diamond in front of the security camera, and a strawberry-picking-robot that was supposed to learn to put strawberries in a bucket, but instead learned to fling red things at light sources.
Alignment failures of that sort do occur in nature. In some birds, simple visual and auditory cues trigger the behavior of feeding chicks, and the absence of a single cue can lead a mother bird to not feed her chick even if it is generally healthy. And then there are poorly disguised brood parasites like a cuckoos which do trigger a feeding response.

Humans seem to be more robust to alignment failures like that. At least the absence of any single sensory cue will not stop a mother from caring for her child. I think the reason why "care about your children" is perhaps more robustly instilled in humans than in some non-mammalian species is that there is a range of different sensory cues, both simple and complex, that trigger dopamine responses in a human mother. "I love my child" might be a parsimonious emotional and cognitive concept that naturally forms in the presence of different reward signals that are triggered by different aspects of caring for a child or being near a child. I think there are at least two factors that make it more likely that this goal is learned robustly:
1. Multiple independent sensory cues (vision, auditory, olfactory), all associated with being near her child, or perceiving her child's well-being, that lead to dopamine responses.
2. Dopamine responses may not only be triggered by very simple sensory cues, they could also be triggered by higher-level abstractions that are informative about the child's well-being. Maybe dopamine is released when a mother sees her child smile. This is not particularly high-level, but more so than detecting pheromones, for example, since it requires the visual cortex to interpret facial expressions and form a concept like "smiling". There is no reason why dopamine responses could not be triggered by even higher-level abstractions that are indicative of her child's well-being. Once her child is able to speak, the dopamine response might be directly triggered by her child verbally expressing that everything is OK. This may not be how affection is formed in humans  (it takes a while until children learn to talk), but an artificial reinforcement learner might learn to care about humans through reward signals that are triggered by humans expressing happiness.

So I think it's possible that a value like "care for your child" could form either through associating a reward signal with a high-level expression of well-being, or with a bunch of different low-level correlates of the child's well-being. In the second case, the concept "care for your child" might emerge as a natural abstraction that unites the different reward triggers.


Going back to the two different ways in which an AGI might behave in a way that we would characterize as misaligned: I think by using a range of different reward signals that all circle around different aspects of human well-being, it should be possible to get an AGI to learn to genuinely care about humans (rather than about reward itself). That still leaves the problem that even a benevolent, all-powerful AGI might take actions that seem bad in the context of our current values:

Imagine a mother who takes her daughter to a doctor to get vaccinated, but the daughter protests because she does not want to get pinched by a needle. From the daughter's point of view, this could constitute an alignment failure. Would a mother still insist on the vaccination if her only reward trigger had been seeing her daughter smile? It might depend on her time horizon. Does she maximize the number of smiles expected in the next minute, or for the number of smiles expected across the course of her child's lifetime?

In this example, most people would probably say it's good that the mother's longer time horizons override the child's short-term preferences. But we might not want an AGI to optimize over very long time horizons. Utilitarianism and consequentialist reasoning with long time horizons are generally pretty accepted frameworks around here, at least more so than in the general public; but at the same time the consensus opinion seems to be that it would be bad if an all-powerful AGI takes far-reaching actions that are inscrutable to us and that seem bad at face value, even if we knew that the AGI intrinsically cares about us and that it has our best interests in mind.
I don't think an aligned AGI would necessarily choose to optimize over very long time horizons and make decisions we can't understand; we might even be able to instill shorter time horizons and legibility as two of its values. My point is that there is an important difference between objectively bad actions taken by an all-powerful AGI with misspecified values, and seemingly bad actions taken by an all-powerful, but benevolent AGI that we may fail to comprehend.

If we put our fate in the hands of an all-powerful AGI, it seems unavoidable that our world will change very drastically. Many of these changes will likely appear bad according to our current, relatively poorly defined values. The very concept of  what it means to be human might become meaningless once brain-machine interfaces get better. The best we can probably aim for is that the hypothetical all-powerful AGI that brings these changes about cares about us (and indirectly about the things we care about) in a way that is somewhat analogous to the way in which parents care about their children.[1] Nature provides us with examples of how complex values like caring for your children can be instilled in ways that make these values more or less robust to changing environments.
 

  1. ^

    Children caring for their elderly parents might be a more accurate but less optimistic analogy

21

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 11:15 AM

Agreed... But I think I disagree with some of the connotations.

How "natural" do you think "natural abstractions" are?

The notion of love is a pretty great abstraction (I say as a human), and every person in my culture has pretty reliably hit on it, or at least close enough that we can all talk to each other. So it's at least somewhat natural - we can talk about the same thing because love is an abstraction that is particularly well-suited for explaining both of our environments.

But do all human cultures have the same concept of love? They probably have something similar, at least. But how different can it get? People in different cultures live in different environments, and interact with the world in different ways than I do, which means that they might find different abstractions best for thinking about their environment.

This variability is what I mean be "how natural are natural abstractions?" Everyone (human or AI) is interacting with the same universe, but we have different local environments, and different ways of interacting with those environments. A lot of the abstractions I care about are also found by nearby fellow humans, but is that because they're so "natural" that they're also found by humans in very different parts of the world, and in AIs that interact with the world in a very different way than me? Or are there a lot of things I care about that aren't that "natural," and might be learned differently be different humans/nonhumans?

This is well written, easy to understand, and I largely agree that instilling a value like love for humans in general (as individuals) could deal with an awful lot of failure modes. It does so amongst humans already (though far from perfectly).

When there is a dispute, rather than optimizing over smiles in a lifetime (a proxy of long-term happiness), preferable is obviously something more difficult like, if the versions of the person in both the worlds where it did and did not happen would end up agreeing that it is better to have happened, and that it would have been better to force the issue, then it might make sense to override the human's current preference. Since the future is not known, such determinations are obviously probabilistic, but the thresholds should be set quite high. The vast majority of adults agree that as a child, they should have gotten vaccinations for many diseases, so the probability that the child would later agree is quite high.

Smiles in a lifetime is a great proxy for what an aligned intelligence, artificial or not, should value getting for those it loves when multiple actions are within acceptable bounds, either because of the above or because of the current preferences of that person and their approval in the world where it happens.

Two out of three versions of the person approving is only complicated in worlds where it is the one where it happens that is the one that would disapprove.

New to LessWrong?