Thoughts about OOD alignment

[-]Charlie Steiner3y60

I'm going to say something critical. I mean it earnestly on the object level, but bear you no ill will on the social or interpersonal level. In fact, I think that making this post is a positive sign for your future :)

A mental mistake has been made here, and I think you're not alone in making it. We humans valorize a mother's love for her children. We humans think that it generalizing to new situations is right and proper. So at first glance it might seem like evolution has miraculously produced something "robustly aligned" in the good generalization properties of a mother's love for her children.

But evolution does not care about motherly love, it only cares about fitness. If a child loses their gonads at age 2, evolution would rather (to the extent that it would rather anything) the mother stop devoting resources to that child and have a new one.

Evolution was just promoting fitness, motherly love is a great result for us humans who think motherly love is great, but to evolution it's just another suboptimal kludge. See the Tragedy of Group Selectionism. The rightness-according-to-humans is bleeding over and affecting your judgment about rightness-according-to-evolution.

All of this is to say: the alignment problem is as hard as it ever was, because motherly love is not a triumph of evolution aligning humans. It's something we think is good, and we think generalizes in good ways, because we are talking about ourselves, our own values. The baby-eater aliens would praise evolution for so robustly aligning them to eat babies, and the puddle would praise the rainstorm for dropping it in a hole so suited for its shape. None of this is evidence that the optimization process that produced them is good at aligning things.

[-]Ocracoke3y32

I recently articulated similar ideas about motherly love. I don't think it's an example of successful alignment because evolution's goals are aligned with the mother's goals. In the example you give where a child loses their gonads at age 2, it would be an alignment failure if the mother continues devoting resources to the child. In reality that wouldn't happen, because with motherly love, evolution created an imperfect intermediate goal that is generally but not always the same as the goal of spreading your genes.

I totally agree that motherly love is not a triumph of evolution aligning humans with its goals. But I think it's a good example of robust alignment between the mother's actions and the child's interests that generalizes well to OOD environments.

[-]Catnee3y21

Thank you for your detailed feedback. I agree that evolution doesn't care about anything, but i think that baby-eater aliens would not think that way. They can probably think about evolution aligning them to eat babies, but in their case it is an alignment of their values to them, not to any other agent/entity.

In our story we somehow care about somebody else, and it is their story that ends up with the "happy end". I also agree that probably given enough time we will end up stop caring about babies who we think can not reproduce anymore, but it will be a much more complex solution.

At the first step it is probably much easier to just "make an animal who cares about it babies no matter what", otherwise you will have to count on ability of that animal to recognize something it might not even understand (like reproductive abilities of a baby)

[-]Charlie Steiner3y31

Ah, I see what you mean and that I made a mistake - I didn't understand how your post was about human mothers being aligned with their children, not just with evolution.

To some extent I think my comment makes sense as a reply, because trying to optimize^[1] a black-box optimizer for fitness of a "simulated child" is still going to end up with the "mother" executing kludgy strategies, rather than recapitulating evolution to arrive at human-like values.

EDIT: Of course my misunderstanding makes most my attempt to psychologize you totally false.

But my comment also kinda doesn't make sense, because since I didn't understand your post I somewhat-glaringly don't mention other key considerations. For example: mothers who love their children still want other things too, so how are we picking out what parts of their desires are "love for children"? Doing this requires an abstract model of the world, and that abstract model might "cheat" a little by treating love as a simple thing that corresponds to optimizing for the child's own values, even if it's messy and human.

A related pitfall is if you're training an AI to take care of a simulated child, thinking about this process using the abstract model we use to think about mothers loving their children will treat "love" as a simple concept that the AI might hit upon all at once. But that intuitive abstract model will not treat ruthlessly exploiting the simulate child's programming to get a high score by pushing it outside of its intended context as something simple, even though that might happen.

^{^}
especially with evolution, but also with gradient descent

[-]Vaniver3y20

We may have one example of realized out-of-distribution alignment: maternal attachment.

When someone becomes maternally attached towards a dog, doesn't this count as an out-of-distribution alignment failure?

[-]Catnee3y10

I think it depends on "alignment to what?". If we talk about evolution process, then sure, we have a lot of examples like that. My idea was more about "humans can be aligned to their children by some mechanism which was found by evolution and this is a somewhat robust".

So if we think about "how our attachment to something not-childish aligned with our children" well... technically, we will spend some resources on our pets, but it usually never really affects the welfare of our children in any notable way. So it is an acceptable failure, I guess? I wouldn't mind if some powerful AGI will love all the humans and will try to ensure their happy future while at the same time will have some weird non-human hobbies/attachments which is still less prioritized than our wellbeing, kind of like parents that spend some free time on pets.

[-]Dave92F13y20

Mothering is constrained by successful reproduction of children - or failure to do so. It's not at all obvious how to get an AI to operate under analogous constraints. (Misbehavior is pruned by evolution, not by algorithm.)

Also, what mothers want and what children want are often drastically at odds.

[-]Catnee3y21

Yes, exactly. That's why i think that current training techniques might not be able to replicate something like that. Algorithm should not "remember" previous failures and try to game them/adapt by changing weights and memorise, but i don't have concrete ideas for how we can do it the other way.

[-]Nathan Helm-Burger3y10

I think there is something important here. Details of implementation aside, I do think that we should make a comprehensive attempt at 'fumbling towards alignment through trial and error' in a similar way evolution tried to align mothers to their children. I think that even if we don't get a comprehensively perfect result from such a process, we might gather some useful data and learn some important lessons along the way.

[-]johnswentworth3y-12

Some additional remarks on ood alignment...

This is what an aligned ood looks like:

This is what an unaligned ood looks like (note the glowing red eyes):

Though aligning the ood is moderately difficult, it is at least very easy to recognize and avoid unaligned ood.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

11

Thoughts about OOD alignment

11

11