How evolution succeeds and fails at value alignment
Disclaimer: I don't have a background in alignment research or reinforcement learning, and I don't think any of the ideas discussed here are new, but they might be interesting to some. A recent post suggested that humans provide an untapped wealth of evidence about alignment. I strongly agree with that and I found it interesting to think about how nature can ensure that a mother's values are aligned with the well-being of her children. One particular thing this made me realize is that there are two very different reasons for why an agentic AGI might behave in a way that we would characterize as misaligned: 1. The reward function was poorly specified, and the model didn't learn the values we wanted it to learn 2. The model did learn the values we wanted it to learn, but those values led it to conclusions and actions that we did not anticipate Let's go back to thinking about how evolution aligns mothers with their children. If we could get an AGI to love all humans like most (mammalian) mothers love their children, it might not necessarily solve the alignment problem, but it would be far better than having an AGI which doesn't care much about humans. We can think of humans as reinforcement learning agents with a range of different goals, motivations and desires that are in one way or another all instrumental for the purpose of survival and reproduction in an environment similar to the one we may have found ourselves in 10,000 years ago. Some of those goals represent pretty simple values like avoiding pain or maintaining a certain blood sugar level, but others represent more complex values, like ensuring the well-being of our children. The biological link between pain or food-intake and a reward signal like dopamine can be quite simple, but how is something like "care for your children" encoded as a value in a way that generalizes to out-of-distribution environments? Two examples for misaligned AGIs that were discussed around here are a system that is supposed to