What can we learn from parent-child-alignment for AI?

[-]StanislavKrym2mo20

If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).

Did you mean that current AIs are RLHFed for answers that bring the humans short-term happiness instead of long-term thriving? Because later models are less sycophantic than earlier ones. And there is the open-sourced KimiK2 who was trained mostly on RLVR instead of RLHF and is LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has...

[-]Karl von Wendt2mo20

I'm not a machine learning expert, so I'm not sure what exactly causes sycophancy. I don't see it as a central problem of alignment; it is just a symptom of a deeper problem to me.

My point is more general: To achieve true alignment in the sense of an AI doing what it thinks is "best for us", it is not sufficient to train it by rewarding behavior. Even it the AI is not sycophantic, it will pursue some goal that we trained into it, so to speak, and that goal will most likely not be what we would have wanted it to be in hindsight.

Contrast that with the way I behave towards my sons: I have no idea what their goals are, so I can't say that my goals are "aligned" with their goals in any strict sense. Instead, I care about them, their wellbeing, but also their independence of me and their ability to find their own way to a good life. I don't think this kind of intrinsic motivation can be "trained" into an AI with any kind of reinforcement learning.

[-]StanislavKrym2mo20

The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.

So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.

[-]Karl von Wendt2mo10

Yes, I agree, although I don't believe that COT reading will carry us very far into the future, it is already pretty unreliable and using it for optimization would ruin it completely.

Alignment is difficult, that is my whole point - with an emphasis on the hypothesis that with any kind of only reinforcement-learning-based approach, it is virtually impossible. IF we could find a way to create some kind of genuine "care about humans module" within an AI that is similar to the kind of parent-child-altruism that I write about, we might have a chance. But the problem is that no one knows how to do that, and even in humans it is a quite fragile mechanism.

One additional thought: Evolution has created the parent-child-care-mechanism through some kind of reinforcement-learning, but is optimizing for a different objective compared to our current AI training process - not any kind of direct evaluation of human behavior, but survival and reproduction. Maybe the evolution of spiral personas is closer to the way evolution works. But of course, in this case AI is a different species, a parasite, and we are the hosts.

LESSWRONG
LW

LESSWRONG
LW

16

What can we learn from parent-child-alignment for AI?

16

16