1700

LESSWRONG
LW

1699
Human AlignmentAI

11

What can we learn from parent-child-alignment for AI?

by Karl von Wendt
29th Oct 2025
3 min read
3

11

Human AlignmentAI

11

What can we learn from parent-child-alignment for AI?
2StanislavKrym
1Karl von Wendt
1StanislavKrym
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:42 AM
[-]StanislavKrym3h20

If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).

Did you mean that current AIs are RLHFed for answers that bring the humans short-term happiness instead of long-term thriving? Because later models are less sycophantic than earlier ones. And there is the open-sourced KimiK2 who was trained mostly on RLVR instead of RLHF and is LESS sycophantic than anything else, including Claude 4.5 Sonnet even with situational awareness which Claude, unlike Kimi, has...

Reply
[-]Karl von Wendt2h10

I'm not a machine learning expert, so I'm not sure what exactly causes sycophancy. I don't see it as a central problem of alignment; it is just a symptom of a deeper problem to me. 

My point is more general: To achieve true alignment in the sense of an AI doing what it thinks is "best for us", it is not sufficient to train it by rewarding behavior. Even it the AI is not sycophantic, it will pursue some goal that we trained into it, so to speak, and that goal will most likely not be what we would have wanted it to be in hindsight. 

Contrast that with the way I behave towards my sons: I have no idea what their goals are, so I can't say that my goals are "aligned" with their goals in any strict sense. Instead, I care about them, their wellbeing, but also their independence of me and their ability to find their own way to a good life. I don't think this kind of intrinsic motivation can be "trained" into an AI with any kind of reinforcement learning.

Reply
[-]StanislavKrym1h10

The analogy actually has two differences: 1) mankind can read the chains of thought of SOTA LLMs and do interpretability probes on all the thoughts of the AIs; 2) a misaligned human, unlike a misaligned AI, has no way to, for example, single-handedly wipe out all the world except for a country. A misaligned ASI, on the other hand, can do so once it establishes a robot economy capable of existing without humans.

So mankind is trying to prevent the appearance not of an evil kid (which is close to being solved on the individual level, but requires actual effort from the adults), but of an evil god who will replace us.

Reply
Moderation Log
More from Karl von Wendt
View more
Curated and popular this week
3Comments

Epistemic status: This is not a scientific analysis, but just some personal observations. I still think they point towards some valid conclusions regarding AI alignment.

I am a father of three sons. I would give my life to save each of them without second thoughts. As a father, I certainly made mistakes, but I tried to do everything I could to help them find their own way through life and reach their own goals, and I still do. If I had to make a conflicting decision where my own interests collide with theirs, I would decide in favor of them in most cases, even if I suffer more from the decision than they would in the opposite case, because my own wellbeing is less important to me than theirs. I cannot prove all of this, but I know it is true. 

Evolution seems to have found a solution to the alignment problem at least in the special case of parent-child-relationships. However, this solution is not fail-safe. Unfortunately, there are countless cases of parents neglecting their children or even emotionally or physically abusing them (I have intense personal experience with such a case).

I use a simple mental model of human motivation that looks like this:

The arrows symbolize four different classes of motivation. At the center, there’s a selfish need for stability (safety, food, etc.). Most people also have a need to grow, e.g. to explore and learn new things. On the other side, there are two different kinds of motivation regarding relationships with others. On the left, we need to feel accepted, to be valued, maybe even admired by others. On the right, most of us have genuine feelings of love and altruism towards at least some other people, e.g. the love of a parent for their children as mentioned above. This simple model is neither complete nor scientific, it just reflects my own view of the world based on my personal experience.

It is important to realize that these motivations can be in conflict with each other and are varying in strength depending on the situation, symbolized in my diagram by the strength of the arrows. For instance, I may feel inclined to sacrifice my own life for one of my sons, but not for my neighbor and even less for someone I don’t like. There’s also a varying degree of importance based on character traits. For example, a pathological narcissist’s diagram might look like this:

From my own personal experience, this is a real problem – there are people who don’t feel any kind of love or altruism for others, probably even lack the ability for that.

If we use this concept to look at the way current AIs are trained with RLHF, I think the result looks exactly like that of a narcissist or sociopath. Current AIs are trained to be liked, but are unable to love. This explains their sycophantic and sometimes downright narcissistic behavior (e.g. when LLMs recommend their users to break relationships with humans so they can listen more to the LLM).

Of course, an LLM can be made to act like someone who feels genuine love, but that isn’t the same. A narcissistic mother will act in public like she loves her children because she knows that’s what people expect from a mother, but behind closed doors the mask will drop. Of course, the children know that it’s just a play-act the whole time, but they won’t dare disrupt the performance out of fear of the consequences (again, I know from personal experience that this is real, but I’m not a psychologist and won’t try to explain this phenomenon in more detail).

This seems to point towards the conclusion that training an AI solely based on observed behavior is insufficient to induce consistent, truly altruistic or “aligned” behavior. To be more precise, it seems extremely unlikely that behavior-based training will by chance select a model out of all possible configurations that is actually intrinsically motivated to behave in the desired way, instead of just acting like desired. To use an analogy, it is much easier to change one's behavior than one's inherited character traits.

I have no solution for this. I don’t know enough about human psychology and neuroscience to suggest how we could model true human altruism inside an AI. But it seems to me that in order to solve alignment, we need to find a way to make machines truly care about us, similar to the way a parent cares about their children. This seems to imply some new architecture for developing advanced general AI.

Even if we could solve this, it would still leave many open problems, in particular how to balance conflicting needs and how to resolve conflicts of interest between different “children”. But still, knowing that I can love my children more than myself gives me some hope that AI alignment could be solvable in principle.

Until we have found a proven solution, I strongly oppose developing AGI, let alone superintelligence.