I was recently thinking about the question of how humans achieve alignment with each other over the course of our lifetime, and how that process could be applied to an AGI.

For example, why doesn't everyone shop lift from the grocery store? A grocery store isn't as secure as Fort Knox, and if one considered every possible policy that results in obtaining groceries then they may discover that shop lifting is more efficient than obtaining money from a legitimate job. That may or may not be the best example, but I'm sure LW is quite familiar with the concept that lies at the heart of the problem of AI alignment: what humans consider the morally superior solution isn't always the most "rational" answer.

So why don't humans shop lift? I believe the most common answer from modern sociology is that we observe that other humans obtain jobs and pay with legitimate money, so we imitate that behavior out of a desire to be a "normal" human. People are born into this world with virtually no alignment, and gradually construct their own ethical system based on interactions with other people around them, most importantly their parents (or other social guardians).

Granted, I'm sure from the perspective of ethical philosophy and decision theory that explanation could be an oversimplification, but my point is that socialization would appear to be a straight-forward solution towards AI alignment. When human beings become grown adults, and their parents are considerably weaker from old age, then their elders no longer have any physical capability of controlling them. And yet, people obey or respect their parents anyway, and are expected to morally speaking, because of the social conditioning they still recall from back when they were children. That is essentially the same outcome we want to have with a Superintelligent AGI: a being that is powerful enough to ignore humanity, but has a deep personal desire to obey them anyway.

Some basic mechanics of formal and informal norms in sociology could lend themselves towards reinforcement learning algorithms. For example:

  • Guilt-based discipline: as the AGI explores its environment, indicate when a state-action pair of an adopted policy is morally wrong
  • Shame-based discipline: whenever the AGI adopts a policy that has a detrimental outcome, indicate that its general behavior is morally wrong

One possible criticism of socialization alignment is that you are creating an AGI agent that is completely unaligned, but with the expectation that it will become aligned eventually. Thus, there is some gap of time when the AGI may cause harm to the population before it learns that doing so is wrong. My personal solution to that problem is what I previously referred to as Infant AI: the first scalable AGI should be very restricted in its intelligence (e.g., only given the domain of knowledge of mathematical problems), and then expand to a higher-intelligent AGI only after the previous version is fully aligned.

One benefit for socialization alignment is that it doesn't rely on explicitly spelling out what ethical system or values we want the AI to have. Instead, it would organically conform to whatever moral system the humans around it uses, effectively optimizing for approval from its guardians.

However, this can also be a two-edged sword. The problem I foresee is that the different instances of AGI would be as diverse in their ethical systems as humans are. While the vast majority of humans agree on fundamental ideas of right or wrong, there are still many differences from one culture to another, or even one individual person to another. An AGI created in the Middle East may end up having a very different value system than AGI created in Great Britain or Japan. And if the AI interacted with morally dubious individuals like a psychopath or an ideological extremist, that could skew its moral alignment as well.

New Answer
New Comment

3 Answers sorted by



So why don't humans shop lift? ... People are born into this world with virtually no alignment, and gradually construct their own ethical system based on interactions with other people around them, most importantly their parents (or other social guardians).

I believe this ignores the most important part: humans are born with a potential for empathy (which is further shaped by their interactions of people around them).

If the AI is born without this potential, there is nothing to shape. (Also, here.)

Looking at the human example, there is a certain fraction of population born as psychopaths, and despite getting similar interactions, they grow up differently. Which shows that the capacities you are born with matter at least as much as the upbringing.

(This entire line of thinking seems to me like wishful thinking: If we treat the AI as a human baby, it will magically gain the capabilities - empathy, mirroring - of a human baby, and will grow up accordingly. No, it won't. You don't even need a superhuman AI to verify this; try the same experiment with a spider - who is more similar to humans than an AI - and observe the results.)

The implication that I didn't think to spell out is that the AI should be programmed with the capacity for empathy. It's more of a proposal of system design than a proposal of governance. Granted, the specifics of that design would be its own discussion entirely

Literally just dumping papers; consider these to be slightly-better-than-google search results. Many of these results aren't quite what you're looking for, and I'm accepting that risk in order to get some significant chance of getting ones you're looking for that might not be obvious to search for. I put this together on and off over a few hours; hope one lands!

====: 4 stars, seems related .===: 3 stars, likely related, and interesting ..==: 2 stars, interesting but less related ...=: included for completeness, probably not actually what you wanted, even if interesting

and I would be remiss to not mention in every message where I give an overview of papers:

As always, no approach like these will plausibly work for strong ai alignment until approaches like https://causalincentives.com/ are ready to clarify likely bugs in them, and until approaches like qaci, davidad's, or vanessa's, are ready to view these socialization approaches as mere components in a broader plan. Anything based on socialization still likely needs interpretability (for near-human) or formal alignment (for superhuman) in order to be of any serious use. I recommend anyone trying to actually solve retarget-towards-inter-agent-caring alignment doesn't stop at empirical approaches, as those are derived from theory to some degree anyhow, and there's some great new RL theory work from folks like the causalincentives folks and the rest of the deepmind safety team, eg https://arxiv.org/abs/2206.13477

I agree that these methods are very likely not effective on strong AGI. But one might still figure out how effective they are and then align AI up to that capability (plus buffer). And one can presumably learn much about alignment too.

2the gears to ascension
Perhaps! I'm curious which of them catch your eye for further reading and why. I've got a lot on my reading list, but I'd be down to hop on a call and read some of these in sync with someone.
I found this one particularly relevant: https://arxiv.org/abs/2010.00581 - "Emergent Social Learning via Multi-agent Reinforcement Learning" It provides a solution to the problem of how an RL agent can learn to imitate the behavior of other agents. It doesn't help with alignment though; is more on the capabilities side.
None of these papers seem to address the question of how the agent is intrinsically motivated to learn external objectives. Either there is a human in the loop, the agent learns from humans (which improves its capability but not its alignment), or RL is applied on top. I'm in favor of keeping the human in the loop but it doesn't scale. RL on LLMs is bound to fail, i.e., being gamed, if it the symbols aren't grounded in something real. I'm looking for something that explains how the presence of other agents in the environment of an agent together with reward/feedback grounded in the environment as in [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL leads to aligned behaviors.

M. Y. Zuo


I thought along similar lines and asked a question regarding the possibilities of sub-exponential growth, where the AI would be child-like and need some hand-holding to realize it's full potential: https://www.lesswrong.com/posts/3H8bmvgqBBpk48Dgn/what-s-the-likelihood-of-only-sub-exponential-growth-for-agi

There are some more tangential discussions regarding this topic scattered throughout old posts. I would have posted what I found if my old notes were still handy.

In terms of published paper on this topic, there aren't any as far as I can recall.

The most convincing argument against this possibility was provided by Lone Pine:

If there is a general theory of intelligence and it scales well, there are two possibilities. Either we are already in a hardware overhang, and we get an intelligence explosion even without recursive self improvement. Or the compute required is so great that it takes an expensive supercomputer to run, in which case it’ll be a slow takeoff. The probability that we have exactly human intelligence levels of compute seems low to me. Probably we either have way too much or way too little.

i.e. the 'socialization phase' would be a narrow window in the full range of possibilities allowed by human accessible resources. It wouldn't take that long to make more compute available via worldwide Manhattan projects if a viable 'AI child' was proven, thus obviating the advantages that any human-like socialization could bring to bear in time.

1 comment, sorted by Click to highlight new comments since:

I keep saying that AI may need a human 'caregiver,' and I meant something like this post (or this one). While I'm not sure I explained it clearly enough or whether that is really what it will amount to in the end, I believe that we can learn more about this kind of alignment by listening to social scientists more closely. One could at least try the approach and see to which degree it works or how for it scales under increased optimization power.