Intent alignment should not be the goal for AGI x-risk reduction

John Nay

P(misalignment x-risk | AGI) is high.

Intent alignment should not be the goal for AGI x-risk reduction. If AGI is developed, and we solve AGI intent alignment, we will not have lowered x-risk sufficiently, and we may have even increased it higher than it would have been otherwise.

P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).

The goal of AI alignment should be alignment with (democratically determined) societal values (because these have broad buy-in from humans).

P(misalignment x-risk | AGI) is higher if intent alignment is solved before societal-AGI alignment.

Most technical AI alignment research is currently focused on solving intent alignment. The (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. This would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); that is highly unlikely.

Solving intent alignment is likely to make practically implementing societal-AGI alignment harder. If we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound.

Why does solving intent alignment not lower x-risk sufficiently?

If we solve the intent alignment problem between a human, H, and an AI, A, then A implements H’s intentions with super-human intelligence and skill.
There are multiple Hs and multiple As.
By the very nature of humans, there are conflicts in the intentions of the Hs.
1. Humans have conflicting preferences about the behavior of other humans and about states of the world more broadly. Intent-aligned As would thus have different intentions from one another.
The As execute actions furthering the H’s intentions far too quickly for those conflicts to be solved through any existing human-driven conflict resolution. Conflicts are thus likely to spiral out of control.
1. Any ultimate conflict resolution mechanism needs to be human-driven. No A can conduct the conflict resolution work because it does not have buy-in from all Hs (or their intent-aligned As). Affected Hs need to endorse the process and respect the outcome. That only happens with democratic procedures.
Therefore, if we solve intent alignment, we do not solve the problem of AGI being sufficiently beneficial to humans. We do not drastically reduce P(misalignment x-risk) because there will be misalignment between many of the AGI systems and many of the humans. That level of conflict of powerful agents could be existential for humanity as a whole.

Then what should we be aiming for?

To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans (and not AI) and authoritative conflict resolution mechanisms driven entirely by humans (and not AI). Humans already have these things (and they are well-developed in the nation with the highest probability of producing AGI, the U.S.).

We need to do the work to internalize these things in AI systems. Work toward intent alignment distracts resources from societal-AGI alignment technical work (at best); and it actively makes finishing the societal-AGI alignment work harder (at worst), if intent aligned AGI is developed first.

If societal-AGI alignment is solved before intent-alignment is solved, then there is powerful societally-aligned AGI that can reduce the probability of intent-aligned AGIs being developed and/or having negative impacts.

Conclusion

We don't yet have a solution for societal-AGI-alignment or intent-AGI-alignment, and both are very hard problems. This post is intended to raise questions about where/when to devote development resources.

Appendix A: What is intent-AGI alignment?

Cullen O’Keefe summarized intent alignment well in this Alignment Forum post.

The standard definition of "intent alignment" generally concerns only the relationship between some property of a human principal H and the actions of the human's AI agent A:
Jan Leike et al. define the "agent alignment problem" as "How can we create agents that behave in accordance with the user's intentions?"
Amanda Askell et al. define "alignment" as "the degree of overlap between the way two agents rank different outcomes."
Paul Christiano defines "AI alignment" as "A is trying to do what H wants it to do."
Richard Ngo endorses Christiano's definition.
Iason Gabriel does not directly define "intent alignment," but provides a taxonomy wherein an AI agent can be aligned with:
"Instructions: the agent does what I instruct it to do."
"Expressed intentions: the agent does what I intend it to do."
"Revealed preferences: the agent does what my behaviour reveals I prefer."
"Informed preferences or desires: the agent does what I would want it to do if I were rational and informed."
"Interest or well-being: the agent does what is in my interest, or what is best for me, objectively speaking."
"Values: the agent does what it morally ought to do, as defined by the individual or society."
All but (6) concern the relationship between H and A. It would therefore seem appropriate to describe them as types of intent alignment.

Appendix B: What is societal-AGI alignment?

Two examples from Alignment Forum posts:

Coherent Extrapolated Volition is a non-democratic version of societal alignment, where "an AI would predict what an idealized version of us would want, "if we knew more, thought faster, were more the people we wished we were, had grown up farther together". It would recursively iterate this prediction for humanity as a whole, and determine the desires which converge."
Law-Informed AI is a democratic version of societal alignment where AGI learns societal values from democratically developed legislation, regulation, court opinions, legal expert human feedback, and more.

To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans

I agree that this kind of work is massively overlooked by this community. I have done some investigations on the root causes of why it is overlooked. The TL;DR is that this work is less technically interesting, and that many technical people here (and in industry and academia) would like to avoid even thinking about any work that needs to triangulate between different stakeholders who might then get mad at them. For a longer version of this analysis, see my paper Demanding and Designing Aligned Cognitive Architectures, where I also make some specific recommendations.

My overall feeling is that the growth in the type of technical risk reduction research you are calling for will will have to be driven mostly by 'demand pull' from society, by laws and regulators that ban certain unaligned uses of AI.

Thanks so much for sharing that paper. I will give that a read.

On one hand, I agree that intent alignment is insufficient for preventing x-risk from AI. There are too many other ways for AI to go wrong: coordination failures, surveillance, weaponization, epistemic decay, or a simple failure to understand human values despite the ability to faithfully pursue specified goals. I'm glad there are people like you working on which values to embed in AI systems and ways to strengthen a society full of powerful AI.

On the other hand, I think this post misses the reason for popular focus on intent alignment. Some people think that, for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk. Ajeya Cotra's framing of this argument is most persuasive to me. Or Eliezer Yudkowsky's "strawberry alignment problem", which (I think) he believes is currently impossible and captures the most challenging part of alignment:

"How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?"

Personally I think there's plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment. Eliezer seems to think this is more distraction from the real problem than it's worth, but surveys suggest that many people in AI safety orgs think x-risk is disjunctive across many scenarios. Which is all to say, aligning AI with societal values is important, but I wouldn't dismiss intent alignment either.

Thanks for those links and this reply.

for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk

I don't see how this is a counterargument to this post's main claim:

P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).

That problem of the collapse of a human provided goal into AGI power-seeking seems to apply just as much to the problem of intent alignment as it does to societal alignment; it could apply even more because the goals provided would be (a) far less comprehensive, and (b) much less carefully crafted.

Personally I think there's plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment.

I agree with this. My point is not that we should not think about the risks of intent alignment, but rather that (if the arguments in this post are valid): AGI-capabilities-advancing-technical-research that actively pushes us closer to developing intent-aligned AGI is a net negative because it could cause us to develop intent-aligned AGIs that would cause an increase in x-risk because AGIs aligned to multiple humans that have conflicting intentions can lead to out-of-control conflicts; and if we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power. Furthermore, humans with intent-aligned AIs would suddenly have significantly more power, and their advantages over others would likely compound, worsening the above issues.

Most current technical AI alignment research is AGI-capabilities-advancing-research that actively pushes us closer to developing intent-aligned AGI, with the (usually implicit, sometimes explicit) assumption is that solving intent alignment will help subsequently solve societal-AGI alignment. But this would only be the case if all the humans that had access to intent-aligned AGI had the same intentions (and did not have any major conflicts between them); and that is highly unlikely.

I think the danger of intent-alignment without societal-alignment is pretty important to consider, although I'm not sure how important it will be in practice. Previously, I was considering writing a post about a similar topic - something about intent-level alignment being insufficient because we hadn't worked out metaethical issues like how to stably combine multiple people's moral preferences and so on. I'm not so sure about this now, because of an argument along the lines of "given that it's aligned with a thoughtful, altruistically motivated team, an intent-aligned AGI would be able to help scale their philosophical thinking so that they reach the same conclusions they would have come to after a much longer period of reflection, and then the AGI can work towards implementing that theory of metaethics."

Here's a recent post that covers at least some of these concerns (although it focuses more on the scenario where one EA-aligned group develops an AGI that takes control of the future): https://www.lesswrong.com/posts/DJRe5obJd7kqCkvRr/don-t-leave-your-fingerprints-on-the-future

I could see the concerns in this post being especially important if things work out such that a full solution to intent-alignment becomes widely available (i.e. easily usable by corporations and potential bad actors) and takeoff is slow enough for these non-altruistic entities to develop powerful AGIs pursuing their own ends. This may be a compelling argument for withholding a solution to intent-alignment from the world if one is discovered.

Thanks.

There seems to be pretty wide disagreement about how intent-aligned AGI could lead to a good outcome.

For example, even in the first couple comments to this post:

The comment above (https://www.lesswrong.com/posts/Rn4wn3oqfinAsqBSf/?commentId=zpmQnkyvFKKbF9au2) suggests "wide open decentralized distribution of AI" as the solution to making intent-aligned AGI deployment go well.
And this comment I am replying to here says, "I could see the concerns in this post being especially important if things work out such that a full solution to intent-alignment becomes widely available."

My guess, and a motivation for writing this post, is that we see something in between (a.) wide and open distribution of intent-aligned AGI (that somehow leads to well-balanced highly multi-polar scenarios), and (b.) completely central ownership (by a beneficial group of very conscientious philosopher-AI-researchers) of intent-aligned AGI.

Sufficiently strong intent alignment implies societal alignment. This is because all of our intents have some implied "...and do so without disrupting social order." And I think this implication is not something you can easily remove or leave out. Intent without this part may make sense as a hypothesis, but any actual implementation of intent alignment will need to consider this. All words carry social aspects with them in non-trivial ways.

The same applies to psychological alignment or medical alignment, or traffic alignment. Social aspects are not special. If I have the intent to clone a strawberry that also implies that I stay healthy and traffic is not disrupted when the AI does its work.

It's definitely not the case that:

all of our intents have some implied "...and do so without disrupting social order."

There are many human intents that want to disrupt social order, and more generally cause things that are negative for other humans.

And that is one of the key issues with intent alignment.

I don't disagree. Intent alignment requires solving social alignment. But I think most people here understand that to be the case.

If we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power.

This is an interesting point, but I think you are missing other avenues for reducing the impact of centralization in futures where intent alignment is easy. We don't necessarily need full societal-AGI alignment - just wide open decentralized distribution of AI could help ensure multipolar scenarios and prevent centralization of power in a few humans (or likely posthumans). Although I guess the natural resulting deliberation could be considered an approximation of CEV regardless.

To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans

Thanks so much for sharing that paper. I will give that a read.

"How would you get an AI system to do some very modest concrete action requiring extremely high levels of intelligence, such as building two strawberries that are completely identical at the cellular level, without causing anything weird or disruptive to happen?"

Thanks for those links and this reply.

for a sufficiently powerful AI trained in the current paradigm, there is no goal that it could faithfully pursue without collapsing into power seeking, reward hacking, and other instrumental goals leading to x-risk

I don't see how this is a counterargument to this post's main claim:

P(misalignment x-risk | intent-aligned AGI) >> P(misalignment x-risk | societally-aligned AGI).

Personally I think there's plenty of x-risk from intent aligned systems and people should think about what we do once we have intent alignment.

Thanks.

There seems to be pretty wide disagreement about how intent-aligned AGI could lead to a good outcome.

For example, even in the first couple comments to this post:

The comment above (https://www.lesswrong.com/posts/Rn4wn3oqfinAsqBSf/?commentId=zpmQnkyvFKKbF9au2) suggests "wide open decentralized distribution of AI" as the solution to making intent-aligned AGI deployment go well.
And this comment I am replying to here says, "I could see the concerns in this post being especially important if things work out such that a full solution to intent-alignment becomes widely available."

It's definitely not the case that:

all of our intents have some implied "...and do so without disrupting social order."

There are many human intents that want to disrupt social order, and more generally cause things that are negative for other humans.

And that is one of the key issues with intent alignment.

I don't disagree. Intent alignment requires solving social alignment. But I think most people here understand that to be the case.

If we first solve intent alignment before solving societal alignment, humans with intent-aligned AGIs are likely to be incentivized to inhibit the development and roll-out of societal AGI-alignment techniques because they would be giving up significant power.

LESSWRONG
LW

LESSWRONG
LW

1

Intent alignment should not be the goal for AGI x-risk reduction

1

Why does solving intent alignment not lower x-risk sufficiently?

Then what should we be aiming for?

Conclusion

Appendix A: What is intent-AGI alignment?

Appendix B: What is societal-AGI alignment?

1

1