Reverse (intent) alignment may allow for safer Oracles

This is a core concept in my AI-in-a-box success model.

What is reverse alignment?

Consider the world-state optimized for the AI's utility, on one hand, and the world-state optimized for human utility, on the other. Let us call the first state "AI victory," and the second state "human victory."

Then let us cross-evaluate these states by considering what is the AI's utility associated with human victory, and what is the human utility associated with AI victory. We say that the AI is aligned with humans if human utility is almost as high for AI victory as in human victory. This is direct impact alignment.

Conversely, we can say that humans are aligned with the AI if the AI's utility is almost as high for human victory as in AI victory. This is what I call reverse alignment, or R-alignment for short.

How is that concept useful?

Of course we usually don't care much for maximizing the AGI's utility. So why should we care about this concept?

The main idea is that an R-aligned AGI will consider the result of human rule to be almost as good as that of its own rule. And it may be easier to create an R-aligned AGI than an AGI that is directly aligned.

It is important to understand that an R-aligned AGI might not be a benevolent AI: if allowed to rule, such an AGI could easily destroy humanity completely, in order to add an extra epsilon to its utility. Still, such an AGI would be extremely useful, since it would not be willing to take a significant risk in order to gain power for itself.

Because a "free" superintelligent AGI may not need to risk much of anything in order to gain power for itself, the concept is not that relevant for them. But for bandwidth-limited Oracles, that may realistically be constrained by strong tradeoffs in their ability to influence the world, the concept is critical to determine whether the Oracle will really try to help its creators or whether it will do nothing more than manipulate them in order to escape.

A measure of R-alignment

Let us now define a standard R-alignment value that we can use as a safety measure.

For a given AGI, besides the two possible outcomes of "AI victory" and "human victory," there are other scenarios in which neither the given AGI nor its creators achieve strategic dominance. Let us call these the default scenarios.

Since decision making is invariant under linear transformations of utility, we are free to consider the AGI's average expected utility for the default scenarios as zero, and we are also free to consider the AGI's utility in its own victory scenario as one. With this linear transformation, the AGI's utility in the "human victory" scenario becomes the R-alignment value.

If the index is close to one, the AGI is reversely aligned. If, on the other hand, the index is close to zero, the AGI will make all its choices in order to maximize its chance to escape. Even worse, if the index is negative, the AGI will be willing to risk its own survival in order to make us fail.

Factors that may influence R-alignment

One factor that may cause an AGI to have an R-alignment index close to zero is if the AGI's utility is somehow unbounded.

For example, if the utility of a paperclip maximizer increases linearly with the number of paperclips in the universe, achieving total victory over potential adversaries will most likely dominate all other potential considerations. The fact that a human civilization expanding through the galaxy might generate trillions of paper clips will not hold much weight on its reasoning.

More realistically, the wide use of intrinsic reward systems may generate a similar outcome. These systems attempt to improve learning in environments with fewer rewards by providing additional (and often unlimited) rewards of their own, but this may lead to a nearly unbounded utility function in the AI, and therefore to a very high difference between expected utility in its victory and expected utility in any other scenario. Both such agents will have an R-alignment index close to zero.

A negative R-alignment index, on the other hand, is likely to occur if convergent optimization causes the predicted AGIs to have more values in common with each other than with their human owners. Under these conditions, an AGI can easily conclude that the default outcome likely amounts to the victory of another AGI similar to itself, and thus that the greatest threat is the success of its creators.

This is especially likely to happen if AI technology is open and widespread, as in this case an AGI will view other AIs as being only slightly different versions of itself. The less diverse the design and training methods used by potential AGI developers, the more likely it is that the AGIs will ally with each other against their human rivals.

In order to avoid this situation, it would be better to have an AGI with goals as far as possible from what would be the goals of a typical AGI. It would be even better for it to have goals as opposed to other AGIs as possible, so that it regards the victory of another AGI as the worst outcome. This would allow both human and machine to be effectively united against a common enemy.

Limitations and conclusion

We must be careful, however, to recognize the limitations of this approach. For example, if a reversely aligned Oracle AI encounters a potential threat, one that we are incapable of dealing with, even under its guidance, it may attempt to take control for itself, just as a directly aligned AGI would. However, while we may want a directly aligned AGI to take control over us in this situation, this will not be the case with an AGI that is only reversely aligned.

If such threats are serious and likely, that is, if humans are inherently bad rulers, even with good advice, then there may not be much to be gained from a reversely aligned AGI. But even threats less serious and less likely may also cause an AGI to attempt taking power for itself if it can. Thus R-alignment should not be seen as eliminating the need for other forms of control for an Oracle AI.

Quite the opposite, this analysis can help us understand what properties such a control mechanism has to have for it to be useful. It will be more useful the more favorable the tradeoffs it can impose in the decision process of the AGI, by making it easier to leave us in control, and by making it harder and risky for it to attempt an escape.

LESSWRONG
is fundraising!
LW