I’ve been thinking about AI systems that take actions their users would most approve of. The word “most” in that sentence can be a bit unsettling.
In this post I want to explain why you might be unsettled, and also why I think that these concerns probably aren’t deal-breakers.
For concreteness: I’ll imagine a system which is overseen by a human operator and proposes a sequence of actions. Most of the time a proposed action is taken without any human involvement. But in a small fraction of cases an action is reviewed by the overseer — before taking it — and each action is chosen to maximize how much the overseer would approve of it if it happened to be selected for review.
(Readers familiar with terms like “perverse instantiation” can safely skip this section.)
The basic problem is that we don’t actually care about whether an AI system take actions we approve of. We are using approval as a proxy for leads-to-good-consequences.
But even if these two concepts typically overlap, they seem especially likely to come apart in the extremes.
For example, the “best” action may be optimized to be misunderstood by the overseer, in a way that leads us to believe that it has impossibly good consequences. Or the most persuasive argument might exploit human fallibility rather than honestly making a case, so that the “best” action is the one most compatible with the most effective tactics of persuasion. Or (in an especially extreme case) a proposed action may be such that merely viewing it compromises the overseer’s computer, seizing control of the approval signal itself.
However we feel about the plausibility of these failures, it would be unfortunate to build a system that actively seeks them out. But that’s exactly what an approval-maximizer does!
There are a few features of approval-directed systems that make me significantly less concerned about maximization:
(See also my discussion of maximization here.)
I don’t have a credible concrete scenario where the “maximization” in approval-maximization seems especially problematic. The safety mechanisms in the previous section also feel plausibly sufficient. I think the biggest question is the extent to which approval-direction is plausible as a way of organizing the internal behavior of an AI system.
That said, I would feel more comfortable with approval-directed systems that don’t apply their ingenuity to coming up with clever ways to exploit their users. I don’t think that this concern is a deal-breaker, but I do think there is room for improvement.