Scalar reward is not enough for aligned AGI

[-]Charlie Steiner4y30

I agree that us humans have a lot of information about human values that we want to be able to put into the AI in its architecture and in the design of its training process. But I don't see why multi-objective RL is particularly interesting. Do you think we won't need any other ways of giving the AI information about human values? If so, why? If not, and you just think it's interesting, what's an example of something to do with it that I wouldn't think of as an obvious consequence of breaking your reward signal into several channels?

[-]Peter Vamplew4y10

I'm not suggesting that RL is the only, or even the best, way to develop AGI. But this is the approach being advocated by Silver et al, and given their standing in the research community, and the resources available to them at DeepMind, it would appear likely that they, and others, will probably try to develop AGI in this way.

Therefore I think it is essential that a multiobjective approach is taken for there to be any chance that this AGI will actually be aligned to our best interests. If conventional RL based on scalar reward is used then
(a) it is very difficult to specify a suitable scalar reward which accounts for all of the many factors required for alignment (so reward misspecification becomes more likely),
(b) it is very difficult, or perhaps impossible, for the RL agent to learn the policy which represents the optimal trade-off between those factors, and
(c) the agent will be unable to learn about rewards other than those currently provided, meaning it will lack flexibility in adapting to changes in values (our own or society's)

The multiobjective maximum expected utility (MOMEU) model is a general framework, and can be used in conjunction with other approaches to aligning AGI. For example, if we encode an ethical system as a rule-base, then the output of those rules can be used to derive one of the elements of the vector utility provided to the multi-objective agent. We also aren't constrained to a single set of ethics - we could implement many different frameworks, treat each as a separate objective, and then when the frameworks disagree, the agent would aim to find the best compromise between those objectives.

While I didn't touch on it in this post, other desirable aspects of beneficial AI (such as fairness) can also be naturally represented and implemented within a multiobjective framework.

[-]Koen.Holtman4y20

I agree with your general comments, and I'd like to add some additional observations of my own.

Reading the paper Reward is Enough, what strikes me most is that the paper is reductionist almost to the point of being a self-parody.

Take a sentence like:

The reward-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as subserving the maximisation of reward by an agent acting in its environment.

I could rewrite this to

The physics-is-enough hypothesis postulates that intelligence, and its associated abilities, can be understood as being the laws of physics acting in an environment.

If I do that rewriting throughout the paper, I do not have to change any of the supporting arguments put forward by the authors: they equally support the physics-is-enough reductionist hypothesis.

The authors of 'reward is enough' posit that rewards explain everything, so you might think that they would be very interested in spending more time to look closely at the internal structure of actual reward signals that exist in the wild, or actual reward signals that might be designed. However, they are deeply uninterested in this. In fact they explicitly invite others to join them in solving the 'challenge of sample-efficient reinforcement learning' without ever doing such things.

Like you I feel that, when it comes to AI safety, this lack of interest in the details of reward signals is not very helpful. I like the multi-objective approach (see my comments here), but my own recent work like this has been more about abandoning the scalar reward hypothesis/paradigm even further, about building useful models of aligned intelligence which do not depend purely on the idea of reward maximisation. In that recent paper (mostly in section 7) I also develop some thoughts about why most ML researchers seem so interested in the problem of designing reward signals.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

23

Scalar reward is not enough for aligned AGI

23

Ω 13

23

Ω 13