My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

2 Rigging learning or reward-maximisation?

I agree that you can cast any behavior as reward maximization with a complicated enough reward function. This does imply that you have to be careful with your prior / update rule when you specify an assistance game / CIRL game.

I'm not arguing "if you write down an assistance game you automatically get safety"; I'm arguing "if you have an optimal policy for some assistance game you shouldn't be worried about it rigging the learning process relative to the assistance game's prior". Of course, if the prior + update rule themselves lead to bad behavior, you're in trouble; but it doesn't seem like I should expect that to be via rigging as opposed to all the other ways reward maximization can go wrong.

3 AI believing false facts is bad

Tbc I agree with this and was never trying to argue against it.

4 Changing preferences or satisfying them

Thus there is no distinction between "compound" and "non-compound" rewards; we can't just exclude the first type. So saying a reward is "fixed" doesn't mean much.

I agree that updating on all reward functions under the assumption that humans are rational is going to be very strange and probably unsafe.

5 Humans learning new preferences

I agree this is a challenge that assistance games don't even come close to addressing.

6 The AI doesn't trick any part of itself

Your explanation in this section involves a compound reward function, instead of rigged learning process. I agree that these are problems; I was really just trying to make a point about rigged learning processes.

Reply

[-]Stuart_Armstrong6yΩ340

My main note is that my comment was just about the concept of rigging a learning process given a fixed prior over rewards. I certainly agree that the general strategy of "update a distribution over reward functions" has lots of as-yet-unsolved problems.

Ah, ok, I see ^_^ Thanks for making me write this post, though, as it has useful things for other people to see, that I had been meaning to write up for some time.

On your main point: if the prior and updating process are over things that are truly beyond the AI's influence, then there will be no rigging (or, in my terms: uninfluenceable->unriggable). But there are many things that look like this, that are entirely riggable. For example, "have a prior 50-50 on cake and death, and update according to what the programmer says". This seems to be a prior-and-update combination, but it's entirely riggable.

So, another way of seeing my paper is "this thing looks like a prior-and-update process. If it's also unriggable, then (given certain assumptions) it's truly beyond the AI's influence".

Reply

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

16

Reward functions and updating assumptions can hide a multitude of sins

16

Ω 8

16

Ω 8

1 Introduction

2 Rigging learning or reward-maximisation?

3 AI believing false facts is bad

4 Changing preferences or satisfying them

5 Humans learning new preferences

6 The AI doesn't trick any part of itself

7 IRL vs CIRL