Learning values versus indifference

[-]IAFF-User-1119yΩ000

RE: my last question- After talking to Stuart, I think one way of viewing the problem with such a proposal is: The agent cares about its future expected utility (which depends on the state/history, not just the MDP).

[-]IAFF-User-1119yΩ000

Why doesn't normalizing rewards work?

(i.e. set max_pi(expected returns)=1 and min_pi(expected_returns)=0, for all environments)... I assume this is what you're talking about at the end?

[-]Charlie Steiner10y00

In your example for bias, the agent only has incentive to manipulate humans because it's treating human word as truth, rather than as evidence. For example, an AI that relies on button-pressing to learn about human morality will try to press its own buttons if it thinks that the buttons are identical to morality, but will not do so if it has a causal model of the world that allows for morality as one of several causes of button presses.

So a fully probabilistic value learner as in Dewey 2011 doesn't have this manipulativeness - the trouble is just that we don't know how to write down the perfect probabilistic model of the world that such a value learner needs in order to work. Hm, I wonder if there's a way to solve this problem with lots of data and stochastic gradient descent.

(EDIT: For toy problems, you might try to learn correct moral updating from examples of correct moral updates, but the data would be hard to generate for the real world and the space to search would be huge. It seems to me that an AI couldn't start ignorant, then learn how to learn about morality as it explored the world, then explore the world.)

[-]jessicata10yΩ000

So though the AI is motivated to learn, it’s also motivated to manipulate the learning process.

It seems like the problem here is that the prior probability that the human says "cake" depends on the AI's policy. The update when seeing the human actually say "cake" isn't a problem, due to conservation of expected evidence.

Under my (very incomplete) model of Everitt's approach, the programmer will specify the prior over values (so the prior is independent of the AI's policy), then disallow actions that would prevent the reward signal from being an unbiased estimate of the values.

[-]jessicata10yΩ000

Correct me if I'm wrong, but doesn't this proposal compromise between the two in a satisfying way?

[-]Stuart_Armstrong10yΩ000

I think I have a better solution, that doesn't involve counterfactuals at all.

Bias

What is the problem then with classical learning systems? The problem is that, typically,

E (w | p = 0)

and

E (v | p = 1)

are not equal. So though the AI is motivated to learn, it's also motivated to manipulate the learning process.

For instance,

v

could be "make cake",

w

"cause deaths", and the update process could be "ask humans which to do".

I_{c}

is the indicator function for "the human answers Cake" and

I_{d}

the indicator function for "the human answers death", then

U = v I_{c} + w I_{d}

And the AI will attempt to increase

I_{d}

if it can.

Versus indifference

How does indifference compare with classical value uncertainty? Well, whenever an indifferent AI changes its utility function, it gets compensatory rewards to ensure the expected utility doesn't change. Therefore its expectation is constant for varying

p

, as shown by the yellow line:

So, between indifference and classical value learning, the first has no incentive to manipulate the learning process, but also no incentive to preserve it. While classical value learning differs in both ways.

Could we get a compromise between the two? One initial idea would be to use a compensatory reward to move

E (w | p = 0)

and

E (v | p = 1)

to the same level, removing the bias but preserving the desire to learn. This approach seems to have a lot of problems, though, and I can't get it to work properly for the moment.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

2

Learning values versus indifference

2

Ω 1

2

Ω 1

Bias

Versus indifference