Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

Corrigibility should allow safe value or policy change. Indifference allows the agent to accept changes without objecting. However, an indifferent agent is similarly indifferent to the learning process.

Classical uncertainty over values has the opposite problem: the AI is motivated to learn more about its values (and preserve the learning process) BUT is also motivated to manipulate its values.

Both these effects can be illustrated on a single graph. Assume that the AI follows utility is uncertain between utilities and , and has a probability that .


Note that the correct way of achieving this is to define for some indicator function . This allows the agent to correctly solve the naive cake or death problem. However, I'll continue to use the terminology that we're used to, with the understanding that means .

Then there are four key values: , , , and (all expectations and probabilities are taken with respect to the AI's estimates). Since is the AI's probability that , means that the AI will behave as a pure maximiser. Thus and are the expectations of and , respectively, given that the AI is maximising . And and are the expectations of the utilities given that the AI is maximising .

In any reasonable world, and -- the AI cannot maximise a utility better by trying to maximise a different one.

For illustrative purposes, assume , , , , and consider the following graph:

The blue line connects (at ) with (at ). This is the expected , plotted against the AI's current , if the AI expects to be immediately informed of the correct . It is the maximal possible expected given .

The expected purple line connects (at ) with (at ). Note that the second point's -value is not where you might naively expect. This represents the expected , as a function of , if the AI were to behave as a pure -maximiser. The yellow line connects (at ) with (at ) and represents the expected , as a function of , for an AI that behaves as a pure -maximiser.

Since the AI has the option of behaving as a pure -maximiser or -maximiser, those lines represent the minimal -utility the AI can achieve. These minimums can actually be obtained: imagine an AI that has a single choice between two options, and these options have differential effects on and .

But in general, there will be some but not perfect tradeoff between maximising and , and the true expectations for , as a function of , will be given by a curve within the triangle defined by the three lines -- a curve like the green one.

Theorem: Curves of expected -utility must be convex as a function of .

Proof: Let the curve be defined as . Fix any . Assume the AI has . Now update its information so that it knows that after getting that info, either will be (with probability ) or (with probability ). Because of the probabilities we've chosen, is still at the moment. Hence the expectation of , given this information update, is the point of -coordinate in the line joining with . Because extra information cannot make the agent worse in expectation, this point must have a -coordinate higher or equal to (it expectation without the information update). This demonstrates the curve is convex.

As a side effect of this argument, we can clearly see the value of information. Looking at the "expectation of " curve (now purple), we can see that extra information can lift its expectation up to the blue line (perfect value information). Therefore the orange segment is the improvements the AI can expect from learning more about is values. This demonstrates that the AI has a urge to preserve its learning process.

Bias

What is the problem then with classical learning systems? The problem is that, typically, and are not equal. So though the AI is motivated to learn, it's also motivated to manipulate the learning process.

For instance, could be "make cake", "cause deaths", and the update process could be "ask humans which to do".

If is the indicator function for "the human answers Cake" and the indicator function for "the human answers death", then

  • .

And the AI will attempt to increase if it can.

Versus indifference

How does indifference compare with classical value uncertainty? Well, whenever an indifferent AI changes its utility function, it gets compensatory rewards to ensure the expected utility doesn't change. Therefore its expectation is constant for varying , as shown by the yellow line:

So, between indifference and classical value learning, the first has no incentive to manipulate the learning process, but also no incentive to preserve it. While classical value learning differs in both ways.

Could we get a compromise between the two? One initial idea would be to use a compensatory reward to move and to the same level, removing the bias but preserving the desire to learn. This approach seems to have a lot of problems, though, and I can't get it to work properly for the moment.

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 3:04 PM

RE: my last question- After talking to Stuart, I think one way of viewing the problem with such a proposal is: The agent cares about its future expected utility (which depends on the state/history, not just the MDP).

Why doesn't normalizing rewards work?

(i.e. set max_pi(expected returns)=1 and min_pi(expected_returns)=0, for all environments)... I assume this is what you're talking about at the end?

In your example for bias, the agent only has incentive to manipulate humans because it's treating human word as truth, rather than as evidence. For example, an AI that relies on button-pressing to learn about human morality will try to press its own buttons if it thinks that the buttons are identical to morality, but will not do so if it has a causal model of the world that allows for morality as one of several causes of button presses.

So a fully probabilistic value learner as in Dewey 2011 doesn't have this manipulativeness - the trouble is just that we don't know how to write down the perfect probabilistic model of the world that such a value learner needs in order to work. Hm, I wonder if there's a way to solve this problem with lots of data and stochastic gradient descent.

(EDIT: For toy problems, you might try to learn correct moral updating from examples of correct moral updates, but the data would be hard to generate for the real world and the space to search would be huge. It seems to me that an AI couldn't start ignorant, then learn how to learn about morality as it explored the world, then explore the world.)

So though the AI is motivated to learn, it’s also motivated to manipulate the learning process.

It seems like the problem here is that the prior probability that the human says "cake" depends on the AI's policy. The update when seeing the human actually say "cake" isn't a problem, due to conservation of expected evidence.

Under my (very incomplete) model of Everitt's approach, the programmer will specify the prior over values (so the prior is independent of the AI's policy), then disallow actions that would prevent the reward signal from being an unbiased estimate of the values.

Correct me if I'm wrong, but doesn't this proposal compromise between the two in a satisfying way?

I think I have a better solution, that doesn't involve counterfactuals at all.