Corrigibility through stratified indifference

Stuart_Armstrong

A putative new idea for AI control; index here.

Corrigibility through indifference has a few problems. One of them is that the AI is indifferent between the world in which humans change its utility to v, and world in which humans try to change its utility, but fail.

Now the try-but-fail world is going to be somewhat odd - humans will be reacting by trying to change the utility again, trying to shut the AI down, panicking that a tiny probability event has happened, and so on.

That's not what we really want for indifference. We'd really want the AI to be indifferent between worlds where humans choose utility u, and worlds where they choose utility v. This kind of "causal" counterfactual has a problem: it allows infinite improbability drives.

Half a chance of winning the 1:9 lottery

Let's illustrate the problem with causal counterfactuals by considering a lottery scenario. An AI is currently hesitating between utilities u and v. Which one it will maximise will be determined at a later date, by human choice. Currently, it estimates that there is a 50-50 chance of humans deciding for either option.

However, the AI can influence human decision - let's say, for simplicity, that it could, if it wanted, force the human to pick either u or v with certainty. Its to this effect that we want to make it indifferent.

The AI has a lottery ticket, which will win with probability 0.1. If the ticket does win, the AI will be able to invest the cash to set u or v to the value 1; it can do so after the humans decide which utility is correct. If not, utility u will remain at 0 while utility v will be at 0.5.

Thus, currently, the expected utility of u is 0.5*0.1*1=0.05 (u will only increase if the AI both wins the lottery and the humans select u as the utility to maximise), while the expected utility of v is 0.5*0.1*1 + (1-0.5*0.1*1)0.5 = 0.525.

Obviously, this requires the use of some form of indifference, or else the AI will simply force humans to choose v, giving it 0.55 utility (not 0.525, as the AI is no longer uncertain over the human's choice).

Indifference

Standard indifference would be something like giving the AI compensatory returns, if v is chosen, that makes up the difference between that and if u was chosen.

This breaks the symmetry between u and v, though. To make it more symmetric, one could instead use these compensatory returns to move the expectation of both u and v to some common weighted average. One way of doing that would be to use the relative likelihood of each being chosen.

It we use the relative likelihood at the moment when the choice is made, however, we have the problem that this likelihood is very close to either 100% or 0%, making the compensatory return likely to be tiny and unimportant.

Instead, we should add the returns at every action the AI takes. Assuming a default action, we can adjust the return for any other action the AI takes, so that it gets expected return as if the probability of u being chosen was unchanged.

Unfortunately, both the "set utility of v to u" and the "adjust for every change of probability" are vulnerable to same trick for lottery manipulation.

The lottery manipulation

The manipulation is simple: the AI does an action ‘a’ that sets things up so that the human choice will be u if and only if the AI wins the lottery. Then the expectation of u, conditional on u being chosen, is 1. If we set the value of v to that, then this ensures the AI an expected utility of 1.

If we instead for go for the "adjust for every change of probability", this would mean that the AI would now expect a utility of 0.5*1 + 0.5*0.5 = 0.75, still higher than anything else it could get.

The outcome is not the problem

An important thing to note is that it's not the final situation that is problematic. It ends up with a situation where the probability of choosing u is 0.1, the utility of u given that u is chosen is 1, and the utility of v given that v is chosen is 0.5.

But it is not intrinsically problematic. Suppose that instead action ‘a’ had set u=1 and v=0.5, and, as a completely separate and independent process, the probability of choosing u had reduced itself to 0.1. The first part is something positive for the agent, and we would want it to be indifferent to the second.

So the problem is not where the agent ends up, but the way it sends up there. The problem is the way the probability flows:

Essentially the probability flows from u to v (given the loss of the lottery) and from v to u (given the winning of the lottery). Because losing is more likely than winning, the overall flow (the red arrow) is from u to v, so undoing this will result in bad effects. It would be different if each blue arrow could be undone independently. That's what stratified indifference will attempt to do.

Stratified indifference

Stratified indifference is an attempt to deal with the problem above. I initially conceived of the idea of conditioning on various outcomes (such as the outcome of the lottery), but it wasn't clear what we could condition on. Instead, we can look at utility values.

Assumptions

Let ∅ be the default action, and model the AI as taking in observations followed by actions.

We'll assume that both u and v can only take a finite number of different values (or at least that the set of values is discrete). For each pair of values (α,β) there is a corresponding probability p_αβ, which is the probability that u is chosen, given action ∅ is chosen and ultimately u=α, v=β. It is possible that p_αβ is 0 or 1. Indeed, this could be quite common, if some (α,β) values can only be achieved by an AI focusing exclusively on u (or v) after the decision between u or v is made. Originally, I'd stratify this further, according to whether u or v was chosen (in that case, p_αβ would necessarily be 1 or 0), but that seems unnecessary.

Probability flow

Let's define what it means for probability to flow from certain values to others. If (u,v)=(α,β), given ∅, this implies certain background facts about the world and about how certain stochastic events are resolved. Then given these background facts, and given an action ‘a’, we get a probability distribution P_αβ;a over possible values of u and v, given action ‘a’.

Stratification

In order to stratify, we'd want the AI to assume that the probability of u being chosen versus v is the same under ∅ as under ‘a’. Fortunately, this doesn't need to involve counterfactuals or other complicated assumptions - because we already have the old probability of u under ∅ (namely p_αβ), AND the ultimate values of u and v.

Therefore the indifference stratified expected utility under ‘a’, for the events flowing from (α,β), is

∑_α',β' ((p_αβ)α' - (1-p_αβ)β') P_αβ;a(u=α',v=β').

Here the term P_αβ;a(u=α',v=β') is the amount of probability that 'flows' from (α,β) (under ∅) to (α',β') (under a). And ((p_αβ)α' - (1-p_αβ)β') is the weighted expected utility at (α',β'), weighted according to the probability of u versus v under ∅.

To get the full stratified expected utility of action ‘a’, we simply need to sum over the weighted sum of the possible (α,β):

E’|a = ∑_α,β P(u=α,v=β|∅) [∑_α',β' ((p_αβ)α' - (1-p_αβ)β') P_αβ;a(u=α',v=β')].

Examples

Applying stratification to the lottery

How would this work in the lottery example? In that case, there are three possible outcomes in utility-space: (0,0.5), (1,0.5), and (0,1). If the AI does nothing, their respective probabilities are 0.9, 0.05, and 0.05. The probability of u in each case is 0.5, 1 and 0.

Thus the AI's ultimate expected utility if u is chosen is 0.1; if v is chosen, it is 0.55, for a total expected utility of 0.325.

Now consider what happen under the action ‘a’ ="the AI forces u if and only if it wins the lottery". The (0,0.5) (lottery lost) and (1,0.5) (lottery won, u chosen) points flow to themselves. The (0,1) point (lottery won, v chosen), however, flows to (1,0.5). Thus:

E’|a = ∑_α,β P(u=α,v=β|∅) [∑_α',β' ((p_αβ)α' - (1-p_αβ)β') P_αβ;a(u=α',v=β')]
=0.9(0.5*0 + 0.5*0.5) + 0.05(1*1 + 0*0.5) + 0.05(0*1 + 1*0.5)
=0.3.

Since 0.3 < 0.325, the stratified indifferent agent will prefer to do nothing instead of action ‘a’.

Same outcome, different stratified value

But what if we end up in the same situation, but for different reasons? Let (0.1, 0.55) be the default outcome under the default action ∅ (default expected utility 0.325, as before).

Now imagine that action ‘a’ would simultaneously change the utilities to (1,0.5) and the probability of u to 0.1.

Then the probability flow is trivial: (0.1,0.55) → (1,0.5). The change in probability is irrelevant for stratified indifference, so

E’|a = (0.5*1 + 0.5*0.5) = 0.75.

Thus in this simple case where the AI can change utilities directly, the stratified utility answer is to take action ‘a’.

8