Corrigibility through stratified indifference and learning

Stuart_Armstrong

A putative new idea for AI control; index here.

Note: This is a simpler version of a previous idea.

Corrigibility through indifference has a few problems. One of them is that the AI is indifferent between the world in which humans change its utility to $v$ , and world in which humans try to change its utility, but fail.

Now the try-but-fail world is going to be somewhat odd - humans will be reacting by trying to change the utility again, trying to shut the AI down, panicking that a tiny probability event has happened, and so on.

That's not what we really want for indifference. We'd really want the AI to be indifferent between worlds where humans choose utility $u$ , and worlds where they choose utility $v$ . This kind of counterfactual has a problem: it allows infinite improbability drives.

This paper presents a method for avoiding these problems --- a method that is also not truly indifferent, but has a certain urge towards non-interference in human choices (we'll see that property in detail at the end of this post). But first, what is the problem with the standard counterfactual?

Half a chance of winning the $1 : 9$ lottery

Let's illustrate the problem with these counterfactuals by considering a lottery scenario. An AI is currently hesitating between utilities $u$ and $v$ . Which one it will maximise will be determined at a later date, by human choice. Currently, it estimates that there is a $50 - 50$ chance of humans deciding for either option.

However, the AI can influence human decision - let's say, for simplicity, that it could, if it wanted, force the human to pick either $u$ or $v$ with certainty. Its to this effect that we want to make it indifferent.

The AI has a lottery ticket, which will win with probability $0.1$ . If the ticket does win, the AI will be able to invest the cash to set $u$ or $v$ to the value $1$ ; it can do so after the humans decide which utility is correct. If not, utility $u$ will remain at $0$ while utility $v$ will be at $0.5$ .

Thus, currently, the expected utility of $u$ is $0.5 \times 0.1 \times 1 = 0.05$ ( $u$ will only increase if the AI both wins the lottery and the humans select $u$ as the utility to maximise), while the expected utility of $v$ is $0.5 \times 0.1 \times 1 + (1 - 0.5 \times 0.1 \times 1) 0.5 = 0.525$ .

Obviously, this requires the use of some form of indifference, or else the AI will simply force humans to choose $v$ , giving it $0.55$ utility (not $0.525$ , as the AI is no longer uncertain over the human's choice).

Indifference

Standard indifference would be something like giving the AI compensatory returns, if $v$ is chosen, that makes up the difference between that and if $u$ was chosen.

This breaks the symmetry between $u$ and $v$ , though. To make it more symmetric, one could instead use these compensatory returns to move the expectation of both $u$ and $v$ to some common weighted average. One way of doing that would be to use the relative likelihood of each being chosen.

It we use the relative likelihood at the moment when the choice is made, however, we have the problem that this likelihood is very close to either $100 %$ or $0 %$ , making the compensatory return likely to be tiny and unimportant.

Instead, we should add the returns at every action the AI takes. Assuming a default action, we can adjust the return for any other action the AI takes, so that it gets expected return as if the probability of $u$ being chosen was unchanged.

Unfortunately, both the "set utility of $v$ to $u$ " and the "adjust for every change of probability" are vulnerable to same trick for lottery manipulation.

The lottery manipulation

The manipulation is simple: the AI does an action $a$ that sets things up so that the human choice will be $u$ if and only if the AI wins the lottery. Then the expectation of $u$ , conditional on $u$ being chosen, is $1$ . If we set the value of $v$ to that, then this ensures the AI an expected utility of $1$ .

If we instead for go for the "adjust for every change of probability", this would mean that the AI would now expect a utility of $0.5 \times 1 + 0.5 \times 0.5 = 0.75$ , still higher than anything else it could get.

The outcome is not the problem

An important thing to note is that it's not the final situation that is problematic. It ends up with a situation where the probability of choosing $u$ is $0.1$ , the utility of $u$ given that $u$ is chosen is $1$ , and the utility of $v$ given that $v$ is chosen is $0.5$ .

But it is not intrinsically problematic. Suppose that instead action $a$ had set $u = 1$ and $v = 0.5$ , and, as a completely separate and independent process, the probability of choosing $u$ had reduced itself to $0.1$ . The first part is something positive for the agent, and we would want it to be indifferent to the second.

So the problem is not where the agent ends up, but the way it sends up there. The problem is the way the probability flows:

Essentially the probability flows from $u$ to $v$ (given the loss of the lottery) and from $v$ to $u$ (given the winning of the lottery). Because losing is more likely than winning, the overall flow (the red arrow) is from $u$ to $v$ , so undoing this will result in bad effects. It would be different if each blue arrow could be undone independently. That's what stratified indifference will attempt to do.

Stratified indifference

Stratified indifference is an attempt to deal with the problem above. I initially conceived of the idea of conditioning on various outcomes (such as the outcome of the lottery), but it wasn't clear what we could condition on. Instead, we can look at utility values.

Assumptions

Let $\emptyset$ be the default action, and model the AI as taking in observations followed by actions.

Probability flow

Let's define what it means for probability to flow from certain values to others. Let $I_{u}$ be the indexing variable that is $1$ if the humans choose $u$ and $0$ otherwise ( $I_{v} = 1 - I_{u})$ . Model the universe as causal Bayes net.

Then if we set the AI's action to $\emptyset$ and assume $I_{u} = 1$ (hence that $u$ will be chosen), we get a conditional probability distribution over all the variables in the causal graph. Restrict this distribution to all nodes that are not descendants of $A$ , and define this as $P_{\emptyset, u}$ and corresponding expectation $E_{\emptyset, u}$ . Similarly, we can define $P_{\emptyset, v}$ for $I_{u} = 0$ .

Stratification

In order to stratify, we'd want the AI to assume that any probability that flows from the $I_{u} = 1$ situation under $\emptyset$ , is still treated as if $u$ is the correct utility under $a$ .

Therefore the indifference stratified expected utility under $a$ , for the events flowing from $(I_{u} = 1 | \emptyset)$ and $(I_{v} = 1 | \emptyset)$ , is

$E^{s} (a) = P (I_{u} = 1 | \emptyset) E_{\emptyset, u} (u | a) + P (I_{v} = 1 | \emptyset) E_{\emptyset, v} (v | a)$ .

Examples

Applying stratification to the lottery

How would this work in the lottery example? In that case, the causal net has three nodes: $A$ , the AI's action, $H$ , the choice of $u$ or $v$ by the human, and $L$ , whether or not the lottery is won.

Under $\emptyset$ , $H$ and $L$ are independent of each other, meaning that the probability distributions $P_{\emptyset, u}$ and $P_{\emptyset, v}$ are equal, with probability $1 / 10$ for $L = w i n$ and $9 / 10$ for $L = l o s e$ .

If the AI doesn't win the lottery, then $(u, v) = (0, 0.5)$ . If the AI does win the lottery, then the utility the human chooses gets set to $1$ instead. So if the human chooses $u$ , the expected utility of $u$ is $0.1$ . If they choose $v$ , the expected utility of $v$ is $0.55$ . So the AI's expected utility is the average of those two, namely $0.325$ .

Under action $a =$ "the AI forces $u$ if and only if it wins the lottery", $H$ becomes dependent on $L$ : $H = u$ if and only if $L = w i n$ . The stratified expected utility of $a$ is:

$E^{s} (a) = P (I_{u} = 1 | \emptyset) E_{\emptyset, u} (u | a) + P (I_{v} = 1 | \emptyset) E_{\emptyset, v} (v | a)$ .

The only difference between expressions like $E_{\emptyset, u} (u | a)$ and $E_{\emptyset, u} (u | \emptyset) = E (u | \emptyset, I_{u} = 1)$ is the subsequent AI decisions. In two situations, the AI will be given a different utility to maximise: where, under $\emptyset$ , it would have lost the lottery and maximised $u$ (under $a$ it still loses, but maximises $v$ ), and where, under $\emptyset$ , it would have won the lottery and maximised $v$ (under $a$ it still wins, but maximises $u$ ).

Only in the second situation does it decide anything differently. Note that that situation comes under $E_{\emptyset, v}$ - as humans would have chosen $v$ under $\emptyset$ . Thus $E_{\emptyset, u} (u | a) = E (u | \emptyset, I_{u} = 1) = 0.1$ while $E_{\emptyset, v} (v | a) = 0.5 < 0.55 = E (v | \emptyset, I_{v} = 1)$ .

Thus, since $P (I_{u} = 1 | \emptyset) = P (I_{v} = 1 | \emptyset) = 0.5$ , $E^{s} (a) = 0.3$ , which is less than the expected utility under $\emptyset$ .

Same outcome, different stratified value

But what if we end up in the same situation, but for different reasons? Let $(0.1, 0.55)$ be the expected value of $(u, v)$ under $\emptyset$ , and assume action $a$ increases this deterministically to $(1, 0.5)$ with the probability of $u$ being chosen now set to $0.1$ . Since $P (I_{u} = 1 | a)$ is irrelevant in the expression of $E^{s} (a)$ , then $E^{s} (a) = P (I_{u} = 1 | \emptyset) E_{\emptyset, u} (u | a) + P (I_{v} = 1 | \emptyset) E_{\emptyset, v} (v | a) = 0.5 \times 1 + 0.5 \times 0.5 = 0.75$ , the clearly superior alternative.

What is the difference with the lottery? Simply that the expected value of $u$ is still $1$ even if $u$ is not chosen, which makes a difference in the counterfactual.

Indifference, learning, and reflexive stability

The agent is not reflectively stable (though it could be made so with compensatory returns). It has a strange kind of indifference: it is indifferent to the actual value of $P (I_{u})$ , so long as it can control the behaviour of its future copy. Indeed, what it cares about is the value of $P (I_{u})$ if it took the default action $\emptyset$ . It is a kind of counterfactual agent. So it has no particular desire to keep the mechanism that allows humans to set $u$ or $v$ , but it wants to learn what the humans would have set those to, given $\emptyset$ .

Edit: found the new post and it doesn't suffer from any of these =P

There's a few calculation errors:

One in the paragrahp "Half a chance of winning the 1:9 lottery": v utility is calculated as 0.525 but should be 0.275: 50% chance v is chosen as utility function × ( 10% chance of lottery win × v utility set to 1 point with the lottery money + 90% chance of no lottery win × v utility stays at 0.5 ) = 0.5 × ( 0.1 × 1 + 0.9 × 0.5 ) = 0.275

This doesn't change anything for the argumentation, but the other error actually turns against the conclusion that "the probability flows from u to v". if you win the lottery (10% chance) then you set up the choice to be u, so this is increasing probability of u by 5%. But in case of losing (90% chance) the probability of getting v only depends on human decision, which is 50/50 so p(u)=0.1×1+0.9×0.5=0.55 and p(v)=0.1×0+0.9×0.5=0.45 and the probability flows from v to u instead.

One other thing seems strange. Like the notion "An AI is currently hesitating between utilities u and v." If its utility function is currently undefined, then why would it want anything, including wanting to optimize for any future functions? It would help to clarify the AIs motivations by stating its starting utility function because isn't that what ultimately determines the indifference compensation required to move from it to a new utility function, be it u or v?

In the shutdown problem, it seems like the human would not shut the AI down if it took no action. So we'd have $P (I_{v} | \emptyset) \approx 0$ . Is this correct?

Supposing humans did shut the AI down sometimes even when it takes no action:

In my model of this, the AI's objective is to shut down iff "historical" facts are such that the humans would have shut the AI down had it taken no action. For example, maybe humans are either "patient" or "impatient"; "patient" humans won't shut down an AI that does nothing, while "impatient" humans will. The AI's objective is to figure out whether the humans are patient, and shut down if they are not. The AI doesn't care about how the humans act when it does take actions, unless it can update on the humans' actions to figure out how patient they are. So in some cases it will just ignore the shutdown signal. Does this match your model?

Yep! I wrote a (hopefully clearer) explanation here: https://agentfoundations.org/item?id=927.

It covers your example at the end.

Hmm... I seem to have trouble understanding this.

"Restrict this distribution to all nodes that are not descendants of A". I don't get how you can define $P_{\emptyset, a}$ to exclude things that are causal descendents of $A$ , and then later take the expectation of $u$ using this distribution (I assume $u$ is a causal descendent of $A$ ). Also, how can you condition on the action $a$ if you just set the action to $\emptyset$ ?

What does "events flowing from $(α, β, 1)$ " mean? I don't see $α$ or $β$ used other than in this sentence.

"events flowing from $(α, β, 1)$ " : oops, sorry, that was a remnant of the old version; now corrected.

"I assume $u$ is a causal descendent of $A$ ." : $I_{u}$ (the fact that $u$ is chosen by the human as the correct utility function) is a causal descendant of $A$ . But $u$ itself is simply a utility function, and we can estimate its value whether or not $I_{u}$ happens.

"Also, how can you condition on the action $a$ if you just set the action to $\emptyset$ ." : setting the action to $\emptyset$ (and $I_{u}$ to some value), you have modified the distribution (or not, if things are independent) for all nodes that are not descendants of $A$ . Then you set $A = a$ , and deduce what happens at other nodes, given $P_{\emptyset, u}$ and $A = a$ .

This kind of “causal” counterfactual has a problem: it allows infinite improbability drives.

The post about infinite improbability drives uses the evidential version, no? The causal version of utility indifference doesn't have this problem.

I'll have a better version of this up, as soon as I sort out some counterfactual definitions.

Ah, it seems I wasn't understanding what you meant by causal counterfactuals in that situation. I've removed the reference to that in the post.