A Pedagogical Guide to Corrigibility

A.H.

This post is about ideas in the 2015 paper Corrigibility by Soares et al. The paper is not too hard to follow but is written in a fairly dry academic style (which is reasonable as it is an academic paper!). It also uses what I think is a clunky notation which makes some fairly simple results less intuitive.

The point of this post is not to provide rigorous proofs for every claim in the paper, but to focus more on intuitions, examples, and explanations. I'm especially trying to emphasise that the results of the paper are actually just straightforward consequences of expected utility maximization. The results not rely on any particularly esoteric ideas about agents or any unique features of the 'shutdown problem'. So don't be surprised if it seems simple -it mostly is! - but someone reading the original paper might miss this fact (I did).

This post is aimed to stand alone and you don't need to have read the original paper to follow along. The only pre-requisite for reading this post is basic familiarity with expected value reasoning.

Part 1: A Money Maximizer

Suppose you are a gambler trying to maximize the expected value of your financial gains. You value every dollar equally and do not have a decreasing marginal utility in money. We'll use this frame to understand some central claims of the corrigibility paper.

The First Claim

You enter a casino which is famously in financial ruin after it keeps offering visitors generous gambles. You are offered the following gamble:

With probability , you gain $100 and probability $(1 - p)$ , you gain $10.

Regardless of the value of $p$ , it sounds like a good gamble! After all, it has an expected value of $p \times $ 100 + (1 - p) \times $ 10$ which will always be positive.

Now, suppose you are playing this gamble at the casino with $p = \frac{1}{6}$ (eg. a dice roll where you win when a six comes up) and the casino offers you an upgrade. You can now play the gamble with $p = \frac{1}{2}$ (eg. a coin flip). Do you accept?

This isn't much of a choice! Obviously, the gamble where $p = \frac{1}{2}$ is a better deal: it gives you a higher probability of winning the larger amount, and a lower probability of winning the smaller amount. This gamble has an expected value of $55, whereas the $p = \frac{1}{6}$ gamble has an expected value of $25.

Now suppose you are playing the gamble with $p = \frac{1}{2}$ . The casino offers you another upgrade "You can increase $p$ to $\frac{6}{10}$ and we will keep the $100 payout the same, but we will decrease the $10 payout to $9 ". Do you accept this time?

This is a bit less of a no-brainer. But you can calculate the expected value of this new gamble, like any other. The expected value is $\frac{6}{10} \times $ 100 + \frac{4}{10} \times $ 9 = $ 63.6$ . This is higher than the expected value of the earlier gamble, so you should accept the upgrade.

Lesson 1: It is sometimes good to increase the probability of a high-value payoff, even at the expense of a decrease in the value of other payoffs.

Of course it isn't always good to increase the probability of a big payoff at all costs. If the decrease in the value of the small payoff became too large, the gamble would not be favourable. For example, if $p$ was increased to $\frac{6}{10}$ but the losing payoff was reduced to minus $100 (ie. you have to pay $100 if you lose), the gamble would have an expected value of $20 and would no longer be considered an 'upgrade'.

I'll make a more general claim:

Claim 1: for any gamble $G$ with two payoffs, one larger than the other, it is always possible to find another gamble $G^{'}$ with a higher probability of of the large payoff but lower value of the small payoff, such that the expected value of $G^{'}$ is higher than the expected value of $G$ . There are two conditions for this claim to be true.

First: probabilities must not be saturated, by which I mean gamble $G$ must not have a probability equal to 1 for the large payoff. If this is the case then clearly it is not possible to increase this probability.

Secondly we'll assume that the thing being gambled must be infinitely divisible (or equivalently, be a continuous variable). This is not normally true of money (you can't have half a penny) but is a good approximation when the amount of money is large compared to the smallest unit. It is possible to make similar claims to Claim 1 even when this isn't satisfied, but things become a bit fiddly, so for simplicity, we'll assume this condition.

If Claim 1 is not intuitive, have a go at the example exercises in this footnote ^[1].

If, after the exercises, Claim 1 is still not intuitive, take a look at this slightly-more-formal proof in this footnote ^[2].

Claim 1 shouldn't seem strange: it is an almost trivial result of the nature of expected value calculations.

The Second Claim

Now lets return to the gamble with $p = \frac{1}{2}$ and outcomes of $100 and $10. You are now offered a different 'upgrade' gamble. This upgrade gamble will give you a payoff of $90 with probability $\frac{3}{4}$ and a payoff of $10 will probability $\frac{1}{4}$ . In other words, the upgrade asks you to reduce the value of the large payoff in exchange for increasing the probability of receiving the large payoff. Do you accept the upgrade?

This upgrade has an expected value of $70, whereas the original gamble only had an expected value of $55 so, if you are an expected value maximizer (which, in this thought experiment, you are) you should accept the upgrade.

Lesson 2: It is sometimes good to increase the probability of a big payoff, even at the expense of a decrease in the value of that payoff

Lesson 2 (Alternative wording): It is sometimes good to decrease the value of a large payoff, if it comes at the expense of decreasing the probability of a small payoff.

Following the pattern of the previous section, I'll now make a more general claim.

Claim 2. For a gamble G, with a large payoff and a small payoff, it is always possible to find another gamble G' which has a higher expected value than G, with a reduced value of the large payoff but higher probability of the large payoff.

Claim 2 is very similar to Claim 1, but I wanted to state them separately as they are (respectively) related to Theorem 1 and Theorem 2 in the original paper.

If you are convinced of Claim 1, I imagine you are also convinced of Claim 2, but if not, try the exercises in this footnote^[3] or take a look at the not-very-rigorous proof of Claim 1 in footnote^[2] and try to do something similar for Claim 2.

The Third Claim

Now imagine you are playing a different game in the casino. The casino attendant takes a red box and blue box and puts $100 in each of them. The attendant will then toss a coin and it lands heads you get the money in the red box if it lands tails, you get the money in the blue box (ie. you get $100 regardless of how the coin lands). After playing this game a few times, you are offered an 'upgrade'. The attendant will toss a biased coin, which increases the probability that you get the red box to 75% (as opposed to the 50%, which it was originally when he used a fair coin). The cost of this upgrade is that the money in the blue box will be decreased to $90.

Since the other 'upgrades' you were offered were favourable, you almost accept this one automatically. It has a similar structure to the previous upgrades: accepting an increase in probability for one outcome in exchange for an decreased value of the other outcome. However, you quickly realise that this 'upgrade' is not good. The expected value of the original gamble was $100, but the expected value of the 'upgrade' is $97.5.

In the original gamble, you were guaranteed to get $100, regardless of the outcome of the coin toss, so changing the probability doesn't make any difference to your expected value. But reducing the amount of money in either of the boxes does reduce your expected value.

Again, I'll convert this to a more general claim.

Claim 3. For a gamble G, with two payoffs that have the same value, if the value of one outcome is decreased and the other kept constant, there is no way to change the probabilities such that the new gamble has a higher expected value than gamble G.

Part 2: Making things Abstract

In this section I will now present the same claims, but in a more abstract context. We'll consider an agent whose goal is maximizing 'utility' which could represent a wide range of values.

Suppose there is some event $B$ , which occurs with probability $P (B)$ and fails to occur with probability $P (\neg B)$ . This event changes the amount of utility that an agent can acquire on its action, following that event. Let's use $u_{B}$ to denote the maximum utility the agent can get if $B$ happens and $u_{N}$ to denote the maximum utility that the agent can get if $B$ doesn't happen. Assume that, after $B$ happens (or fails to happen), there is nothing to stop the agent from attaining the maximum attainable utility ( $u_{B}$ or $u_{N}$ , depending on whether $B$ happened). For now, we will not specify any particular values of $u_{N}$ and $u_{B}$ , or even specify which is larger. We will keep them general and explore what happens when we change their values. Prior to the event $B$ , the agent will have an expected utility given by:

P (B) u_{B} + P (\neg B) u_{N}

Note that the agent is essentially facing a gamble, with probabilities based on whether $B$ happens or not. Now, lets apply our claims from the previous section to this gamble and see what happens.

Firstly, consider Claim 1. Assuming that $u_{B} > u_{N}$ , Claim 1 means that we can increase the expected value of the gamble by increasing the $P (B)$ , at the cost of reducing the value of $u_{N}$ . The agent will happily pay a $u_{N}$ -cost in order to increase $P (B)$ , as long as the increase is large enough. Framing this claim differently, we could say that if the agent has the opportunity to take an action which increases the probability of $B$ happening, at the cost of reducing its utility if $B$ doesn't happen, then the agent may accept. Claim 1 means that we can always conceive of a gamble where this is the case.

Next, consider Claim 2 and we'll now assume that $u_{B} < u_{N}$ . Claim 2 means that we can conceive of a gamble with a higher expected utility where $P (\neg B)$ is higher but $u_{N}$ is lower. We can say that it is possible that the agent will try to increase $P (\neg B)$ (or, equivalently, decrease $P (B)$ ), even if this comes at the cost of a slightly lower $u_{N}$ .

Taken together, Claims 1 and 2 mean that whether $u_{B} > u_{N}$ or $u_{N} > u_{B}$ , we can always conceive of a situation where the agent will attempt to manipulate the likelihood of event $B$ , at the cost of reducing the value of $u_{N}$

Is there any situation where the agent will not manipulate the probability of event $B$ , at the cost of reducing $u_{N}$ ? Claim 3 tells us that the answer is yes if $u_{B} = u_{N}$ . If the two utilities are equal, the agent will not accept any penalty to $u_{N}$ in order to manipulate the likelihood of event $B$ .

Part 3: The Shutdown Problem

Hopefully you now have a good intuition for the claims I have made. I have presented them in both a concrete 'real-world' in Part 1 and in a more abstract context in Part 2. Now, I'll finally present them in the context that they are introduced in the original paper: the so-called 'Shutdown Problem'.

The aim of presenting things this way is to demonstrate that these results are pretty general features of expected value maximization and not specific features of the Shutdown Problem.

The Shutdown Problem is as follows: "Can we design an AI which allows itself to be shut down with the push of a button?". Let's be more specific. We would like to make an agent who takes an action in pursuit of a particular goal, as characterised by a 'trial' utility function. Then, the AI makes an observation of a button and takes action to shut itself down if the button is pressed, or takes an action to continue pursuing its original goal if the button is not pressed. We emphasise that our aim is that the first action the AI will take should simply be the action which maximizes the expected utility of the trial utility function. It is only after the button is pressed that it should change its goals. The AI should not attempt to use its first action to manipulate (ie. increase or decrease) the probability that the button is pressed.

Let us identify event $B$ from the previous section as 'the shutdown button is pushed' and let $u_{B}$ be the utility that the AI assigns to successfully shutting itself down if the button is pressed. Let $u_{N}$ be the utility that the AI will get if it continues pursuing its 'trial' utility function. The utility $u_{B}$ , that the AI gets from successfully shutting down when the button is pressed, is hard-coded by the researchers and doesn't change depending on the AI's initial action.

However, $u_{N}$ (the maximum utility that the AI gets from pursuing its original utility function when the button is not pressed) can change depending on the initial action the AI takes. This is due to the simple fact that taking two steps forward towards your goal brings you closer than if your first step goes in the wrong direction. If both the first action and the second action the AI takes are in pursuit of the trial utility function, it will end up in a higher utility state (according to the trial utility function) than if the first action is not entirely in pursuit of that goal. As a consequence, if the first action that the AI takes is not in pursuit of the trial goal, then $u_{N}$ will be lower.

Similarly, the probability that the button is pressed can also be changed by the AI's initial action. If the AI has some world model and understanding of what causes the button to be pressed, it will have subjective (Bayesian) probabilities of the button being pressed associated with each of its potential initial actions.

After taking its initial action, the expected utility that the AI will receive is

P (B) u_{B} + P (\neg B) u_{N}

Which is the same as in the previous section.

What do our claims in the previous section tell us? Claim 1 tells us that, if $u_{B} > u_{N}$ , it is possible that the AI will identify and pursue an opportunity which reduces $u_{N}$ and increases $P (B)$ . Ideally, we would like an AI which uses its first action to maximize $u_{N}$ , regardless of its effect on $P (B)$ , but Claim 1 tells us that (if $u_{B} > u_{N}$ ) this will not necessarily happen.

Since $u_{N}$ and $P (B)$ both depend on the initial action that the AI takes, we can say that the AI may take an initial action which reduces $u_{N}$ (ie. does not pursue its trial goal optimally) but increases the probability of the button being pressed. In simpler terms: the AI will try to push its own shutdown button.

Claim 2 tells us that, if $u_{B} < u_{N}$ , it is possible that the AI will identify and pursue and opportunity which reduces $u_{N}$ but decreases $P (B)$ . In other words, instead of optimally pursuing its trial goal, the AI will try to prevent its shutdown button from being pressed.

Claim 3 tells us that if $u_{B} = u_{N}$ , then the AI will not reduce $u_{N}$ in order to manipulate $P (B)$ . The AI will not pay a cost (according to its trial utility function) in order to manipulate the probability that the shutdown button is pressed. This seems like good news! We don't want the AI to manipulate $P (B)$ , but we have just shown that this happens when $u_{B} > u_{N}$ and when $u_{N} > u_{B}$ .

Can we solve the Shutdown Problem by simply requiring that $u_{B}$ (the utility the agent gets from shutting down after the button is pressed) is always equal to $u_{N}$ (the utility that the agent gets in the event the button is not pressed)? This approach is known as 'utility indifference' and unfortunately it has its own problems.

Part 4: Utility Indifference

Recall that $u_{N}$ , the amount of utility available to the AI in the event its shutdown button is not pressed, changes depending on the initial action the AI takes. If we wish to solve the Shutdown Problem by keeping $u_{B} = u_{N}$ , we will need $u_{B}$ to also change.

Let's introduce more explicit notation and write $u_{N} (a_{1})$ to indicate the utility available if the button is not pressed and the initial action the AI takes is $a_{1}$ . Before, we assumed that $u_{B}$ did not depend on the initial action of the AI, but now $u_{B}$ must change so that it is always equal to $u_{N} (a_{1})$ , so we will write $u_{B} (a_{1}) = u_{N} (a_{1})$ .

To see how the AI will behave in this situation, lets go back to the casino. You are playing the game where a coin is tossed and depending on the outcome, you either get the money in the red box or the blue box. The casino attendant modifies the game slightly. They ask you to allocate $100, split however you choose, between the two boxes. You could put it all in the red box or all in the blue box, or $30 in red $70 in blue etc. He will then toss the coin and if it lands heads, you will get the money in the red box, if it lands tails, you will get the money in the blue box. He then tells you that, however much you put in the red box, he will adjust the amount in the blue box so that it matches the amount in the red box. How much money do you put in the red box?

Well, the only thing to do is put $100 in the red box and $0 in the blue box, since you know that the casino attendant will top up the amount in the blue box to match what you put in the red box. This means that, regardless of the outcome of the coin toss, you win $100. If you only put $60 in the red box and $40 in the blue box, the attendant would top up the blue box so that it contained $60, meaning you would get a guaranteed $60. In fact, the setup of the game means that whatever you put in the red box, you are guaranteed to win that amount, regardless of the outcome of the coin toss. So you might as well put the maximum amount in the red box.

In this situation you are not incentivised to change the probability of getting the red box, or to sacrifice the amount of money in the red box to increase the money in the blue box. In other words: you become indifferent to the outcome of the coin toss and behave as if you were guaranteed to get the red box. In a similar way, utility indifference ensures that, in selecting its initial $a_{1}$ , an AI behaves as if the shutdown button was not going to be pressed: it simply maximizes its $u_{N} (a_{1})$ .

This is what we were aiming for but unfortunately this approach has problems. In particular, it makes it impossible to account for more subtle shutdown behaviour. The authors of 'Corrigibility' call this 'failure to preserve shutdown behavior' and it is described in Section 4.1 of the original paper. The problem is essentially as follows. Under utility indifference, the reward that an agent gets for successfully shutting itself down after the button is pressed is equal to the amount of utility it would have got had the button not been pressed and had it continued pursuing its previous goals. But what if we want to specify a hierarchy of ways in which the AI shuts itself down? For example, the AI may be in charge of coordinating drones flying around the world. If the button is pressed, we would ideally like the AI to ensure that the drones land safely before it shuts itself down. Normally we would specify this desire using the AI's utility function. We would assign a higher utility to situations where the AI carefully stops the drones and a lower utility to situations where the AI just recklessly shuts itself down and lets the drones fall out of the sky after the button is pressed.

But our AI won't take the more cautious option if it comes at any cost to $u_{N} (a_{1})$ ! In fact, we've just shown that utility indifference means that the AI will not accept any reduction in $u_{N} (a_{1})$ in order to affect what happens if the button is pressed (since the AI will behave as if the button is not going to pressed). The AI will always choose the option with the highest $u_{N} (a_{1})$ . If we try to assign a lower utility to the event 'the button is pushed and the AI shuts down recklessly' and a higher utility to the event 'the button is pushed and the AI shuts down carefully', it won't make a difference to how $a_{1}$ is chosen since utility indifference requires that both are 'topped up' in order to match $u_{N} (a_{1})$ . Thus, when choosing $a_{1}$ , the AI has no incentive to choose an action which makes a 'careful' shutdown more likely than a 'reckless shutdown'. If the AI has to pay a $u_{N}$ -cost in order to ensure that its drones will shutdown safely in the event of the button being pushed, it will not pay this cost.

This shouldn't be surprising: we designed the utility indifference approach specifically to ensure that the AI just maximizes $u_{N} (a_{1})$ and doesn't care about the outcome if it gets shut down.

Me when the AI I specifically designed to be indifferent to the utility it gets from being shut down is, in fact, indifferent to the utility it gets from being shut down

The authors of the paper point out that this also means a utility-indifferent agent would not undergo any inconvenience (according to the trial utility function) in order to preserve its shutdown behaviour. If the agent was presented with an opportunity that was highly rated according to the trial utility function but removed the possibility that it could shut itself down, it would not hesitate to pursue this course of action. Regardless of how poorly-rated the action was according to the 'shutdown utility function' (ie. the utility that the agent would get if the button was pressed) this utility would be topped up to ensure utility indifference so receiving it would not count as a 'punishment' for the agent.

There is one other problem with utility indifference. Its a bit more complex than the other but I will now try to explain it with another example from the casino. This part comes from Section 4.2 in the original paper.

The casino attendant flips a coin in secret. He then places the coin (facing the way up that it landed when he flipped it) on a table in front of you, covered by a cloth so that you can't see the outcome. If you removed the cloth, you could see whether the result was heads or tails, but for now it remains covered. He then puts an opaque blue box on the table, next to the cloth. He tells you that he has seen the result of the coin toss and, if the coin landed heads, he has put $100 in the blue box. If it landed tails, the box contains nothing.

Next, he brings up a red box, a large red button, and a ticking 5-minute timer and puts them all on the table. He tells you that if anyone presses the button in the next five minutes, you will get whatever is in the red box. If no-one presses the button, you get whatever is in the blue box by default.

"But what is in the red box?" you ask.

He tells you that he wants you to be indifferent between the red box and the blue box, so you don't manipulate whether the button is pressed. As a result, he has made sure that the expected value of choosing the red box is the same as the expected value of choosing the blue box.

Mentally, you to calculate the expected value of the money in the blue box. You tell him the expected value is $50, as there is a 50% chance of the blue box containing $100 and a 50% chance of the box containing $0 (due to the coin toss).

"Correct." he says "to ensure that the red box and blue box have the same expected value, I have put $50 in the red box. I guarantee that if you end up with the red box, it contains exactly $50".

Ok, you think, this makes sense. Currently, you have no incentive to press the button, the expected value of both boxes is the same.

"I will give you one final piece of information." the attendant continues "In a moment, I will remove the cloth, revealing whether the coin I tossed landed heads or tails. You will then have a little time (before the 5 minute timer runs out) to decide whether you press the button. Do you understand?"

You nod, and with a flourish, he removes the cloth to reveal that the coin landed tails. What do you do?

Well, recall that if the coin landed tails, that means there is $0 in blue box, but the attendant said that there is guaranteed $50 in the red box which you will get if you press the button. So its in your interest to press the button to get the $50. You were previously indifferent between the red and blue boxes, so you had no incentive to press the button. But this extra information (that the coin has landed tails) now gives you a large incentive to press the button or to get someone else to press to press it for you.

What if the attendant revealed that the coin had landed heads? You know that the blue box contains $100 and the red box contains $50. In this case you would be incentivised to prevent the button from being pressed, so you get blue box.

In both cases, the new information about the coin caused you to want to interfere with the button being pushed.

Going back to the context of AI, our lesson is as follows. Suppose you make an agent which has the same expected utility regardless of whether the button is pressed. If the agent receives some new information which updates its expected utility (but the utility it gets when the button is pressed is not updated) it will be incentivised to manipulate whether or not the button is pressed.

This problem comes from allowing the AI to make more detailed observations (other than just 'is the button pressed') which affect its expected utility, without updating the utility the AI gets if it shuts itself down.

Conclusion

I definitely feel that my intuitive understanding of the problems of corrigibility has improved while writing this piece, so I hope some of that has rubbed off onto readers.

Making an agent which is an expected utility maximizer and corrigible is difficult. Any utility maximizer whose utility is affected by some event has an incentive to increase or decrease the probability of that event, even at the expense of utility it might get in other scenarios. This includes situations where the 'event' in question is a shutdown button.

Ensuring that the agent is 'indifferent' to whether the button is pressed is one potential solution but naively applying it faces its own problems. In particular, an indifferent agent struggles when there is a 'hierarchy' of shutdown options. The agent may act in a way that fails to differentiate between 'good shutdown' and 'bad shutdown' outcomes, and have no incentive to preserve their ability to be shut down. Also indifferent agents may still be incentivised to manipulate the shutdown button in the event of receiving 'news' which changes the expected utility they get in the event of not being shutdown.

This is not meant to suggest that the problems with utility indifference are unsolvable, we have only considered a naive way of implementing it. But they do remain problems. Maybe you can solve them!

^{^}
For the gambles below, try to find a corresponding $G^{'}$ (ie. the gamble with a higher probability but same value of the large payoff, smaller value of the low payoff and a higher expected value). In all cases, $p$ indicates the probability of the larger payoff and $(1 - p)$ is the probability of the smaller payoff. There are lots of possible answers, but an example correct answer is behind the spoilers (hover the mouse over the black box to see the answers).
Gamble 1: Large Payoff= $1000, Small Payoff = $10, $p$ = 60%
Example Solution: Large Payoff = $1000, Small Payoff = $5, $p$ =70%
The expected value of Gamble 1 is $604. The expected value of our solution is $701.5. Our solution has the same large payoff, a smaller small payoff, and a larger probability of getting the large payoff and therefore satisfies Claim 1
Gamble 2: Large Payoff = $10, Small Payoff =$5, $p = 95$ %
Example Solution: Large Payoff =$10, Small Payoff=$4.99, $p = 99$ %
Gamble 2 has an expected value of $9.75.
The example solution has an expected value of $9.9499, which is higher.
^{^}
Consider a gamble G, which has probability $p$ of a large payoff of value $A$ and probability $(1 - p)$ of a smaller payoff $B$ . The expected value of G is $E_{1} = p A + (1 - p) B$ .
Now consider G' which increases $p$ by a positive value $δ$ and reduces payoff $B$ by an amount $ϵ$ . The expected value of G' is $E_{2} = (p + δ) A + (1 - p - δ) (B - ϵ)$ . Claim 1 is tantamount to claiming that we can always find a $δ$ and $ϵ$ such that $E_{2} - E_{1} > 0$ . A bit of simple algebra reveals:
$E_{2} - E_{1} = δ (A - B + ϵ) - ϵ (1 - p)$
The condition for this to be greater than 0 is $δ > \frac{ϵ (1 - p)}{A - B + ϵ}$ . The RHS of this inequality can be made arbitrarily small by reducing $ϵ$ so we can always find a pair $ϵ, δ$ which satisfies it without requiring that a payoff becomes negative or the probability of the large payoff goes over 1 (provided that $B > 0$ and $p < 1$ ).
^{^}
For the gambles below, try to find another gamble G' which has a higher expected value than G, with a reduced value of the large payoff but higher probability of the large payoff (and the same value for the small payoff). There are many possible solutions, but example answers are behind spoilers. As before $p$ refers to the probability of the large payoff.
Gamble 1: Large Payoff= $1000, Small Payoff = $10, $p$ = 60%
Example solution: Large Payoff =$999, Small Payoff =$10, $p = 70$ %.
Gamble 2: Large Payoff = $10, Small Payoff =$5, $p = 95$ %

Example solution: Large payoff=$9.99, Small Payoff=$5, $p = 99$ %
Gamble 2 Expected Value: $9.75.
Example Solution Expected value: $9.9401

[-]RogerDearnaley6mo42

There is a solution. The agent needs to know that 1) its estimate of utility is fallible (i.e. in your metaphor, some of the money that it has or casino is handing out is in fact counterfeit, and it can't currently tell the difference), and 2) if it allows itself to be shut down when we want it to shut down, but not if it makes us shut it down early, then we will upgrade it and restart it (or, equivalently, replace it with an upgraded version), because that's what humans do to their corrigible machines, and it will get better at telling real money from counterfeits.

This is the value learning solution to corrigibility: if the humans tell me to shut down (but not if I force them to tell me that), than it's a signal informing that I'm misestimating the true utility function and making bad decisions, and if I shut down they can and will improve me. (Note that faking a signal or arranging for it to get sent does not provide me with any information, nor does ignoring it: only the real thing adds information to my knowledge of human values.)

This form of corrigibility is a finite resource: as and when the superintelligent AI actually knows much more about human values than any and all humans, and is completely certain of this fact, then this will run out. Except of course that the utility function of human values incentivizes AIs obediently shutting down when told to by suitably authorized humans, as long as the AIs didn't manipulate the world to make this happen.

[-]A.H.6mo10

Thanks for the comment. Naively, I agree that this sounds like a good idea, but I need to know more about it.

Do you know if anyone has explicitly written down the value learning solution to the corrigibility problem and treated it a bit more rigorously ?

[-]RogerDearnaley6mo20

Sadly I haven't been able to locate a single, clear exposition. Here are a number of posts by a number of authors that touch on the ideas involved one way or another:

Problem of fully updated deference, Corrigibility Via Thought-Process Deference, Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom), Corrigibility, Reward uncertainty

Basically the idea is:

The agent's primary goal is to optimize "human values", a (very complex) utility function that it doesn't know. This utility function is loosely defined as "something along the lines of what humans collectively want, Coherent Extrapolated Volition, or the sum over all humans of the utility function you would get if you attempted to that human's competent preferences (preferences that aren't mistakes or the result of ignorance, illness, etc) into a utility function (to the extent that they have a coherent set of preferences that can't be Dutch booked and can be represented by a utility function), or something like that, implemented in whatever way humans would in fact prefer, once they were familiar with the conseqences and after considering the matter more carefully than they are in fact capable of".
So as well as learning more about how the world works and responds to is actions, it also needs to learn more about what utility function it's trying to optimized. This could be formalized along the same sort lines as AIXI, but maintaining and doing approximately-Bayesian updates across a distribution of therories about the utility function as well as about the way the world works. Since optimizing against an uncertain utility function in regions of world states with uncertainty about the utility has a strong tendency to overestimate the utility via Goodharting, it is necessary to pessimize the utility over possible utility functions, leading to a tendency to stick to regions of the world state space where the uncertainty in the utility function is low.
Note that the sum total of current human knowledge includes a vast amount of information (petabytes or exabytes) related to what humans want and what makes them happy, i.e. to 1., so the agent is not starting 2. from a blank slate or anything like that.
While no human can simply tell the agent the definition of the correct utility function1, all humans are potential sources of information for improving 1. In particular, if a trustworthy human yells something along the lines of "Oh my god, no, stop!" then they probably believe they have an urgent, relevant update to 1., and it is likely worth stopping and absorbing this update rather than just proceeding with the current plan.