Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system "corrigible" if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:

As AI systems grow more intelligent and autonomous, it becomes increasingly important that they pursue the intended goals. As these goals grow more and more complex, it becomes increasingly unlikely that programmers would be able to specify them perfectly on the first try.

Contemporary AI systems are correctable in the sense that when a bug is discovered, one can simply stop the system and modify it arbitrarily; but once artificially intelligent systems reach and surpass human general intelligence, an AI system that is not behaving as intended might also have the ability to intervene against attempts to "pull the plug".

Indeed, by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected: general analysis of rational agents.1 has suggested that almost all such agents are instrumentally motivated to preserve their preferences, and hence to resist attempts to modify them [3, 8]. Consider an agent maximizing the expectation of some utility function U. In most cases, the agent's current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior. In Stephen Omohundro's terms, "goal-content integrity'' is an instrumentally convergent goal of almost all intelligent agents [6].

This holds true even if an artificial agent's programmers intended to give the agent different goals, and even if the agent is sufficiently intelligent to realize that its programmers intended to give it different goals. If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U* (as this change is rated poorly according to U). This could result in agents with incentives to manipulate or deceive their programmers.2

As AI systems' capabilities expand (and they gain access to strategic options that their programmers never considered), it becomes more and more difficult to specify their goals in a way that avoids unforeseen solutionsoutcomes that technically meet the letter of the programmers' goal specification, while violating the intended spirit.Simple examples of unforeseen solutions are familiar from contemporary AI systems: e.g., Bird and Layzell [2] used genetic algorithms to evolve a design for an oscillator, and found that one of the solutions involved repurposing the printed circuit board tracks on the system's motherboard as a radio, to pick up oscillating signals generated by nearby personal computers. Generally intelligent agents would be far more capable of finding unforeseen solutions, and since these solutions might be easier to implement than the intended outcomes, they would have every incentive to do so. Furthermore, sufficiently capable systems (especially systems that have created subsystems or undergone significant self-modification) may be very difficult to correct without their cooperation.

In this paper, we ask whether it is possible to construct a powerful artificially intelligent system which has no incentive to resist attempts to correct bugs in its goal system, and, ideally, is incentivized to aid its programmers in correcting such bugs. While autonomous systems reaching or surpassing human general intelligence do not yet exist (and may not exist for some time), it seems important to develop an understanding of methods of reasoning that allow for correction before developing systems that are able to resist or deceive their programmers. We refer to reasoning of this type as corrigible.

1Von Neumann-Morgenstern rational agents [7], that is, agents which attempt to maximize expected utility according to some utility function)

2In particularly egregious cases, this deception could lead an agent to maximize U* only until it is powerful enough to avoid correction by its programmers, at which point it may begin maximizing U. Bostrom [4] refers to this as a "treacherous turn''.

3Bostrom [4] calls this sort of unforeseen solution a "perverse instantiation''.

(See the paper for references.)

This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.

Before we build generally intelligent systems, we will require some understanding of what it takes to be confident that the system will cooperate with its programmers in addressing aspects of the system that they see as flaws, rather than resisting their efforts or attempting to hide the fact that problems exist. We will all be safer with a formal basis for understanding the desired sort of reasoning.

As demonstrated in this paper, we are still encountering tensions and complexities in formally specifying the desired behaviors and algorithms that will compactly yield them. The field of corrigibility remains wide open, ripe for study, and crucial in the development of safe artificial generally intelligent systems.

New Comment
28 comments, sorted by Click to highlight new comments since: Today at 7:23 PM

I think I can give a concise unification of my idea with Karl's. In short, the comment in the paper that

The concerns in Section 4.2 could potentially be addressed by some form of counterfactual (rather than conditional) reasoning.

is correct, and the fix is a pretty simple one. Basically, we want the following:

  1. In selecting a1, the agent should act as though it was indifferent between the counterfactual possibilities of shutting down and not shutting down, conditional on the same actions and observations.
  2. In selecting a2, the agent should desire to shut down when it is required to, and desire not to shut down when it is not required to.
  3. Despite its indifference according to (1), the agent should care about its future corrigibility (i.e. its ability to actually shut down when the button is pressed).

One utility function that satisfies these requirements, in a general form, is U(a1, o, a2) = {
(1-p) UN(a1, o, a2) + p max_b(US(a1, o, b)), if o is not in Press
p US(a1, o, a2) + (1-p) max_b(UN(a1, o, b)), if o is in Press
}, where p is a fixed constant.

From the agent's point of view, this is equivalent to deciding a1 based on an a priori assumption that it will shut down (maximise US) with a fixed probability p, and not shut down (maximise UN) otherwise. However, in selecting a2 it will in fact maximise either UN or US depending upon whether the shutdown button was pressed or not. As a particular example, Karl's proposition below is equivalent to the limit of this general case as p->0. See below for further details on how/why this utility function achieves the three criteria above.

  1. We don't want the agent to be indifferent to the actual event of the shutdown button being pressed. In fact, what we want is indifference as to whether or not any given event is classed as being a shutdown press. Thus the expected utility of (a1, o) for o in Press should be equivalent to the expected utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa. To do this, the utilities of both branches of U (the Press branch and the NotPress branch) need to be made counterfactually equivalent, even though the agent selects a2 via UN in the NotPress branch and via US in the Press branch. This is precisely what Karl's modification achieves.

  2. This is pretty easy to achieve by ensuring that only the UN component of the NotPress branch depends upon a2, and only the US component of the Press branch depends upon a2.

  3. We want the agent to be incentivised to take actions that are compatible with future shutdown, which means that the agent's selection of a1 should be influenced by US as well as UN, but without introducing perverse incentives. Fortunately, the counterfactual indifference approach already solves the perversity issue, and so all that remains is to appropriately combine the influence of both UN and US. A reasonable approach is for the agent to act as though it will have to shut down with a fixed probability p, regardless of its observations. Because p is fixed, and because the agent is indifferent between any given observation being classed as a button press or not a button press, the agent is not in any way incentivised to influence the likelihood of the shutdown button being pressed.

Thanks, and nice work!

Thus the utility of (a1, o) for o in Press should be equivalent to the utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa

Yeah, this is pretty key. You need it to optimize for both cases as if the probability of the button being pressed is fixed and independent of whether the programmers actually press the button. We can achieve this via a causal intervention on whether or not the button is pressed, and then clean up your U a bit by redefining it as follows:

U(a1, o, a2) :=
{  UN(a1, o, a2) + E[US|do(O in Press)] if o not in Press
;  US(a1, o, a2) + E[UN|do(O not in Press)] else }

(Choosing how to compare UN values to US values makes the choice of priors redundant. If you want the priors to be 2:1 in favor of US then you could also have just doubled US in the first place instead; the degree of freedom in the prior is the same as the degree of freedom in the relative scaling. See also Loudness Priors, a technical report from the last workshop.)

This method does seem to fulfill all the desiderata in the paper, although we're not too confident in it yet (it took us a little while to notice the "managing the news" problem in the first version, and it seems pretty likely that this too will have undesirable properties lurking somewhere). I'm fairly pleased with this solution, though, and a little miffed -- we found something similar to this a little while back (our research outstrips our writing speed, unfortunately) and now you've gone and ruined the surprise! :-)

(In seriousness, though, nice work. Next question is, can we pick any holes in it?)

That's definitely a more elegant presentation.

I'm not too surprised to hear you had already discovered this idea, since I'm familiar with the gap between research and writing speed. As someone who is not involved with MIRI, consideration of some FAI-related problems is at least somewhat disincentivized by the likelihood that MIRI already has an answer.

As for flaws, I'll list what I can think of. First of all, there are of course some obvious design difficulties, including the difficulty of designing US in the first place, and the difficulty of choosing the appropriate way of scaling US, but those seem to be resolvable.

One point that occurs to me under the assumptions of the toy model is that decisions involving larger differences in values of UN are at the same time more dangerous and more likely to outweigh the agent's valuation of its future corrigibility. Moreover, simply increasing the scaling of US to compensate would cause US to significantly outweigh UN in the context of smaller decisions.

An example would be that the AI decides it's crucial to take over the world in order to "save" it, so it starts building an army of subagents to do it, and it decides that building corrigibility into those subagents is not worth the associated risk of failure.

However, it appears that this problem can still be solved by designing US correctly in the first place; a well-designed US should clearly assign greater negative weighting to larger-scale corrigibility failures than to smaller scale ones.

There's two other questions that I can see that relate to scaling up the toy model.

  1. How does this model extend past the three-timestep toy scenario?
  2. Does the model remain stable under assumptions of bounded computational power? In more complex scenarios there are obvious questions of "tiling", but I think there is a more basic issue to answer that applies even in the three-timestep case. That is, if the agent will not be able to calculate the counterfactual utility values E[U | do(.)] exactly, can we make sure that the agent's process of estimation will avoid making systematic errors that result in pathological behaviour?

As someone who is not involved with MIRI, consideration of some FAI-related problems is at least somewhat disincentivized by the likelihood that MIRI already has an answer.

Yeah, sorry about that -- we are taking some actions to close the writing/research gap and make it easier for people to contribute fresh results, but it will take time for those to come to fruition. In the interim, all I can provide is LW karma and textual reinforcement. Nice work!

(We are in new territory now, FWIW.)

I agree with these concerns; specifying US is really hard and making it interact nicely with UN is also hard.

  1. How does this model extend past the three-timestep toy scenario?

Roughly, you add correction terms f1(a1), f2(a1, o1, a2), etc. for every partial history, where each one is defined as E[Ux|A1=a1, O1=o1, ..., do(On rel Press)]. (I think.)

  1. Does the model remain stable under assumptions of bounded computational power?

Things are certainly difficult, and the dependence upon this particular agent's expectations is indeed weird/brittle. (For example, consider another agent maximizing this utility function, where the expectations are the first agent's expectations. Now it's probably incentivized to exploit places where the first agent's expectations are known to be incorrect, although I haven't the time right now to figure out exactly how.) This seems like potentially a good place to keep poking.

What do we mean by "o in Press"? That whoever wrote the list defining "Press" made a different decision that day?

Maybe we can have the definition of "Press" as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that... Need some sort of abstract version of this (might be doable).

This still seems vulnerable to Benja's blackmail. Let me explain:

Assume that a2 is irrelevant, that o=Pr, and that a1 is merely yes or no to the following deal: "Give me 1 utilon from UN, and if o is not Pr, then I give 2 utilons to UN". This is the only way of changing UN; otherwise UN=0.

Then the agent is immune to blackmail. It will computer U(YES,-,-) = U(YES,Pr,-) = US(YES,Pr,-)+E(UN|Pr not in Press, YES, Pr) = US(YES,Pr,-) - 1. On the other hand, U(NO,-,-) = US(NO,Pr,-).

So if US doesn't care about YES vs NO, it will reject the deal. Yay!

But now imagine that the deal is phrased differently: "Give me 1 utilon from UN, and if o is not in Press, then I give 2 utilons to UN"

Here E(UN|Pr not in Press, YES, Pr) become -1+2=+1, so the agent will accept the deal.

This may depend on how it cashes out statements like "if o is not in Press".

Yep, I think you're right.

My preferred method of cashing out "do(o not in Press)" etc. is to say that observations are a tuple (p, r) where p is either Press or not Press and r is the rest of the observation (unrelated to the button). Now the causal intervention is done only on p, so (desugaring everything) we define

U(a1, (p, r), a2) :=
{ UN(a1, (p, r), a2) + E[US(a1, (P, R), A2(a1, (P, R))) | do(P := Press)] if r = not Press
, US(a1, (p, r), a2) + E[UN(a1, (P, R), A2(a1, (P, R))) | do(P := not Press)] else }

Then whether or not the agent pays the blackmail comes down to whether or not E[UN(YES, (P, R), A2(YES, (P, R))) | do(P := not Press)] is -1 or +1. This counterfactual is computed by taking the world-model indexed on the action being "YES" as it was before making the observation, overwriting P to "not Press" via causal intervention, then propagating the update and computing the expectation of UN (with respect to R). In this case, A2(-, -) is constant, so the question comes down to the value of UN(YES, (P, R), -) under the causal counterfactual assumption that P := not Press.

There are two ways this could go down. First, let's say that R is all physical observations, including whether or not the humans seem to be pressing the button. In this case, the agent treats P as if it is a separate platonic variable controlled by the Causal Counterfactual God rather than actual physics, and it is perfectly capable of reasoning about the case where the humans pressed the button but P was not Press anyway. In this case, if the deal is "Give me one util from UN, and then I'll give 2 utils to UN if the humans don't press the button", then the agent is perfectly capable of rejecting the deal, because under the counterfactual do(P:= not Press), the agent still believes that the humans physically pressed the button (and therefore does not expect its 2 utils).

But if the deal is "Give me one util from UN, and I'll give 2 utils to UN if P = not Press" then the agent will pay up, because under the counterfactual do(P := not Press) it clearly expects to get paid. (Or, rather, it actually cares about the counterfactual world where the humans press the button and P = not Press anyway, so you can blackmail the agent by promising great things in that world.)

Yep, it's blackmailable. Nice catch.

(The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn't believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.)

The intuitive fix is to try to prevent P from being the causal ancestor of anything in the graph; e.g., have the agent act as if it doesn't believe that the blackmailer can really observe / base their action on P. That sounds really difficult to set up and horribly hacky, though.

It is relevant that the decision to blackmail (probably need a better word) is determined by the fact that P=not Press, and because of the particular structure of the algorithm. This flags up the blackmail as something unusual, but I'm not sure how to safely exploit that fact... The rule "don't take deals that only exist because of property Q of your algorithm" is too rigid, but maybe a probabilistic version of that?

Why not make it so that the agent in selecting A1 act as a UN-agent that believe that it will continue to optimize according to UN even in the event of the button being pressed rather than a UN agent that believe that the button will never be pressed: that is pick U such that

U(a1,o,a2) = UN(a1,o,a2) if o is in Press or US(a1,o,a2) + f(a1,o) - g(a1,o) if o is not in Press

where f(a1,o) is the maximum value of UN(a1,o,b) for b in A2 and g(a1,o) is the maximum value of US(a1,o,b) for b in A2.

This would avoid the perverse manipulation incentives problem detailed on section 4.2 of the paper.

How does this differ from indifference?

In the indifference formalism the agent in selecting A1 act like a UN agent that believe that the shutdown button will not be pressed, therefore it create perverse incentives to "manage the news". Which means that if the agent can cause his shutdown button to be pressed in the event of bad news, it will.

My formulation avoid this pathological behavior by instead making the agent select A1 as if it was a UN-agent which believed that it would continue to optimize according to UN even in the vent of the button being pressed which avoid the perverse incentives to "manage the news", while still not having any incentives to avoid the button being pressed because the agent will act like it believe that pressing the button will not cause it to initiate a shutdown.

Wow. This is the simplest/shortest explanation I've seen yet for how AI can becomes unfriendly, without reference to Terminator-style outcomes.

Of course, per the illusion of transparency, it may be that it only seems so clear to me because of my long term exposure to the idea of FAI... Still, it looks like an important step in subdividing the problem, and one that I expect would be more intuitively obvious to outsiders: "we're studying ways to make sure the sorcerer's apprentice can turn the magic mop off." ;-)

is a "corrigibility module" a plausible safeguard against some (significant) classes of UfAIs?

Yes. It's not a sure-fire safeguard and it doesn't work against all UFAIs, but if done correctly, you can think of corrigibility as granting a saving throw. But note that while this paper is a huge step forward, "how to do corrigibility correctly" is not nearly a solved problem yet.

(Corrigibility was a topic at the second MIRIx Boston workshop, and we have results that build on this paper which we are working on writing up.)

No, at least not anything like the corrigibility we're currently considering. Everything we've written about so far relies on having the ability to specify the utility function in detail, the utility function being reflectively stable, the utility function being able to contain references to external objects like 'the shutdown button' with the corresponding problems of adapting to new ontologies as the surrounding system shifts representations (see the notion of an 'ontological crisis'), etcetera. It's a precaution for a Friendly AI in the process of being built; you couldn't tack it onto super-Eurisko.

OK, following on from Karl's comment below, which seems to resolve Section 4.2, I think it shouldn't be that difficult to resolve some or all of the Section 4.1 concerns as well. In order to satisfy Desideratum 4, we need the agent to be incentivised to build in safeguards to ensure that its possible future actions will be rated well according to US in cases where the shutdown button is pressed. The most obvious way to do this is to penalise situations where there is no good shutdown action according to US and reward situations where there is such an action, but this fails according to Desideratum 2, because we want the agent to remain indifferent as to whether the shutdown button is actually pressed.

However, there appears to be a simple solution; consider the "penalty term"
g(a1) = E[max_b(US(a1, O, b)) | O in Press ; a1],
which represents the expected value of US when the shutdown button is pressed. The first step in doing things properly is to add that term to observations not in Press, rather than subtracting it from observations in Press; this helps to ensure future corrigibility (and hence also corrigibility of subagents and successor agents).

Additionally, to resolve the Section 4.2 concerns we also want the agent to be indifferent to pressing its own shutdown button in cases where it gets unlucky, and the way we do that is, as per Karl's post, by calculating the bonus term conditional on the actual observation, or on the set of observations in Press, rather than conditional on the set of possible observations not in Press, i.e. Karl's f(a1, o) = max_b(UN(a1, o, b))

Finally, we will also need the additional terms
fp(a1) = E[max_b(UN(a1, O, b) | O in Press; a1] = E[f(a1, O) | O in Press; a1]
fnp(a1) = E[max_b(UN(a1, O, b)) | O not in Press; a1] = E[f(a1, O) | O not in Press; a1]
in order to make things work.

The utility function I propose is

U(a1, o, a2) = {
g(a1) + UN(a1, o, a2), for o not in Press.
US(a1, o, a2) + fnp(a1) + f(a1, o) - fp(a1), for o in Press.

Let's verify each of the Desiderata in turn.
(1) is satisfied because the value of a2 conditional on any o in Press depends only upon US, not UN.
(2) and (3) are not satisfied :( because of the "managing the news" problem, as Karl points out below. That being said, we've ensured indifference to the shutdown button because the expected value for any action conditional on Press is equivalent to its expected value conditional on NotPress, i.e. E[U | NotPress ; a1] = E[U | Press; a1], which holds because of the following three equations:
g(a1) = E[max_b(US(a1, O, b)) | O in Press; a1],
fnp(a1) =E[max_b(UN(a1, O, b)) | O not in Press; a1], and
fp(a1) = E[f(a1, O) | O in Press; a1].

(4) holds because the g(a1) term is added to the utility of every action, and will be greater for actions that promote future corrigibility.
(5) holds because apart from the g(a1) term, which is required to satisfy (4), the agent is simply maximising
E[UN | NotPress; a1] over all possible a1, which is what we wanted.

In actual fact, when choosing a1 our agent is simply maximising E[UN | NotPress; a1] + E[US | Press; a1].

EDIT: I've edited this to reflect the errors pointed out by Karl.

Firstly, the important part of my modification to the indifference formalism is not about conditioning on the actual o but it's the fact that in evaluating the expectation of UN it take the action in A2 (for a given pair (a1,o)) which maximize UN instead of the action which maximize U (note that U is equal to US in the case that o is not in Press.).

Secondly an agent which chose a1 by simply maximizing E[UN | NotPress; a1] + E[US | Press; a1] do exhibit pathological behaviors. In partcular, there will still be incentives to manage the news, but from both sides now (there is an incentive to cause the button to be pressed in the event of an information which is bad news from the point of view of UN and incentives to cause the button to not be pressed in the events of information which is bad news from the point of view of US.

I think this means "indifference" isn't really the right term any more, because the agent is not actually indifferent between the two sets of observations, and doesn't really need to be.

So, how about U(a1, o, a2) =
UN(a1, o, a2) + max_b(US(a1, o, b)), if o is not in Press
US(a1, o, a2) + max_b(UN(a1, o, b)), if o is in Press

or, in your notation, U(a1, o, a2) = g(a1, o) + UN(a1, o, a2) if o is in Press, or US(a1, o, a2) + f(a1, o) if o is in Press.

OK, you're right on that point; I misunderstood the "managing the news" problem because I hadn't quite realised that it was about shifting observations between the Press/NotPress sets. As you've said, the only resolution is to select a1 based on
E[max_b(UN(a1, O, b) | O; a1]
and not
E[max_b(UN(a1, O, b) | O not in Press; a1]

Hi Nate, interesting work.

I don't understand the assumptions behind the corrigibility problem. According to the intelligence explosion thesis, a self-improving AI will spend very little time in the near-human intelligence interval. Thus, most of the time it will be either far subhuman or far superhuman. In the far subhuman region the AIs manipulations against its programmers don't seem to be a concern. In the far superhuman region fixing bugs seems to be way too late. In addition, it seems infeasible to debug the AI at this stage since it would have rewritten its own source codes into something humans probably cannot understand.

When building an AGI, it's quite prudent to expect that you didn't get everything exactly right on the first try. Therefore, it's important to build systems that are amenable to modification, that don't tile the universe halfway through value loading, etc. etc. In other words, even if you could kick off an intelligence explosion that quickly makes a system which you have no control over, this is probably a bad plan if you haven't done a whole hell of a lot of hard work verifying the system and ensuring that it is aligned with your interests first, and so on. Corrigibility is the study of reasoning systems that are amenable to modification in that window between "starting to build models of its operators" and "it doesn't matter what you try to do anymore."

You might be right that that window would be small by default, but it's pretty important to make that window as wide as possible in order to attain good outcomes.

Thx for replying!

Let me see if I understood the assumptions correctly:

  1. We have a way of keeping the AGI's evolution in check so that it arrives at near-human level but doesn't go far superhuman. For example, we limit the available RAM, there is a theorem which produces a spatial complexity lower bound per given level of intelligence (rigorously quantified in some way) and there is a way to measure human intelligence on the same scale. Alternatively, the amount of RAM doesn't give a strong enough bound by itself but it does combined with a limit on evolution time starting from seed AI.

  2. We are reasonably confident the AGI follows the utility function we program into it and this property is stable with respect to self-modification.

  3. We are not reasonably confidently the utility function used in practice is actually friendly (although a serious attempt to make it friendly has been made).

  4. We are reasonably confident in the ability to formally describe conditions such as "shutdown when button is pressed".

Is this about right?

Those points roughly describe the assumptions you'd have to make to think that the shutdown problem in particular (and "solutions" such as the one in the paper) were valid, I suppose. The study of corrigibility more broadly does not depend upon these assumptions necessarily, though -- the overall question that the study of corrigibility attempts to answer is this: given that your utility function probably won't be "friendly" on the first try, what sort of system do you want to build? As you've noticed, there are many open problems in this space, including:

  1. How do you build a system such that you can be reasonably confident that you know what it's going to optimize in the first place?
  2. How do you build something that doesn't undergo a hard takeoff when you're not looking?
  3. How do you specify something like "shut down" (or "have a low impact"), given that there's no privileged null action?
  4. How do you build a generally intelligent system that doesn't optimize too much in any direction?
  5. How do you build a system that avoids the default incentives to manipulate/deceive?

These are all FAI problems that have nothing to do with directly specifying a "friendly utility function." (5) is definitely a corrigibility problem; (2), (3), and (4) could be seen as loosely falling under the wide "corrigibility" dilemma, and (1) I'd characterize as a "highly reliable agent designs" problem. The corrigibility paper pretty much only touches on (5) (and even then indirectly); it's better to look at that formalization of the "shutdown problem" not as an attempt at an answer to these problems, per se, but as an early attempt to carve out a toy model of the problem where we can start messing around and looking for insights that generalize to the problems at large.

Thanks, that is a good explanation.

Regarding problem 5, one approach I thought of is what I call "epistemic boxing". Namely, we put the AGI in a virtual world ("box") and program it to optimize utility expectation value over a "hard-coded" (stochastic) model of the box rather than over a Solomonoff measure. This assumes the utility function is given explicitly in terms of the box's degrees of freedom.

Such an AGI can still recursively self-improve and become superintelligent, however it will never escape the box since the possibility is a non-sequitur in its epistemology. In particular, the box can have external inputs but the AGI will model them as e.g. random noise and won't attempt to continue whatever pattern they contain (it will always consider it "accidental").

Regarding question 2, I think there is a non-negligible probability it is unsolvable. That is not to say we shouldn't look for solutions but IMO we should be prepared for the possibility there are none.

I think there is an issue with the utility indifference framework that has not been mentioned in the paper and in the comments so far:

If the agent is able to affect with its action a1 the probability of the shutdown button being pressed, that is, a1 can be a manipulative action, and if the agent is indifferent to the button being pressed, then it may happen that it "accidentally" performs a manipulative action.

E.g. assume that the users (the programmers) would use a remote controlled robotic arm to press the shutdown button. If the agents turns out to be a paperclipper, it may disassemble the robotic arm just to turn it into paperclips. The agent is not "intentionally" trying to resist shutdown, but the effect will be the same. Symmetrically there could be scenarios where the agent "accidentally" presses the shutdown button itself.

If I understand correctly, UN is already supposed to penalize manipulative actions, but UN is untrusted, hence the problem still exist.
Corrigibility implemented using utility indifference might make sense as a precaution, but it is not foolproof.

E.g. assume that the users (the programmers) would use a remote controlled robotic arm to press the shutdown button. If the agents turns out to be a paperclipper, it may disassemble the robotic arm just to turn it into paperclips. The agent is not "intentionally" trying to resist shutdown, but the effect will be the same. Symmetrically there could be scenarios where the agent "accidentally" presses the shutdown button itself.

Yep! In fact, this is exactly the problem discussed in section 4.1 and described in Theorem 6, is it not?

Section 4.1 frames the problem in terms of the agent creating a sub-agent or successor. My point is that the issue is more general, as there are manipulative actions that don't involve creating other agents.
Theorem 6 seems to address the general case, although I would remark that even if epsilon == 0 (that is, even UN is indifferent to manipulation) you aren't safe.

This sounds familiar. Are you aware of other similar concepts previously communicated elsewhere? I feel certain I've read something along these lines before. By all means, claim it's original though.

Not sure if this is what you're thinking of, but there's a research area called "adjustable autonomy" and a few other names, which superficially sounds similar but isn't actually getting at the problem described here, which comes about due to convergent instrumental values in sufficiently advanced agents.