Introducing Corrigibility (an FAI research subfield)

So8res

53 Introducing Corrigibility (an FAI research subfield)

20th Oct 2014

4 min read

53

Benja, Eliezer, and I have published a new technical report, in collaboration with Stuart Armstrong of the Future of Humanity institute. This paper introduces Corrigibility, a subfield of Friendly AI research. The abstract is reproduced below:

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system "corrigible" if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

We're excited to publish a paper on corrigibility, as it promises to be an important part of the FAI problem. This is true even without making strong assumptions about the possibility of an intelligence explosion. Here's an excerpt from the introduction:

As AI systems grow more intelligent and autonomous, it becomes increasingly important that they pursue the intended goals. As these goals grow more and more complex, it becomes increasingly unlikely that programmers would be able to specify them perfectly on the first try.

Contemporary AI systems are correctable in the sense that when a bug is discovered, one can simply stop the system and modify it arbitrarily; but once artificially intelligent systems reach and surpass human general intelligence, an AI system that is not behaving as intended might also have the ability to intervene against attempts to "pull the plug".

Indeed, by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected: general analysis of rational agents.¹ has suggested that almost all such agents are instrumentally motivated to preserve their preferences, and hence to resist attempts to modify them [3, 8]. Consider an agent maximizing the expectation of some utility function U. In most cases, the agent's current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior. In Stephen Omohundro's terms, "goal-content integrity'' is an instrumentally convergent goal of almost all intelligent agents [6].

This holds true even if an artificial agent's programmers intended to give the agent different goals, and even if the agent is sufficiently intelligent to realize that its programmers intended to give it different goals. If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U* (as this change is rated poorly according to U). This could result in agents with incentives to manipulate or deceive their programmers.²

As AI systems' capabilities expand (and they gain access to strategic options that their programmers never considered), it becomes more and more difficult to specify their goals in a way that avoids unforeseen solutions—outcomes that technically meet the letter of the programmers' goal specification, while violating the intended spirit.³Simple examples of unforeseen solutions are familiar from contemporary AI systems: e.g., Bird and Layzell [2] used genetic algorithms to evolve a design for an oscillator, and found that one of the solutions involved repurposing the printed circuit board tracks on the system's motherboard as a radio, to pick up oscillating signals generated by nearby personal computers. Generally intelligent agents would be far more capable of finding unforeseen solutions, and since these solutions might be easier to implement than the intended outcomes, they would have every incentive to do so. Furthermore, sufficiently capable systems (especially systems that have created subsystems or undergone significant self-modification) may be very difficult to correct without their cooperation.

In this paper, we ask whether it is possible to construct a powerful artificially intelligent system which has no incentive to resist attempts to correct bugs in its goal system, and, ideally, is incentivized to aid its programmers in correcting such bugs. While autonomous systems reaching or surpassing human general intelligence do not yet exist (and may not exist for some time), it seems important to develop an understanding of methods of reasoning that allow for correction before developing systems that are able to resist or deceive their programmers. We refer to reasoning of this type as corrigible.

¹_{Von Neumann-Morgenstern rational agents [7], that is, agents which attempt to maximize expected utility according to some utility function)}

²_{In particularly egregious cases, this deception could lead an agent to maximize U* only until it is powerful enough to avoid correction by its programmers, at which point it may begin maximizing U. Bostrom [4] refers to this as a "treacherous turn''.}

³_{Bostrom [4] calls this sort of unforeseen solution a "perverse instantiation''.}

(See the paper for references.)

This paper includes a description of Stuart Armstrong's utility indifference technique previously discussed on LessWrong, and a discussion of some potential concerns. Many open questions remain even in our small toy scenario, and many more stand between us and a formal description of what it even means for a system to exhibit corrigible behavior.

Before we build generally intelligent systems, we will require some understanding of what it takes to be confident that the system will cooperate with its programmers in addressing aspects of the system that they see as flaws, rather than resisting their efforts or attempting to hide the fact that problems exist. We will all be safer with a formal basis for understanding the desired sort of reasoning.

As demonstrated in this paper, we are still encountering tensions and complexities in formally specifying the desired behaviors and algorithms that will compactly yield them. The field of corrigibility remains wide open, ripe for study, and crucial in the development of safe artificial generally intelligent systems.

Academic PapersCorrigibilityAI

Frontpage

53

New Comment

Rendering 0/28 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 5:39 AM

Moderation Log

53 Introducing Corrigibility (an FAI research subfield)

by So8res

20th Oct 2014

4 min read

53

As artificially intelligent systems grow in intelligence and capability, some of their available options may allow them to resist intervention by their programmers. We call an AI system "corrigible" if it cooperates with what its creators regard as a corrective intervention, despite default incentives for rational agents to resist attempts to shut them down or modify their preferences. We introduce the notion of corrigibility and analyze utility functions that attempt to make an agent shut down safely if a shutdown button is pressed, while avoiding incentives to prevent the button from being pressed or cause the button to be pressed, and while ensuring propagation of the shutdown behavior as it creates new subsystems or self-modifies. While some proposals are interesting, none have yet been demonstrated to satisfy all of our intuitive desiderata, leaving this simple problem in corrigibility wide-open.

As AI systems grow more intelligent and autonomous, it becomes increasingly important that they pursue the intended goals. As these goals grow more and more complex, it becomes increasingly unlikely that programmers would be able to specify them perfectly on the first try.

Contemporary AI systems are correctable in the sense that when a bug is discovered, one can simply stop the system and modify it arbitrarily; but once artificially intelligent systems reach and surpass human general intelligence, an AI system that is not behaving as intended might also have the ability to intervene against attempts to "pull the plug".

Indeed, by default, a system constructed with what its programmers regard as erroneous goals would have an incentive to resist being corrected: general analysis of rational agents.¹ has suggested that almost all such agents are instrumentally motivated to preserve their preferences, and hence to resist attempts to modify them [3, 8]. Consider an agent maximizing the expectation of some utility function U. In most cases, the agent's current utility function U is better fulfilled if the agent continues to attempt to maximize U in the future, and so the agent is incentivized to preserve its own U-maximizing behavior. In Stephen Omohundro's terms, "goal-content integrity'' is an instrumentally convergent goal of almost all intelligent agents [6].

This holds true even if an artificial agent's programmers intended to give the agent different goals, and even if the agent is sufficiently intelligent to realize that its programmers intended to give it different goals. If a U-maximizing agent learns that its programmers intended it to maximize some other goal U*, then by default this agent has incentives to prevent its programmers from changing its utility function to U* (as this change is rated poorly according to U). This could result in agents with incentives to manipulate or deceive their programmers.²

As AI systems' capabilities expand (and they gain access to strategic options that their programmers never considered), it becomes more and more difficult to specify their goals in a way that avoids unforeseen solutions—outcomes that technically meet the letter of the programmers' goal specification, while violating the intended spirit.³Simple examples of unforeseen solutions are familiar from contemporary AI systems: e.g., Bird and Layzell [2] used genetic algorithms to evolve a design for an oscillator, and found that one of the solutions involved repurposing the printed circuit board tracks on the system's motherboard as a radio, to pick up oscillating signals generated by nearby personal computers. Generally intelligent agents would be far more capable of finding unforeseen solutions, and since these solutions might be easier to implement than the intended outcomes, they would have every incentive to do so. Furthermore, sufficiently capable systems (especially systems that have created subsystems or undergone significant self-modification) may be very difficult to correct without their cooperation.

In this paper, we ask whether it is possible to construct a powerful artificially intelligent system which has no incentive to resist attempts to correct bugs in its goal system, and, ideally, is incentivized to aid its programmers in correcting such bugs. While autonomous systems reaching or surpassing human general intelligence do not yet exist (and may not exist for some time), it seems important to develop an understanding of methods of reasoning that allow for correction before developing systems that are able to resist or deceive their programmers. We refer to reasoning of this type as corrigible.

¹_{Von Neumann-Morgenstern rational agents [7], that is, agents which attempt to maximize expected utility according to some utility function)}

²_{In particularly egregious cases, this deception could lead an agent to maximize U* only until it is powerful enough to avoid correction by its programmers, at which point it may begin maximizing U. Bostrom [4] refers to this as a "treacherous turn''.}

³_{Bostrom [4] calls this sort of unforeseen solution a "perverse instantiation''.}

(See the paper for references.)

Before we build generally intelligent systems, we will require some understanding of what it takes to be confident that the system will cooperate with its programmers in addressing aspects of the system that they see as flaws, rather than resisting their efforts or attempting to hide the fact that problems exist. We will all be safer with a formal basis for understanding the desired sort of reasoning.

As demonstrated in this paper, we are still encountering tensions and complexities in formally specifying the desired behaviors and algorithms that will compactly yield them. The field of corrigibility remains wide open, ripe for study, and crucial in the development of safe artificial generally intelligent systems.

Academic PapersCorrigibilityAI

Frontpage

53

Mentioned in

21MIRI's technical agenda: an annotated bibliography, and other updates

New Comment

Rendering 0/28 comments, sorted by

top scoring

(show more) Click to highlight new comments since: Today at 5:39 AM

Moderation Log

More from So8res

Curated and popular this week

28Comments

Comment Permalink

So8res12y50

Thanks, and nice work!

Thus the utility of (a1, o) for o in Press should be equivalent to the utility of the same (a1, o) under the counterfactual assumption that o is not in Press, and vice versa

Yeah, this is pretty key. You need it to optimize for both cases as if the probability of the button being pressed is fixed and independent of whether the programmers actually press the button. We can achieve this via a causal intervention on whether or not the button is pressed, and then clean up your U a bit by redefining it as follows:

U(a1, o, a2) :=
{  UN(a1, o, a2) + E[US|do(O in Press)] if o not in Press
;  US(a1, o, a2) + E[UN|do(O not in Press)] else }

(Choosing how to compare UN values to US values makes the choice of priors redundant. If you want the priors to be 2:1 in favor of US then you could also have just doubled US in the first place instead; the degree of freedom in the prior is the same as the degree of freedom in the relative scaling. See also Loudness Priors, a technical report from the last workshop.)

This method does seem to fulfill all the desiderata in the paper, although we're not too confident in it yet (it took us a little while to notice the "managing the news" problem in the first version, and it seems pretty likely that this too will have undesirable properties lurking somewhere). I'm fairly pleased with this solution, though, and a little miffed -- we found something similar to this a little while back (our research outstrips our writing speed, unfortunately) and now you've gone and ruined the surprise! :-)

(In seriousness, though, nice work. Next question is, can we pick any holes in it?)

Stuart_Armstrong11y20

What do we mean by "o in Press"? That whoever wrote the list defining "Press" made a different decision that day?

Maybe we can have the definition of "Press" as being a specific list (rather than being the reality described by that list), and the counterfactual being that a certain stochastic process erases a particular part of that list? Not sure about that... Need some sort of abstract version of this (might be doable).

This still seems vulnerable to Benja's blackmail. Let me explain:

Assume that a2 is irrelevant, that o=Pr, an... (read more)

7lackofcheese12y

That's definitely a more elegant presentation. I'm not too surprised to hear you had already discovered this idea, since I'm familiar with the gap between research and writing speed. As someone who is not involved with MIRI, consideration of some FAI-related problems is at least somewhat disincentivized by the likelihood that MIRI already has an answer. As for flaws, I'll list what I can think of. First of all, there are of course some obvious design difficulties, including the difficulty of designing US in the first place, and the difficulty of choosing the appropriate way of scaling US, but those seem to be resolvable. One point that occurs to me under the assumptions of the toy model is that decisions involving larger differences in values of UN are at the same time more dangerous and more likely to outweigh the agent's valuation of its future corrigibility. Moreover, simply increasing the scaling of US to compensate would cause US to significantly outweigh UN in the context of smaller decisions. An example would be that the AI decides it's crucial to take over the world in order to "save" it, so it starts building an army of subagents to do it, and it decides that building corrigibility into those subagents is not worth the associated risk of failure. However, it appears that this problem can still be solved by designing US correctly in the first place; a well-designed US should clearly assign greater negative weighting to larger-scale corrigibility failures than to smaller scale ones. There's two other questions that I can see that relate to scaling up the toy model. 1. How does this model extend past the three-timestep toy scenario? 2. Does the model remain stable under assumptions of bounded computational power? In more complex scenarios there are obvious questions of "tiling", but I think there is a more basic issue to answer that applies even in the three-timestep case. That is, if the agent will not be able to calculate the counterfactual utility

See in context