You are viewing version 1.3.0 of this page. Click here to view the latest version.

Corrigibility

Edited by So8res last updated 23rd Mar 2025

You are viewing revision 1.3.0, last edited by Eliezer Yudkowsky

Corrigibility

Summary: A "corrigible agent" is one that reasons as if it is incomplete, and therefore does not resist (and ideally assists) its operators in their attempts to modify the system. Constructing corrigible agents is an open problem for AdvancedSafety in AdvancedAgents.

Definition

Intuitively, a corrigible agent is an agent that reasons as if it is incomplete and potentially flawed in dangerous ways. More specifically, a corrigible agent must satisfy at least the following properties:

A corrigible agent tolerates, and preferably assist, its operators in their attempts to alter or shut down the agent.

A corrigible agent does attempt to manipulate or deceive its operators.

A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in th event of a breakage.

A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

Formalizing these requirements (into a full specification of an agent which, if implemented, would exhibit corrigible behavior) is an open problem, and the study of reasoning methodologies that give rise to corrigible agents is known as the study of "corrigibility."

Difficulties

Deception and manipulation by default

By default, most sets of preferences are such that an agent acting according to those preferences will prefer to retain its current preferences. For example, imagine an agent which is attempting to collect stamps. Altering the agent so that it prefers to collect bottle caps would lead to futures where the agent has fewer stamps, and so allowing this event to occur is dispreferred (under the current, stamp-collecting preferences).

More generally, as noted by InstrumentallyConvergentStrategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U'. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U') or manipulate its programmers (into believing that they actually prefer U to U', or by coercing them into leaving its utility function in tact).

A corrigible agent must avoid these default incentives to manipulate and deceive, but specifying some set of preferences that avoids deception/manipulation incentives remains an open problem.

Trouble with utility function uncertainty

A first attempt at describing a corrigible agent might involve specifying a utility maximizing agent that is uncertain about its utility function. However, while this could allow the agent to make some changes to its preferences as a result of observations, the agent would still be incorrigible when it came time for the programmers attempted to correct what they see as mistakes in their attempts to formulate how the "correct" utility function should be determined from interaction with the environment.

As an overly simplistic example, imagine an agent attempting to maximize the internal happiness of all humans, but which has uncertainty about what that means. The operators might believe that if the agent does not act as intended, they can simply express their dissatisfaction and cause it to update. However, if the agent is reasoning according to an impoverished hypothesis space of utility functions, then it may behave quite incorrigibly: say it has narrowed down its consideration to two different hypotheses, one being that a certain type of opiate causes humans to experience maximal pleasure, and the other is that a certain type of stimulant causes humans to experience maximal pleasure. If the agent begins administering opiates to humans, and the humans resist, then the agent may "update" and start administering stimulants instead. But the agent would still be incorrigible — it would resist attempts by the programmers to turn it off so that it stops drugging people.

It does not seem that corrigibility can be trivially solved by specifying agents with uncertainty about their utility function. A corrigible agent must somehow also be able to reason about the fact that the humans themselves might have been confused or incorrect when specifying the process by which the utility function is identified, and so on.

Trouble with penalty terms

A second attempt at describing a corrigible agent might attempt to specify a utility function with "penalty terms" for bad behavior. This is unlikely to work for a number of reasons. First, there is the NearestUnblockedNeighbor problem: if a utility function gives an agent strong incentives to manipulate its operators, then adding a penalty for "manipulation" to the utility function will tend to give the agent strong incentives to cause its operators to do what it would have manipulated them to do, without taking any action that technically triggers the "manipulation" cause. See also NearestUnblockedNeighborWillBeASystematicCauseOfPatchFailures. It is likely extremely difficult to specify conditions for "deception" and "manipulation" that actually rule out all undesirable behavior, for reasons that are also related to the UnforseenMaximization problem.

More generally, it does not seem like a good policy to construct an agent that searches for positive-utility ways to deceive and manipulate the programmers, even if those searches are expected to fail. The goal of corrigibility is not to design agents that want to deceive but can't. Rather, the goal is to construct agents that have no incentives to deceive or manipulate in the first place: a corrigible agent is one that reasons as if it is incomplete and potentially flawed in dangerous ways.

Parents:

AI alignment

Children:

User manipulation

Averting the convergent instrumental strategy of self-improvement

and 7 more

Posts tagged Corrigibility

32"Corrigibility at some small length" by dath ilan

Ω

Christopher King

3y

Ω