You are viewing version 1.0.0 of this page. Click here to view the latest version.

Corrigibility

Edited by So8res last updated 23rd Mar 2025

You are viewing revision 1.0.0, last edited by So8res

Corrigibility

Summary: A "corrigible agent" is one that reasons as if it is incomplete, and therefore does not resist (and ideally assists) its operators in their attempts to modify the system. Constructing corrigible agents is an open problem for AdvancedSafety in AdvancedAgents.

Definition

Intuitively, a corrigible agent is an agent that reasons as if it is incomplete and potentially flawed in dangerous ways. More specifically, a corrigible agent must satisfy at least the following properties:

A corrigible agent tolerates, and preferably assist, its operators in their attempts to alter or shut down the agent.

A corrigible agent does attempt to manipulate or deceive its operators.

A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in th event of a breakage.

A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

Formalizing these requirements (into a full specification of an agent which, if implemented, would exhibit corrigible behavior) is an open problem, and the study of reasoning methodologies that give rise to corrigible agents is known as the study of "corrigibility."

Difficulties

Deception and manipulation by default

By default, most sets of preferences are such that an agent acting according to those preferences will prefer to retain its current preferences. For example, imagine an agent which is attempting to collect stamps. Altering the agent so that it prefers to collect bottle caps would lead to futures where the agent has fewer stamps, and so allowing this event to occur is dispreferred (under the current, stamp-collecting preferences).

More generally, as noted by InstrumentallyConvergentStrategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U'. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U') or manipulate its programmers (into believing that they actually prefer U to U', or by coercing them into leaving its utility function in tact).

A corrigible agent must avoid these default incentives to manipulate and deceive, but specifying some set of preferences that avoids deception/manipulation incentives remains an open problem.