Corrigibility
Summary: A "corrigible agent" is one that reasons as if it is incomplete, and therefore does not resist (and ideally assists) its operators in their attempts to modify the system. Constructing corrigible agents is an open problem for AdvancedSafety in AdvancedAgents.
Intuitively, a corrigible agent is an agent that reasons as if it is incomplete and potentially flawed in dangerous ways. More specifically, a corrigible agent must satisfy at least the following properties:
Formalizing these requirements (into a full specification of an agent which, if implemented, would exhibit corrigible behavior) is an open problem, and the study of reasoning methodologies that give rise to corrigible agents is known as the study of "corrigibility."
By default, most sets of preferences are such that an agent acting according to those preferences will prefer to retain its current preferences. For example, imagine an agent which is attempting to collect stamps. Altering the agent so that it prefers to collect bottle caps would lead to futures where the agent has fewer stamps, and so allowing this event to occur is dispreferred (under the current, stamp-collecting preferences).
More generally, as noted by InstrumentallyConvergentStrategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U'. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U') or manipulate its programmers (into believing that they actually prefer U to U', or by coercing them into leaving its utility function in tact).
A corrigible agent must avoid these default incentives to manipulate and deceive, but specifying some set of preferences that avoids deception/manipulation incentives remains an open problem.