This seems like something that could be ~~investigate~~investigated in practice on e.g. a chess program.

Further reading and references

Corrigibility by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong.

A 'corrigible' agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

Definition

~~Intuitively,~~A 'corrigible' agent is one that doesn't interfere with what we would intuitively see as attempts to 'correct' the agent, or 'correct' our mistakes in building it; and permits these 'corrections' despite the apparent instrumentally convergent reasoning saying otherwise.

If we try to suspend the AI to disk, or shut it down entirely, a corrigible AI will let us do so. (Even though, if suspended, the AI will then be unable to fulfill what would usually be its goals.)
If we try to reprogram the AI's utility function or meta-utility function, a corrigible AI will allow this modification to go through. (Rather than, e.g., fooling us into believing the utility function was modified successfully, while the AI actually keeps its original utility function as obscured functionality; as we would expect by default to be a preferred outcome according to the AI's current preferences.)

More abstractly:

A corrigible agent ~~is an agent that doesn't~~experiences no preference or instrumental pressure to interfere with attempts by the programmers or operators to ~~correct it. More specifically, a corrigible agent must satisfy at least~~modify the ~~following properties:~~
1. ~~A corrigible agent tolerates, and preferably assists,~~agent, impede its ~~operators in their attempts to alter~~operation, or ~~shut down the agent.~~halt its execution.
2. A corrigible agent does not attempt to manipulate or deceive its operators, especially with respect to properties of the agent that might otherwise cause its operators to modify it.
3. A corrigible agent does not try to obscure its thought processes from its programmers or operators.
4. A corrigible agent ~~has incentives~~is motivated to ~~repair safety measures (such as shutdown buttons, tripwires, or containment tools)~~preserve the corrigibility of the larger system if ~~they break, or at least notify its operators~~that agent self-modifies, constructs sub-agents in the ~~event~~environment, or offloads part of ~~a breakage.~~its cognitive processing to external systems; or alternatively, the agent has no preference to execute any of those general activities.

A ~~corrigible agent preserves its corrigibility, even as it creates new sub-systems~~stronger form of corrigibility would require the AI to positively cooperate or ~~sub-agents, even~~assist, such that the AI would rebuild the shutdown button if it ~~undergoes significant~~were destroyed, or experience a positive preference not to self-~~modification.~~modify if self-modification could lead to incorrigibility. But this is not part of the primary specification since it's possible that we would not want the AI trying to actively be helpful in assisting our attempts to shut it down, and would in fact prefer the AI to be passive about this.

~~Formalizing these requirements (into a full specification~~Good proposals for achieving corrigibility in specific regards are open problems in AI alignment. Some areas of ~~an agent which, if implemented, would exhibit corrigible behavior)~~active current research are Utility indifference and Interruptibility.

Achieving total corrigibility everywhere via some single,...

Read More (141 more words)

To build e.g. a Butlerian genie, we need to have the AI e.g. not experience an instrumental incentive to get better at modeling minds, or refer mind-modeling problems to subagents, etcetera. The general subproblem might be 'averting the instrumental pressure to become good at modeling a particular aspect of reality'. A toy problem might be an AI that in general wants to get the gold in a Wumpus problem, but doesn't experience an instrumental pressure to know the state of the upper-right-hand-corner cell in particular.

To build e.g. a ~~mindblind~~behaviorist genie, we need to have the AI e.g. not experience an instrumental incentive to get better at modeling minds, or refer mind-modeling problems to subagents, etcetera. The general subproblem might be 'averting the instrumental pressure to become good at modeling a particular aspect of reality'. A toy problem might be an AI that in general wants to get the gold in a Wumpus problem, but doesn't experience an instrumental pressure to know the state of the upper-right-hand-corner cell in particular.

To build e.g. a ~~Butlerian~~mindblind genie, we need to have the AI e.g. not experience an instrumental incentive to get better at modeling minds, or refer mind-modeling problems to subagents, etcetera. The general subproblem might be 'averting the instrumental pressure to become good at modeling a particular aspect of reality'. A toy problem might be an AI that in general wants to get the gold in a Wumpus problem, but doesn't experience an instrumental pressure to know the state of the upper-right-hand-corner cell in particular.

A corrigible agent tolerates, and preferably ~~assist,~~assists, its operators in their attempts to alter or shut down the agent.
A corrigible agent does not attempt to manipulate or deceive its operators.
A corrigible agent has incentives to repair safety measures (such as shutdown buttons, tripwires, or containment tools) if they break, or at least notify its operators in the event of a breakage.
A corrigible agent preserves its corrigibility, even as it creates new sub-systems or sub-agents, even if it undergoes significant self-modification.

More generally, as noted by instrumentally convergent strategies, most utility functions give an agent strong incentives to retain its current utility function: imagine an agent constructed so that it acts according to the utility function U, and imagine further that its operators think they built the agent to act according to a different utility function U'. If the agent learns this fact, then it has incentives to either deceive its programmers (prevent them from noticing that the agent is acting according to U instead of U') or manipulate its programmers (into believing that they actually prefer U to U', or by coercing them into leaving its utility function ~~in tact)~~intact).

A first attempt at describing a corrigible agent might involve specifying a utility maximizing agent that is uncertain about its utility function. However, while this could allow the agent to make some changes to its preferences as a result of observations, the agent would still be incorrigible when it came time for the programmers ~~attempted~~to attempt to correct what they see as mistakes in their attempts to formulate how the "correct" utility function should be determined from interaction with the environment.

			v1.16.0Mar 23rd 2025 GMT	(+119)
			v1.15.0Mar 23rd 2025 GMT	(+12/-13)
			v1.14.0Feb 20th 2025 GMT	(-1)
			v1.13.0Feb 20th 2025 GMT
			v1.12.0Feb 20th 2025 GMT	(+3)
			v1.11.0Feb 9th 2017 GMT	(+2508/-914)
			v1.10.0Nov 23rd 2016 GMT	(+25/-24)
			v1.9.0Dec 29th 2015 GMT	(+11/-9)
			v1.8.0Dec 25th 2015 GMT	(+9/-9)
			v1.7.0Dec 24th 2015 GMT	(+5)

			v1.16.0Mar 23rd 2025 GMT	(+119)
			v1.15.0Mar 23rd 2025 GMT	(+12/-13)
			v1.14.0Feb 20th 2025 GMT	(-1)
			v1.13.0Feb 20th 2025 GMT
			v1.12.0Feb 20th 2025 GMT	(+3)
			v1.11.0Feb 9th 2017 GMT	(+2508/-914)
			v1.10.0Nov 23rd 2016 GMT	(+25/-24)
			v1.9.0Dec 29th 2015 GMT	(+11/-9)
			v1.8.0Dec 25th 2015 GMT	(+9/-9)
			v1.7.0Dec 24th 2015 GMT	(+5)

LESSWRONG
LW

LESSWRONG
LW

Corrigibility

Corrigibility

Further reading and references

Definition