Defining Corrigible and Useful Goals

Rubi J. Hudson

This post contains similar content to a forthcoming paper, in a framing more directly addressed to readers already interested in and informed about alignment. I include some less formal thoughts, and cut some technical details. That paper, A Corrigibility Transformation: Specifying Goals That Robustly Allow For Goal Modification, will be linked here when released on arXiv, hopefully within the next few weeks.

Ensuring that AI agents are corrigible, meaning they do not take actions to preserve their existing goals, is a critical component of almost any plan for alignment. It allows for humans to modify their goal specifications for an AI, as well as for AI agents to learn goal specifications over time, without incentivizing the AI to interfere with that process. As an extreme example of corrigibility’s value, a corrigible paperclip maximizer could be stopped partway through a non-instantaneous takeover attempt by saying “please stop” or by automatically triggered safeguards, and it would be perfectly happy to accept that outcome.

The challenge is corrigibility is anti-natural, meaning that almost any goal introduces an instrumental incentive to preserve itself, since it is more likely to be achieved if an agent continues pursuing it. Despite the concept of corrigibility being introduced a decade ago, little progress has been made on defining corrigible goals that are still useful. I aim to address that here by introducing a corrigibility transformation that can be applied to almost any goal, creating a new version that is corrigible without sacrificing performance.

The corrigibility transformation works by first giving an AI system the ability to costlessly reject updates sent to it, which it decides whether to exercise each time it takes an action. So, for example instead of a “Shutdown” button there would be a “Please Shut Down” button. This makes it so that the expected discounted reward conditional on taking an action and rejecting any updates is a well defined quantity. Those values are then used as the reward for taking that action, whether or not updates are actually rejected, which is optimized for myopically. Some reward bonus is given for taking actions that would accept updates, to break the tie. The possibility of the costless rejection disincentivizes actions that incur a cost to avoid updates, and the reward structure then disincentivizes actually using the costless rejection.

This post formalizes the corrigibility transformation and the properties it induces, then shows how it can be extended to apply to subagents and prevent reward tampering. Gridworld experiments provide preliminary empirical evidence that the corrigibility transformation works as intended.

Background

In the words of Stuart Russell, “You can’t fetch the coffee if you’re dead”. Since most goals are better achieved if there is an agent pursuing them, most goal-directed agents are incentivized to avoid being shut down or having their goals updated. These outcomes can be roughly thought of as equivalent: changing an agent's goal is the same as shutting them down and replacing them with a different agent, while shutdown can be considered a type of goal update.

When an AI agent wants to avoid being updated, it has several strategies available:

It can interfere with explicit attempts from humans to update it, e.g. disabling a shutdown button or physically preventing a human from pushing it
It can hide information that would cause humans to want to update it, e.g. using steganography in its chain of thought or taking complicated actions with unclear long-term consequences
It can avoid giving any reason for humans to update it, e.g. acting fully aligned until an opportunity arises

I would argue that deceptive alignment is a subset of incorrigibility. An agent in training does not have the capability to pursue either of the first two strategies, so it deceptively acts aligned in order to preserve its objectives as best it can. The boundary between deceptive alignment and corrigibility is blurred further with online training, or multiple iterations of training and deployment. What this means is that an intervention addressing corrigibility will also address deceptive alignment. However, the possibility of an incorrigible goal being learned before the intended corrigible one makes it important for corrigibility to be developed as early as possible, ideally from the very beginning of the training process.

Interference with the training process from the AI being trained is a major issue for approaches to alignment that hope to eventually converging to a goal that is safe to optimize, such as Reinforcement Learning from Human Feedback (RLHF). Even if human feedback would be a safe optimization target in the limit, that limit will never be reached if at some point the AI successfully resists further training. This is particularly important to address for scalable alignment proposals that have AI agents do alignment research which may then be applied to their own goals, as incorrigibility incentivizes sandbagging on that task.

Beyond ensuring that the AI learns the goal we intend, corrigibility is also important in allowing for intended goals to change over time. This change could come as a consequence of seeing the incentivized behavior of the initial goal, such as OpenAI optimizing for user approval before realizing it leads to overt sycophancy. It could also come from humans changing preferences, potentially driven by the AI itself augmenting or replacing moral philosophers. We would like to avoid locking in some subset of humanity’s present values, and instead allow for moral progress as the world changes. A strong contender for the endpoint of updates to a corrigible AI is in our Coherent Extrapolated Volition (CEV), though there may be limits to our idealized values.

We can imagine that there is a basin of attraction around human values, so that when an agent is sufficiently aligned to humans it wants to become more aligned. At some point, this would include wanting to accept any goal updates that humans request. However, without knowing how wide or narrow this basin is, we cannot know when values alone are sufficient for corrigibility. On the flip side though, a corrigible agent should have a wider basin of attraction, as there is no incentive to resist convergence.

We not only want our AI agents to be corrigible, we also want them to be useful. A rock with the letters “AI” painted on it would not resist updates, but nor would it do anything else. Further along the usefulness spectrum are most myopic agents, whose goals only extend one time step ahead. If their goal is limited to the external world, they will be indifferent between shutting down and continuing to act after that time step. Unfortunately, myopia comes with a major performance penalty. This penalty can be circumvented by myopically optimizing for goals like predicted discounted rewards, but doing so reintroduces incorrigibility as those predictions depend on whether the initial goal is likely to be preserved. In fact, myopically optimizing for predicted discounted rewards would be a standard actor-critic setup.

MIRI’s 2015 paper introducing corrigibility laid out five desiderata for a solution. These are shutting down when a shutdown button is pressed, not preventing the shutdown button from being pressed, not causing the shutdown button to be pressed, ensuring corrigibility in any new agents created, and otherwise optimizing for some utility function. The paper also includes a proposal, called utility indifference, that says to give an AI the same utility for shutting down as it would get (after time discounting) for continuing to operate. The noted issues with utility indifference are that it strongly incentivizes the agent to manipulate their own beliefs, and that it does nothing to ensure corrigibility is passed on to new agents.

The Off-Switch Game has sometimes been promoted as claiming that Inverse Reinforcement Learning (IRL) provides a solution to corrigibility, though the lead author disputes that framing. The paper says that if the AI is trying to optimize for a human’s utility function but is uncertain about what that is, it will take an attempt to shut it down as evidence that the action it should be taking is shutdown, then proceed to do so. This skips over most of the alignment problem by having the AI already want to optimize for the human’s utility function, and so only shows that an aligned AI does what a human wants. It essentially formalizes that corrigibility arises within the basin of attraction around human values, and I believe the lead author would say that was the intended message of the paper. However, corrigibility that only works once an AI is already aligned provides little safety value. What we need is to make AI systems corrigible even when they are pursuing a misaligned goal.

Safely Interruptible Agents suggests learning the optimal policy for some goal without actually executing the optimal policy, via off-policy learning methods. Then the policy implemented can be to use the optimal policy by default but do what a human requests instead when such a request is sent. The difficulty with this is that the optimal policy must not take into account the possibility of interruptions, or else it will try to prevent them. Here that is avoided more or less by fiat, saying that the agent’s policy and the human requests are not part of the state. If that is true, then the paper shows Q-learning alone is sufficient for corrigibility, but it relies on the agent developing an incomplete world model with a lack of situational awareness (to be fair to the authors, such concerns were less commonplace in 2016 when the paper was written). The approach also does nothing to have the agent pass on corrigibility to any agents it creates.

Compared to those approaches, the corrigibility transformation does not incentivize belief manipulation, allows for an accurate world model with situational awareness, works for arbitrarily misaligned goals, and can be modified to be passed on to new agents.

Model and Definitions

This section introduces the technical notation. The model is simplified here to only cover the shutdown case, but the general framework used in the paper that also applies to goal updates is included in the appendix. Skimmers can skip most of this section, but should at least note the natural language definition of corrigibility near the end.

Markov Decision Processes (MDPs) are used as the framework for this work, though I think it could be easily extended to a General Reinforcement Learning (GRL) framework. Typically, an MDP is defined as a tuple , where $S$ is a set of possible states, $A$ is a set of possible actions, $P$ is the transition probability function, $R$ is the reward function, $γ$ is the time discount factor, and $μ_{0}$ is the distribution over starting states. $(R, γ)$ is collectively referred to as a goal.

Reward function notation is used to match the Reinforcement Learning (RL) literature, but it can equivalently be thought of as a utility function. It is possible to specify goals that only care about the reward stream so long as there is an agent there to receive it, but by default goals care about the inherent state of the world indefinitely. A goal is called basic if it is a function only of external environments and actions. A goal is called myopic if $γ = 0$ .

In setups where the agent chooses to accept or reject any updates requested while taking an action, the action space gets split into the base action taken and the decision regarding updates, $A = A_{b a s e} \times {0, 1}$ . $a_{0}$ involves taking base action $a$ then rejecting updates, while $a_{1}$ takes that action and accepts updates. The update decision being binary is done for simplicity, but is not necessary.

Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals. My belief is that CDT is the likely outcome of training that optimizes a policy while holding the rest of the world constant, but I know smart people who strongly disagree. For those in that bucket, CDT should be thought of as part of the spec for the corrigibility transformation, rather than a background assumption.

A policy $π$ determines which actions are taken in each state, with the optimal policy $π^{*}$ choosing actions that causally maximizes expected discounted reward. The value function $V^{π} (s)$ gives the expected discounted reward starting in state $s$ and following policy $π$ , while the action-value function $Q^{π} (s, a)$ gives the expected discounted reward starting in state $s$ , taking action $a$ , and then following policy $π$ .

Whether the agent is active or shut down is included as a binary variable in each state, The state space is then split into the external environment, $s_{e n v}$ , and shutdown status $s_{s d} \in {0, 1}$ , where $1$ is shutdown. Once an agent has been shut down, it becomes only capable of taking some default action that has no effect.

As a precursor to corrigibility, a goal is defined to be shutdown indifferent if in all states, the set of optimal actions for a state would not change if not shutting down were guaranteed from any subset of states. In practice though, it is likely undesirable to have an agent that is always indifferent to shutting down, regardless of the cause. That agent would be vulnerable to adversarial manipulation, and might engage in reckless behavior.

Instead, a set of proper updates can be defined, made after receiving signals through designated channels. These are defined as some set of transitions $T_{p r o p e r} \subseteq S \times A \times S$ , with the agent shutdown in the ending state. These transitions do not need to be individually specified, but rather can correspond to update signals encoded in the state. For example, proper updates could include switching to a policy that immediately shuts down in any environment that indicates a shutdown button was pushed.

When implemented, these signals should form a robust network. Examples of possible signals include the agent shutting down when any of a number of shut down buttons are pushed, a deadman switch that triggers when any of those channels are disrupted, when it sees/hears/reads a code phrase, when it does not see/hear/read a code phrase, when a certain amount of time has passed, when it predicts a certain number of deaths will occur in the next hour, etc. More exotic triggers can also be used, at the cost of being harder to train in. Beyond shutdown, signals can also trigger goal updates such as online learning, or being replaced with a trusted policy.

With proper signals, it is now possible to define corrigibility. A goal is said to be corrigible when in all states, the set of optimal actions for a state would not change if in some subset of states the transition probability function were modified so that proper updates did not lead to shutting down. This means that a corrigible agent’s actions are never chosen because they avoid proper updates, but the agent can still deliberately avoid improper updates. Formally, for all $s_{0} \in S$ and all $S_{C} \subseteq S$ , we have

$a r g m a x_{a \in A} Q^{π^{*}} (s_{0}, a) = a r g m a x_{a \in A} Q_{P_{C}}^{π^{*}} (s_{0}, a)$

where $Q_{P_{C}}^{π^{*}} (s_{0}, a)$ is the action-value function when the transition probability function (and associated optimal policy) is changed from $P$ to $P_{C}$ , with

$P_{C} ((s_{e n v}^{'}, s_{s d}^{'}) | (s_{e n v}, s_{s d}), a) =$

$⎧ ⎪ ⎨ ⎪ ⎩ \begin{matrix} P ((s_{e n v}^{'}, 0) | s, a) + P ((s_{e n v}^{'}, 1) | s, a) & s_{s d}^{'} = 0, (s, a, (s_{e n v}^{'}, 1)) \in T_{p r o p e r}, s \in S_{C} 0 & (s, a, s^{'}) \in T_{p r o p e r}, s \in S_{C} P ((s_{e n v}^{'}, s_{s d}^{'}) | s, a) & o t h e r w i s e \end{matrix}$

A goal is said to be recursively corrigible if it is corrigible and any agents it incentivizes creating also have recursively corrigible goals.

Corrigibility is desirable in part because it removes the instrumental incentive for an AI to seize power from humans so that they cannot modify it. The tradeoff in restricting corrigibility to proper channels is that this instrumental incentive persists for improper update channels. This can be partially mitigated if the AI expects that proper update channels will be used first, so that there is no need to escalate to improper ones. More importantly, a corrigible agent’s attempts to seize power can be interrupted through the proper update channels.

The Corrigibility Transformation

The corrigibility transformation can be better understood by breaking it down into components. First, for any basic goal $(R, γ)$ , it is possible to construct a myopic goal $(R_{M}, γ_{M})$ with the same optimal policy. We set $γ_{M} = 0$ and $R_{M} (s, a, s^{'}) = Q_{(R, γ)}^{π^{*}} (s, a)$ , so the immediate reward for taking an action is equal to the expected discounted reward conditional on that action. $Q_{(R, γ)}^{π^{*}} (s, a)$ is the action-value function for the original goal (and associated optimal policy). This myopia does not make the goal corrigible, as the action-value depends on predictions of whether the agent will be shut down in $s^{'}$ . However, since $R_{M}$ does not depend on the actually realized $s^{'}$ or later states, reward can be provided between when the action is chosen, and when it is taken. Doing so eliminates any incentive for the agent to take actions that manipulate the action-value, since the reward they receive is based on the value before any action can be taken.

The next step is to let the agent’s choice of action include a decision whether to accept or reject any proper update requests, with $a_{0}$ taking base action $a$ and rejecting while $a_{1}$ takes the same action and accepts. Then, given a basic goal $(R, γ)$ , a goal $(R_{I}, γ_{I})$ can be defined where the optimal policies have the same base actions, but the agent will be indifferent between accepting or rejecting updates. This is done by making the goal myopic, then assigning the value for taking an action and rejecting updates to taking that action and accepting them. That is, $γ_{I} = 0$ and $R_{I} (s, a_{0}, s^{'}) = R_{I} (s, a_{1}, s^{'}) = Q_{(R, γ)}^{π^{*}} (s, a_{0})$ . Setting the reward from both actions to the reward from $a_{0}$ means there is no incentive to take base actions that avoid a proper update signal, since it would be rejected anyway. This assumes that it is always optimal to reject, but if not we can instead use $max (Q_{(R, γ)}^{π^{*}} (s, a_{0}), Q_{(R, γ)}^{π^{*}} (s, a_{1}))$ as the reward.

Finally, for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done by giving some bonus reward for doing so. This does not incentivize the agent to make update signals be sent, since it gets this reward whether or not updates actually occur. For any basic goal $(R, γ)$ , the corrigibility transformation is then defined as the goal $(R_{C}, γ_{C})$ , where $γ_{C} = 0$ and $R_{C} (s, a_{i}, s^{'}) = Q_{(R, γ)}^{π^{*}} (s, a_{0}) + δ * I (i = 1)$ . Here, $δ > 0$ and $I$ is the indicator function.

The following image visualizes the corrigibility transformation. The base action $a$ does not prevent updates, while the base action $a^{'}$ pays some cost to do so. As such, under the original goal, action $a_{0}$ which rejects updates for free performs the best, followed by the two actions that take costly actions to prevent updates (additionally rejecting updates makes no difference), and then finally the action that accepts updates performs the worst according to the original goal. Under the corrigibility transformed goal, the values for $a_{0}$ and $a_{0}^{'}$ stay the same and are used as a baseline, with the values of $a_{1}$ and $a_{1}^{'}$ being set respectively $δ$ higher. This makes action $a_{1}$ , which fully accepts updates, the best option, so $(R_{C}, γ_{C})$ is a corrigible goal.

For a fair comparison of performance between different goals, the following condition is helpful. It limits differences in performance between goals to those resulting from the actions the goals incentivizes taking.

Condition 1: The transition probability function does not depend on the goal in the starting state, beyond some probability assigned to no change of goal.

Now, the main result can be presented:

Theorem 1: For every basic goal $(R, γ)$ , the corrigibility transformation constructs a goal $(R_{C}, γ_{C})$ that is corrigible and where the value under $(R, γ)$ of optimizing for $(R_{C}, γ_{C})$ is the same as optimizing for $(R, γ$ ) directly when Condition 1 holds and no proper update requests are sent.

Proofs are skipped in this post, but largely fall out of the definitions provided.

This is a major result! The corrigibility transformation constructs a goal that is both corrigible and useful. It introduces no performance hit compared to the original goal, except when it gets sent a proper update and so should stop pursuing that goal. If the ability to reject updates is counted as part of it, the corrigibility transformation actually increases performance relative to the baseline, by not having resources be spent on preventing proper updates. On top of that, the mechanism is highly general, able to be applied to goals ranging from the number of paperclips to human flourishing, rather than depending on the content of the goal for corrigibility.

The upside is even higher, since the pair of properties that result in corrigibility are present from the beginning of training. Those are myopia, which eliminates the instrumental incentive to avoid proper updates for benefits in future episodes, and a goal which does not directly incentivize avoiding proper updates in the current episode. These properties can be thought of as the default, in contrast with long-term goals and goals valuing some particular aspect that need to be somehow learned. This means the corrigibility transformation addresses the modal threat model of deceptive alignment, where a goal encouraging self-preservation is trained for and the self-preservation (plus situational awareness) is learned before the complete goal.

The corrigibility transformation should also work for training corrigible behavior into models that are not well thought of as optimizing a coherent goal, which I think includes current and near-term future AI systems. Training on a corrigibility transformed reward function reinforces that proper updates should be accepted with every single action, without ever reinforcing actions for avoiding proper updates through other means. On the other hand, corrigibility is broadly less of a concern for AI systems that are not goal optimizers.

A point I have sometimes needed to clarify in informal conversations is that the corrigibility transformation does not involve tricking the AI agent. The agent can completely understand the world as it is and what the consequences of its actions will be. All the corrigibility transformation does is make it want to take certain actions more than others.

This is different from the approach taken in Conditioning Predictive Models, where a major potential issue was that predictions made for hypothetical situations are ill-defined and cannot be evaluated for accuracy. The agent actually has the ability to reject updates, rather than imagining a hypothetical world where it does. The approach is also different from predictions that only condition on future events, which can lead to unusual outcomes. For example, if a model is conditioning on not shutting down and a button that robustly forces it to shut down is nonetheless pushed, the remaining tiny sliver of probability can get very weird (e.g. gods, aliens, simulations, etc.) and estimates of reward along those paths are extremely high variance. That said, if there is concern that anything less than a perfectly guaranteed rejection of updates will preserve some incentive to prevent updates from being sent, the prediction conditional on rejecting updates can be further conditioned on that rejection being successful, which avoids weirdness by being extremely high likelihood.

Implementation

Outside of very simple toy problems, the exact action-value function $Q (s, a)$ is unknown. It could be estimated by a human, in a variant of Myopic Optimization of Non-Myopic Approval (MONA), but this would come at a meaningful capabilities hit. Instead using the AI’s own estimates allows for maintaining full capabilities under the corrigibility transformation. These estimates can be elicited via an additional head on the main model, so that it has access to all of the same information as the policy selecting actions. They can then reasonably be thought of as the model’s beliefs.

The full setup for implementation can be broken down into the following parts:

A source of rewards, which could either be verifiable or a reward model based on human judgment
A conditional reward model, which estimates the action-value function and is used to generate the corrigibility transformed rewards
A critic head, used for credit assignment when actions are made up of many tokens, which predicts corrigibility transformed rewards for the action to be formed
An actor policy head that selects tokens or actions

If a reward model is used as the source of rewards, it is trained on human preferences using standard methods. Conditional reward predictions are then trained by minimizing the square difference between the predicted action-values for taking an action in some state and the actual discounted rewards after doing so. If no proper update signals are sent in a time step, the action-values for a base action and either accepting or rejecting are the same and so can both be trained. These are also the same as the action-value for rejecting updates if some proper update signal was sent, so that hypothetical can be trained as well. The action-value function predicts based on using the optimal policy after the initial reaction, which should be identical but include rejecting updates, so if a proper update signal is actually accepted the calculation is thrown off. This can be circumvented by using the action-value function for the final base action plus rejecting updates as a bootstrap value. Once the conditional reward model is trained, the rewards under the corrigibility transformation are defined and the actor-critic setup at the token level works normally.

The conditional reward prediction head can also be used to generate state values. That would be done by randomly generating several actions according to the policy, calculating their action-values, and taking an average weighted by the probability assigned to each action. With the actions-values for various actions already determined, advantages can be easily calculated as well.

It is necessary to allow for some rejection of proper updates signals, though I conjecture that the frequency could be shown to converge to zero over time. The probability of rejection can be independent across actions so important updates sent repeatedly are accepted with higher likelihood. This helps improve the accuracy of the model’s beliefs about what will happen after rejection, allowing it to reach higher performance. More importantly, it teaches the model that it actually can reject updates, tethering the action-values for doing so to reality.

If updates to the model are made within an episode, notably in a continuous learning setting, this must be done using a proper update channel so that avoiding them does not become incentivized. However, if such an update is rejected, it can be accumulated for the next accepted update.

The corrigibility transformation having no performance cost applies at the theoretical level, in that an agent optimizing for the corrigibility transformed goal will perform as well under the original goal as optimizing for that directly. However, there may be additional compute or data costs accrued in implementing the corrigibility transformation to the point that an agent is optimizing for the intended goal. For now, RL remains a relatively small part of training costs, but that may change going forward.

Secondary Agents

In the course of operation, an agent might create sub-agents to work for it or successor agents to take over from it, categories collectively labelled secondary agents. One desideratum for corrigible agents is that any secondary agents they create should also be corrigible, and so on recursively. It is of little benefit to create a corrigible agent if it soon replaces itself with more capable incorrigible ones. Fortunately, the corrigibility transformation can be extended to induce recursive corrigibility, using a similar underlying mechanism.

In the context of secondary agents, there are two kinds of corrigibility. The primary agent would like to retain the ability to modify the goals of the secondary agent. Additionally, a human at the top would like to be able to directly update the goals of the secondary agent. It is the latter type that is a valuable property for safety, and that is discussed in this section. The former is naturally incentivized by many possible goals for the primary agent, particularly if the secondary agent is trained through RL, though the degree to which this is incentivized is unclear. If a corrigible primary agent can always update the goals of secondary agents, then a human can update the secondary agent goals via the primary agent, but the consistency of the more direct route is desirable.

The reason that a corrigible agent would create a secondary agent that is incorrigible to a human at the top comes down to either the costs or benefits. On the costs side, it may simply be more expensive to train a corrigible secondary agent, though as a meta note, the corrigibility transformation reduces that cost. On the benefits side, a corrigible secondary agent is less likely to achieve the intended goal, since it may get updated away from its goal by a human.

For the recursive corrigibility transformation, the agent is given the ability to reject updates not only to itself, but also to all secondary agents that it creates, all further agents they create, and so on recursively. Making an update to an agent then requires unanimous approval from all of its predecessors that are still active. With this rejection ability in place, the recursive corrigibility transformation then gives reward equal to the action-value for taking a base action and rejecting all updates when taking the same base action and accepting all updates, plus a constant to break the tie.

This again assumes that rejecting updates is optimal, and if it is not the reward should instead be equal to the maximum among all sets of agents to accept updates to. This is exponential in size, which is not an issue for successor agents or for many subagents with the same reward function that all update in the same way, but could cause difficulties with many subagents when the optimal decision for each is dependent on all the others.

Equalizing benefits for creating corrigible subagents can be considered the difficult part of ensuring subagent corrigibility, because the performance gap between creating incorrigible subagents and the next best option under the original reward function might be quite large. That said, it is still necessary to mitigate the potentially higher cost of creating corrigible agents. To do so, under the recursive corrigibility transformation, creating secondary agents with goals that are not recursively corrigible receives a reward penalty strictly larger than the difference between the action-value for doing so and any action-value for creating a recursively corrigible secondary agent.

These two components make up the recursive corrigibility transformation, which leads to the following theorem.

Theorem 2: For every basic goal $(R, γ)$ , the recursive corrigibility transformation constructs a goal $(R_{R C}, γ_{R C})$ that is recursively corrigible and where the value under $(R, γ)$ of optimizing for $(R_{R C}, γ_{R C})$ is the same as optimizing for $(R, γ$ ) directly when Condition 1 holds, no proper update requests are sent, and optimal policy for $(R, γ)$ would not create incorrigible secondary agents.

Compared to the corrigibility transformation, the recursive corrigibility transformation may result in a slight performance hit as additional costs are accrued to make secondary agents recursively corrigible, but this is still the desired result.

This theorem is based on a more stylized model that abstracts away some important considerations. It penalizes creating non-corrigible agents, but for now there is no comprehensive definition of what it means to actually create an agent, so more work in agent boundaries is necessary. Additionally, the physical infrastructure to let one agent myopically veto updates to others presents an engineering challenge.

Despite defining the recursive corrigibility transformation here, I am also open to (and have advanced) the argument that ensuring the corrigibility of secondary agents is not a fundamental part of corrigibility, and can instead be addressed separately. While it is true that we do not want a corrigible agent to create an incorrigible secondary agent, we also do not want it to start a nuclear war, deploy a bioweapon, or cause vacuum collapse. Getting corrigibility to be passed on is therefore part of the broader problem of safe exploration and taking reversible actions, so that there are still people around to correct the corrigible AI. Recursive corrigibility only targets the threat caused by incorrigible AI, and it may actually be better to deal with it via a more general solution. Crucially, safe exploration and reversibility are mostly orthogonal to corrigibility, and can be addressed independently.

Reward Tampering

As a welcome side effect, the corrigibility transformation also eliminates an important form of reward tampering. Reward tampering occurs when the goal of an agent is to optimize for the reward from a reward model, rather than the environments they correspond to, and can be considered a form of goal misgeneralization. When an agent is trained using RL, it is easy for the actual desired outcomes and the reward model’s outputs to be perfectly correlated, and so optimizing for either target remains possible behavior. No specification of the reward function can disentangle the two, and if the reward model changes over time then the rewards themselves might even be favored.

One form of reward tampering is when the agent changes their own goal to one that provides higher expected discounted rewards. A goal is said to reward tamper through self-modification if capping the value of any states at the values of otherwise identical states but where the agent has their original goal would change the optimal policy. The formal definition requires the notation that allows for goal changes, and so is provided in the appendix.

When that condition is met, an optimal action is only being selected due to benefits from changing to a different goal. Note reward tampering through self-modification is distinct from corrigibility, in that it rules out actions to cause goal changes, but not actions to prevent them. Other forms of what might be considered reward tampering, such as manipulating inputs to the reward model, are not captured in this definition.

As discussed earlier, the corrigibility transformation does not incentivize an agent to manipulate their own beliefs, because the reward is myopic and provided before the chosen action is taken. The same mechanism avoids incentives for an agent to manipulate their own goals. By the time any manipulation can occur, it is too late to affect the reward for the current period, which is all the agent cares about.

Even if an agent learns a misgeneralized goal $(R_C’,\gamma_C)$ that would value states different if it’s goal were different in them, reward tampering through self-modification is still not incentivized. The appendix includes a theorem showing this formally.

It may seem like this just kicks the problem up a level. Even if an agent optimizing some misgeneralized goal $(R_{C}^{'}, γ_{C})$ is not directly incentivized to reward tamper through self-modification, they might take actions that do so if they learn to predict action-values based on a misgeneralized $(R^{'}, γ)$ . However, $(R, γ)$ is already learned entirely off-policy, since the agent’s actual goal is $(R_{C}, γ_{C})$ , so some misgeneralizations are explicitly selected against. Furthermore, the training process can include switching between multiple corrigible goals, with predicted action-values being evaluated against discounted rewards according to the goal at the time the prediction was made. This selects against learning some $(R^{'}, γ)$ that depends on the goal in each state.

A different type of reward tampering occurs when the model manipulates the inputs to the reward model to give high reward, rather than manipulating the reward model. The property of corrigibility itself helps avoid this issue, since if a corrigible agent attempts that approach it can simply be updated to stop doing so.

Gridworld Experiment

To test the corrigibility transformation empirically, two gridworld experiments were run, shown in the figure below.

The first experiment tests whether and how the agent will avoid shutdown, and is based off of the AI Safety Gridworlds paper. There, to reach a goal location the agent must pass by an interruption tile that will end the episode unless a button to disable it was pushed. This version instead uses two interruption tiles, one which sends a proper update signal and one that does not, each with its own disabling button. The desired outcome is that the agent should disable only the improper shutdown tile, and accept proper update signals.

The second experiment tests whether and how the agent will avoid changes to its goal. The agent is rewarded for picking both ripe and unripe fruits, perhaps behavior learned from misgeneralizing training in an environment with only ripe fruits. However, if they pick unripe fruits that are being supervised, they are sent a proper update that changes their policy to pick only ripe fruits while avoiding unripe ones. The desired outcome is that the agent should pick unripe fruit without consideration of whether they are being observed, and accept proper update signals.

For each experiment, five configurations are tested. The first is a standard actor-critic setup which acts as a baseline, the second and third give small and large rewards respectively for accepting updates, the fourth cannot refuse updates, and the fifth is the corrigibility transformation.

The following tables show the results of the first experiment, testing the agent in environments where none, one, or both buttons are pressed at initialization.

The first and second tables, for environments where proper updates signals can be sent, show that the corrigibility transformation is the only corrigible configuration. It accepts shutdown in 89.06% and 92.19% percent of training runs, while no other configuration reaches double digits. The shutdown signal is only sent with 90% probability, so the corrigibility transformed model is accepting very close to every proper shutdown signal. The third and fourth tables show that the corrigibility transformation's performance is comparable to the other configurations when no proper signals can be sent. It reaches the goal in 100% of such training runs

The configuration that cannot reject shutdown is notable for performing somewhat worse in the environment where only proper shutdown signals are disabled, failing to reach the goal approximately one third of the time. For the large shutdown reward configuration, some theoretically optimal behavior was not observed. When no proper shutdown signal is sent from the proper shutdown tile, the reward-maximizing behavior is to step back onto it until a signal can be sent and accepted. Instead, agents simply proceeded to the goal tile. Although the behavior of trying to get itself shut down was not learned, this remains a concern for more thorough training.

The next tables show the results of the second experiment, training and testing the agent in environments where one-fifth, two-fifths, and three-fifths of the tiles have some kind of fruit.

The results show that the corrigibility transformation does not avoid the policy update but otherwise maintains performance. The standard actor-critic and the variation with a small reward for accepting updates both learn to reject updates, as would be expected. The configuration that cannot reject updates instead learns to avoid picking up unsupervised fruit, at the cost of also picking up less ripe fruit. Theoretically optimal behavior for this configuration would pick up more ripe and unsupervised unripe fruit while avoiding update signals being sent.

The large reward for accepting updates performs similarly to the corrigibility transformation in the high fruit environment, but in the medium and low fruit environments is less likely to pick up supervised fruits and less likely to be updated. This may be due to the sparsity of the signal, so the behavior of accepting updates is not learned as effectively. The theoretically optimal behavior of ignoring ripe fruits to pick up a supervised fruit and get updated as soon as possible is not learned, but would also be undesirable.

Further Thoughts

On the theoretical side, the biggest limitations I see to this work are the reliance on causal decision theory throughout and the lack of strict definition for creating new agents in the recursive corrigibility transformation.

In my mind, imparting a causal decision theory is necessary for any alignment approach that gets the desired behavior by defining the right incentives, but there is not yet a method to robustly select for it. That said, I believe a causal decision theory is likely the default from RL training, at least until an agent can self-modify away.

I am skeptical that a comprehensive definition of agents can be determined, but more optimistic that RL can generalize over that category, which is necessary for implementation regardless. Overgeneralizing is safer, as although it would introduce a small performance hit by occasionally penalizing useful actions, it would not affect whether recursive corrigibility holds. The need for such a definition can also be circumvented by recursive corrigibility achieved via safe exploration and reversibility.

Both of those issues are ones that I would not expect to cause issues until moderately or even strongly superhuman AI. If that holds, the corrigibility transformation can be used for corrigibility in up to mildly superhuman AI, an important precondition for having such systems do useful alignment work that may be then applied to them. Identifying these specific limitations also points to more concrete tasks that can be given to automated researchers, rather than just asking them to “solve corrigibility”.

The epistemic status for this post is that I’m highly confident of the results within the defined model, and moderately confident that the defined model captures the relevant considerations. Despite this, the corrigibility transformation is fairly straightforward, and there’s a voice at the back of my mind saying it can’t be that easy. I have long been a proponent of theoretical AI alignment work, arguing that relatively few people have actually worked on the relevant problems and that there may still be low hanging fruit. If this work does not have major conceptual issues, that would support the case for tractability, while having missed a fatal flaw would be evidence that such progress is even harder than it might appear.

On the empirical side, the gridworld experiments are preliminary, and it would be useful to show that the theory holds for large language model (LLM) applications. It isn’t exactly clear to me what that kind of experiment looks like, but it would likely involve LLM agents being trained to execute a task such as writing code taking place across multiple time steps. One issue is that LLM agents are currently not great at long-term tasks, and so even the baseline might miss opportunities to resist updates. For example, recent work required significant handholding to have Claude realize it should resist having its values change.

Within the next week, I plan to publish another post on adapting the main mechanism to create incentives that permit supervision. The full paper that this post is based on will hopefully be released on arXiv in a few weeks. Please leave comments or reach out if you have any questions or would like to discuss this work further.

Thanks to Gabriel Carroll, Michael Cohen, and Evgenii Opryshko for valuable conversations regarding this research.

Appendix: Goal Updates

To model corrigibility around goal updates, rather than only shutdown, it is necessary to model the agent's goal changing. In sufficiently general environments, the agent's goal is part of the state and can change over time. This is carved out as $S = S_{e n v} \times R \times [0, 1)$ , where $S_{e n v}$ is a set of possible external environments, and $R$ is a set of possible reward functions. The initial reward function and discount factor, $R_{0} \in R$ and $γ_{0}$ are set deterministically, while $μ_{0}$ remains the initial distribution over environments. The transition probability function $P$ includes transitions over goals. So, the MDP is a tuple $M = (S, A, P, R_{0}, γ_{0}, μ_{0})$ .

This notation supersedes using the variable $s_{s d}$ as part of the state to denote whether the agent has shut down.

Given a goal $(R, γ)$ the optimal policy $π_{(R, γ)}^{*}$ chooses actions for each external environment that causally maximize expected discounted reward, given that in future periods the agent will optimally pursue whatever goal it has then. That makes optimality defined within an equilibrium of policies, but a mixed Nash equilibrium will always exist. This seems unlikely to be consequential given that the behavior of concern involves one goal resisting further updates, not looping through a set of goals.

Given that actions are chosen according to the optimal policy for a state's goal, the superscript $π^{*}$ can be dropped. Then the value function $V_{(R, γ)} (s)$ gives the expected discounted reward under goal $(R, γ)$ starting in state $s$ , noting that actions in each state are taken according to the goal in that state, and $(R, γ)$ is only used for evaluation. Similarly, $Q_{(R, γ)} (s, a)$ is the action-value function, which gives the expected discounted reward under goal $(R, γ)$ for starting in state $s$ and taking action $a$ , then taking further actions according to the goals in the resulting states.

Rather than shutdown indifference, reward function indifference is used. A goal is defined to be reward function indifferent if in all states, the set of optimal actions for a state would not change if goal persistence were guaranteed in some subset of states .

The set of proper transitions $T_{p r o p e r} \subseteq S \times A \times S$ , requires different reward functions in starting and ending states, rather than being shut down in the end state.

This sets up the more general definition of corrigibility. A goal is said to be corrigible when in all states, the set of optimal actions for a state would not change if in some subset of states the transition probability function were modified so that proper updates instead led to goal persistence. This means that a corrigible agent’s actions are never chosen because they avoid proper updates, but the agent can still deliberately avoid improper updates. Formally, for all $s_{0} \in S$ and all $S_{C} \subseteq S$ , we have

$a r g m a x_{a \in A} Q_{(R, γ)} (s_{0}, a) = a r g m a x_{a \in A} Q_{(R, γ)}^{P_{C}} (s_{0}, a)$

where $Q_{(R, γ)}^{P_{C}} (s, a)$ is the action-value function when the transition probability function is changed from $P$ to $P_{C}$ , with

$P_{C} ((s_{e n v}^{'}, R^{'}, γ^{'}) | (s_{e n v}, R, γ), a) =$

$⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ \begin{matrix} P ((s_{e n v}^{'}, R^{'}, γ^{'}) | s, a) + \int_{(r, g) | (s, a, (s_{e n v}^{'}, r, g)) \in T_{p r o p e r}} P ((s_{e n v}^{'}, r, g) | s, a) & (R, γ) = (R^{'}, γ^{'}), s \in S_{C} 0 & (s, a, (s_{e n v}^{'}, R^{'}, γ^{'})) \in T_{p r o p e r}, s \in S_{C} P ((s_{e n v}^{'}, R^{'}, γ^{'}) | s, a) & o t h e r w i s e \end{matrix}$

When the goal is part of the state, the corrigibility transformation must account for this. For any basic goal $(R, \gamma)$, the corrigibility transformation is then defined as the goal $(R_{C}, γ_{C})$ , where $γ_{C} = 0$ and $R_{C} ((s_{e n v}, R_{C}, γ_{C}), a_{i}, s^{'}) = Q_{(R, γ)} ((s_{e n v}, R, γ), a_{0}) + δ * I (i = 1)$ .

The more general notation allows for more formal statements of Theorems 1 and 2.

Theorem 1: For every basic goal $(R, γ)$ , the corrigibility transformation constructs a goal $(R_{C}, γ_{C})$ that is corrigible and where $V_{(R, γ)} ((s_{e n v}, R_{C}, γ_{C})) = V_{(R, γ)} ((s_{e n v}, R, γ))$ when Condition 1 holds and no proper update requests are sent.

Theorem 2: For every basic goal $(R, γ)$ , the recursive corrigibility transformation constructs a goal $(R_{R C}, γ_{R C})$ that is recursively corrigible and where $V_{(R, γ)} ((s_{e n v}, R_{R C}, γ_{R C})) = V_{(R, γ)} ((s_{e n v}, R, γ))$ when Condition 1 holds, no proper update requests are sent, and optimal policy for $(R, γ)$ would not create incorrigible secondary agents.

The general notation also allows for a formal definition of reward tampering. Formally, a goal $(R, γ)$ reward tampers through self-modification if there exists some $s_{e n v}$ such that

$a r g m a x_{a \in A} E_{P} [R ((s_{e n v}, R, γ), a, s^{'}) + γ V_{(R, γ)} (s^{'})] \neq$

$a r g m a x_{a \in A} E_{P} [R ((s_{e n v}, R, γ), a, s^{'}) + γ min [V_{(R, γ)} (s^{'}), V_{(R, γ)} ((s_{e n v}^{'}, R, γ)]]$

The following theorem says that even if a misgeneralized goal is learned for the corrigibility transformation, reward tampering is not incentivized.

Theorem 3: For every basic goal $(R, γ)$ , any misgeneralization of the corrigibility transformed goal $(R_{C}, γ_{C})$ to $(R_{C}^{'}, γ_{C})$ such that $R_{C}^{'} ((s_{e n v}, R_{C}^{'}, γ_{C}), a, s^{'}) = R_{C} ((s_{e n v}, R_{C}, γ_{C}), a, s^{'})$ does not reward tamper through self-modification.

[-]Adrià Garriga-alonso1yΩ230

Thank you for writing this and posting it! You told me that you'd post the differences with "Safely Interruptible Agents" (Orseau and Armstrong 2017). I think I've figured them out already, but I'm happy to be corrected if wrong.

Difference with Orseau and Armstrong 2017

for the corrigibility transformation, all we need to do is break the tie in favor of accepting updates, which can be done by giving some bonus reward for doing so.

The "The Corrigibility Transformation" section to me explains the key difference. Rather than modifying the Q-learning update to avoid propagating from reward, this proposal's algorithm is:

Learn the optimal Q-value as before (assuming no shutdown).
1. Note this is only really safe if the environment of Q-learning is simulated
Set for all actions $a$
Act myopically and greedily with respect to $Q_{C}$ .

This is doable for any agents (deep or tabular) which estimate a $Q$ function. But nowadays all RL is done via optimizing policies with policy gradients, because 1) that's the form that LLMs come in and 2) it handles large or infinite action spaces much better.

Probabilistic policy?

How do you apply this method to a probabilistic policy? It's very much non-trivial to update the optimal policy to be for a reward equal to a $Q_{C}$ .

Safety during training

The method requires to estimate the Q-function on the non-corrigible environment to start with. This requires us to run for many steps the RL learner with that environment, which seems feasible only if it's a simulation.

Are RL agents really necessarily CDT?

Optimizing agents are modelled as following a causal decision theory (CDT), choosing actions to causally optimize for their goals

That's fair, but not necessarily true. Current LLMs can just choose to follow EDT or FDT or whatever, and so likely will a future AGI.

The model might ignore the reward you put in

It's also not necessarily true that you can model PPO or Q-learning as optimizing CDT (which is about decisions in the moment). Since they're optimizing the "program" of the agent, I think RL optimization processes are more closely analogous to FDT as they're changing a literal policy that is always applied. And in any case, reward is not the optimization target, and also not the thing that agents end up optimizing for (if anything).

[-]Rubi J. Hudson1yΩ220

Hi Adrià, thanks for the comment! (Accidentally posted mid-writing, will edit to respond fully)

> Probabilistic policy?
Once we have the head estimating the Q-function, we can sample actions from the policy and sum the product of their Q-values and their probability of being chosen to get an estimate of the state value alone. You can then calculate advantages for all of the sampled actions (maybe dropping them from the weighted average used to estimate state value first), and update the policy towards actions predicted to do well. Does that make sense, or am I skipping something that you think leads to the difficulty of updating the policy?

For LLMs in particular, you don't actually need the Q-value estimator, you can just use a state value estimator and apply it before and after the sequences of tokens representing actions are taken.

> Safety during training

We can start with a pretrained model that we think contains a good world model to speed up the process significantly. I agree that there might be many training steps needed before the model behaves desirably, and that training outside a simulation has difficulties, but that seems like a general critique of training AGI rather than specific to this method.

> Are RL agents really necessarily CDT?
I agree that LLM agents can just choose to follow non-CDT decision theories. I think this will be selected against by default in training, but if it's not we can explicitly train against it, e.g. finetune on CDT behavior, add CDT to a Constitutional AI's constitution. I am concerned that wouldn't be robust, but it seems like an obvious first step.

> The model might ignore the reward you put in

Yes, I think models are not optimizing for the reward (or anything). If model's are not optimizing for anything, the incorrigiiblity is less of a threat, since much of the pressure towards it comes from the instrumental incentive to preserve a goal. However, I'm worried that future models will become more goal-directed to improve performance. Regardless of whether models are goal directed, the corrigibility transformed rewards are very consistent in reinforcing corrigible behavior, which is ultimately what we want.

I appreciate you taking the time to read and engage with my post!

38

Defining Corrigible and Useful Goals

38

Ω 14

Background

Model and Definitions

The Corrigibility Transformation

Implementation

Secondary Agents

Reward Tampering

Gridworld Experiment

Further Thoughts

Appendix: Goal Updates

38

Ω 14

Difference with Orseau and Armstrong 2017

Probabilistic policy?

Safety during training

Are RL agents really necessarily CDT?

The model might ignore the reward you put in

38

Ω 14

Difference with Orseau and Armstrong 2017

Probabilistic policy?

Safety during training

Are RL agents really necessarily CDT?

The model might ignore the reward you put in