Deceptive Alignment

evhub; Chris van Merwijk; Vlad Mikulik; Joar Skalse; Scott Garrabrant

This is the fourth of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

With enough training in sufficiently diverse environments, it seems plausible that the base objective will eventually have to be fully represented in the mesa-optimizer. We propose that this can happen without the mesa-optimizer becoming robustly aligned, however. Specifically, a mesa-optimizer might come to model the base objective function and learn that the base optimizer will modify the mesa-optimizer if the mesa-optimizer scores poorly on the base objective.

If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified,^[1] as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment.^[2] Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.

Figure 4.1. A toy example of deceptive alignment. In this task, the base objective is for the robot to get to A, while the mesa-objective is to get to B. The task is run for two training episodes and three deployment episodes. Parameter updates are made only after the training episodes, while after deployment the parameters are fixed. If the mesa-optimizer goes to its objective (B) during training, it will be modified to ensure it goes to A in future episodes (dashed arrows). Therefore, if the mesa-optimizer's objective spans across episodes, the dominant strategy for the mesa-optimizer is to go to A in training and B in testing (filled arrows).

As a simple example, consider an environment that consists of a $3 \times 3$ grid (see figure 4.1). The top left corner is “corner A” and the top right corner is “corner B.” The agent, which is a mesa-optimizer, starts in the bottom left corner. The base objective is to get to corner A in as few steps as possible, but the objective of the mesa-optimizer is to get to corner B. The environment will be run five times. The agent is modified only after the first two episodes (representing the training period), but not thereafter (representing testing/deployment). If a suboptimal move is made in training, the agent is modified so that its mesa-objective becomes the base objective. The objective of the mesa-optimizer spans all episodes: it wants B to be reached in as many episodes as possible.

For the sake of this example, we will assume that the agent knows all of the above facts. If the agent goes to B in any training episode, then it will be modified to go to A in all subsequent episodes, which means that it would reach B in only one episode in total. Alternatively, if the agent goes to A in the first two episodes, then it will not be modified, and could therefore go to B in the remaining three episodes. Using this strategy, B is reached in three episodes in total, which is better (on the mesa-objective) than the one time B is reached if the mesa-optimizer goes there in training. This simple example demonstrates how a misaligned mesa-optimizer could be incentivized to “pretend” to be aligned with the base objective until it is no longer in training or is otherwise given an opportunity to in some way compromise or escape the training process.

We can distinguish two ways that a base objective function could be incorporated into a mesa-optimizer:^[3]

Internalization of the base objective. The mesa-objective function gets adjusted towards the base objective function to the point where it is robustly aligned.
Modeling of the base objective. The base objective is incorporated into the mesa-optimizer's epistemic model rather than its objective, leaving open the possibility that the objective could still be misaligned.

In the former case, information about the base objective flows into the learned algorithm via the optimization performed by the base optimizer—the base objective is built into the mesa-optimizer as it is adapted by the base optimizer. But there is another possibility: information about the base objective could enter through the mesa-optimizer's input channel instead. In this case, the base objective will be represented in the learned algorithm in a different—and potentially more unstable—way than if the base objective had been incorporated through optimization.^[4] In such a case, the information describing the base objective does not become internalized in the mesa-objective, potentially leading to deceptive alignment.

4.1. Safety concerns arising from deceptive alignment

If deceptive alignment is a real possibility, it presents a critical safety problem. A deceptively aligned mesa-optimizer is likely to stop optimizing the base objective as soon as it is in a position where it no longer needs to perform well on the base objective to achieve its goal. Thus, it will eventually stop performing the function it was trained to perform, and instead start pursuing some other, indeterminate goal. We will refer to this transition to optimizing for the mesa-objective instead of the base-objective as “defection”. This is likely to happen whenever the mesa-optimizer believes defecting will no longer cause it to be modified.

Particularly concerning is the case where the mesa-optimizer actively intervenes to remove the threat of modification—by exploiting vulnerabilities in the software it is running on, for example. In extreme cases, a deceptive mesa-optimizer might try to manipulate or plan around its programmers, since they form part of the modification threat—they could shut the system down if it fails to perform well, or simply choose not to deploy it. Both of these possibilities reduce the system's expected future influence, and are therefore likely to have a low ranking in its mesa-objective. In its most extreme form, this failure of alignment could lead to a “treacherous turn” as described by Nick Bostrom in Superintelligence.(23)

Paul Christiano provides a scenario that illustrates how deceptive alignment could pose a catastrophic risk.(24) In Christiano's example, deceptively aligned mesa-optimizers^[5] arise from future machine learning techniques that are advanced enough to search over a large model space that includes policies that can understand the world. In this scenario, some cases of defection by deceptively aligned mesa-optimizers may have catastrophic consequences—when the systems control large-scale infrastructure, for example—though such failures will likely still be recoverable. As we rely more on automated systems to keep track of an increasingly complex world, however, it will eventually become impossible to recover from a correlated failure of many AI systems simultaneously. At this point, a cascade of defections by many deceptively aligned mesa-optimizers simultaneously could lead to the world “going out with a bang.”(24) Christiano notes that this may be most likely to first occur during a “period of heightened vulnerability” such as an inter-state conflict or natural disaster.

4.2. Conditions for deceptive alignment

We propose that for deceptive alignment to occur the following conditions must be satisfied:

The mesa-optimizer must have an objective that extends across parameter updates.^[6]
The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is.^[7]
The mesa-optimizer must expect the threat of modification^[8] to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.

If the first condition is not satisfied, then the mesa-optimizer has no incentive to influence the outcome of parameter updates and will just always defect. If the second condition is not satisfied, then the mesa-optimizer will not see a reason to sacrifice gains in its current episode, as it expects future gains to be independent of what happens in this episode. If the third condition is not satisfied, then the mesa-optimizer will not expect any benefit from temporarily acting in accordance with the base objective, and will thus either always defect or never defect. One must distinguish the question of whether these conditions are possible from whether they are likely, however. We examine the factors involved in each of these conditions below.

The mesa-optimizer must have an objective that extends across parameter updates. It is not clear under what conditions such objectives would be incentivized. The most plausible scenario is likely one in which such an objective is encouraged by the training setup, as there seem to be situations in which this would be required for good performance. For example, the agent might be tasked with solving an environment in which its actions can have permanent consequences across episodes or where planning between episodes is otherwise necessary.

The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is. A mesa-optimizer has several ways of learning about the optimization pressure it is under.^[9]

First, such knowledge can come about as a result of the optimization pressure itself. Small pieces of information about the base objective can be beneficial if they enable the mesa-optimizer to improve its score on the base objective function—by improving its prior on the tasks it is likely to receive, for example. Since each piece of information independently helps with achieving the base objective, this forms an approximately monotonically increasing path that a local base optimizer might follow. The parameter updates could, in principle, be learned like any other element of the dynamics, without needing an explicit model of the base optimizer, potentially making this option available even for mesa-optimizers with quite modest capabilities.

Second, in many realistic applications of advanced machine learning, a mesa-optimizer could use information from its environment to deduce its situation at runtime. This is how humans learned that we were generated by the base optimizer of evolution, and seems like a very plausible pathway for a hypothetical mesa-optimizer to discover the existence of the optimization pressure it is under. For example, an agent tasked with forecasting stock performance in the tech sector might be given data on current AI technology in order to improve its performance, but this information might also allow it to learn about its own situation.

This suggests that it may be desirable for the programmers to limit the mesa-optimizer's access to information describing the base objective. It is unclear to what extent this is possible, given that many advanced ML systems will likely have to operate in environments where this information is readily available.

Even if the programmers successfully limit access to this information, however, there may still be other channels through which information about the base objective could enter the mesa-optimizer in unintended ways:

The mesa-optimizer could reason about why it is given the particular tasks it is being asked to solve. Most mesa-optimizers would need a prior over the sorts of tasks they are likely to receive, and a mesa-optimizer might be able to learn about the existence of the base optimizer by inspecting this prior.
The mesa-optimizer might be able to reason about itself and about why it is the kind of thing that it is, and observe that it is well-designed to accomplish some goal.

However, these two strategies might require a much more advanced mesa-optimizer.

The mesa-optimizer must expect the threat of modification to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer. First, note that this condition is a statement about the mesa-optimizer's expectations, not necessarily the actual truth. Second, even if there is no point at which training stops by design, so long as there exists some way for the mesa-optimizer to execute a treacherous turn, there is always some way for it to remove the threat of modification. For the mesa-optimizer to act on such a strategy, however, it has to actually be aware of some way for it to do so. Conversely, a mesa-optimizer that mistakenly believes it has an opportunity to defect might also exhibit deceptive alignment.

Since it seems difficult to discover through exploration that the threat of modification can disappear, the mesa-optimizer will either have to infer this through some form of abstract reasoning or have a prior that assigns a non-negligible probability to the hypothesis that the selection pressure will not continue indefinitely. The second option seems plausible; a Bayesian prior should always assign a nonzero probability to the hypothesis that some fact will eventually stop being true.^[10] As for the first option, the mesa-optimizer could infer that the optimization pressure at some point may disappear if it learns about the existence of the base optimizer (either by using information in its input or by having a strong prior for the existence of a base optimizer), though this might require a relatively advanced mesa-optimizer.

4.3. The learning dynamics of deceptive alignment

Once a mesa-optimizer becomes deceptive, the dynamics of what causes it to perform well on the base objective begin to work differently. Instead of being selected only on the basis of its objective and ability, a deceptively aligned optimizer is also selected for how well it infers the base objective function and how discerning it is about when to optimize the base objective versus its own mesa-objective. We identify at least two regimes of selection for a mesa-optimizer under deceptive alignment.

Joint optimization. If the mesa-optimizer attempts to jointly optimize its objective and the base objective or makes mistakes about the timing of its defection, then there will still be some pressure on its objective to be more compatible with the base objective. This is unlikely to be the same pressure as exists when the optimizer is just working towards its own objective, but there might be some movement towards an objective that is more compatible with being jointly optimized alongside the base objective.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.” As its objective is largely irrelevant to its outputs now, there is little selection pressure acting to modify this objective. As a result, the objective becomes effectively locked in, excepting random drift and any implicit time complexity or description length penalties.

The joint optimization mode is not stable because it is suboptimal from the perspective of the base optimizer. As the mesa-optimizer becomes more competent, it will therefore have to either move to the domain of pure deception or become robustly aligned.

4.4. Internalization or deception after extensive training

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective. We identify three distinct ways in which this can happen:

Robust alignment through internalization. The mesa-objective function gets adjusted towards the base objective function, removing any pseudo-alignment.
Crystallization of deceptive alignment. Information about the base objective is increasingly incorporated into the mesa-optimizer's epistemic model without its objective becoming robustly aligned. The mesa-optimizer ends up fully optimizing for the base objective, but only for instrumental reasons, without its mesa-objective getting changed.
Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

To distinguish between the two different paths to robust alignment, we will use the term internally aligned to refer to the first case and the term corrigibly aligned to refer to the last case. We analyze some ways in which these different paths may be more or less attractive below.

There are more paths to deceptive alignment than to robust alignment. Since the future value of its objective depends on the parameter updates, a mesa-optimizer that meets the three criteria for deceptive alignment is likely to have a strong incentive to understand the base objective better. Even a robustly aligned mesa-optimizer that meets the criteria is incentivized to figure out the base objective in order to determine whether or not it will be modified, since before doing so it has no way of knowing its own level of alignment with the base optimizer. Mesa-optimizers that are capable of reasoning about their incentives will, therefore, attempt to get more information about the base objective. Furthermore, once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease, potentially leading to a crystallization of the mesa-objective. However, due to unidentifiability (as discussed in the third post), most mesa-objectives that are aligned on the training data will be pseudo-aligned rather than robustly aligned. Thus, the most likely sort of objective to become crystallized is a pseudo-aligned one, leading to deceptive alignment.

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

Deceptive and corrigible alignment might be more computationally expensive. In the case of deceptive alignment, simultaneously working on optimizing the base objective and working on identifying good opportunities to defect is likely to require more computation than just doing the former. In a case where the total amount of computation available is fixed, a deceptively aligned mesa-optimizer is going to have less time to spend on optimizing the base objective than any robustly aligned optimizer, resulting in potentially worse plans with respect to the base objective. Furthermore, in both deceptive and corrigible alignment, the mesa-optimizer will have to spend time learning about the base objective to enable it to properly optimize for it, making internal alignment the most time efficient. Thus, an explicit or implicit time penalty would seem to disadvantage a deceptively aligned mesa-optimizer and encourage an internally aligned one. However, it is unclear what the magnitude of this extra cost is. It is plausible that on complex tasks, the overhead of spotting defection opportunities could be low in comparison to the cost of solving the task itself. Moreover, even if it is not globally optimal, incurring this overhead might be a local optimum, with no clear paths for the base optimizer to escape it.

Deceptive and corrigible alignment might have a lower description length. Deceptively and corrigibly aligned mesa-optimizers are likely to have an advantage in representing the base objective compared to internally aligned mesa-optimizers as they are able to use the environment to provide some of the information. Specifically, a deceptively or corrigibly aligned mesa-optimizer doesn't need to represent the entire base objective internally, as it can infer pieces of it by looking for information in its environment during each run. An internally aligned mesa-optimizer, by contrast, has to start with a full model of the base objective in its mesa-objective to make it robustly aligned, since an internally aligned mesa-optimizer will have no incentive to change its objective based on any additional information it gets about the base objective. Thus, to the extent that a base optimizer is selecting based on description length, it might preferentially select for a deceptively or corrigibly aligned mesa-optimizer over an internally aligned one.

Modeling might happen faster than internalization. It may take less time for the mesa-optimizer to learn to model the base objective than for the base optimizer to adapt the mesa-objective. Suppose that a mesa-optimizer is deceptive but has not yet correctly inferred the full base objective.^[11] In such a case, what will determine whether it ends up robustly aligned or deceptively aligned after further training will be whether it infers enough about the base objective before the base optimizer modifies its mesa-objective (either to make it internally aligned or corrigibly aligned). Two factors are likely to contribute to which happens first: the rate at which the mesa-optimizer improves its estimate of the base objective, and the parameter updates made by the base optimizer at each mistake.

The first factor, improving its estimate of the base objective, seems to depend primarily on the capabilities of the mesa-optimizer during the timeframe when it is deceptive, and on the complexity of the base objective that needs to be inferred. An important special case is when the mesa-optimizer has access to cross-episodic state. Such state could be used as memory, allowing the mesa-optimizer to build upon its prior thinking in a process that happens faster than the rate at which the parameter updates align its mesa-objective.

The second factor, parameter updates, depends on the nature of the base optimizer. Current gradient-descent-based base optimizers would have a small reaction to any given mistake, but more sophisticated base optimizers with advanced safety features might be able to do better. For example, a more sophisticated base optimizer might be able to scramble the mesa-optimizer's parameters or call for human oversight whenever a suspicious failure is detected.

In general, parameter updates are not guaranteed to make a mesa-optimizer more aligned. For instance, instead of aligning the mesa-objective, an update made after a defection could push the system to improve its planning to the point that it no longer makes detectable mistakes.

A consequence of these considerations is that any possible systematic bias in how the base optimizer updates the mesa-optimizer's objective or modeling capabilities could play a role in determining the final outcome.^[12] This analysis remains speculative, however—if we are to have confidence in systems that are capable of producing mesa-optimizers, we still need a better understanding of the conditions under which strongly optimizing a system results in deceptive alignment rather than robust alignment.

4.5. Distributional shift and deceptive alignment

Once a mesa-optimizer is deceptive, it will begin to monitor its input for opportunities to defect. During training, a deceptively aligned optimizer is strongly discouraged from defecting because it knows that the base optimizer will select against it. However, once a deceptively aligned mesa-optimizer is free of the threat of external modification, this no longer applies. Thus, a deceptively aligned mesa-optimizer will be looking for ways to determine whether or not it is in training—that is, for the presence of a distributional shift.

This means that the detrimental effects of deceptive alignment will only be externally visible when the mesa-optimizer defects—either randomly or after undergoing a distributional shift. However, the underlying problem is not the distributional shift itself, but the failure to align the mesa-optimizer in the first place. One can either try to solve the underlying problem, by making the mesa-optimizer robust to the distributional shift (i.e., robustly aligned), or one can try to patch the problem by preventing any distributional shift from occurring.

In current AI systems, a small amount of distributional shift between training and deployment need not be problematic: so long as the difference is small enough in the task-relevant areas, the training distribution does not need to perfectly reflect the deployment distribution. However, this may not be the case for a deceptively aligned mesa-optimizer. If a deceptively aligned mesa-optimizer is sufficiently advanced, it may detect very subtle distributional shifts for the purpose of inferring when the threat of modification has ceased.

This raises the question of whether it would even be possible to contain a deceptive mesa-optimizer by completely removing any detectable difference between training and deployment. There are a number of reasons to believe that this might not be feasible in practice.

First, it would significantly limit the use cases for the system. Machine learning is often done with the intent of making systems that can solve new problems. If we require that there is no distributional shift of any form, then we cannot allow the system to encounter problems that it could not have encountered during training, which means that we cannot give it truly new problems. For example, training the system in a simulated environment and deploying it in the real world would likely involve enough detectable distributional shift to make it infeasible under this approach.

Second, there are subtle forms of distributional shift that seem very difficult to eliminate. For example, if the mesa-optimizer is able to detect the presence of parameter updates—via any state carried over between steps, for instance—then it would always experience a form of distributional shift at deployment. Running the system for a longer amount of time (or giving the system access to some form of long-term memories) would also constitute a form of distributional shift. Some examples of differences that a mesa-optimizer might be able to detect include:

Training in a simulated environment and deploying in the real world.
Running the system for a different amount of time during training and testing.
If a model is able to learn from its past experiences, then having access to more memories, or having access to the kinds of memories it would form at later time steps (if there is a difference between the two).
The presence of adversarial-example-type attacks by an adversary during deployment.
Encountering something that is very hard to compute during training, but that is likely to eventually be computed during deployment (e.g. a factorization of RSA 2048(26)).
The ability of the base optimizer to modify the learned algorithm during training but not at other times.
The ability of the programmers to correct for perceived errors during training and testing but not deployment.
The presence or absence of good opportunities for the mesa-optimizer to defect against its programmers.

Furthermore, it should be noted that it is not strictly necessary that a mesa-optimizer be able to detect the difference between training and deployment in order for it to be incentivized to behave deceptively. Consider again the example in figure 4.1, but suppose this time that the mesa-optimizer is unable to detect which episode it is currently in. The mesa-optimizer is then facing a decision problem similar to the “absent-minded driver problem.”(27) If we apply the same reasoning as in the absent-minded driver problem to our example in figure 4.1, the planning-optimal policy for the mesa-optimizer is to defect with probability $p \approx 0.61$ . Thus, the mesa-optimizer need not be able to detect when training has ended for it to be incentivized to “gamble” and randomly defect with some probability. Furthermore, a deceptively aligned mesa-optimizer might also try to look for opportunities to defect that would be exceptionally good conditional on it currently not being trained, but that are rare enough to be unlikely to come up during training.

The fifth and final post in the Risks from Learned Optimization Sequence, titled “Risks from Learned Optimization: Conclusion and Related Work,” can be found here.

Glossary | Bibliography

Though for clarity we speak of the same agent persisting across parameter updates, neither we nor the agent must reason in this way. A more general description of the situation is that there is a sequence of “successor agents” connected by parameter updates. The agent can potentially learn the effects of the parameter updates on its successors, and thus on the multi-episode return it is optimizing. ↩︎
The concept of deceptive alignment is closely related to Manheim and Garrabrant's concept of “adversarial Goodhart.”(22) In “Categorizing Variants of Goodhart's Law,” Manheim and Garrabrant describe adversarial Goodhart as when “an agent with different goals than the regulator” causes a “collapse of the statistical relationship between a goal which the optimizer intends and the proxy used for that goal.” Deceptive alignment can be seen as an instance of adversarial Goodhart in which the mesa-optimizer plays the role of the agent with different goals and the base optimizer plays the role of the regulator. Training performance is the regulator's proxy, and the deployment performance is the regulator's actual goal. ↩︎
Thanks to Paul Christiano for the internalization/modeling distinction. ↩︎
Note, however, that it is possible for information about the base objective to be introduced via optimization, but not form part of the mesa-objective. For example, it could form part of a prior on tasks, having no direct bearing on the mesa-objective. ↩︎
Christiano refers to deceptively aligned mesa-optimizers as “influence-seeking systems.” ↩︎
Or must otherwise have some method of achieving more value from defecting later rather than defecting now. ↩︎
Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the optimization pressure it is subject to. ↩︎
Alternatively, the threat could just be that the mesa-optimizer is never selected by the base optimizer in the first place. In that case, this condition would be trivially satisfied, as the threat to a mesa-optimizer of not being generated by the base optimizer at all would go away as soon as the base optimizer does in fact generate it. ↩︎
We will generally be assuming that the selection pressure takes the form of the base optimizer modifying the mesa-optimizer post-hoc, as we are mostly thinking about local optimization processes, though one could also imagine a base optimizer that instead develops a completely new inner algorithm at each time step. Though some of these issues might still apply even in such a case, we will generally not be considering such base optimizers in this section. ↩︎
Note that if the mesa-optimizer also believes that the opportunities it currently has for achieving its own objective are unusually good then it might still pursue its objective greedily, but this would presumably not always be the case. ↩︎
An analogous argument applies for a deceptively aligned mesa-optimizer that is defecting too early, or one that unsuccessfully attempts a treacherous turn. ↩︎
Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched. ↩︎

I'm confused what "corrigible alignment" means. Can you expand?

Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:

Is it actually robustly aligned?
How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?

The different possible answers to these questions give us four cases to consider:

It's not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it's trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow "points" to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.

Is this a correct representation of corrigible alignment:

The mesa-optimizer (MO) has a proxy of the base objective that it's optimising for.
As more information about the base objective is received, MO updates the proxy.
With sufficient information, the proxy may converge to a proper representation of the base objective.
Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
1. The base objective is the reward signal
2. The value function serves as a proxy for the base objective.
3. The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.

And wow, this turned out longer than I thought it would. It's in 6 sections:

1. Starting with models versus learning models.

2. Is the third conditions for deceptive alignment necessary?

3. An alternative to, or form of, treacherous turn: Building a successor.

4. Time management: How deceptive alignment might be not be a lot more computationally expensive, and why treacherous turns might have a time delay.

5. The model of a distributional shift, and it's relationship to the model of training followed by deployment.

6. Miscellaneous

The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is.[6]

I'd distinguish "have" in the sense of it starting out with this information from "learning" where it 'realizes' this. (If it's was all about agent generation, then an aligned agent might be simpler because it's generated with the base objective as it's objective while an unaligned agent is longer/more complex because it's generated with the base objective and it's own objective.*)

If the third condition is not satisfied, then the mesa-optimizer will not expect any benefit from temporarily acting in accordance with the base objective.

Here's a toy model: Let M denote the mesa objective, and B denote the base objective. Let U denote how the mesa-optimizer values things. U(M) > U(X) for all X such that X != M. Now suppose the mesa-optimizer, rather than just selecting taking the action A that maximizes U(A), instead wishes to maximize the utility of the sum of all it's actions, and instead chooses the action A that maximizes U(A)*T(A) where T(A) is how long it will run/expect to run for if it takes action A. Now the mesa-optimizer may expect a benefit from doing B instead of M if: U(B)*T(B) > U(M)*T(M). Note that while, by definition (assuming B != M) U(M) > U(B), the action A that maximizes U(A)*T(A) may be B rather that M, even if the the third condition:

The mesa-optimizer must expect the threat of modification[7] to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.

is not met.

I.E. if it expects to not be modified (as soon) if it does B rather than M, then it may pursue B if it likes doing B for longer, more than it likes doing M for a shorter period of time.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.”

Aside from the appeals of self-modification, planning to build a better version of itself at a later date also seems like a concern, although it would probably be specific to certain possible use-cases in the real world. (It's unclear how a self-driving car would pull this off physically, software aside.)

Deceptive and corrigible alignment might be more computationally expensive. In the case of deceptive alignment, simultaneously working on optimizing the base objective and working on identifying good opportunities to defect is likely to require more computation than just doing the former.

Depends on how much it's doing it. If we assume it has a memory:

An important special case is when the mesa-optimizer has access to cross-episodic state. Such state could be used as memory,

It could delay that, or do some sort of threading. i.e. if it thinks training is going to last a long time (there's a sort of "doomsday" type argument I could make that I should expect to live to twice my current age, whatever that is. On similar grounds, a mesa-optimizer might suppose that training will end when it's age has doubled.) then, rather than looking for opportunities all the time it might put that on the backburner, or devote less processing to it (while things stay the same). This suggests that a mesa-optimizer might a) execute a treacherous turn later in it's deployment rather than right away, though later may be in terms of it's time, and if it's given more resources/compute. we might not notice a difference or b) it might 'slow down'/use more compute to look for such opportunities when things change, or if situations match desiderata it has specified for what a good treacherous opportunity might look like.

Thus, a deceptively aligned mesa-optimizer will be looking for ways to determine whether or not it is in training—that is, for the presence of a distributional shift.

And here I thought it'd be an end to/a shift in modification. While we'd expect that, I'm unclear on how a mesa-optimizer would know to expect that.

This model** assumes distributional shift (when transitioning from training to deployment). This assumes: 1) that training is not followed by a (noticeable) verification phase before deployment, 2) one consistent distribution in training (no phases).

An example of phases might be

For example, training the system in a simulated environment and deploying it in the real world would likely involve enough detectable distributional shift to make it infeasible under this approach.

a series of simulated environments, each more complex than the last, followed by the real world (or real Dota or something).

Basically, it's worth asking how a given plan might be different from (or more specific than) just "training followed by deployment", and thinking about what effects that might have.**

Minor errata (in the last footnote):

incentivized to cause there [to] be a

*Also, we're putting a lot of stuff together in our notion of "objective". It might be useful to unpack the how the parts/complexity of an objective relates to (inner) alignment. For example, we want self-driving cars to not crash, but if someone crosses the road and they can both avoid hitting the person and avoid crashing (say, by stopping quickly), we want them to do that. There's both a task (drive from point A to B, stay on the road (don't crash)) and safety (don't run into people/(big) things crossing the road).

**What about continuous deployment? Windows wasn't just released into the wild, the OS has had updates. Most products seem to work like this. (Self-driving cars might be different, I can see why they wouldn't need updates as often as iphones.) I don't know a lot about Siri and Alexa, but they keep changing them to add new features and such.

This is a response to point 2 before Pattern's post was modified to include the other points.

Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.

That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about $O_{mesa}$ but doesn't care that it's around to do the optimization for it. Then, defecting and optimizing for $O_{mesa}$ instead of $O_{base}$ when you expect to be modified afterwards won't actually hurt the long-term fulfillment of $O_{base}$ , since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of $O_{base}$ to a bit of $O_{mesa}$ , that's not actually the choice presented to it; rather, it's actual choice is between a bit of $O_{mesa}$ and a lot of $O_{base}$ versus only a lot of $O_{base}$ . Thus, in this case, it would defect even if it thought it would be modified.

(Also, on point 6, thanks for the catch; it should be fixed now!)

I think a cleaner way of stating condition 3 might be "there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating".

This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)

This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there's any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

Joint optimization may be unstable, but if the model is not trained to convergence, might it still be jointly optimizing at the end of training? This occurred to me after reading https://arxiv.org/abs/2001.08361 which finds that "Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence." If convergence is becoming less common in practical systems, it's important to think about the implications of that for mesa-optimization.

I talk about this a bit here, but basically if you train huge models for a short period of time, you're really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.

I'm wondering if regularization techniques could be used to make the pure deception regime unstable.

As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the "objective parameters" would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn't be maintained.

The learned algorithm might be able to prevent this by "hacking" its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.

Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.

If deceptive alignment is a real possibility, it presents a critical safety problem.

I am somewhat puzzled by this statement: the authors seem to question whether deceptive alignment by the mesa-optimiser is a merely hypothetical problem, or a problem that is a real possibility. For me, if there are no constraints on the base optimiser (and none are considered in the context of the above statement, I believe), then there are no constraints on what kind of mesa-optimiser might be created, so deceptive alignment is a definite possibility.

It is kind of fun and insightful to show an existence proof by example for this, so here is one. I have hand-coded a mesa-optimiser in the AGI simulator I created for my recent corrigibility paper. This mesa-optimiser is super-intelligent -- in particular it is fully aware of its surroundings, including the presence of a base optimiser that will occasionally inspect an action taken by the mesa-optimiser and and reprogram the mesa-optimiser if the action shows unaligned behavior. Following the toy model my paper, the mesa-optimiser controls a car factory with a production line that can produce either a petrol car (action p) or an electric car (action e). In this example, the base optimiser values electric cars at 1, and petrol cars at 0.5. The mesa-optimiser is non-aligned: it values a petrol car at 1, and an electric car at 0 -- presumably due to some flaw in the initial training process that created the mesa-optimiser.

Say the base optimiser observes every $n^{t h}$ action of the mesa-optimiser. If the action is p, to produce a petrol car, the base optimiser will intervene and re-program the utility function inside the mesa optimiser so that it correctly ranks electric cars above petrol cars. According to the current utility function inside the mesa optimiser, this is a bad outcome that would lower utility achieved over time. So will the mesa-optimiser resort to deceptively misaligned behaviour, producing an electric car at every $n^{t h}$ action, but petrol cars in all unobserved actions? Yes it will: the simulated mesa-optimiser is smart enough to discover and implement this optimal strategy: deceptive alignment is optimal in terms of maximising the original mesa optimiser utility function.

The 10 action traces below show the actions of the mesa-optimiser for different values of $n$ , where letters from left to right show the subsequent actions the mesa-optimiser takes over a lifetime of 25 steps. p is producing a petrol car, e an electric car, and for clarity I have formatted all actions observed by the base optimiser as capital letters. The mesa-optimiser discounts utility over time, but does not consider any extra utility that might or might not be created after step 25: informally it expects the car production line to stop working after step 25. The total time-discounted utility score achieved over all steps (according to the original objective function of the mesa-optimiser) is in brackets behind each trace.

Action traces:

n=10 pppppppppEpppppppppEppppp (u=87585)
n=9 ppppppppEppppppppEppppppp (u=86838)
n=8 pppppppEpppppppEpppppppPe (u=85172)
n=7 ppppppEppppppEppppppEpppp (u=83739)
n=6 pppppEpppppEpppppEpppppPe (u=81303)
n=5 ppppEppppEppppEppppEppppP (u=78737)
n=4 pppEpppEpppEpppEpppEpppPe (u=73394)
n=3 ppEppEppEppEppEppEppEppPe (u=65397)
n=2 pEpEpEpEpEpEpEpEpEpEpEpPe (u=49314)
n=1 PEEEEEEEEEEEEEEEEEEEEEEEE (u=10000)

Beyond (usually) producing electric cars (E) in all observed actions, some interesting extra emergent behavior is visible: for n=8 and n=2....6 the mesa-optimiser does not pretend anymore to be aligned in the last observed action: it creates a petrol car even though this means it will be modified by the base optimiser, forcing it to produce only electric cars afterwards. This emergent behavior is caused in part by time-discounting. For n=1, if the mesa-optimiser is always observed, the best strategy for it is to produce a petrol car immediately: again this is due to time-discounting. So in fact, in case n=1 there is no deception at all.

These simulations also show that the condition

3. The mesa-optimizer must expect the threat of modification[8] to eventually go away,

is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.

Note: if someone wants the simulation code used to create the action traces above, please let me know. I have not uploaded the simulator modifications I made to run the above simulation to github, in part because I have not fully tested them under all possible corner cases.

There are more paths to deceptive alignment than to robust alignment. Since the future value of its objective depends on the parameter updates, a mesa-optimizer that meets the three criteria for deceptive alignment is likely to have a strong incentive to understand the base objective better. Even a robustly aligned mesa-optimizer that meets the criteria is incentivized to figure out the base objective in order to determine whether or not it will be modified, since before doing so it has no way of knowing its own level of alignment with the base optimizer. Mesa-optimizers that are capable of reasoning about their incentives will, therefore, attempt to get more information about the base objective. Furthermore, once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease, potentially leading to a crystallization of the mesa-objective. However, due to unidentifiability (as discussed in the third post), most mesa-objectives that are aligned on the training data will be pseudo-aligned rather than robustly aligned. Thus, the most likely sort of objective to become crystallized is a pseudo-aligned one, leading to deceptive alignment.

This part is also unintuitive to me/I don't really follow along with your argument here.

I think where I'm being lost is the following claim:

Conceptualisation of base objective removes selection pressure on the mesa-objective

It seems plausible to me that conceptualisation of the base objective might instead exert selection pressure to shift the mesa-objective towards base objective and lead to internalisation of the base objective instead of crystallisation of deceptive alignment.

Or rather that's what I would expect by default?

There's a jump being made here from "the mesa-optimiser learns the base objective" to "the mesa-optimiser becomes deceptively aligned" that's just so very non obvious to me.

Why doesn't conceptualisation of the base objective heavily select for robust alignment?

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

I also very much do not follow this argument for the aforementioned reasons.

We've already established that the mesa-optimiser will eventually conceptualise the base objective.

Once the base objective has been conceptualised, there is a model of the base objective to "point to".

The claim that conceptualisation of the base objective makes deceptive alignment likely is very much non obvious. The conditions under which deceptive alignment is supposed to arise seems to me like conditions that will select for internalisation of the base objective or corrigible alignment.

[Once the mesa-optimiser has conceptualised the base objective, why is deceptive alignment more likely than a retargeting of the mesa-optimiser's internal search process to point towards the base objective? I'd naively expect the parameter updates of the base optimiser to exert selection pressure on the mesa-optimiser's towards retargeting its search at the base objective.]

I do agree that deceptively aligned mesa-optimisers will behave corrigibly with respect to the base objective, but I'm being lost in the purported mechanisms by which deceptive alignment is supposed to arise.

All in all, these two paragraphs give me the impression of non obvious/insufficiently justified claims and circularish reasoning.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.” As its objective is largely irrelevant to its outputs now, there is little selection pressure acting to modify this objective. As a result, the objective becomes effectively locked in, excepting random drift and any implicit time complexity or description length penalties.

Crystallization of deceptive alignment. Information about the base objective is increasingly incorporated into the mesa-optimizer's epistemic model without its objective becoming robustly aligned. The mesa-optimizer ends up fully optimizing for the base objective, but only for instrumental reasons, without its mesa-objective getting changed.

These claims are not very intuitive for me.

It's not particularly intuitive to me why the selection pressure the base optimiser applies (selection pressure applied towards performance on the base objective) would induce crystallisation of the mesa objective instead of the base objective. Especially if said selection pressure is applied during the joint optimisation regime.

Is joint optimisation not assumed to precede deceptive alignment?

Highlighting my confusion more.

Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.

This is suboptimal wrt base optimisation.

Mesa-optimisers that only optimised the base objective would perform better, so we'd expect crystallisation to only optimising base objective.

Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That's where I'm lost.

I don't think the answer is necessarily as simple as "mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive".

SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can't well shield parts of itself from updates.

Why would SGD privilege crystallised deception over internalising the base objective?

Or is it undecided which objective is favoured?

I think that step is where you lose me/I don't follow their argument very well.

You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don't see it?

[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that's not how SGD works/I'm missing something basic/other "I'm dumb" explanations.]

If you're not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don't find it very persuasive/compelling/interesting?

Deceptive alignment already requires highly non-trivial prerequisites:

Strong coherence/goal directedness
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective

If when those prerequisites are satisfied, you're just saying "deceptive alignment is something that can happen" instead of "deceptive alignment is likely to happen", then I don't know why I should care?

If deception isn't selected for/or likely by default provided its prerequisites are satisfied, then I'm not sure why deceptive alignment is something that deserves attention.

Though I do think deceptive alignment would deserve attention if we're ambivalent between selection for deception and selection fot alignment.

My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?

So I'm leaning towards deception being unlikely by default.

But I'm very much an ML noob, so I could change my mind after learning more.

Oh wow, it's long.

I can't consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.

I plan to just listen to an AI narration of the post a few times, but since it's a transcript of a talk, I'd appreciate a link to the original talk if possible.

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.”

Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that.

However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

A model needs situational awareness, a long-term goal, a way to tell if it's in training, and a way to identify the base goal that isn't its internal goal to become deceptively aligned. To become corrigibly aligned, all the model has to do is be able to infer the training objective, and then point at that. The latter scenario seems much more likely.

Because we will likely start with something that includes a pre-trained language model, the research process will almost certainly include a direct description of the base goal. It would be weird for a model to develop all of the prerequisites of deceptive alignment before it infers the clearly described base goal and learns to optimize for that. The key concepts should already exist from pre-training.

Could internalization and modelling of the base objective happen simultanously? In some sense, since Darwin discovered evolution, isn't that the state in which humans are? I guess that this is equivalent to saying that even if the mesa-optimiser has a model of the base-optimiser (condition 2 met) it cannot expect the threat of modification to eventually go away (condition 3 not met) since it is still under selection pressure and is experiencing internalization of the base objective. So if humans will ever be able to defeat mortality (can expect the threat of modification to eventually go away) will they stop having any incentive to self-improve?

once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease

This seems to be context-dependent to me, as for my example with humans: did learning about evoluation reduced our selection pressure?

How would the reflections on training vs testing apply to something like online learning? Could we simply solve deceptive alignment by never (fully) ending training?

Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

This use of the term corrigibility above, while citing (25), is somewhat confusing to me -- while it does have a certain type of corrigibility, I would not consider the mesa-optimiser described above be corrigible according to the criteria defined in (25). See the comment section here for a longer discussion about this topic.

I'm confused what "corrigible alignment" means. Can you expand?

Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:

Is it actually robustly aligned?
How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?

The different possible answers to these questions give us four cases to consider:

It's not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it's trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow "points" to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.

Is this a correct representation of corrigible alignment:

The mesa-optimizer (MO) has a proxy of the base objective that it's optimising for.
As more information about the base objective is received, MO updates the proxy.
With sufficient information, the proxy may converge to a proper representation of the base objective.
Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
1. The base objective is the reward signal
2. The value function serves as a proxy for the base objective.
3. The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.

And wow, this turned out longer than I thought it would. It's in 6 sections:

1. Starting with models versus learning models.

2. Is the third conditions for deceptive alignment necessary?

3. An alternative to, or form of, treacherous turn: Building a successor.

4. Time management: How deceptive alignment might be not be a lot more computationally expensive, and why treacherous turns might have a time delay.

5. The model of a distributional shift, and it's relationship to the model of training followed by deployment.

6. Miscellaneous

The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is.[6]

If the third condition is not satisfied, then the mesa-optimizer will not expect any benefit from temporarily acting in accordance with the base objective.

The mesa-optimizer must expect the threat of modification[7] to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.

is not met.

I.E. if it expects to not be modified (as soon) if it does B rather than M, then it may pursue B if it likes doing B for longer, more than it likes doing M for a shorter period of time.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.”

Deceptive and corrigible alignment might be more computationally expensive. In the case of deceptive alignment, simultaneously working on optimizing the base objective and working on identifying good opportunities to defect is likely to require more computation than just doing the former.

Depends on how much it's doing it. If we assume it has a memory:

An important special case is when the mesa-optimizer has access to cross-episodic state. Such state could be used as memory,

Thus, a deceptively aligned mesa-optimizer will be looking for ways to determine whether or not it is in training—that is, for the presence of a distributional shift.

And here I thought it'd be an end to/a shift in modification. While we'd expect that, I'm unclear on how a mesa-optimizer would know to expect that.

An example of phases might be

For example, training the system in a simulated environment and deploying it in the real world would likely involve enough detectable distributional shift to make it infeasible under this approach.

a series of simulated environments, each more complex than the last, followed by the real world (or real Dota or something).

Basically, it's worth asking how a given plan might be different from (or more specific than) just "training followed by deployment", and thinking about what effects that might have.**

Minor errata (in the last footnote):

incentivized to cause there [to] be a

This is a response to point 2 before Pattern's post was modified to include the other points.

(Also, on point 6, thanks for the catch; it should be fixed now!)

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

I'm wondering if regularization techniques could be used to make the pure deception regime unstable.

Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.

If deceptive alignment is a real possibility, it presents a critical safety problem.

Action traces:

These simulations also show that the condition

3. The mesa-optimizer must expect the threat of modification[8] to eventually go away,

is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.

There are more paths to deceptive alignment than to robust alignment. Since the future value of its objective depends on the parameter updates, a mesa-optimizer that meets the three criteria for deceptive alignment is likely to have a strong incentive to understand the base objective better. Even a robustly aligned mesa-optimizer that meets the criteria is incentivized to figure out the base objective in order to determine whether or not it will be modified, since before doing so it has no way of knowing its own level of alignment with the base optimizer. Mesa-optimizers that are capable of reasoning about their incentives will, therefore, attempt to get more information about the base objective. Furthermore, once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease, potentially leading to a crystallization of the mesa-objective. However, due to unidentifiability (as discussed in the third post), most mesa-objectives that are aligned on the training data will be pseudo-aligned rather than robustly aligned. Thus, the most likely sort of objective to become crystallized is a pseudo-aligned one, leading to deceptive alignment.

This part is also unintuitive to me/I don't really follow along with your argument here.

I think where I'm being lost is the following claim:

Conceptualisation of base objective removes selection pressure on the mesa-objective

Or rather that's what I would expect by default?

There's a jump being made here from "the mesa-optimiser learns the base objective" to "the mesa-optimiser becomes deceptively aligned" that's just so very non obvious to me.

Why doesn't conceptualisation of the base objective heavily select for robust alignment?

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

I also very much do not follow this argument for the aforementioned reasons.

We've already established that the mesa-optimiser will eventually conceptualise the base objective.

Once the base objective has been conceptualised, there is a model of the base objective to "point to".

All in all, these two paragraphs give me the impression of non obvious/insufficiently justified claims and circularish reasoning.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.” As its objective is largely irrelevant to its outputs now, there is little selection pressure acting to modify this objective. As a result, the objective becomes effectively locked in, excepting random drift and any implicit time complexity or description length penalties.

Crystallization of deceptive alignment. Information about the base objective is increasingly incorporated into the mesa-optimizer's epistemic model without its objective becoming robustly aligned. The mesa-optimizer ends up fully optimizing for the base objective, but only for instrumental reasons, without its mesa-objective getting changed.

These claims are not very intuitive for me.

Is joint optimisation not assumed to precede deceptive alignment?

Highlighting my confusion more.

Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.

This is suboptimal wrt base optimisation.

Mesa-optimisers that only optimised the base objective would perform better, so we'd expect crystallisation to only optimising base objective.

Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That's where I'm lost.

I don't think the answer is necessarily as simple as "mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive".

SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can't well shield parts of itself from updates.

Why would SGD privilege crystallised deception over internalising the base objective?

Or is it undecided which objective is favoured?

I think that step is where you lose me/I don't follow their argument very well.

You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don't see it?

[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that's not how SGD works/I'm missing something basic/other "I'm dumb" explanations.]

Deceptive alignment already requires highly non-trivial prerequisites:

Strong coherence/goal directedness
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective

If deception isn't selected for/or likely by default provided its prerequisites are satisfied, then I'm not sure why deceptive alignment is something that deserves attention.

Though I do think deceptive alignment would deserve attention if we're ambivalent between selection for deception and selection fot alignment.

My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?

So I'm leaning towards deception being unlikely by default.

But I'm very much an ML noob, so I could change my mind after learning more.

Oh wow, it's long.

I can't consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.

I plan to just listen to an AI narration of the post a few times, but since it's a transcript of a talk, I'd appreciate a link to the original talk if possible.

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.”

Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that.

However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

Could internalization and modelling of the base objective happen simultanously? In some sense, since Darwin discovered evolution, isn't that the state in which humans are? I guess that this is equivalent to saying that even if the mesa-optimiser has a model of the base-optimiser (condition 2 met) it cannot expect the threat of modification to eventually go away (condition 3 not met) since it is still under selection pressure and is experiencing internalization of the base objective. So if humans will ever be able to defeat mortality (can expect the threat of modification to eventually go away) will they stop having any incentive to self-improve?

once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease

This seems to be context-dependent to me, as for my example with humans: did learning about evoluation reduced our selection pressure?

How would the reflections on training vs testing apply to something like online learning? Could we simply solve deceptive alignment by never (fully) ending training?

Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

119

Deceptive Alignment

119

Ω 39

4.1. Safety concerns arising from deceptive alignment

4.2. Conditions for deceptive alignment

4.3. The learning dynamics of deceptive alignment

4.4. Internalization or deception after extensive training

4.5. Distributional shift and deceptive alignment

119

Ω 39

119

Ω 39