*This is the second of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.*

In this post, we consider how the following two components of a particular machine learning system might influence whether it will produce a mesa-optimizer:

**The task:**The training distribution and base objective function.**The base optimizer:**The machine learning algorithm and model architecture.

We deliberately choose to present theoretical considerations for why mesa-optimization may or may not occur rather than provide concrete examples. Mesa-optimization is a phenomenon that we believe will occur mainly in machine learning systems that are more advanced than those that exist today.^{[1]} Thus, an attempt to induce mesa-optimization in a *current* machine learning system would likely require us to use an artificial setup specifically designed to induce mesa-optimization. Moreover, the limited interpretability of neural networks, combined with the fact that there is no general and precise definition of “optimizer,” means that it would be hard to evaluate whether a given model is a mesa-optimizer.

## 2.1. The task

Some tasks benefit from mesa-optimizers more than others. For example, tic-tac-toe can be perfectly solved by simple rules. Thus, a base optimizer has no need to generate a mesa-optimizer to solve tic-tac-toe, since a simple learned algorithm implementing the rules for perfect play will do. Human survival in the savanna, by contrast, did seem to benefit from mesa-optimization. Below, we discuss the properties of tasks that may influence the likelihood of mesa-optimization.

**Better generalization through search.** To be able to consistently achieve a certain level of performance in an environment, we hypothesize that there will always have to be some minimum amount of optimization power that must be applied to find a policy that performs that well.

To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.(9) After these divisions, there will be some remaining space of policies that the optimizer is unable to distinguish between. Then, to ensure that all policies in the remaining space have some minimum level of performance—to provide a performance lower bound^{[2]} —will always require the original space to be divided some minimum number of times—that is, there will always have to be some minimum bits of optimization power applied.

However, there are two distinct levels at which this optimization power could be expended: the base optimizer could expend optimization power selecting a highly-tuned learned algorithm, or the learned algorithm could itself expend optimization power selecting highly-tuned actions.

As a mesa-optimizer is just a learned algorithm that itself performs optimization, the degree to which mesa-optimizers will be incentivized in machine learning systems is likely to be dependent on which of these levels it is more advantageous for the system to perform optimization. For many current machine learning models, where we expend vastly more computational resources training the model than running it, it seems generally favorable for most of the optimization work to be done by the base optimizer, with the resulting learned algorithm being simply a network of highly-tuned heuristics rather than a mesa-optimizer.

We are already encountering some problems, however—Go, Chess, and Shogi, for example—for which this approach does not scale. Indeed, our best current algorithms for those tasks involve explicitly making an optimizer (hard-coded Monte-Carlo tree search with learned heuristics) that does optimization work on the level of the learned algorithm rather than having all the optimization work done by the base optimizer.(10) Arguably, this sort of task is only adequately solvable this way—if it were possible to train a straightforward DQN agent to perform well at Chess, it plausibly would *have* to learn to internally perform something like a tree search, producing a mesa-optimizer.^{[3]}

We hypothesize that the attractiveness of search in these domains is due to the diverse, branching nature of these environments. This is because search—that is, optimization—tends to be good at generalizing across diverse environments, as it gets to individually determine the best action for each individual task instance. There is a general distinction along these lines between optimization work done on the level of the learned algorithm and that done on the level of the base optimizer: the learned algorithm only has to determine the best action for a given task instance, whereas the base optimizer has to design heuristics that will hold regardless of what task instance the learned algorithm encounters. Furthermore, a mesa-optimizer can immediately optimize its actions in novel situations, whereas the base optimizer can only change the mesa-optimizer's policy by modifying it ex-post. Thus, for environments that are diverse enough that most task instances are likely to be completely novel, search allows the mesa-optimizer to adjust for that new task instance immediately.

For example, consider reinforcement learning in a diverse environment, such as one that directly involves interacting with the real world. We can think of a diverse environment as requiring a very large amount of computation to figure out good policies before conditioning on the specifics of an individual instance, but only a much smaller amount of computation to figure out a good policy once the specific instance of the environment is known. We can model this observation as follows.

Suppose an environment is composed of different instances, each of which requires a completely distinct policy to succeed in.^{[4]} Let be the optimization power (measured in bits(9)) applied by the base optimizer, which should be approximately proportional to the number of training steps. Then, let be the optimization power applied by the learned algorithm in each environment instance and the total amount of optimization power the base optimizer must put in to get a learned algorithm capable of performing that amount of optimization.^{[5]} We will assume that the rest of the base optimizer's optimization power, , goes into tuning the learned algorithm's policy. Since the base optimizer has to distribute its tuning across all task instances, the amount of optimization power it will be able to contribute to each instance will be , under the previous assumption that each instance requires a completely distinct policy. On the other hand, since the learned algorithm does all of its optimization at runtime, it can direct all of it into the given task instance, making its contribution to the total for each instance simply .^{[6]}

Thus, if we assume that, for a given , the base optimizer will select the value of that maximizes the minimum level of performance, and thus the total optimization power applied to each instance, we get^{[7]}

As one moves to more and more diverse environments—that is, as increases—this model suggests that will dominate , implying that mesa-optimization will become more and more favorable. Of course, this is simply a toy model, as it makes many questionable simplifying assumptions. Nevertheless, it sketches an argument for a pull towards mesa-optimization in sufficiently diverse environments.

As an illustrative example, consider biological evolution. The environment of the real world is highly diverse, resulting in non-optimizer policies directly fine-tuned by evolution—those of plants, for example—having to be very simple, as evolution has to spread its optimization power across a very wide range of possible environment instances. On the other hand, animals with nervous systems can display significantly more complex policies by virtue of being able to perform their own optimization, which can be based on immediate information from their environment. This allows sufficiently advanced mesa-optimizers, such as humans, to massively outperform other species, especially in the face of novel environments, as the optimization performed internally by humans allows them to find good policies even in entirely novel environments.

**Compression of complex policies.** In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.

One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler.

This effect is most pronounced for tasks with a broad diversity of details but common high-level features. For example, Go, Chess, and Shogi have a very large domain of possible board states, but admit a single high-level strategy for play—heuristic-guided tree search—that performs well across all board states.(10) On the other hand, a classifier trained on random noise is unlikely to benefit from compression at all.

The environment need not necessarily be too diverse for this sort of effect to appear, however, as long as the pressure for low description length is strong enough. As a simple illustrative example, consider the following task: given a maze, the learned algorithm must output a path through the maze from start to finish. If the maze is sufficiently long and complicated then the specific strategy for solving this particular maze—specifying each individual turn—will have a high description length. However, the description length of a general optimization algorithm for finding a path through an arbitrary maze is fairly small. Therefore, if the base optimizer is selecting for programs with low description length, then it might find a mesa-optimizer that can solve all mazes, despite the training environment only containing one maze.

**Task restriction.** The observation that diverse environments seem to increase the probability of mesa-optimization suggests that one way of reducing the probability of mesa-optimizers might be to keep the tasks on which AI systems are trained highly restricted. Focusing on building many individual AI services which can together offer all the capabilities of a generally-intelligent system rather than a single general-purpose artificial general intelligence (AGI), for example, might be a way to accomplish this while still remaining competitive with other approaches.(11)

**Human modeling.** Another aspect of the task that might influence the likelihood of mesa-optimization is the presence of humans in the environment.(12) Since humans often act as optimizers, reasoning about humans will likely involve reasoning about optimization. A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself, resulting in a mesa-optimizer. For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

Thus, tasks that do not benefit from human modeling seem less likely to produce mesa-optimizers than those that do. Furthermore, there are many tasks that might benefit from human modeling that don't explicitly involve modeling humans—to the extent that the training distribution is generated by humans, for example, modeling humans might enable the generation of a very good prior for that distribution.

## 2.2. The base optimizer

It is likely that certain features of the base optimizer will influence how likely it is to generate a mesa-optimizer. First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.

Second, we hypothesize that the details of a machine learning model's architecture will have a significant effect on its tendency to implement mesa-optimization. For example, a tabular model, which independently learns the correct output for every input, will never be a mesa-optimizer. On the other hand, if a hypothetical base optimizer looks for the program with the shortest source code that solves a task, then it is more plausible that this program will itself be an optimizer.(13) However, for realistic machine learning base optimizers, it is less clear to what extent mesa-optimizers will be selected for. Thus, we discuss some factors below that might influence the likelihood of mesa-optimization one way or the other.

**Reachability.** There are many kinds of optimization algorithms that a base optimizer could implement. However, almost every training strategy currently used in machine learning uses some form of local search (such as gradient descent or even genetic algorithms). Thus, it seems plausible that the training strategy of more advanced ML systems will also fall into this category. We will call this general class of optimizers that are based on local hill-climbing *local optimization processes.*

We can then formulate a notion of *reachability,* the difficulty for the base optimizer to find any given learned algorithm, which we can analyze in the case of a local optimization process. A local optimization process might fail to find a particular learned algorithm that would perform very well on the base objective if the learned algorithm is surrounded by other algorithms that perform poorly on the base objective. For a mesa-optimizer to be produced by a local optimization process, it needs to not only perform well on the base objective, but also be *reachable;* that is, there needs to be a path through the space of learned algorithms to it that is approximately monotonically increasing. Furthermore, the degree to which the path only need be approximate—that is, the degree to which ML training procedures can escape local optima—is likely to be critical, as optimization algorithms are complex enough that it might require a significant portion of the algorithm to be present before performance gains start being realized.

**Algorithmic range.** One key factor likely to determine the reachability of mesa-optimizers is the *algorithmic range* of the learned algorithms—that is, how extensive is the set of algorithms (or how expressive is the model space) capable of being found by the base optimizer. The more extensive a model's algorithmic range, the broader the space of possible learned algorithms, and thus the more likely that it will be able to find one that is a mesa-optimizer, assuming the base optimizer is incentivized to do so. For example, architectures that explicitly give the algorithm access to a wide range of possible computations, such as recurrent neural networks or neural Turing machines,(14) seem more likely to produce mesa-optimizers.

**Inductive biases.** Another important factor is the degree to which the base optimizer is explicitly or implicitly biased in various ways. The nature of these inductive biases will contribute to the likelihood of a mesa-optimizer being selected for. One of the most important kinds of inductive bias is simplicity bias, which would almost certainly be exhibited by almost all base optimizers. We identify three ways in which simplicity bias can manifest itself:

- An explicit penalty due to parameter regularization or architectural constraints such as weight-sharing or sparse connections.
- An implicit bias due to the model architecture. For example, it has been shown that neural networks are more likely to fit a simple function to a set of training data, even when no regularization is used.(15)
- The capacity limitations of the model. The size of a model imposes a hard limit on the complexity of the functions it is able to represent. Thus, to the degree that the base optimizer is selecting based on performance, it will be driven to “squeeze out” as much performance as it can for any given model capacity, leading to a bias in favor of relatively compressed policies.

The more a base optimizer is biased towards simple solutions, the more it will be incentivized to find a compressed policy such as a mesa-optimizer.

The base optimizer could also be biased in other ways. For example, it could be biased towards algorithms with a low time or space complexity. If we view mesa-optimization as delegation of optimization work, a time penalty on the learned algorithm is likely to incentivize the base optimizer to do more precomputation itself by pre-computing relevant heuristics rather than delegating optimization work to the learned algorithm. Thus, we hypothesize that penalizing description length will favor mesa-optimizers while penalizing time complexity will disfavor them. This suggests the following speculative conjecture: neither a minimal-depth nor minimal-size boolean circuit that solves a problem can be a mesa-optimizer.(16)

Lastly, another form of bias that might have particularly interesting effects is the presence of an information funnel. In deep learning, the base optimizer directly selects the values of individual parameters. In biological evolution, by contrast, the base optimizer selects DNA strings, which in turn produce the detailed structure of the brain only indirectly. This implicitly introduces pressure towards compressing the brain's structure. As we noted previously, this might favor the development of mesa-optimizers, as they can be seen as a way of compressing complex policies. Because deep learning algorithms currently select weights directly, there is no such information funnel. In the case of meta-learning over a machine learning model's architecture, however, an information funnel might present itself due to the fact that the meta-learner has to learn a single policy for creating architectures that is capable of solving all of the problems it is presented with.

**Statefulness.** Another factor that could make it easier for a learned algorithm to implement a complex optimization process is the ability of the learned algorithm to save and recall information. The ability to refer back to previous state enables the caching of intermediate results, increasing the architecture's algorithmic range by giving the learned algorithm significantly more time to complete its computations. Statefulness, therefore, favors mesa-optimization by decreasing the implicit penalty on time complexity imposed by enforcing all computations to end when the learned algorithm produces its output. One simple form of this might be a recurrent neural network, which can perform computations with longer time horizons due to its ability to pass intermediate activations across different time steps. An even more extreme version of this might be a network that is able to store state not just across different time steps in the same episode, but also across episodes, enabling it to do more long-term learning, though such an architecture is currently completely hypothetical.

**Hard-coded optimization.** One possible means of alleviating some of these issues might be to include hard-coded optimization where the learned algorithm provides only the objective function and not the optimization algorithm. The stronger the optimization performed explicitly, the less strong the optimization performed implicitly by the learned algorithm needs to be. For example, architectures that explicitly perform optimization that is relevant for the task—such as hard-coded Monte Carlo tree search—might decrease the benefit of mesa-optimizers by reducing the need for optimization other than that which is explicitly programmed into the system.

*The third post in the Risks from Learned Optimization Sequence, titled “The Inner Alignment Problem,” can be found here.*

As of the date of this post. Note that we do examine some existing machine learning systems that we believe are close to producing mesa-optimization in post 5. ↩︎

It is worth noting that the same argument also holds for achieving an average-case guarantee. ↩︎

Assuming reasonable computational constraints.. ↩︎

This definition of is somewhat vague, as there are multiple different levels at which one can chunk an environment into instances. For example, one environment could always have the same high-level features but completely random low-level features, whereas another could have two different categories of instances that are broadly self-similar but different from each other, in which case it's unclear which has a larger . However, one can simply imagine holding constant for all levels but one and just considering how environment diversity changes on that level. ↩︎

Note that this makes the implicit assumption that the amount of optimization power required to find a mesa-optimizer capable of performing bits of optimization is independent of . The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find an -bit optimizer should be relatively independent of the environment. That being said, it won't be completely independent, but as long as the primary difference between environments is how much optimization they need, rather than how hard it is to do optimization, the model presented here should hold. ↩︎

Note, however, that there will be some maximum simply because the learned algorithm generally only has access to so much computational power. ↩︎

Subject to the constraint that . ↩︎