Conditions for Mesa-Optimization

evhub; Chris van Merwijk; Vlad Mikulik; Joar Skalse; Scott Garrabrant

This is the second of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper.

In this post, we consider how the following two components of a particular machine learning system might influence whether it will produce a mesa-optimizer:

The task: The training distribution and base objective function.
The base optimizer: The machine learning algorithm and model architecture.

We deliberately choose to present theoretical considerations for why mesa-optimization may or may not occur rather than provide concrete examples. Mesa-optimization is a phenomenon that we believe will occur mainly in machine learning systems that are more advanced than those that exist today.^[1] Thus, an attempt to induce mesa-optimization in a current machine learning system would likely require us to use an artificial setup specifically designed to induce mesa-optimization. Moreover, the limited interpretability of neural networks, combined with the fact that there is no general and precise definition of “optimizer,” means that it would be hard to evaluate whether a given model is a mesa-optimizer.

2.1. The task

Some tasks benefit from mesa-optimizers more than others. For example, tic-tac-toe can be perfectly solved by simple rules. Thus, a base optimizer has no need to generate a mesa-optimizer to solve tic-tac-toe, since a simple learned algorithm implementing the rules for perfect play will do. Human survival in the savanna, by contrast, did seem to benefit from mesa-optimization. Below, we discuss the properties of tasks that may influence the likelihood of mesa-optimization.

Better generalization through search. To be able to consistently achieve a certain level of performance in an environment, we hypothesize that there will always have to be some minimum amount of optimization power that must be applied to find a policy that performs that well.

To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.(9) After these divisions, there will be some remaining space of policies that the optimizer is unable to distinguish between. Then, to ensure that all policies in the remaining space have some minimum level of performance—to provide a performance lower bound^[2] —will always require the original space to be divided some minimum number of times—that is, there will always have to be some minimum bits of optimization power applied.

However, there are two distinct levels at which this optimization power could be expended: the base optimizer could expend optimization power selecting a highly-tuned learned algorithm, or the learned algorithm could itself expend optimization power selecting highly-tuned actions.

As a mesa-optimizer is just a learned algorithm that itself performs optimization, the degree to which mesa-optimizers will be incentivized in machine learning systems is likely to be dependent on which of these levels it is more advantageous for the system to perform optimization. For many current machine learning models, where we expend vastly more computational resources training the model than running it, it seems generally favorable for most of the optimization work to be done by the base optimizer, with the resulting learned algorithm being simply a network of highly-tuned heuristics rather than a mesa-optimizer.

We are already encountering some problems, however—Go, Chess, and Shogi, for example—for which this approach does not scale. Indeed, our best current algorithms for those tasks involve explicitly making an optimizer (hard-coded Monte-Carlo tree search with learned heuristics) that does optimization work on the level of the learned algorithm rather than having all the optimization work done by the base optimizer.(10) Arguably, this sort of task is only adequately solvable this way—if it were possible to train a straightforward DQN agent to perform well at Chess, it plausibly would have to learn to internally perform something like a tree search, producing a mesa-optimizer.^[3]

We hypothesize that the attractiveness of search in these domains is due to the diverse, branching nature of these environments. This is because search—that is, optimization—tends to be good at generalizing across diverse environments, as it gets to individually determine the best action for each individual task instance. There is a general distinction along these lines between optimization work done on the level of the learned algorithm and that done on the level of the base optimizer: the learned algorithm only has to determine the best action for a given task instance, whereas the base optimizer has to design heuristics that will hold regardless of what task instance the learned algorithm encounters. Furthermore, a mesa-optimizer can immediately optimize its actions in novel situations, whereas the base optimizer can only change the mesa-optimizer's policy by modifying it ex-post. Thus, for environments that are diverse enough that most task instances are likely to be completely novel, search allows the mesa-optimizer to adjust for that new task instance immediately.

For example, consider reinforcement learning in a diverse environment, such as one that directly involves interacting with the real world. We can think of a diverse environment as requiring a very large amount of computation to figure out good policies before conditioning on the specifics of an individual instance, but only a much smaller amount of computation to figure out a good policy once the specific instance of the environment is known. We can model this observation as follows.

Suppose an environment is composed of $N$ different instances, each of which requires a completely distinct policy to succeed in.^[4] Let $P$ be the optimization power (measured in bits(9)) applied by the base optimizer, which should be approximately proportional to the number of training steps. Then, let $x$ be the optimization power applied by the learned algorithm in each environment instance and $f (x)$ the total amount of optimization power the base optimizer must put in to get a learned algorithm capable of performing that amount of optimization.^[5] We will assume that the rest of the base optimizer's optimization power, $P - f (x)$ , goes into tuning the learned algorithm's policy. Since the base optimizer has to distribute its tuning across all $N$ task instances, the amount of optimization power it will be able to contribute to each instance will be $\frac{P - f (x)}{N}$ , under the previous assumption that each instance requires a completely distinct policy. On the other hand, since the learned algorithm does all of its optimization at runtime, it can direct all of it into the given task instance, making its contribution to the total for each instance simply $x$ .^[6]

Thus, if we assume that, for a given $P$ , the base optimizer will select the value of $x$ that maximizes the minimum level of performance, and thus the total optimization power applied to each instance, we get^[7]

$x^{*} = {argmax}_{x} \frac{P - f (x)}{N} + x .$

As one moves to more and more diverse environments—that is, as $N$ increases—this model suggests that $x$ will dominate $\frac{P - f (x)}{N}$ , implying that mesa-optimization will become more and more favorable. Of course, this is simply a toy model, as it makes many questionable simplifying assumptions. Nevertheless, it sketches an argument for a pull towards mesa-optimization in sufficiently diverse environments.

As an illustrative example, consider biological evolution. The environment of the real world is highly diverse, resulting in non-optimizer policies directly fine-tuned by evolution—those of plants, for example—having to be very simple, as evolution has to spread its optimization power across a very wide range of possible environment instances. On the other hand, animals with nervous systems can display significantly more complex policies by virtue of being able to perform their own optimization, which can be based on immediate information from their environment. This allows sufficiently advanced mesa-optimizers, such as humans, to massively outperform other species, especially in the face of novel environments, as the optimization performed internally by humans allows them to find good policies even in entirely novel environments.

Compression of complex policies. In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.

One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler.

This effect is most pronounced for tasks with a broad diversity of details but common high-level features. For example, Go, Chess, and Shogi have a very large domain of possible board states, but admit a single high-level strategy for play—heuristic-guided tree search—that performs well across all board states.(10) On the other hand, a classifier trained on random noise is unlikely to benefit from compression at all.

The environment need not necessarily be too diverse for this sort of effect to appear, however, as long as the pressure for low description length is strong enough. As a simple illustrative example, consider the following task: given a maze, the learned algorithm must output a path through the maze from start to finish. If the maze is sufficiently long and complicated then the specific strategy for solving this particular maze—specifying each individual turn—will have a high description length. However, the description length of a general optimization algorithm for finding a path through an arbitrary maze is fairly small. Therefore, if the base optimizer is selecting for programs with low description length, then it might find a mesa-optimizer that can solve all mazes, despite the training environment only containing one maze.

Task restriction. The observation that diverse environments seem to increase the probability of mesa-optimization suggests that one way of reducing the probability of mesa-optimizers might be to keep the tasks on which AI systems are trained highly restricted. Focusing on building many individual AI services which can together offer all the capabilities of a generally-intelligent system rather than a single general-purpose artificial general intelligence (AGI), for example, might be a way to accomplish this while still remaining competitive with other approaches.(11)

Human modeling. Another aspect of the task that might influence the likelihood of mesa-optimization is the presence of humans in the environment.(12) Since humans often act as optimizers, reasoning about humans will likely involve reasoning about optimization. A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself, resulting in a mesa-optimizer. For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

Thus, tasks that do not benefit from human modeling seem less likely to produce mesa-optimizers than those that do. Furthermore, there are many tasks that might benefit from human modeling that don't explicitly involve modeling humans—to the extent that the training distribution is generated by humans, for example, modeling humans might enable the generation of a very good prior for that distribution.

2.2. The base optimizer

It is likely that certain features of the base optimizer will influence how likely it is to generate a mesa-optimizer. First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.

Second, we hypothesize that the details of a machine learning model's architecture will have a significant effect on its tendency to implement mesa-optimization. For example, a tabular model, which independently learns the correct output for every input, will never be a mesa-optimizer. On the other hand, if a hypothetical base optimizer looks for the program with the shortest source code that solves a task, then it is more plausible that this program will itself be an optimizer.(13) However, for realistic machine learning base optimizers, it is less clear to what extent mesa-optimizers will be selected for. Thus, we discuss some factors below that might influence the likelihood of mesa-optimization one way or the other.

Reachability. There are many kinds of optimization algorithms that a base optimizer could implement. However, almost every training strategy currently used in machine learning uses some form of local search (such as gradient descent or even genetic algorithms). Thus, it seems plausible that the training strategy of more advanced ML systems will also fall into this category. We will call this general class of optimizers that are based on local hill-climbing local optimization processes.

We can then formulate a notion of reachability, the difficulty for the base optimizer to find any given learned algorithm, which we can analyze in the case of a local optimization process. A local optimization process might fail to find a particular learned algorithm that would perform very well on the base objective if the learned algorithm is surrounded by other algorithms that perform poorly on the base objective. For a mesa-optimizer to be produced by a local optimization process, it needs to not only perform well on the base objective, but also be reachable; that is, there needs to be a path through the space of learned algorithms to it that is approximately monotonically increasing. Furthermore, the degree to which the path only need be approximate—that is, the degree to which ML training procedures can escape local optima—is likely to be critical, as optimization algorithms are complex enough that it might require a significant portion of the algorithm to be present before performance gains start being realized.

Algorithmic range. One key factor likely to determine the reachability of mesa-optimizers is the algorithmic range of the learned algorithms—that is, how extensive is the set of algorithms (or how expressive is the model space) capable of being found by the base optimizer. The more extensive a model's algorithmic range, the broader the space of possible learned algorithms, and thus the more likely that it will be able to find one that is a mesa-optimizer, assuming the base optimizer is incentivized to do so. For example, architectures that explicitly give the algorithm access to a wide range of possible computations, such as recurrent neural networks or neural Turing machines,(14) seem more likely to produce mesa-optimizers.

Inductive biases. Another important factor is the degree to which the base optimizer is explicitly or implicitly biased in various ways. The nature of these inductive biases will contribute to the likelihood of a mesa-optimizer being selected for. One of the most important kinds of inductive bias is simplicity bias, which would almost certainly be exhibited by almost all base optimizers. We identify three ways in which simplicity bias can manifest itself:

An explicit penalty due to parameter regularization or architectural constraints such as weight-sharing or sparse connections.
An implicit bias due to the model architecture. For example, it has been shown that neural networks are more likely to fit a simple function to a set of training data, even when no regularization is used.(15)
The capacity limitations of the model. The size of a model imposes a hard limit on the complexity of the functions it is able to represent. Thus, to the degree that the base optimizer is selecting based on performance, it will be driven to “squeeze out” as much performance as it can for any given model capacity, leading to a bias in favor of relatively compressed policies.

The more a base optimizer is biased towards simple solutions, the more it will be incentivized to find a compressed policy such as a mesa-optimizer.

The base optimizer could also be biased in other ways. For example, it could be biased towards algorithms with a low time or space complexity. If we view mesa-optimization as delegation of optimization work, a time penalty on the learned algorithm is likely to incentivize the base optimizer to do more precomputation itself by pre-computing relevant heuristics rather than delegating optimization work to the learned algorithm. Thus, we hypothesize that penalizing description length will favor mesa-optimizers while penalizing time complexity will disfavor them. This suggests the following speculative conjecture: neither a minimal-depth nor minimal-size boolean circuit that solves a problem can be a mesa-optimizer.(16)

Lastly, another form of bias that might have particularly interesting effects is the presence of an information funnel. In deep learning, the base optimizer directly selects the values of individual parameters. In biological evolution, by contrast, the base optimizer selects DNA strings, which in turn produce the detailed structure of the brain only indirectly. This implicitly introduces pressure towards compressing the brain's structure. As we noted previously, this might favor the development of mesa-optimizers, as they can be seen as a way of compressing complex policies. Because deep learning algorithms currently select weights directly, there is no such information funnel. In the case of meta-learning over a machine learning model's architecture, however, an information funnel might present itself due to the fact that the meta-learner has to learn a single policy for creating architectures that is capable of solving all of the problems it is presented with.

Statefulness. Another factor that could make it easier for a learned algorithm to implement a complex optimization process is the ability of the learned algorithm to save and recall information. The ability to refer back to previous state enables the caching of intermediate results, increasing the architecture's algorithmic range by giving the learned algorithm significantly more time to complete its computations. Statefulness, therefore, favors mesa-optimization by decreasing the implicit penalty on time complexity imposed by enforcing all computations to end when the learned algorithm produces its output. One simple form of this might be a recurrent neural network, which can perform computations with longer time horizons due to its ability to pass intermediate activations across different time steps. An even more extreme version of this might be a network that is able to store state not just across different time steps in the same episode, but also across episodes, enabling it to do more long-term learning, though such an architecture is currently completely hypothetical.

Hard-coded optimization. One possible means of alleviating some of these issues might be to include hard-coded optimization where the learned algorithm provides only the objective function and not the optimization algorithm. The stronger the optimization performed explicitly, the less strong the optimization performed implicitly by the learned algorithm needs to be. For example, architectures that explicitly perform optimization that is relevant for the task—such as hard-coded Monte Carlo tree search—might decrease the benefit of mesa-optimizers by reducing the need for optimization other than that which is explicitly programmed into the system.

The third post in the Risks from Learned Optimization Sequence, titled “The Inner Alignment Problem,” can be found here.

Glossary | Bibliography

As of the date of this post. Note that we do examine some existing machine learning systems that we believe are close to producing mesa-optimization in post 5. ↩︎
It is worth noting that the same argument also holds for achieving an average-case guarantee. ↩︎
Assuming reasonable computational constraints.. ↩︎
This definition of $N$ is somewhat vague, as there are multiple different levels at which one can chunk an environment into instances. For example, one environment could always have the same high-level features but completely random low-level features, whereas another could have two different categories of instances that are broadly self-similar but different from each other, in which case it's unclear which has a larger $N$ . However, one can simply imagine holding $N$ constant for all levels but one and just considering how environment diversity changes on that level. ↩︎
Note that this makes the implicit assumption that the amount of optimization power required to find a mesa-optimizer capable of performing $x$ bits of optimization is independent of $N$ . The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find an $x$ -bit optimizer should be relatively independent of the environment. That being said, it won't be completely independent, but as long as the primary difference between environments is how much optimization they need, rather than how hard it is to do optimization, the model presented here should hold. ↩︎
Note, however, that there will be some maximum $x$ simply because the learned algorithm generally only has access to so much computational power. ↩︎
Subject to the constraint that $P - f (x) \geq 0$ . ↩︎

I really like this formulation, and it greatly clarifies something I was trying to note in my recent paper on multiparty dynamics and failure modes - Link here. The discussion about the likelihood of mesa-optimization due to human modeling is close to the more general points I tried to make in the discussion section of that paper. As argued here about humans, other systems are optimizers (even if they are themselves only base-optimizers,) and therefore any successful machine-learning system in a multiparty environment is implicitly forced to model the other parties. I called this the "opponent model," and argued that they are dangerous because they are always approximate, arguing directly from that point to claim there is great potential for misalignment - but the implication from this work is that they are also dangerous because it encourages machine learning in multi-agent systems to be mesa-optimizers, and the mesa-optimization is a critical enabler of misalignment even when the base optimizer is well aligned.

I would add to the discussion here that multiparty systems can display the same dynamics, and therefore have risks similar to that of systems which require human models. I also think, less closely connected to the current discussion, but directly related to my paper, that mesa-optimizers misalignments pose new and harder to understand risks when they interact with one another.

I also strongly agree with the point that current examples are not really representative of the full risk. Unfortunately, peer-reviewers strongly suggested that I have moreconcrete examples of failures. But as I said in the paper, "the failures seen so far are minimally disruptive. At the same time, many of the outlined failures are more problematic for agents with a higher degree of sophistication, so they should be expected not to lead to catastrophic failures given the types of fairly rudimentary agents currently being deployed. For this reason, specification gaming currently appears to be a mitigable problem, or as Stuart Russell claimed, be thought of as “errors in specifying the objective, period.”"

As a final aside, I think that the concept of mesa-optimizers is very helpful in laying out the argument against that last claim - misalignment is more than just misspecification. I think that this paper will be very helpful in showing why,

That link is broken, and I'm very interested to read this. Is there a newer link?

ETA: I think I found it: https://arxiv.org/abs/1810.10862

The better link is the final version - https://www.mdpi.com/2504-2289/3/2/21/htm

The link in the original post broke because it included the trailing period by accident.

The Risks from Learned Optimization paper and this sequence don't seem to talk about the possibility of mesa-optimizers developing from supervised learning and the resulting inner alignment problem. The part that gets closest is

First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.

I wonder if this was intentional, and if not maybe it would be worth making a note somewhere in the paper/posts that an oracle/predictor that is trained on sufficiently diverse data using SL could also become a mesa-optimizer (especially since this seems counterintuitive and might be overlooked by AI researchers/builders). See related discussion here.

To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.

This is pretty confusing for me: If I'm doing gradient descent, how many times am I halving the entire search space? (although I appreciate that it's hard to come up with a better measure of optimisation)

You could imagine that, if you use gradient descent to reach a loss value of $L$ , then amount of optimization applied in bits $= - log \frac{| {θ \in R^{d} : L (θ) \leq L} |}{| R^{d} |}$ . (Yes, I know I shouldn't be taking sizes of continuous vector spaces, but you know what I mean.)

I think there is a typo in your formula, because the number of bits you get is negative. Going back to Yudkowsky's post, I think the correct formula (using your approximations of sizes) is , or $- log \frac{| {θ \in R^{d} ∣ L (θ) \leq L} |}{| R^{d} |}$ to be closer to the entropy notation.

Yeah, you're right, fixed.

Minor nitpicks:

The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find a x-bit optimizer should be independent of the environment.

This sounds blatantly false to me unless the only things you count as optimizers are things like AIXI. (In particular, humans could not count as optimizers.) Especially if you've already put in some base-level optimization to narrow down the search space. You then need to "encode" the knowledge you already have into the optimizer, so that it doesn't redo the work already done by the base-level optimization. In practice this looks like inductive biases and embedding of domain knowledge.

As a particular example, many NP-hard problems become easy if they have particular structure. (Horn-SAT and 2-SAT are easy while SAT is hard; longest path is easy in DAGs but hard with general graphs.)

I get that this is a footnote and that it's a toy model that doesn't claim to mimic reality, but it seems like a very false statement to me.

$x = {argmax}_{x} \frac{P - f (x)}{N} + x$

Can you rename the LHS variable $x$ to something else, like $x^{*}$ , to avoid confusion with the $x$ on the RHS?

Algorithmic range.

Is this different from expressivity, or the size of the model's hypothesis class? I can't tell how this is different from the inductive biases section, especially its third point.

For example, architectures that explicitly give the algorithm access to a wide range of possible computations, such as recurrent neural networks or neural Turing machines,(14) seem more likely to produce mesa-optimizers.

I (and I suspect other ML researchers) would call these inductive biases, not algorithmic range / model expressivity.

AFAICT, algorithmic range isn't the same thing as model capacity: I think that tabular learners have low algorithmic range, as the terms are used in this post, but high model capacity.

It definitely will vary with the environment, though the question is degree. I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both. The footnote should probably be rephrased so as not to assert that it is completely independent, as I agree that it obviously isn't, but just that it needs to be relatively independent, with the amount of optimization power dominating for the model to make sense.

Renamed $x$ to $x^{*}$ —good catch (though editing doesn't appear to be working for me right now—it should show up in a bit)!

Algorithmic range is very similar to model capacity, except that we're thinking slightly more broadly as we're more interested in the different sorts of general procedures your model can learn to implement than how many layers of convolutions you can do. That being said, they're basically the same thing.

I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I'm keeping algorithmic range here.

I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both.

Fwiw, I have the opposite intuition quite strongly, but not sure it's worth debating that here.

A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself, resulting in a mesa-optimizer.

In this case it seems like you'd have a policy that uses "optimization machinery" to:

Predict what other agents are going to do
Create plans to achieve some form of inner objective

So, the model-outputted-by-the-base-optimization is a policy that chooses how to use the optimization machinery, not the optimization machinery itself. This seems substantially different from your initial concept of a mesa-optimizer

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.

and seems more like a subagent. But perhaps I've misunderstood what you meant by a mesa-optimizer.

The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:

For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

The idea there isn't that the algorithm simulates human judgement as an external source of information for itself, but that the actual algorithm learns to be a human-like reasoner, with human-like goals (because that's a good way of approximating the output of human-like reasoning). In that case, the agent really is a mesa-optimiser, to the degree that a goal-directed human-like reasoner is an optimiser.

(I'm not sure to what degree it's actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)

Just to make sure I understand, this example assumes that the base objective is "predict human behavior", and doesn't apply to most base objectives, right?

Yes, it probably doesn’t apply to most objectives. Though it seems to me that the closer the task is to something distinctly human, the more probable it is that this kind of consideration can apply. E.g., making judgements in criminal court cases and writing fiction are domains where it’s not implausible to me that this could apply.

I do think this is a pretty speculative argument, even for this sequence.

The idea would be that all of this would be learned—if the optimization machinery is entirely internal to the system, it can choose how to use that optimization machinery arbitrarily. We talk briefly about systems where the optimization is hard-coded, but those aren't mesa-optimizers. Rather, we're interested in situations where your learned algorithm itself performs optimization internal to its own workings—optimization it could re-use to do prediction or vice versa.

It sounds like there was a misunderstanding somewhere -- I'm aware that all of this would be learned; my point is that the learned policy contains an optimizer rather than being an optimizer, which seems like a significant point, and your original definition sounded like you wanted the learned policy to be an optimizer.

Better generalization through search.

Search only generalizes well when you are able to accurately determine the available options, the consequences of selecting those options, and the utility of those consequences. It's extremely unclear whether a mesa-optimizer would be able to do all three of these things well enough for search to actually generalize. Selection and Control makes some similar points.

We are already encountering some problems, however—Go, Chess, and Shogi, for example—for which this approach does not scale.

There should be a lot of caveats to this:

I'm pretty sure that even if you remove the MCTS at test time, AlphaZero will be very good at the game. (I'm pretty sure we could find numbers for this somewhere if it was a crux. I spent two minutes looking and didn't find them.)
I'd also bet that with more compute and a larger model, AlphaZero (even without MCTS at test time) would continue improving.
AlphaZero assumes access to a perfect simulator of the environment (i.e. the rules of the game), which is why hardcoded search generalizes correctly. It's not clear what would happen if you forced AlphaZero to also learn the rules of the game. That's the setting used in Dota and StarCraft, and notably in both of those environments we did not use the AlphaZero approach, and we did see that it was "generally favorable for most of the optimization work to be done by the base optimizer". (Unless you think that OpenAI Five / AlphaStar did in fact have mesa-optimizers, and we can't tell because the neural nets are opaque.)

Arguably, this sort of task is only adequately solvable this way—if it were possible to train a straightforward DQN agent to perform well at Chess, it plausibly would have to learn to internally perform something like a tree search, producing a mesa-optimizer.

Given enough time and space, a DQN agent turns into a lookup table, which could encode the optimal policy for chess. I'd appreciate a rewrite of the sentence, or a footnote that says something to the effect of "assuming reasonable time and space limitations on the agent".

I also disagree with the spirit of the sentence. My intuition is that with sufficient model capacity and training time (say, all of the computing resources in the world today), you could get a very large bundle of learned heuristics that plays chess well. (Depending on your threshold for "plays chess well", AlphaZero without MCTS at test time might already reach it.) Of course, you can always amplify any such agent by throwing a small hardcoded MCTS or alpha-beta tree search on top of it, and so I'd expect the best agent at any given level of compute to be something of that form.

I believe AlphaZero without MCTS is still very good but not superhuman—International Master level, I believe. That being said, it's unclear how much optimization/search is currently going on inside of AlphaZero's policy network. My suspicion would be that currently it does some, and that to perform at the same level as the full AlphaZero it would have to perform more.

I added a footnote regarding capacity limitations (though editing doesn't appear to be working for me right now—it should show up in a bit). As for the broader point, I think it's just a question of degree—for a sufficiently diverse environment, you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization. So the question is just what does "perform well" mean and what threshold are you drawing for "internally performs something like a tree search."

you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization.

I agree with this, but I don't think it's the point that I'm making; my claim is more that "just heuristics" is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).

So the question is just what does "perform well" mean and what threshold are you drawing for "internally performs something like a tree search."

I don't think my claim depends much on the threshold of "perform well", and I suspect that if you do think the current model is performing something like a tree search, you could make the model larger and run the same training process and it would no longer perform something like a tree search.

my claim is more that "just heuristics" is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).

This claim seems incorrect for at least some tasks (if you already think that, skip the rest of this comment).

Consider the following 2-player turn-based zero-sum game as an example for a task in which "heuristics" seemingly can't replace a tree search.

The game starts with an empty string. In each turn the following things happen:

(1) the player adds to the end of the string either "A" or "B".

(2) the string is replaced with its SHA256 hash.

Player 1 wins iff after 10 turns the first bit in the binary representation of the string is 1.

(Alternatively, consider the 1-player version of this game, starting with a random string.)

Yeah, agreed; I meant that claim to apply to "realistic" tasks (which I don't yet know how to define).

I meant that claim to apply to "realistic" tasks (which I don't yet know how to define).

Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if "just heuristics" is enough for arbitrary levels of performance on "realistic" tasks?

Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

Humans and systems produced by meta learning both do reasonably well at learning, and don't do "search" (depending on how loose you are with your definition of "search").

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if "just heuristics" is enough for arbitrary levels of performance on "realistic" tasks?

It's plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven't had much experience on before, but nonetheless because they aren't made up of a two clean parts (mesa objective + capabilities) there isn't a single obvious mesa objective that the AI system is optimizing for off distribution. I'm not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don't seem like "search" to me, though they also aren't just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

(Also, reading back through this comment thread, I'm no longer sure whether or not a neural net could learn to play at least the 1-player random version of the SHA game. Certainly in the limit it can just memorize the input-output table, but I wouldn't be surprised if it could get some accuracy even without that.)

It’s plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven’t had much experience on before, but nonetheless because they aren’t made up of a two clean parts (mesa objective + capabilities) there isn’t a single obvious mesa objective that the AI system is optimizing for off distribution.

Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)? Or other reason you think this could be the case in the future?

Also, do you think this will be significantly more efficient than "two clean parts (mesa objective + capabilities)"? (If not, it seems like we can use inner alignment techniques, e.g., transparency and verification, to force the model to be "two clean parts" if that's better for safety.)

Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)?

Humans don't seem to have one mesa objective that we're optimizing for. Even in this community, we tend to be uncertain about what our actual goal is, and most other people don't even think about it. Humans do lots of things that look like "changing their objective", e.g. maybe someone initially wants to have a family but then realizes they want to devote their life to public service because it's more fulfilling.

Also, do you think this will be significantly more efficient than "two clean parts (mesa objective + capabilities)"?

I suspect it would be more efficient, but I'm not sure. (Mostly this is because humans and animals don't seem to have two clean parts, but quite plausibly we'll do something more interpretable than evolution and that will push towards a clean separation.) I also don't know whether it would be better for safety to have it split into two clean parts.

Humans do lots of things that look like “changing their objective” [...]

That's true but unless the AI is doing something like human imitation or metaphilosophy (in other words, we have some reason to think that the AI will converge to the "right" values), it seems dangerous to let it "changing their objective" on its own. Unless, I guess, it's doing something like mild optimization or following norms, so that it can't do much damage even if it switches to a wrong objective, and we can just shut it down and start over. But if it's as messy as humans are, how would we know that it's strictly following norms or doing mild optimization, and won't "change its mind" about that too at some point (kind of like a human who isn't very strategic suddenly has an insight or reads something on the Internet and decides to become strategic)?

I think overall I'm still confused about your perspective here. Do you think this kind of "messy" AI is something we should try to harness and turn into a safety success story (if so how), or do you think it's a danger that we should try to avoid (which may for example have to involve global coordination because it might be more efficient than safer AIs that do have clean separation)?

Oh, going back to an earlier comment, I guess you're suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.

In this entire comment thread I'm not arguing that mesa optimizers are safe, or proposing courses of action we should take to make mesa optimization safe. I'm simply trying to forecast what mesa optimizers will look like if we follow the default path. As I said earlier,

I'm not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.

It's very plausible that the mesa optimizers I have in mind are even more dangerous, e.g. because they "change their objective". It's also plausible that they're safer, e.g. because they are full-blown explicit EU maximizers and we can "convince" them to adopt goals similar to ours.

Mostly I'm saying these things because I think the picture presented in this sequence is not fully accurate, and I would like it to be more accurate. Having an accurate view of what problems will arise in the future tends to help with figuring out solutions to those problems.

Humans and systems produced by meta learning both do reasonably well at learning, and don’t do “search” (depending on how loose you are with your definition of “search”).

Part of what inspired me to write my comment was watching my kid play logic puzzles. When she starts a new game, she has to do a lot of random trial-and-error with backtracking, much like MCTS. (She does the trial-and-error on the physical game board, but when I play I often just do it in my head.) Then her intuition builds up and she can start to recognize solutions earlier and earlier in the search tree, sometimes even immediately upon starting a new puzzle level. Then the game gets harder (the puzzle levels slowly increase in difficulty) or moves to a new regime where her intuitions don't work, and she has to do more trial-and-error again, and so on. This sure seems like "search" to me.

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don’t seem like “search” to me, though they also aren’t just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

This really confuses me. Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that's search or not), but with RL, surely the "explore" steps have to count as "search"? Do you have a different kind of thing in mind when you think of "search"?

This sure seems like "search" to me.

I agree that if you have a model of the system (as you do when you know the rules of the game), you can simulate potential actions and consequences, and that seems like search.

Usually, you don't have a good model of the system, and then you need something else.

Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that's search or not), but with RL, surely the "explore" steps have to count as "search"?

I was thinking of following a gradient in supervised learning.

I agree that pure reinforcement learning with a sparse reward looks like search. I doubt that pure RL with sparse reward is going to get you very far.

Reinforcement learning with demonstrations or a very dense reward doesn't really look like search, it looks more like someone telling you what to do and you following the instructions faithfully.

A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself

I'm confused about this. I tried substituting different words for "optimisation":

"A system capable of reasoning about photosynthesis is likely also capable of reusing that same machinery to do photosynthesis itself." Nope.

"A system capable of reasoning about arithmetic is likely also capable of reusing that same machinery to do arithmetic itself". Maybe? The rules of arithmetic can be reused, but the machinery to reason abstractly about arithmetic is probably different to the machinery to run a specific calculation, especially with a learned model with lots of free parameters, like a neural network.

Maybe optimisation is not like the above examples because it is so generic? Or I misunderstood the claim.

As one moves to more and more diverse environments—that is, as $N$ increases—this model suggests that $x$ will dominate $\frac{P - f (x)}{N}$ , implying that mesa-optimization will become more and more favorable.

I believe this sentence should say " $x$ will dominate $\frac{P - f (x)}{N} + x$ , implying [...]". Same for the sentence in the paper.

I suppose it depends on what you mean by “dominate.” Mathematically, the point here is just that

$lim N \to \infty \frac{P - f (x)}{N} + x = x$

which I would describe as the $x$ term “dominating” the $\frac{P - f (x)}{N}$ term.

Ah, I see. I was thinking of 'dominate' as narrowly meaning 'determines the total value of the term', but I agree that the usage above works perfectly well.

I'm having difficulty with this in a way that "smells" like either "too much jargon" or "not enough effort expended explaining terms at start". That is, I (recent PhD in statistics) can more-or-less parse individual statements, but unable to chunk things into parts that are large enough to seem non-trivial. Note that I'm coming here from the "deep double-descent" article which does seem intuitive for me at first read (that is, definitely interesting, but closer to being too trivial than too complex).

I'll come back and read this again later, but for now: if you're editing this, I'd suggest starting slower and/or working to de-jargonify.

This post is the second post in the middle of a five-post sequence—you should definitely start with the introduction and if you still feel confused by the terminology, there's also a glossary that might be helpful as well. You can also get a lot of the content in podcast form here if you'd prefer that.

The concept of pre-training and fine-tuning in ML seems closely related to mesa-optimization. You pre-train a model on a general distribution so that it can quickly learn from little data on a specific one.

However, as the number of tasks you want to do (N) increases, there seems to be the opposite effect as what your (very neat) model in section 2.1 describes: you get higher returns for meta-optimization so you'll want to spend relatively more on it. I think model's assumptions are defied here because the tasks don't require completely distinct policies. E.g. GPT-2 does very well across tasks with the exact same prediction-policy. I'm not completely sure about this point but it seems fruitful to explore the analogy to pre-training which is widely used.

One possible means of alleviating some of these issues might be to include hard-coded optimization where the learned algorithm provides only the objective function and not the optimization algorithm.

Given that the risk comes from the inner objective being misaligned, how does this help?

The argument in this post is just that it might help prevent mesa-optimization from happening at all, not that it would make it more aligned. The next post will be about how to align mesa-optimizers.

Is it a requirement of mesa-optimization that the optimization algorithm must be learned? I would have expected that it would only be a requirement that the objective be learned. Are there considerations that apply to learned optimization algorithms that don't apply to hardcoded optimization algorithms?

The main benefit I see of hardcoding optimisation is that, assuming the system's pieces learn as intended (without any mesa-optimisation happening in addition to the hardcoded optimisation) you get more access and control as a programmer over what the learned objective actually is. You could attempt to regress the learned objective directly to a goal you want, or attempt to enforce a certain form on it, etc. When the optimisation itself is learned*, the optimiser is more opaque, and you have fewer ways to affect what goal is learned: which weights of your enormous LSTM-based mesa-optimiser represent the objective?

This doesn't solve the problem completely (you might still learn an objective that is very incorrect off-distribution, etc.), but could offer more control and insight into the system to the programmer.

*Of course, you can have learned optimisation where you keep track of the objective which is being optimised (like in Learning to Learn by Gradient Descent), but I'd class that more under hard-coded optimisation for the purposes of this discussion. Here I mean the kind of learned optimisation that happens where you're not building the architecture explicitly around optimising or learning to optimise.

That makes sense, thanks.

Task restriction. The observation that diverse environments seem to increase the probability of mesa-optimization

Where does this observation come from?

Is a mesa-optimizer unaligned by definition?

No, not at all—we distinguish robustly aligned mesa-optimizers, which are aligned on and off distribution, from pseudo-aligned mesa-optimizers, which appear to be aligned on distribution, but are not necessarily aligned off-distribution. For the full glossary, see here.

That was helpful, thank you.

That link is broken, and I'm very interested to read this. Is there a newer link?

ETA: I think I found it: https://arxiv.org/abs/1810.10862

The better link is the final version - https://www.mdpi.com/2504-2289/3/2/21/htm

The link in the original post broke because it included the trailing period by accident.

First, though we largely focus on reinforcement learning in this sequence, RL is not necessarily the only type of machine learning where mesa-optimizers could appear. For example, it seems plausible that mesa-optimizers could appear in generative adversarial networks.

To see this, we can think of optimization power as being measured in terms of the number of times the optimizer is able to divide the search space in half—that is, the number of bits of information provided.

Yeah, you're right, fixed.

Minor nitpicks:

The justification for this is that optimization is a general algorithm that looks the same regardless of what environment it is applied to, so the amount of optimization required to find a x-bit optimizer should be independent of the environment.

I get that this is a footnote and that it's a toy model that doesn't claim to mimic reality, but it seems like a very false statement to me.

$x = {argmax}_{x} \frac{P - f (x)}{N} + x$

Can you rename the LHS variable $x$ to something else, like $x^{*}$ , to avoid confusion with the $x$ on the RHS?

Algorithmic range.

Is this different from expressivity, or the size of the model's hypothesis class? I can't tell how this is different from the inductive biases section, especially its third point.

For example, architectures that explicitly give the algorithm access to a wide range of possible computations, such as recurrent neural networks or neural Turing machines,(14) seem more likely to produce mesa-optimizers.

I (and I suspect other ML researchers) would call these inductive biases, not algorithmic range / model expressivity.

AFAICT, algorithmic range isn't the same thing as model capacity: I think that tabular learners have low algorithmic range, as the terms are used in this post, but high model capacity.

Renamed $x$ to $x^{*}$ —good catch (though editing doesn't appear to be working for me right now—it should show up in a bit)!

I actually just updated the paper to just use model capacity instead of algorithmic range to avoid needlessly confusing machine learning researchers, though I'm keeping algorithmic range here.

I suspect most of the variation will be in how much optimization power you need, as opposed to how difficult it is to get some degree of optimization power, which motivates the model presented here—though certainly there will be some deviation in both.

Fwiw, I have the opposite intuition quite strongly, but not sure it's worth debating that here.

A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself, resulting in a mesa-optimizer.

In this case it seems like you'd have a policy that uses "optimization machinery" to:

Predict what other agents are going to do
Create plans to achieve some form of inner objective

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.

and seems more like a subagent. But perhaps I've misunderstood what you meant by a mesa-optimizer.

The section on human modelling annoyingly conflates two senses of human modelling. One is the sense you talk about, the other is seen in the example:

For example, it might be the case that predicting human behavior requires instantiating a process similar to human judgment, complete with internal motives for making one decision over another.

(I'm not sure to what degree it's actually likely that a good way to approximate the behaviour of human-like reasoning is to instantiate human-like reasoning)

Just to make sure I understand, this example assumes that the base objective is "predict human behavior", and doesn't apply to most base objectives, right?

I do think this is a pretty speculative argument, even for this sequence.

Better generalization through search.

We are already encountering some problems, however—Go, Chess, and Shogi, for example—for which this approach does not scale.

There should be a lot of caveats to this:

I'm pretty sure that even if you remove the MCTS at test time, AlphaZero will be very good at the game. (I'm pretty sure we could find numbers for this somewhere if it was a crux. I spent two minutes looking and didn't find them.)
I'd also bet that with more compute and a larger model, AlphaZero (even without MCTS at test time) would continue improving.
AlphaZero assumes access to a perfect simulator of the environment (i.e. the rules of the game), which is why hardcoded search generalizes correctly. It's not clear what would happen if you forced AlphaZero to also learn the rules of the game. That's the setting used in Dota and StarCraft, and notably in both of those environments we did not use the AlphaZero approach, and we did see that it was "generally favorable for most of the optimization work to be done by the base optimizer". (Unless you think that OpenAI Five / AlphaStar did in fact have mesa-optimizers, and we can't tell because the neural nets are opaque.)

Arguably, this sort of task is only adequately solvable this way—if it were possible to train a straightforward DQN agent to perform well at Chess, it plausibly would have to learn to internally perform something like a tree search, producing a mesa-optimizer.

you can do pretty well with just heuristics, you do better introducing optimization, and you keep getting better as you keep doing more optimization.

So the question is just what does "perform well" mean and what threshold are you drawing for "internally performs something like a tree search."

my claim is more that "just heuristics" is enough for arbitrary levels of performance (even if you could improve that by adding hardcoded optimization).

This claim seems incorrect for at least some tasks (if you already think that, skip the rest of this comment).

Consider the following 2-player turn-based zero-sum game as an example for a task in which "heuristics" seemingly can't replace a tree search.

The game starts with an empty string. In each turn the following things happen:

(1) the player adds to the end of the string either "A" or "B".

(2) the string is replaced with its SHA256 hash.

Player 1 wins iff after 10 turns the first bit in the binary representation of the string is 1.

(Alternatively, consider the 1-player version of this game, starting with a random string.)

Yeah, agreed; I meant that claim to apply to "realistic" tasks (which I don't yet know how to define).

I meant that claim to apply to "realistic" tasks (which I don't yet know how to define).

Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

Machine learning seems hard to do without search, if that counts as a "realistic" task. :)

Humans and systems produced by meta learning both do reasonably well at learning, and don't do "search" (depending on how loose you are with your definition of "search").

I wonder if you can say something about what your motivation is to talk about this, i.e., are there larger implications if "just heuristics" is enough for arbitrary levels of performance on "realistic" tasks?

It’s plausible to me that for tasks that we actually train on, we end up creating systems that are like mesa optimizers in the sense that they have broad capabilities that they can use on relatively new domains that they haven’t had much experience on before, but nonetheless because they aren’t made up of a two clean parts (mesa objective + capabilities) there isn’t a single obvious mesa objective that the AI system is optimizing for off distribution.

Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)? Or other reason you think this could be the case in the future?

Coming back to this, can you give an example of the kind of thing you're thinking of (in humans, animals, current ML systems)?

Also, do you think this will be significantly more efficient than "two clean parts (mesa objective + capabilities)"?

Humans do lots of things that look like “changing their objective” [...]

Oh, going back to an earlier comment, I guess you're suggesting some of each: try to harness at lower capability levels, and coordinate to avoid at higher capability levels.

I'm not sure what happens in this regime, but it seems like it undercuts the mesa optimization story as told in this sequence.

Humans and systems produced by meta learning both do reasonably well at learning, and don’t do “search” (depending on how loose you are with your definition of “search”).

Fwiw, on the original point, even standard machine learning algorithms (not the resulting models) don’t seem like “search” to me, though they also aren’t just a bag of heuristics and they do have a clearly delineated objective, so they fit well enough in the mesa optimization story.

This sure seems like "search" to me.

I agree that if you have a model of the system (as you do when you know the rules of the game), you can simulate potential actions and consequences, and that seems like search.

Usually, you don't have a good model of the system, and then you need something else.

Maybe with some forms of supervised learning you can either calculate the solution directly, or just follow a gradient (which may be arguable whether that's search or not), but with RL, surely the "explore" steps have to count as "search"?

I was thinking of following a gradient in supervised learning.

I agree that pure reinforcement learning with a sparse reward looks like search. I doubt that pure RL with sparse reward is going to get you very far.

Reinforcement learning with demonstrations or a very dense reward doesn't really look like search, it looks more like someone telling you what to do and you following the instructions faithfully.

A system capable of reasoning about optimization is likely also capable of reusing that same machinery to do optimization itself

I'm confused about this. I tried substituting different words for "optimisation":

"A system capable of reasoning about photosynthesis is likely also capable of reusing that same machinery to do photosynthesis itself." Nope.

Maybe optimisation is not like the above examples because it is so generic? Or I misunderstood the claim.

As one moves to more and more diverse environments—that is, as $N$ increases—this model suggests that $x$ will dominate $\frac{P - f (x)}{N}$ , implying that mesa-optimization will become more and more favorable.

I believe this sentence should say " $x$ will dominate $\frac{P - f (x)}{N} + x$ , implying [...]". Same for the sentence in the paper.

I suppose it depends on what you mean by “dominate.” Mathematically, the point here is just that

$lim N \to \infty \frac{P - f (x)}{N} + x = x$

which I would describe as the $x$ term “dominating” the $\frac{P - f (x)}{N}$ term.

Ah, I see. I was thinking of 'dominate' as narrowly meaning 'determines the total value of the term', but I agree that the usage above works perfectly well.

I'll come back and read this again later, but for now: if you're editing this, I'd suggest starting slower and/or working to de-jargonify.

One possible means of alleviating some of these issues might be to include hard-coded optimization where the learned algorithm provides only the objective function and not the optimization algorithm.

Given that the risk comes from the inner objective being misaligned, how does this help?

The argument in this post is just that it might help prevent mesa-optimization from happening at all, not that it would make it more aligned. The next post will be about how to align mesa-optimizers.

That makes sense, thanks.

Task restriction. The observation that diverse environments seem to increase the probability of mesa-optimization

Where does this observation come from?

Is a mesa-optimizer unaligned by definition?

That was helpful, thank you.

86

Conditions for Mesa-Optimization

86

Ω 30

2.1. The task

2.2. The base optimizer

86

Ω 30

86

Ω 30