Explaining the inner alignment problem to myself in an attempt to gain a gears-level understanding. This post is mostly a distillation of Risks from Learned Optimization, but focusing more on the statistical learning theory perspective. It isn't quite an introduction, more an attempt at reframing and summarizing to make the problem easier for me to think about. Any feedback is appreciated. Written as my final project for AGI Safety Fundamentals, also submitted to the AIMS distillation contest.

Inner misalignment <=> poor generalization of goal learning

A good ML algorithm generalizes well, meaning it performs as intended on tasks it has never seen before (i.e. the test loss is low). Inner misalignment is a type of generalization problem we expect to face when training powerful ML systems that will cause these systems to generalize very poorly and dangerously in out-of-distribution situations. Inner alignment problems can only occur when your learning algorithm is powerful enough that it is selecting an algorithm from some set of hypothesis algorithms, and the algorithm it happens to select is an optimizing algorithm. Problems arise when the chosen optimizing algorithm has an objective that doesn't precisely match the objective of the base optimizer (i.e. the expected loss of the ML algorithm). The inner optimizer (called a mesa-optimizer), could be any algorithm that searches over a space of possible outputs and chooses the best one (with “best” defined by some internal objective, called the mesa-objective).

The key part of this definition is that the learned objective doesn’t exactly match the base objective.[1] The word objective here is pretty much interchangeable with goal or utility function. So the core of the problem of inner alignment is simply that we might try to learn a goal from data, but the goal was learned imperfectly and fails to generalize when tested on some input outside of the training distribution.

Trying to make algorithms that generalize well is the whole point of machine learning; all of statistical learning theory is focused on this question. What’s different and important about this particular form of generalization failure?

  1. It will be particularly difficult to detect, because it will perform well on many (or all)[2] test cases.
  2. It appears to be difficult to reduce the impact of this generalization failure using standard techniques (the usual strategies for improving generalization).

Generalization <=> accurate priors + diverse data

For most goals, there is a large set of similar goals that lead to similar outcomes in normal situations but dramatically different outcomes when facing an unusual situation. This is the same problem that all ML algorithms face: for any given training data there are many functions that fit all data points. Learning algorithms solve this problem by having good enough prior information, and diverse enough training data. Learning algorithms generalize poorly when they have been given insufficient or incorrect prior information.[3] Good prior information allows the algorithm to distinguish between hypotheses that do equally well on the data. For example, a simplicity prior says that hypotheses with a shorter description length are a priori more likely, so in the case where many hypotheses fit the data equally well, the simplest one is chosen. 

Let’s quickly go over an example showing why prior information is the key to generalization. We train an RL agent to be a good assistant, and let’s assume we have a fully aligned reward signal that we can use to train the agent.[4] We give it simulated example situations like: “The user asks it to buy a stapler”, and the agent gets rewarded for ordering a stapler online, but gets a negative reward signal for stealing one. We also give some extreme examples, like “The user asks it to kill someone”, and it would get a positive reward for not doing that. Prior information can help generalization by telling the algorithm how similar two situations are (or another level up, it can provide information about how to learn how similar two data points are). For example, if we are testing the system, and give it the situation “The user asks it to pay someone to kill someone else”, then what will the RL agent do? It has learned that doing what it’s told will give it high reward, but it’s also learned that in some situations it should disobey. The agent’s action will depend on whether the system treats this situation as more similar to “The user asks it to buy a stapler”, or more similar to “The user asks it to kill someone”. One learning algorithm might treat similar action sequences as likely to have similar reward, whereas another might treat similar outcomes as having similar reward, and the difference will ultimately depend on the built-in prior of the learning algorithm.[5]

So for normal ML problems, where we are having trouble making the algorithm generalize correctly, we would try to improve the prior information we provide to the algorithm,[6] or collect more diverse data to cover every unusual situation. Both of these strategies are more difficult when it comes to inner optimizers.

  1. Improving the prior information about the mesa-objective is difficult, because the mesa-objective is hidden inside a learned algorithm. In the above example, the RL agent could be a learned function that takes a camera + microphone data as input, and outputs actions. Any prior information we have about the objective would have to apply to the whole learned algorithm. At most we could use something like a speed prior on the whole learned algorithm. Even this might incentivize the learned algorithm to use information from the environment to infer its own objective, which is not a robust way to store the objective.
  2. Collecting more diverse training data is also difficult, because we will be trying to get training examples that are as realistic as possible while still being extreme examples. Both difficulties are hard to overcome when training a human-level intelligence, because for a human-level intelligence, it won’t be hard to tell the difference between training simulations and deployment. With the RL assistant from earlier, we would want to create extreme examples like “the world turns out to be a simulation” or “the user dies halfway through an instruction”, but if it was easy to tell the difference between these simulated examples and the real experience, then the system might learn to do the right things in training but revert to the heuristic “obey the user” in the real world, because it was given actual real world data for situations where the answer was “obey the user”.

Why are mesa-optimizers likely

Why are mesa-optimizers likely to be a problem in the first place? Machine learning is all about creating algorithms, except using training data and inductive biases instead of explicitly writing the algorithm ourselves. From this perspective, it isn’t surprising that an ML algorithm might learn to be an optimizer, since human programmers write programs to do optimization all the time. There are simply some tasks for which optimization is the best approach. For example, many RL environments, like environments that require planning ahead in order to avoid future obstacles, naturally incentivize a solution involving optimization. With huge amounts of data, many aspects of the real world might be most easily modeled using optimizing algorithms, and advanced generative models could plausibly learn to do this, with similar dangers of poor generalization in new scenarios.

Two examples

Let's look at two concrete examples of mesa-optimizers, a minimal example and a dangerous example.

Minimal example: A generative image model might learn to internally generate two “proposal vectors”, which are both passed through an approximate test of quality, with the best one being chosen and used to generate an image. Technically, this test of quality step is an optimizer (taking max of two possibilities). This sort of optimization isn’t very worrying, because it is a very weak optimizer. This means that the mesa-objective doesn’t have much influence over the output, so it doesn’t really matter that it’s imperfect. 

Dangerous example: An RL agent is trained simultaneously on a diverse array of tasks (e.g. playing games, question answering, prediction) for the intended goal of “solving any task given to it”,[7] and learns a very general strategy of evaluating plans and predicting which one will give it the most reward.  From playing games, it learns the strategy: "create a model of the environment and use the model to predict outcomes, then optimize over actions”. It might learn the aligned objective as its creators intended. But if it has also learned a good model of itself by being taught to solve programming tasks, then the mesa-objective might just as easily[8] be: "minimize the amount my parameters are updated", a goal that could lead it to deceive its creators or defend itself from updates. Both goals are equivalent to the goal of the base reward function, in the training environment. Other potential mesa-objectives could also look quite different to that of the base optimizer. It might learn to use proxy goals that are correlated with the reward it gets, or instrumental goals that it treats as final goals. Perhaps it spent a long time training on chess and learned the goal of obtaining information about the relative advantage of different board states. If this goal were pursued to the fullest in the real world, it might mean turning most matter and energy into compute for the purpose of solving chess. I think most realistically, the mesa-objective would be a messy combination of many instrumental goals like those given above, which together incentivised good performance in the training environment. When the agent is released to the real world, it is unclear what states of the universe will best satisfy such a messy combination of instrumental goals, but it seems fairly obvious that it is unlikely to be good by human standards.

Solving the problem

A full solution to the problem of inner misalignment would be a model (or training procedure) that provably only develops aligned mesa-optimizers. I’m not sure what this would look like, but perhaps it would look like a generalization bound that holds for large spaces of potential extreme examples. In this case, we might not need to solve outer alignment, because we could use such a model to implement a human imitator to imitate alignment researchers and have them solve the problem, or we could use iterated amplification to create an aligned superintelligence out of human imitators.

Avoiding the problem

To avoid mesa-optimizers completely, we need better ways to separate learning algorithms into those that are likely to find mesa-optimizers and those that are not.

Clearly, most small scale ML systems used today don't contain optimizers (at least powerful ones), because their hypothesis spaces are too simple. Linear regression, with the hypothesis space being the set of linear functions from some input domain to some output, clearly are not algorithmically complex enough to contain an optimizer. Similarly, small feed-forward neural networks, decision trees and Gaussian processes are too algorithmically simple to contain a program that does (powerful) optimization.

Some hypothesis spaces clearly do contain optimizers, e.g. the space of all programs used by Solomonoff Induction. Similarly, many hypothesis spaces are Turing complete (RNNs, transformers (sometimes), NTMs, etc.), and hence must contain optimizers.

In larger hypothesis spaces, there must implicitly or explicitly be some non-uniform prior over hypotheses, which determines how likely a hypothesis is to be chosen by the base optimizer. From this point of view, the problem of avoiding mesa-optimizers is the problem of manipulating this prior such that hypotheses containing optimizers have low mass. Some techniques that might help include a speed prior (because powerful optimizers tend to take a lot of computation), reducing statefulness (can’t save computations across timesteps), and reducing the strength of a simplicity prior (because optimizers are highly compressed ways to implement a complex policy). However, manipulating priors in practical ML algorithms is often difficult. One approach for further research might be to investigate techniques for manipulating priors. If we had a way of providing prior information like “the internal computation of the learned function is likely to look similar to some preprogrammed function”, this could be useful for developing learning algorithms that are less likely to find “optimizer-like” algorithms.

I think we are most likely to find a solution for avoiding mesa-optimization that looks like this: we have some model (or training procedure) that we know is highly unlikely to find an inner optimizer, but the chance of finding an inner optimizer increases as we scale up the size of the model. In this case, we could still build a smarter-than-human agent, capable of planning and learning to plan more efficiently, using limited-capacity ML models as components. For example, we could take an algorithm like EfficientZero, which consists of a policy network (estimating what action will be best), a world-model (predicting the future), and a value function (estimating the value of a state).[9] These are all combined with a hard-coded optimization procedure that searches for the best series of actions, while taking advantage of the approximations to speed up computation. The hard-coded optimization amplifies the capabilities of the (mesa-optimization free) internal networks to create an agent that is above-human-intelligent. At my current, very limited level of knowledge, future versions of this approach seem to be the easiest way of avoiding mesa-optimizers in advanced AI.

My updates after writing this

Part of the reason I wrote this is because I wanted to have a better understanding of whether the unsolved problem of inner misalignment made work on outer alignment relatively unimportant. After a conversation at EAG London, I worried that value learning research was a waste of time, because inner misalignment would need to be fully solved before building an outer aligned AGI. I now think that it is likely we will end up needing progress on both, but that it’s not obvious that mesa-optimization problems will be the biggest bottleneck.


Thanks to Bradley Tjandra, Nicholas Greig, Chris Leong and Kelley Gao for providing feedback and proofreading.

  1. ^

    I’m purposely not including examples in this section to keep it concise, but if you want a concrete example then the canonical one is “evolution is a base optimizer and humans are mesa-optimizers, and human goals don’t match the ‘goals’ of evolution”. See below for two more detailed concrete examples of mesa-optimizers.

  2. ^

    In the case of deceptive alignment.

  3. ^

    The word “prior” is a bit overloaded and ambiguous, and I think some of the ML researchers I know would object to the way I’m using it here. I’m using it to mean all the inductive biases of an algorithm, even if the algorithm doesn’t appear to be Bayesian. The relevant theory for this way of thinking about inductive bias is PAC-Bayes theory. I have a draft post about learning theory that I hope to post soon, that (hopefully) make this post easier to understand, and justify this way of thinking.

  4. ^

    By fully aligned, I mean a reward signal that rewards action sequences or worldstates that are “good” from the perspective of (lets say) the team that built the AI assistant. 

  5. ^

    Prior information is more than information about data point similarity, but I find it’s a good way to get an intuition about how crucial good priors are. This way of thinking about priors is easiest to understand by looking at kernel methods (e.g. Gaussian Processes), which allow you to specify a prior distribution over functions by directly specifying the similarity of any pair of data points using a kernel function.

  6. ^

    Examples of techniques used for manipulating the prior (inductive bias) of neural networks include changing the architecture, weight decay, weight sharing, dropout, data augmentation and changing the input features. None of these are very good techniques, because they don’t usually give us fine-grained control over priors, and because it’s not always clear what prior information each technique actually contains. But ML practitioners develop some intuition about what works and what doesn’t, for different types of data and learning problems. So the act of choosing an architecture (and various hyperparameters and regularization schemes) is analogous to the act of providing a Bayesian learning algorithm with a prior distribution over function.

  7. ^

    Let’s assume for the sake of the example that it’s an aligned-to-human-values version of this goal, although in practice it wouldn’t be.

  8. ^

    In the sense that both mesa-objectives have the same training loss.

  9. ^

    The value function here is learned, which means that it could learn the wrong goal and generalize poorly, which is exactly the problem we are trying to avoid. But the advantage of this system is that since the optimizer is hard-coded, we can easily point to the objective of the agent, and if we have some solution to outer alignment, we could (with a couple of additional difficulties) slot in an aligned value function as a replacement.

New Comment
2 comments, sorted by Click to highlight new comments since:

It's great that you are trying to develop a more detailed understanding of inner alignment. I noticed that you didn't talk about deception much. In particular, the statement below is false:

Generalization <=> accurate priors + diverse data

You have to worry about what John Wentworth calls 'manipulation of imperfect search.' You can have accurate priors and diverse data and (unless you have infinite data) the training process could produce a deceptive agent that is able to maintain its misalignment. 

Thanks for reading!

Are you referring to this post? I hadn't read that, thanks for pointing me in that direction. I think technically my subtitle is still correct, because the way I defined priors in the footnotes covers any part of the training procedure that biases it toward some hypotheses over others. So if the training procedure is likely to be hijacked by "greedy genes" then it wouldn't count as having an "accurate prior". 

I like the learning theory perspective because it allows us to mostly ignore optimization procedures, making it easier to think about things. This perspective works nicely until the outer optimization process can be manipulated by the hypothesis. After reading John's post I think I did lean too hard on the learning theory perspective.

I didn't have much to say about deception because I considered it to be a straightforward extension of inner misalignment, but I think I was wrong, the "optimization demon" perspective is a good way to think about it.