This essay is an adaptation of a talk I gave at the Human-Aligned AI Summer School 2019 about our work on mesa-optimisation. My goal here is to write an informal, accessible and intuitive introduction to the worry that we describe in our full-length report.
The essay has six parts:
Two distinctions draws the foundational distinctions between
“optimised” and “optimising”, and between utility and reward.
What objectives? discusses the behavioral and internal approaches to understanding objectives of ML systems.
Why worry? outlines the risk posed by the utility ≠ reward gap.
Mesa-optimisers introduces our language for analysing this worry.
An alignment agenda sketches different alignment problems presented by these ideas, and suggests transparency and interpretability as a way to solve them.
Where does this leave us? summarises the essay and suggests where to look next.
The views expressed here are my own, and do not necessarily reflect those of my coauthors or MIRI. While I wrote this essay in first person, all of the core ideas are the fruit of an equal collaboration between Joar Skalse, Chris van Merwijk, Evan Hubinger and myself. I wish to thank Chris and Joar for long discussions and input as I was writing my talk, and all three, as well as Jaime Sevilla Molina, for thoughtful comments on this essay.
I wish to draw a distinction which I think is crucial for clarity about AI alignment, yet is rarely drawn. That distinction is between the reward signal of a reinforcement learning (RL) agent and its “utility function”. That is to say, it is not in general true that the policy of an RL agent is optimising for its reward. To explain what I mean by this, I will first draw another distinction, between “optimised” and “optimising”. These distinctions lie at the core of our mesa-optimisation framework.
It’s helpful to begin with an analogy. Viewed abstractly, biological evolution is an optimisation process that searches through configurations of matter to find ones that are good at replication. Humans are a product of this optimisation process, and so we are to some extent good at replicating. Yet we don’t care, by and large, about replication in itself.
Many things we care about look like replication. One might be motivated by starting a family, or by having a legacy, or by similar closely related things. But those are not replication itself. If we cared about replication directly, gamete donation would be a far more mainstream practice than it is, for instance.
Thus I want to distinguish the objective of the selection pressure that produced humans from the objectives that humans pursue. Humans were selected for replication, so we are good replicators. This includes having goals that correlate with replication. But it is plain that we are not motivated by replication itself. As a slogan, though we are optimised for replication, we aren’t optimising for replication.
Another clear case where “optimised” and “optimising” come apart are “dumb” artifacts like bottle caps. They can be heavily optimised for some purpose without optimising for anything at all.
These examples support the first distinction I want to make: optimised ≠ optimising. They also illustrate how this distinction is important in two ways:
- A system optimised for an objective need not be pursuing any objectives itself. (As illustrated by bottle caps.)
- The objective a system pursues isn’t determined by the objective it was optimised for. (As illustrated by humans.)
The reason I draw this distinction is to ask the following question:
Our machine learning models are optimised for some loss or reward. But what are they optimising for, if anything? Are they like bottle caps, or like humans, or neither?
In other words, do RL agents have goals? And if so, what are they?
These questions are hard, and I don’t think we have good answers to any of them. In any case, it would be premature, in light of the optimised ≠ optimising distinction, to conclude that a trained RL agent is optimising for its reward signal.
Certainly, the RL agent (understood as the agent’s policy representation, since that is the part that does all of the interesting decision-making) is optimised for performance on its reward function. But in the same way that humans are optimised for replication, but are optimising for our own goals, a policy that was selected for its performance on reward may in fact have its own internally-represented goals, only indirectly linked to the intended reward. A pithy way to put this point is to say that utility ≠ reward, if we want to call the objective a system is optimising its “utility”. (This is by way of metaphor – I don’t suggest that we must model RL agents as expected utility maximizers.)
Let’s make this more concrete with an example. Say that we train an RL agent to perform well on a set of mazes. Reward is given for finding and reaching the exit door in each maze (which happens to always be red). Then we freeze its policy and transfer the agent to a new environment set for testing. In the new mazes, the exit doors are blue, and red distractor objects are scattered elsewhere in the maze. What might the agent do in the new environment?
Three things might happen.
- It might generalise: the agent could solve the new mazes just as well, reaching the exit and ignoring the distractors.
- It might break under the distributional shift: the agent, unused to the blue doors and weirdly-shaped distractor objects, could start twitching or walking into walls, and thus fails to reach the exit.
- But it might also fail to generalise in a more interesting way: the agent could fail to reach the exit, but could instead robustly and competently find the red distractor in each maze we put it in.
To the extent that it's meaningful to talk about the agent's goals, the contrast between the first and third cases suggests that those goals depend only on its policy, and are distinct from its reward signal. It is tempting to say that the objective of the first agent is reaching doors; that the objective of the third agent is to reach red things. It does not matter that in both cases, the policy was optimised to reach doors.
This makes sense if we consider how information about the reward gets into the policy:
For any given action, the policy’s decision is made independently of the reward signal. The reward is only used (standardly, at least) to optimise the policy between actions. So the reward function can’t be the policy’s objective – one cannot be pursuing something one has no direct access to. At best, we can hope that whatever objective the learned policy has access to is an accurate representation of the reward. But the two can come apart, so we must draw a distinction between the reward itself and the policy’s internal objective representation.
To recap: whether an AI system is goal-directed or not is not trivially answered by the fact that it was constructed to optimise an objective. To say that is to fail to draw the optimised ≠ optimising distinction. If we then take seriously goal-directedness in AI systems, then we must draw a distinction between the AI’s internal learned objective and the objective it was trained on; that is, draw the utility ≠ reward distinction.
I’ve been talking about the objective of the RL agent, or its “utility”, as if it is an intuitively sensible object. But what actually is it, and how can we know it? In a given training setup, we know the reward. How do we figure out the utility?
Intuitively, the idea of the internal goal being pursued by a learned system feels compelling to me. Yet right now, we don't have any good ways to make the intuition precise – figuring out how to do that is an important open question. As we start thinking about how to make progress, there are at least two approaches we can take: what I’d call the behavioural approach and the internal approach.
Taking the behavioural approach, we look at how decisions made by a system systematically lead to certain outcomes. We then infer objectives from studying those decisions and outcomes, treating the system as a black box. For example, we could apply Inverse Reinforcement Learning to our trained agents. Eliezer’s formalisation of optimisation power also seems to follow this approach.
Or, we can peer inside the system, trying to understand the algorithm implemented by it. This is the internal approach. The goal is to achieve a mechanistic model that is abstract enough to be useful, but still grounded in the agent’s inner workings. Interpretability and transparency research take this approach generally, though as far as I can tell, the specific question of objectives has not yet seen much attention.
It’s unclear whether one approach is better, as both potentially offer useful tools. At present, I am more enthusiastic about the internal approach, both philosophically and as a research direction. Philosophically, I am more excited about it because understanding a model’s decision-making feels more explanatory than making generalisations about its behaviour. As a research direction, it has potential for empirically-grounded insights which might scale to future prosaic AI systems. Additionally, there is the possibility of low-hanging fruit, as this space appears underexplored.
Utility and reward are distinct. So what? If a system is truly optimised for an objective, determining its internal motivation is an unimportant academic debate. Only its real-world performance matters, not the correct interpretation of its internals. And if the performance is optimal, then isn’t our work done?
In practice, we don’t get to optimise performance completely. We want to generalise from limited training data, and we want our systems to be robust to situations not foreseen in training. This means that we don’t get to have a model that’s perfectly optimised for the thing we actually want. We don’t get optimality on the full deployment distribution complete with unexpected situations. At best, we know that the system is optimal on the training distribution. In this case, knowing whether the internal objective of the system matches the objective we selected it for becomes crucial, as if the system’s capabilities generalise while its internal goal is misaligned, bad things can happen.
Say that we prove, somehow, that optimising the world with respect to some objective is safe and useful, and that we can train an RL agent using that objective as reward. The utility ≠ reward distinction means that even in that ideal scenario, we are still not done with alignment. We still need to figure out a way to actually install that objective (and not a different objective that still results in optimal performance in training) into our agent. Otherwise, we risk creating an AI that appears to work correctly in training, but which is revealed to be pursuing a different goal when an unusual situation happens in deployment. So long as we don’t understand how objectives work inside agents, and how we can influence those objectives, we cannot be certain of the safety of any system we build, even if we literally somehow have a proof that the reward it was trained on was “correct”.
Will highly-capable AIs be goal-directed? I don’t know for sure, and it seems hard to gather evidence about this, but my guess is yes. Detailed discussion is beyond our scope, but I invite the interested reader to look at some arguments about this that we present in section 2 of the report. I also endorse Rohin Shah’s Will Humans Build Goal-Directed Agents?.
All this opens the possibility for misalignment between reward and utility. Are there reasons to believe the two will actually come apart? By default, I expect them to. Ambiguity and underdetermination of reward mean that there are many distinct objectives that all result in the same behaviour in training, but which can disagree in testing. Think of the maze agent, whose reward in training could mean “go to red things” or “go to doors”, or a combination of the two. For reasons of bounded rationality, I also expect pressures for learning proxies for the reward instead of the true reward, when such proxies are available. Think of humans, whose goals are largely proxies for reproductive success, rather than replication itself. (This was a very brief overview; section 3 of our report examines this question in depth, and expands on these points more.)
The second reason these ideas matter is that we might not want goal-directedness at all. Maybe we just want tool AI, or AI services, or some other kind of non-agentic AI. Then, we want to be certain that our AI is not somehow goal-directed in a way that would cause trouble off-distribution. This could happen without us building it in – after all, evolution didn’t set out to make goal-directed systems. Goal-directedness just turned out to be a good feature to include in its replicators. Likewise, it may be that goal-directedness is a performance-boosting feature in classifiers, so powerful optimisation techniques would create goal-directed classifiers. Yet perhaps we are willing to take the performance hit in exchange for ensuring our AI is non-agentic. Right now, we don’t even get to choose, because we don’t know when systems are goal-directed, nor how to influence learning processes to avoid learning goal-directedness.
Taking a step back, there is something fundamentally concerning about all this.
We don’t understand our AIs’ objectives, and we don’t know how to set them.
I don’t think this phrase should ring true in a world where we hope to build friendly AI. Yet today, to my ears, it does. I think that is a good reason to look more into this question, whether to solve it or to assure ourselves that the situation is less bad than it sounds.
This worry is the subject of our report. The framework of mesa-optimisation is a language for talking about goal-directed systems under the influence of optimisation processes, and about the objectives involved.
A part of me is worried that the terminology invites viewing mesa-optimisers as a description of a very specific failure mode, instead of as a language for the general worry described above. I don’t know to what degree this misconception occurs in practice, but I wish to preempt it here anyway. (I want data on this, so please leave a comment if you had confusions about this after reading the original report.)
In brief, our terms describe the relationship between a system doing some optimisation (the base optimiser, e.g.: evolution, SGD), and a goal-directed system (the mesa-optimiser, e.g.: human, ML model) that is being optimised by that first system. The objective of the base optimiser is the base objective; the internal objective of the mesa-optimiser is the mesa-objective.
(“Mesa” as a Greek word that means something like the opposite of “meta”. The reason we use “mesa” is to highlight that the mesa-optimiser is an optimiser that is itself being optimised by another optimiser. It is a kind of dual to a meta-optimiser, which is an optimiser that is itself optimising another optimiser. There is contention around whether this is good Greek, however.
While we’re on the topic of terms, “inner optimiser” is a confusing term that we used in the past in the same way as “mesa-optimiser”. It did not accurately reflect the concept, and has been retired in favour of the current terminology. Please use ”mesa-optimiser” instead.)
I see the word “optimiser” in “mesa-optimiser” as a way of capturing goal-directedness, rather than a commitment to some kind of (utility-)maximising structure. What feels important to me in a mesa-optimiser is its goal-directedness, not the fact it is an optimiser. A goal-directed system which isn’t taking strictly optimal actions (but which is still competent at pursuing its mesa-objective) is still worrying.
Optimisation could be a good way to model goal-directedness—though I don’t think you gain that much, conceptually, from that model—but equally, it seems plausible that some other approach we have not yet explored could work better. So I myself read the “optimiser” in “mesa-optimiser” analogously to how I accept treating humans as optimisers; as a metaphor, more than anything else.
I am not sure that mesa-optimisation is the best possible framing of these concerns. I would welcome more work that attempts to untangle these ideas, and to improve our concepts.
An alignment agenda
There are at least three alignment-related ideas prompted by this worry.
The first is unintended optimisation. How do we ensure that systems that are not supposed to be goal-directed actually end up being not-goal-directed?
The second is to factor alignment into inner alignment and outer alignment. If we expect our AIs to be goal-directed, we can view alignment as a two-step process. First, ensure outer alignment between humans and the base objective of the AI training setup, and then ensure inner alignment between the base objective and the mesa-objective of the resulting system. The former involves finding low-impact, corrigible, aligned with human preferences, or otherwise desirable reward functions, and has been the focus of much of the progress made by the alignment community so far. The latter involves figuring out learned goals, interpretability, and a whole host of other potential approaches that have not yet seen much popularity in alignment research.
The third is something I want to call end-to-end alignment. It’s not obvious that alignment must factor in the way described above. There is room for trying to set up training in such a way to guarantee a friendly mesa-objective somehow without matching it to a friendly base-objective. That is: to align the AI directly to its human operator, instead of aligning the AI to the reward, and the reward to the human. It’s unclear how this kind of approach would work in practice, but this is something I would like to see explored more. I am drawn to staying focused on what we actually care about (the mesa-objective) and treating other features as merely levers that influence the outcome.
We must make progress on at least one of these problems if we want to guarantee the safety of prosaic AI. If we don’t want goal-directed AI, we need to reliably prevent unintended optimisation. Otherwise, we want to solve either inner and outer alignment, or end-to-end alignment. Success at any of these requires a better understanding of goal-directedness in ML systems, and a better idea of how to control the emergence and nature of learned objectives.
More broadly, it seems that taking these worries seriously will require us to develop better tools for looking inside our AI systems and understanding how they work. In light of these concerns I feel pessimistic about relying solely on black-box alignment techniques. I want to be able to reason about what sort of algorithm is actually implemented by a powerful learned system if I am to feel comfortable deploying it.
Right now, learned systems are (with maybe the exception of feature representation in vision) more-or-less hopelessly opaque to us. Not just in terms of goals, which is the topic here—most aspects of their cognition and decision-making are obscure. The alignment concern about objectives that I am presenting here is just one argument for why we should take this obscurity seriously; there may be other risks hiding in our poor understanding of AI inner workings.
Where does this leave us?
In summary, whether a learned system is pursuing any objective is far from a trivial question. It is also not trivially true that a system optimised for achieving high reward is optimising for reward.
This means that with our current techniques and understanding, we don’t get to know or control what objective a learned system is pursuing. This matters because in unusual situations, it is that objective that will determine the system’s behaviour. If that objective mismatches the base objective, bad things can happen. More broadly, our ignorance about the cognition of current systems does not bode well for our prospects at understanding cognition in more capable systems.
This forms a substantial hole in our prospects at aligning prosaic AI. What sort of work would help patch this hole? Here are some candidates:
- Empirical work. Distilling examples of goal-directed systems and creating convincing scaled-down examples of inner alignment failures, like the maze agent example.
- Philosophical, deconfusion and theoretical work. Improving our conceptual frameworks about goal-directedness. This is a promising place for philosophers to make technical contributions.
- Interpretability and transparency. Getting better tools for understanding decision-making, cognition and goal-representation in ML systems.
These feel to me like the most direct attacks on the problem. I also think there could be relevant work to be done in verification, adversarial training, and even psychology and neuroscience (I have in mind something like a review of how these processes are understood in humans and animals, though that might come up with nothing useful), and likely in many more areas: this list is not intended to be exhaustive.
While the present state of our understanding feels inadequate, I can see promising research directions. This leaves me hopeful that we can make substantial progress, however confusing these questions appear today.
By “utility”, I mean something like “the goal pursued by a system”, in the way that it’s used in decision theory. In this post, I am using this word loosely, so I don’t give a precise definition. In general, however, clarity on what exactly “utility” means for an RL agent is an important open question. ↩︎
Perhaps the intuition I have is a distant cousin to the distinction drawn by Einstein between principle and constructive theories. The internal approach seems more like a “constructive theory” of objectives. ↩︎