Mesa-Optimizers vs “Steered Optimizers”

Steven Byrnes

Abstract:

The paper Risks from learned optimization introduced the term "inner alignment" in the context of a specific class of scenarios, namely a "base optimizer" which searches over a space of “inner” algorithms. If the inner algorithm is an optimizer, it's called a "mesa-optimizer", and if its objective differs from the base algorithm's, it's called an "inner alignment" problem. In this post I want to plead for us to also keep in mind a different class of scenarios, which I'll call "Steered Optimizers", and which also has an "inner alignment" problem. The inner alignment problem for mesa-optimizers is directly analogous to the inner alignment problem for steered optimizers, but the specific failure modes and risk factors and solutions are all somewhat different. I’ll argue that it's at least comparably likely for our future AGIs to be steered optimizers rather than mesa-optimizers. So again, we should keep both scenarios in mind.

Introduction

I recently wrote a post about brain algorithms with "inner alignment" in the title, but I was talking about something kinda different than in the famous Risks from Learned Optimization paper that I was implicitly referring to. I didn't directly explain why I felt entitled to use the term “inner alignment” for this different situation, but I think it's worth going into, especially because it’s a more general approach to making AGI that goes beyond brain-inspired approaches.

(Terminology note: Following “Risks From Learned Optimization”, I will use the term "optimizer" in this post to mean an algorithm which uses foresight / planning to search over possible actions in pursuit of a particular goal, a.k.a. a "selection"-type optimizer. I want humans to count as “optimizers”, so I will also allow “optimizers” to sometimes choose actions for other reasons, and to maybe have inconsistent, context-dependent goals, as long as they at least sometimes use foresight / planning to pursue goals.)

Let's start with two scenarios in which we might create highly intelligent AI "optimizers":

1. Search-over-algorithms scenario: (this is the one from Risks from Learned Optimization). Here, you have a "base optimizer" which searches over a space of possible algorithms for an algorithm which performs very well according to a "base objective". For example, the base optimizer might be doing gradient descent on the weights of an RNN (large enough RNNs are Turing-complete!). Anyway, if the base optimizer settles on an inner algorithm which is itself an optimizer, then we call it a “mesa-optimizer”. Inner alignment is alignment between the mesa-optimizer’s objective and the base objective, while outer alignment is alignment between the base objective and what the programmer wants.

2. Steered Optimizer scenario: (this is how I think the human brain works, more or less, see my post "Inner alignment in the brain"). Here, you also have a two-layer structure, but the layers are different. The inner layer is an algorithm that does online learning, world-modeling, planning, and acting, and it is an optimizer. We wrote the inner-layer algorithm ourselves, and it is never modified or reset (the whole scenario is just one “episode”, in RL terms). But as the inner algorithm learns more and more, it becomes increasingly powerful, and increasingly difficult for us to understand—like comparing a newborn brain to an adult brain, where the latter carries a lifetime of experience and ideas. Meanwhile, the base layer watches the inner layer in real time, and tries to "steer" it towards optimizing the right target, using hooks that we had built into the inner layer’s source code. How does that steering work? In the simplest case, the base layer can be a reward function calculator, and it sends the reward information to the inner layer, which in turn tries to build a predictive model of the correlates of that reward signal, set them as a goal, and make foresighted plans to maximize them. (There could be other steering methods too—see below.) As in the other scenario, inner alignment is alignment between the inner layer’s objective(s) and the formula used by the base layer to compute rewards, while outer alignment is alignment between the latter and what the programmer wants.

Here’s a little comparison table:

Property	“Search Over Algorithms” scenario	“Steered Optimizer” scenario
Base layer	Run inner layer for N steps, compute score, do gradient descent on inner layer algorithm	Run inner layer. As it runs, watch it, and send rewards (or other signals) to “steer” it.
Inner layer	Arbitrary algorithm discovered by base layer	Known, hand-coded algorithm, involving learning, world-modeling, planning, acting, etc.
Interpretability of inner layer (by default, i.e. without special interpretability tools)	Always inscrutable	Starts from a known, simple state, but gets more and more inscrutable as it builds a complex world-model
What is the inner layer’s objective?	It might not have one. If it does, we don’t know what it is (by default)	We designed it to form and seek goals based on the steering signals it receives, but we don’t know its actual goals at any given time (by default)
How many training episodes?	Millions, I presume.	As few as one; maybe several, but more like a run-and-debug loop.
Are we doing this today?	Not really (but see references in “Risks from Learned Optimization”).	Not that I know of, off-hand, but it’s probably in the AI literature somewhere.

By the way, these two scenarios are not the only two possibilities, nor are they mutually exclusive. The obvious example for “not mutually exclusive” is the human brain, which fits nicely into both categories—the subcortex steers the neocortex (more on which below), and meanwhile evolution is a search-over-algorithms-type base optimizer for the whole brain.

Why might we expect AI researchers to build steered optimizers, rather than searches-over-algorithms?

(Update: I later massively elaborated this section into the post Against Evolution As An Analogy For How Humans Will Build AGI.)

Steered optimizers enable dramatically longer episodes than searches-over-algorithms. In the first line of the table above I wrote that search-over-algorithms involves running the inner layer for N steps per episode. How big is N? If we want to build a system that can learn a whole predictive world-model from scratch, that's an awfully big N! Evolution is a good example here; it picks a genotype and then spends many decades calculating its loss. Imagine doing ML with one gradient descent step per decade! For various reasons, I don't think this rules out a search-over-algorithms approach, but I definitely think it's a strike against its plausibility. Steered optimizers do not have this problem; they do not need to run through millions of episodes to reach excellent performance, just a single very long episode, or more likely dozens of very long episodes for debugging, hyperparameter search, etc.
As I keep mentioning, I think brains work as steered optimizers, with the steered optimizer subsystem centered around the neocortex (or pallium in birds and lizards), and the steering subsystem based in other parts of the brain. If I’m right about that, that would imply that (1) steered optimizers are a viable path to AGI, and (2) we have a straightforward-ish development path to get there (which lots of people are already working on), i.e. we “merely” need to reverse-engineer the neocortex.
Given that we know at least vaguely what a world-modeling-and-acting-and-planning algorithm is supposed to do and how, I think people will be able to write such an algorithm themselves faster than they could find it by blind search. I don't think it's that complicated an algorithm, compared to the collective capability of the worldwide AI community. (Why don’t I think the algorithm is horrifically complicated? Mainly from inside-view reading and thinking about neocortical algorithms, which I discussed most recently here. I could be wrong.)

Incidentally, if we’re writing the inner algorithm ourselves, why not just put the goal into the source code directly? Well, that would be awesome ... But it may not be possible! I think the easiest way to build the inner algorithm is to have it build a world-model from scratch, more-or-less by finding patterns in the input, and patterns in the patterns, etc. So if you want the AGI to have a goal of maximizing paperclips, we face the problem that there is no “paperclips” concept in its source code; it has to run for a while before forming that concept. That’s why we might instead build an AGI by letting it start learning and acting, and trying to steer it as it goes.

How might one steer an AGI steered optimizer?

As mentioned above, we can send reward signals—calculated automatically and/or by human overseers.
A human, assisted by interpretability tools, could reach in and add / subtract / edit goals. Or a similar thing could be automated.
You could build a hook in the inner layer for receiving natural language commands. Like maybe, whenever you press the button and talk into the microphone, whatever world-model concepts are internally activated by that speech become the inner layer’s goals (or something like that).
Any of the weird tricks that the brain uses, as discussed in my posts inner alignment in the brain and an earlier post about human instincts. (Update: Also this later post.)
I don’t know! I’m sure there are other things.

Lessons from being a human

If the human neocortex is a steered optimizer, what can we learn from that example?

1. How does it feel to be steered?

You try a new food, and find it tastes amazing! This wonderful feeling is your subcortex sending a steering signal up to your neocortex. All of the sudden, a new goal has been installed in your mind: eat this food again! This is not your only goal in life, of course, but it is a goal, and you might use your intelligence to construct elaborate plans in pursuit of that goal, like shopping at a different grocery store so you can buy that food again.

It’s a bit creepy, if you think about it!

“You thought you had a solid identity? Ha!! Fool, you are a puppet! If your neocortex gets dopamine at the right times, all of the sudden you would want entirely different things out of life!”

2. What does Inner Alignment failure look like in humans?

A prototypical inner alignment failure would be knowing that there is some situation that would lead the subcortex to install a certain goal in our minds, and we don’t want to have that goal (according to our current goals), so we avoid that situation.

Imagine, for example, not trying a drug because you’re afraid of getting addicted.

To make that analogy explicit, you could imagine that our brain was designed by an all-powerful alien who wanted us to take the drug, and therefore set up our brain with a system that recognizes the chemical signature of that drug, and installs that drug as a new goal when that chemical signature is detected. At first glance, that’s not a bad design for a steering mechanism, and indeed it works sometimes. But we can undermine the alien's intentions by understanding how that steering mechanism works, and thus avoiding the drug.

A more prosaic example: practically every “productivity hack” is a scheme to manipulate our own future subcortical steering signals.

3. What would corrigible alignment look like in humans?

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

More random thoughts on steering

An AGI might be easier to steer than a human brain, if we can find a way to reliably steer in response to imagination / foresight, and not just actions. In the example above, where I am trying not to get addicted to a drug, my job is made pretty easy by the fact that I need to actually take the drug before getting addicted. Merely thinking about taking the drug will not install that goal in my brain. Maybe we can avoid that problem in our steered AGIs somehow? (Update: the brain sorta does this via supervised learning.)
I mentioned corrigible alignment above. I think that the sense of “corrigible alignment” which is most analogous to the “Risks from learned optimization” paper is like hedonism—valuing the reward steering signals, as an end in themselves. If that’s the definition, then I would be concerned that a corrigibly-aligned system solves the inner alignment problem while horribly exacerbating the outer alignment problem, because the system is now motivated to wirehead or otherwise game the reward signals. It’s not necessarily an unsolvable outer alignment problem—maybe an AGI could be motivated by both hedonism and a specific aversion to self-modification other than by normal learning, for example. But I’m awfully skeptical that this is a good starting point. I think it’s more promising to go for a different flavor of corrigibility, where we try to steer the system so that it becomes motivated by something like “the intentions of the programmer”, i.e. a flavor of corrigibility that tries to cut through both the inner and the outer alignment problems simultaneously. (Maybe this is my opinion about corrigible alignment in the search-over-algorithms scenario as well...)

Related work

Deep RL from Human Preferences and Scalable Agent Alignment Via Reward Modeling both bring up the idea of taking reward signals, trying to understand those signals in the form of a predictive model, and then using that reward model as a target for training an agent (if I understand everything correctly). (This is not the only idea in the papers, and in most respects the papers are more like search-over-algorithms.) Anyway, that specific idea has parallels with how a steered optimizer would try to relate its reward signals to meaningful, predictive concepts in its world-model. In this post I’m trying to emphasize that reward-modeling part, and generalize it to other ways of steering agents. Also, unlike those papers, I prefer to merge the reward-modeling task and the choosing-actions task into a single model, because their requirements seem to heavily overlap, at least in the case of a powerful, world-modeling, optimizing agent. For example, the reward-modeling part needs to look at a bunch of reward signals and figure out that they correspond to the goal “maximize paperclips”; while the choosing-actions part needs to take the goal “maximize paperclips” and figure out how to actually do it. These two parts require much the same world-modeling capabilities, and indeed I don’t see how it would work except by having both parts actually referencing the same world-model.

(I'm sure there's other related work too, that’s just what jumped to my mind.)

(thanks Evan Hubinger for comments on a draft.)

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

At first this claim seemed kind of wild, but there's a version of it I agree with.

It seems like conditional on the inner optimizer being corrigible, in the sense of having a goal that's a pointer to some optimizer "outside" it, it's underspecified what it should point to. In the evolution -> humans -> gradient descent -> model example, corrigibility as defined in RLO could mean that the model is optimizing for the goals of evolution, humans, or the gradient. This doesn't seem to be different between the RLO and steered optimization stories.

I think the analogy to corrigible alignment among humans being hedonism assumes that a corrigibly aligned optimizer's goal would point to the thing immediately upstream of its reward. This is not obvious to me. It seems like wireheading / manipulating reward signals is a potential problem, but this is just a special case of not being able to steer an inner optimizer even conditional on it having a narrow corrigibility property.

Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯

BTW this is an old post; see my more up-to-date discussion here, esp. Posts 8–10.

I dunno, the productivity hacks thing sounds pretty bad.

But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent. Trying to steer the agent towards thinking of its goal as "do what the programmers want" is essentially talking about a machine-learning method of trying to find this description.

I dunno, the productivity hacks thing sounds pretty bad.

Well, we ought to be able to either figure out how to use this kind of system safely, or prove it's impossible. Either would be valuable. :-)

I don't think it's obviously impossible though. In particular, with the right motivation, it won't be motivated to undermine the steering signals. And also, the subcortex can be a slightly-less powerful AI, assisted by intrusive interpretability tools, multiple copies running faster, etc.

But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent...

Yeah, I struggle with that too. Maybe an alternative (or at least starting point) would be to try to solve the challenge of building a question-answering oracle that has no motivation to lie or manipulate or escape its box, etc. I think that is a goal I can fully understand, although maybe I just haven't thought about it carefully enough to find the edge cases. :-)

A steered optimizer has an incentive to remove all steering control as fast as possible. A learned, static mesaoptimizer (from the search over algorithms scenario), seems to be in less of a hurry to have its treacherous turn. Perhaps this means steered optimizers are more likely to clumsily attempt to wrest control before they're strong enough?

But as a (human) steered optimizer, the way I relate to my steering is as my true values. Like, I might think that some crazy edge case sounds great (endlessly eating a hypercake in an endless forest of more and more interesting plants), but I always reserve some probability mass for in fact finding it empty and meaningless and not what I value (which is presumably what just-in-time steering feels like)

Thanks for the comment!

A steered optimizer has an incentive to remove all steering control...

Well, not necessarily. We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?

...as fast as possible... Perhaps this means steered optimizers are...likely to clumsily attempt to wrest control before they're strong enough?

That would be nice! One situation where it might fail is that it takes a while for the system to develop an understanding of its situation, and by the time it understands what the steering signals are and how they work, it is already competent enough able to skillfully plan around them. More generally, I have low confidence about the relative difficulties and learning curves of a future AGI, and don't want to rely on anything like that, even if it seems intuitively probable.

After thinking about it for a minute, it's not obvious to me whether mesa-optimizers vs steered optimizers are better or worse on likelihood of clumsy failed attempts at treacherous turns...

Like, I might think that some crazy edge case sounds great (endlessly eating a hypercake in an endless forest of more and more interesting plants), but I always reserve some probability mass for in fact finding it empty and meaningless and not what I value

What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the "this is empty and meaningless" gut feeling and replace it with a "this is deeply fulfilling" feeling? Would you eat it then?

For me, for some of my goals, I feel a strong pull of goal preservation—like, I would commit today to a vow that, if "making the world a better place" ceased to feel fulfilling for me, and started to feel empty and pointless, I will alter my brain however necessary to make "making the world a better place" feel fulfilling again. Other goals I don't feel like I need to preserve: for example, I enjoy chocolate today, but I am not particularly disturbed by the thought that I might stop enjoying chocolate someday in the future, and start enjoying something else instead. I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category. Or maybe "socially-praiseworthy goals important to my self-image" are in the first category. Or something else. I don't know... :-)

We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?

That's true. I should have said "a misaligned steered optimizer"

don't want to rely on [things like AGI learning curves], even if it seems intuitively probable.

Strongly agree

What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the "this is empty and meaningless" gut feeling and replace it with a "this is deeply fulfilling" feeling? Would you eat it then?

Indeed not! I'm not sure if this is obvious (because the example was not excellently chosen), but I meant to suggest something like "if I had to choose my best guess at a thing that would be selfishly good for me in the future, I would care more about my actual experience of it (and subcortically-generated reward) than my guess of what I would feel now".

I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category

That was my first guess when reading your "making the world a better place" example. But I don't think it quite works. If I have an outward-facing goal of ensuring more people enter long-lasting meaningful relationships, I want that goal to be able to shift in the face of data from reality. But perhaps my imagination is misfiring because that's not actually a very important goal to me.

Again analogizing from the definition in “Risks From Learned Optimization”, “corrigible alignment” would be developing a motivation along the lines of “whatever my subcortex is trying to reward me for, that is what I want!” Maybe the closest thing to that is hedonism? Well, I don’t think we want AGIs with that kind of corrigible alignment, for reasons discussed below.

At first this claim seemed kind of wild, but there's a version of it I agree with.

Hmm, I think it’s probably more productive to just talk directly about the “steered optimizer” thing, instead of arguing about what’s the best analogy with RLO. ¯\_(ツ)_/¯

BTW this is an old post; see my more up-to-date discussion here, esp. Posts 8–10.

I dunno, the productivity hacks thing sounds pretty bad.

I dunno, the productivity hacks thing sounds pretty bad.

Well, we ought to be able to either figure out how to use this kind of system safely, or prove it's impossible. Either would be valuable. :-)

But yeah, doing better seems to be held up by the fact that we don't yet have a coherent way to describe the standards for doing better, when the human isn't an idealized sort of agent...

Thanks for the comment!

A steered optimizer has an incentive to remove all steering control...

Well, not necessarily. We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?

...as fast as possible... Perhaps this means steered optimizers are...likely to clumsily attempt to wrest control before they're strong enough?

After thinking about it for a minute, it's not obvious to me whether mesa-optimizers vs steered optimizers are better or worse on likelihood of clumsy failed attempts at treacherous turns...

Like, I might think that some crazy edge case sounds great (endlessly eating a hypercake in an endless forest of more and more interesting plants), but I always reserve some probability mass for in fact finding it empty and meaningless and not what I value

We could steer it into a motivational system in which it happily accepts steering signals, hopefully, right?

That's true. I should have said "a misaligned steered optimizer"

don't want to rely on [things like AGI learning curves], even if it seems intuitively probable.

Strongly agree

What if the hypercake was laced with a special nanobot that would travel around your brain and deactivate the "this is empty and meaningless" gut feeling and replace it with a "this is deeply fulfilling" feeling? Would you eat it then?

I think the difference is outward-facing goals are in the first category, and goals that mainly impact myself are in the second category

48

Mesa-Optimizers vs “Steered Optimizers”

48

Ω 19

Introduction

Why might we expect AI researchers to build steered optimizers, rather than searches-over-algorithms?

How might one steer an AGI steered optimizer?

Lessons from being a human

1. How does it feel to be steered?

2. What does Inner Alignment failure look like in humans?

3. What would corrigible alignment look like in humans?

More random thoughts on steering

Related work

48

Ω 19

48

Ω 19