Clarifying inner alignment terminology

by evhub2 min read9th Nov 202010 comments

68

Ω 34

AI
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I have seen a lot of confusion recently surrounding exactly how outer and inner alignment should be defined and I want to try and provide my attempt at a clarification.

Here's my diagram of how I think the various concepts should fit together:

The idea of this diagram is that the arrows are implications—that is, for any problem in the diagram, if its direct subproblems are solved, then it should be solved as well (though not necessarily vice versa). Thus, we get:


And here are all my definitions of the relevant terms which I think produce those implications:

Alignment: An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.

Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans.

Outer alignment: An objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.[1]

Robustness: An agent is robust if it performs well on the base objective it was trained under even in deployment/off-distribution.[2]

Objective robustness: An agent is objective robust if its behavioral objective is aligned with the base objective it was trained under.

Capability robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.

Inner alignment: A mesa-optimizer is inner aligned if its mesa-objective is aligned with the base objective it was trained under.


And an explanation of each of the diagram's implications:

: If a model is a mesa-optimizer, then its behavioral objective should match its mesa-objective, which means if it's mesa-objective is aligned with the base, then it's behavioral objective should be too.

: Outer alignment ensures that the base objective is measuring what we actually care about and objective robustness ensures that the model's behavioral objective is aligned with that base objective. Thus, putting them together, we get that the model's behavioral objective must be aligned with humans, which is precisely intent alignment.

: Intent alignment ensures that the behavioral objective is aligned with humans and capability robustness ensures that the model actually pursues that behavioral objective effectively—even off-distribution—which means that the model will actually always take aligned actions, not just have an aligned behavioral objective.


FAQ

If a model is both outer and inner aligned, what does that imply?

Intent alignment. Reading off the implications from the diagram, we can see that the conjunction of outer and inner alignment gets us to intent alignment, but not all the way to alignment in general, as we're missing capability robustness.

Can alignment be split into outer alignment and inner alignment?

No. As I just mentioned, the conjunction of both outer and inner alignment only gives us intent alignment, not alignment in general. Furthermore, if the model is not a mesa-optimizer, then it can be objective robust (and thus intent aligned) without being inner aligned.

Does a model have to be inner aligned to be aligned?

No—we only need inner alignment if we're dealing with mesa-optimization. While we can get alignment through a combination of inner alignment, outer alignment, and capability robustness, the diagram tells us that we can get the same exact thing if we substitute objective robustness for inner alignment—and while inner alignment implies objective robustness, the converse is not true.

How does this breakdown distinguish between the general concept of inner alignment as failing “when your capabilities generalize but your objective does not” and the more specific concept of inner alignment as “eliminating the base-mesa objective gap?”[3]

Only the more specific definition is inner alignment. Under this set of terminology, the more general definition instead refers to objective robustness, of which inner alignment is only a subproblem.

What type of problem is deceptive alignment?[4]

Inner alignment—assuming that deception requires mesa-optimization. If we relax that assumption, then it becomes an objective robustness problem. Since deception is a problem with the model trying to do the wrong thing, it's clearly an intent alignment problem rather than a capability robustness problem—and see here for an explanation of why deception is never an outer alignment problem. Thus, it has to be an objective robustness problem—and if we're dealing with a mesa-optimizer, an inner alignment problem.

What type of problem is training a model to maximize paperclips?

Outer alignment—maximizing paperclips isn't an aligned objective even in the limit of infinite data.


  1. What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters. That gets a bit tricky for reinforcement learning, though in that setting we can ask for the model to act according to the optimal policy on the actual MDP that it experiences. ↩︎

  2. Note that robustness as a whole isn't included in the diagram as I thought it made it too messy. For an implication diagram with robustness instead of intent alignment, see here. ↩︎

  3. See here for an example of this confusion regarding the more general vs. more specific uses of inner alignment. ↩︎

  4. See here for an example of this confusion regarding deceptive alignment. ↩︎

AI2
Frontpage

68

Ω 34

10 comments, sorted by Highlighting new comments since Today at 9:13 AM
New Comment

+1, great post.

Only nitpick: seems like it's worth clarifying what you mean by "infinite data" - from which distribution? And same with "off-distribution".

Thanks! And good point—I added a clarifying footnote.

Hmm, I think this is still missing something.

  1. "What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters" - I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.
  2. When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).

I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.

No—all data points that it could ever encounter is stronger than I need and harder to define, since it relies on a counterfactual. All I need is for the model to always output the optimal loss answer for every input that it's ever actually given at any point.

When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).

Deployment, but I agree that this one gets tricky. I don't think that the fact that the world is non-stationary is a problem for conceptualizing it as an MDP, since whatever transitions occur can just be thought of as part of a more abstract state. That being said, modeling the world as an MDP does still have problems—for example, the original reward function might not really be well-defined over the whole world. In those sorts of situations, I do think it gets to the point where outer alignment starts breaking down as a concept.

Planned summary for the Alignment Newsletter:

This post clarifies the author’s definitions of various terms around inner alignment. Alignment is split into intent alignment and capability robustness, and then intent alignment is further subdivided into outer alignment and objective robustness. Inner alignment is one way of achieving objective robustness, in the specific case that you have a mesa optimizer. See the post for more details on the definitions.

Planned opinion:

I’m glad that definitions are being made clear, especially since I usually use these terms differently then the author. In particular, as mentioned in my opinion on the highlighted paper, I expect performance to smoothly go up with additional compute, data, and model capacity, and there won’t be a clear divide between capability robustness and objective robustness. As a result, I prefer not to divide these as much as is done in this post.

Thanks for writing this. 

I wish you included an entry for your definition of 'mesa-optimizer'. When you use the term, do you mean the definition from the paper* (an algorithm that's literally doing search using the mesa objective as the criterion), or you do speak more loosely (e.g., a mesa-optimizer is an optimizer in the same sense as a human is an optimizer)? 

A related question is: how would you describe a policy that's a bag of heuristics which, when executed, systematically leads to interesting (low-entopy) low-base-objective states?

*incidentally, looking back on the paper, it doesn't look like we explicitly defined things this way, but it's strongly implied that that's the definition, and appears to be how the term is used on AF.

Glad you liked it! I definitely mean mesa-optimizer to refer to something mechanistically implementing search. That being said, I'm not really sure whether humans count or not on that definition—I would probably say humans do count but are fairly non-central. In terms of the bag of heuristics model, I probably wouldn't count that, though it depends on what “bag of heuristics” means exactly—if the heuristics are being used to guide a planning process or something, then I would call that a mesa-optimizer.

Great post! Thanks for writing this — it feels quite clarifying. I'm finding the diagram especially helpful in resolving the sources of my confusion.

I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.

This may be a fundamental confusion on my part — but I don't see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system. 

Zooming in on the "inner alignment  objective robustness" part of the diagram, I think what's actually going on is something like:
 

  1. A human AI researcher wishes to optimize for some base objective, .
     
  2. It would take too much work for our researcher to optimize for  manually. So our researcher builds an agent to do the work instead, and sets  to be the agent's loss function.
     
  3. Depending on how it's built, the agent could end up optimizing for , or it could end up optimizing for something different. The thing the agent ends up truly optimizing for is the agent's behavioral objective — let's call it . If  is aligned with , then the agent satisfies objective robustness by the above definition: its behavioral objective is aligned with the base. So far, so good.

    But here's the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent's "behavioral objective" is really just the mesa-objective of that mesa-optimizer.
     
  4. And now, we've got an agent that wishes to optimize for some mesa-objective . (Its "behavioral objective" by the above definition.)
     
  5. And then our agent builds a sub-agent to do the work instead, and sets  to be the sub-agent's loss function.
     
  6. I'm sure you can see where I'm going with this by now, but the sub-agent the agent builds will have its own objective  which may or may not be aligned with , which may or may not in turn be aligned with . From the point of view of the agent, that sub-agent is a mesa-optimizer. But from the point of view of the researcher, it's actually a "mesa-mesa-optimizer".
     

That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, "intent alignment", as defined here, is what I'd call "inner alignment between the researcher and the agent"; and "inner alignment", as defined here, is what I'd call "inner alignment between the agent and the mesa-optimizer it may give rise to".

In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we'd use to analyze any other optimizer. (I do, for what it's worth, make this point in my earlier post — though perhaps not clearly enough.)

Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that's a promising sign that it may be pointing to something fundamental.

A final caveat: there may well be a big conceptual piece that I'm missing here, or a deep confusion that I have around one or more of these concepts that I'm still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!

I agree that what you're describing is a valid way of looking at what's going on—it's just not the way I think about it, since I find that it's not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn't itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.

Sure, makes sense! Though to be clear, I believe what I'm describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.