Clarifying inner alignment terminology

[-]Vanessa Kosoy4y*Ω10130Review for 2020 Review

This post aims to clarify the definitions of a number of concepts in AI alignment introduced by the author and collaborators. The concepts are interesting, and some researchers evidently find them useful. Personally, I find the definitions confusing, but I did benefit a little from thinking about this confusion. In my opinion, the post could greatly benefit from introducing mathematical notation^[1] and making the concepts precise at least in some very simplistic toy model.

In the following, I'll try going over some of the definitions and explicating my understanding/confusion regarding each. The definitions I omitted either explicitly refer to these or have analogous structure.

(Impact) Alignment: An agent is impact aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic.

This one is more or less clear. Even though it's not a formal definition, it doesn't have to be: after all, this is precisely the problem we are trying to solve.

Intent Alignment: An agent is intent aligned if the optimal policy for its behavioral objective is impact aligned with humans.

The "behavioral objective" is defined in a linked page as:

The behavioral objective is what an optimizer appears to be optimizing for. Formally, the behavioral objective is the objective recovered from perfect inverse reinforcement learning.

This is already thorny territory, since it's far from clear what is "perfect inverse reinforcement learning". Intuitively, an "intent aligned" agent is supposed to be one whose behavior demonstrates an aligned objective, but it can still make mistakes with catastrophic consequences. The example I imagine is: an AI researcher who is unwittingly building transformative unaligned AI.

Capability Robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/off-distribution.

This is confusing because it's unclear what counts as "well" and what are the underlying assumptions. The no-free-lunch theorems imply that an agent cannot perform too well off-distribution, unless you're still constraining the distribution somehow. I'm guessing that either this agent is doing online learning or it's detecting off-distribution and failing gracefully in some sense, or maybe some combination of both.

Notably, the post asserts the implication intent alignment + capability robustness => impact alignment. Now, let's go back to the example of the misguided AI researcher. In what sense are they not "capability robust"? I don't know.

Inner Alignment: A mesa-optimizer is inner aligned if the optimal policy for its mesa-objective is impact aligned with the base objective it was trained under.

The "mesa-objective" is defined in the linked page as:

A mesa-objective is the objective of a mesa-optimizer.

So it seems like we could replace "mesa-objective" with just "objective". This is confusing, because in other places the author felt the need to use "behavioral objective" but here he is referring to some other notion of objective, and it's not clear what's the difference.

I guess that different people have different difficulties. I often hear that my own articles are difficult to understand because of the dense mathematics. But for me, it is the absence of mathematics which is difficult! ↩︎

[-]Richard_Ngo5yΩ6100

+1, great post.

Only nitpick: seems like it's worth clarifying what you mean by "infinite data" - from which distribution? And same with "off-distribution".

[-]evhub5yΩ220

Thanks! And good point—I added a clarifying footnote.

[-]Richard_Ngo5yΩ350

Hmm, I think this is still missing something.

"What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters" - I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.
When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).

[-]evhub5yΩ220

I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.

No—all data points that it could ever encounter is stronger than I need and harder to define, since it relies on a counterfactual. All I need is for the model to always output the optimal loss answer for every input that it's ever actually given at any point.

When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).

Deployment, but I agree that this one gets tricky. I don't think that the fact that the world is non-stationary is a problem for conceptualizing it as an MDP, since whatever transitions occur can just be thought of as part of a more abstract state. That being said, modeling the world as an MDP does still have problems—for example, the original reward function might not really be well-defined over the whole world. In those sorts of situations, I do think it gets to the point where outer alignment starts breaking down as a concept.

[-]Jack R4yΩ120

I'm not sure you have addressed Richard's point -- if you keep your current definition of outer alignment, then memorizing the answers to the finite set of data is always a way to score perfect loss, but intuitively doesn't seem like it would be intent aligned. And if memorization were never intent aligned, then your definition of outer alignment would be impossible.

[-]Rohin Shah5yΩ670

Planned summary for the Alignment Newsletter:

This post clarifies the author’s definitions of various terms around inner alignment. Alignment is split into intent alignment and capability robustness, and then intent alignment is further subdivided into outer alignment and objective robustness. Inner alignment is one way of achieving objective robustness, in the specific case that you have a mesa optimizer. See the post for more details on the definitions.

Planned opinion:

I’m glad that definitions are being made clear, especially since I usually use these terms differently then the author. In particular, as mentioned in my opinion on the highlighted paper, I expect performance to smoothly go up with additional compute, data, and model capacity, and there won’t be a clear divide between capability robustness and objective robustness. As a result, I prefer not to divide these as much as is done in this post.

[-]Joe Carlsmith5y*Ω350

Thanks for writing this up. Quick question re: "Intent alignment: An agent is intent aligned if its behavioral objective is aligned with humans." What does it mean for an objective to be aligned with humans, on your view? You define what it is for an agent to be aligned with humans, e.g.: "An agent is aligned (with humans) if it doesn't take actions that we would judge to be bad/problematic/dangerous/catastrophic." But you don't say explicitly what it is for an objective to be aligned: I'm curious if you have a preferred formulation.

Is it something like: “the behavioral objective is such that, when the agent does ‘well’ on this objective, the agent doesn’t act in a way we would view as bad/problematic/dangerous/catastrophic." If so, it seems like a lot might depend on exactly how “well” the agent does, and what opportunities it has in a given context. That is, an “aligned” agent might not stay aligned if it becomes more powerful, but continues optimizing for the same objective (for example, a weak robot optimizing for beating me at chess might be "aligned" because it only focuses on making good chess moves, but a stronger one might not be, because it figures out how to drug my tea). Is that an implication you’d endorse?

Or is the thought something like: "the behavioral objective such that, no matter how powerfully the agent optimizes for it, and no matter its opportunities for action, it doesn't take actions we would view as bad/problematic/dangerous/catastrophic"? My sense is that something like this is often the idea people have in mind, especially in the context of anticipating things like intelligence explosions. If this is what you have in mind, though, maybe worth saying so explicitly, since intent alignment in this sense seems like a different constraint than intent alignment in the sense of e.g. "the agent's pursuit of its behavioral objective does not in fact give rise to bad actions, given the abilities/contexts/constraints that will in fact be relevant to its behavior."

[-]evhub5y*Ω220

Maybe the best thing to use here is just the same definition as I gave for outer alignment—I'll change it to reference that instead.

[-]Joe Carlsmith5yΩ130

Aren't they now defined in terms of each other?

"Intent alignment: An agent is intent aligned if its behavioral objective is outer aligned.

Outer alignment: An objective function is outer aligned if all models that perform optimally on $r$ in the limit of perfect training and infinite data are intent aligned."

[-]evhub5yΩ220

Good point—and I think that the reference to intent alignment is an important part of outer alignment, so I don't want to change that definition. I further tweaked the intent alignment definition a bit to just reference optimal policies rather than outer alignment.

[-]Joe Carlsmith5y50

Cool (though FWIW, if you're going to lean on the notion of policies being aligned with humans, I'd be inclined to define that as well, in addition to defining what it is for agents to be aligned with humans. But maybe the implied definition is clear enough: I'm assuming you have in mind something like "a policy is aligned with humans if an agent implementing that policy is aligned with humans.").

Regardless, sounds like your definition is pretty similar to: "An agent is intent aligned if its behavioral objective is such that an arbitrarily powerful and competent agent pursuing this objective to arbitrary extremes wouldn't act in ways that humans judge bad"? If you see it as importantly different from this, I'd be curious.

[-]Vlad Mikulik5y*Ω340

Thanks for writing this.

I wish you included an entry for your definition of 'mesa-optimizer'. When you use the term, do you mean the definition from the paper* (an algorithm that's literally doing search using the mesa objective as the criterion), or you do speak more loosely (e.g., a mesa-optimizer is an optimizer in the same sense as a human is an optimizer)?

A related question is: how would you describe a policy that's a bag of heuristics which, when executed, systematically leads to interesting (low-entopy) low-base-objective states?

*incidentally, looking back on the paper, it doesn't look like we explicitly defined things this way, but it's strongly implied that that's the definition, and appears to be how the term is used on AF.

[-]evhub5yΩ230

Glad you liked it! I definitely mean mesa-optimizer to refer to something mechanistically implementing search. That being said, I'm not really sure whether humans count or not on that definition—I would probably say humans do count but are fairly non-central. In terms of the bag of heuristics model, I probably wouldn't count that, though it depends on what “bag of heuristics” means exactly—if the heuristics are being used to guide a planning process or something, then I would call that a mesa-optimizer.

[-]Edouard Harris5y*Ω440

Great post. Thanks for writing this — it feels quite clarifying. I'm finding the diagram especially helpful in resolving the sources of my confusion.

I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.

This may be a fundamental confusion on my part — but I don't see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system.

Zooming in on the "inner alignment objective robustness" part of the diagram, I think what's actually going on is something like:

A human AI researcher wishes to optimize for some base objective, $L$ .
It would take too much work for our researcher to optimize for $L$ manually. So our researcher builds an agent to do the work instead, and sets $L$ to be the agent's loss function.
Depending on how it's built, the agent could end up optimizing for $L$ , or it could end up optimizing for something different. The thing the agent ends up truly optimizing for is the agent's behavioral objective — let's call it $L^{'}$ . If $L^{'}$ is aligned with $L$ , then the agent satisfies objective robustness by the above definition: its behavioral objective is aligned with the base. So far, so good.

But here's the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent's "behavioral objective" is really just the mesa-objective of that mesa-optimizer.
And now, we've got an agent that wishes to optimize for some mesa-objective $L^{'}$ . (Its "behavioral objective" by the above definition.)
And then our agent builds a sub-agent to do the work instead, and sets $L^{'}$ to be the sub-agent's loss function.
I'm sure you can see where I'm going with this by now, but the sub-agent the agent builds will have its own objective $L^{''}$ which may or may not be aligned with $L^{'}$ , which may or may not in turn be aligned with $L$ . From the point of view of the agent, that sub-agent is a mesa-optimizer. But from the point of view of the researcher, it's actually a "mesa-mesa-optimizer".

That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, "intent alignment", as defined here, is what I'd call "inner alignment between the researcher and the agent"; and "inner alignment", as defined here, is what I'd call "inner alignment between the agent and the mesa-optimizer it may give rise to".

In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we'd use to analyze any other optimizer. (I do, for what it's worth, make this point in my earlier post — though perhaps not clearly enough.)

Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that's a promising sign that it may be pointing to something fundamental.

A final caveat: there may well be a big conceptual piece that I'm missing here, or a deep confusion that I have around one or more of these concepts that I'm still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!

[-]evhub5yΩ220

I agree that what you're describing is a valid way of looking at what's going on—it's just not the way I think about it, since I find that it's not very helpful to think of a model as a subagent of gradient descent, as gradient descent really isn't itself an agent in a meaningful sense, nor do I think it can really be understood as “trying” to do anything in particular.

[-]Edouard Harris5yΩ330

Sure, makes sense! Though to be clear, I believe what I'm describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.

The point of talking about the “optimal policy for a behavioral objective” is to reference what an agent's behavior would look like if it never made any “mistakes.” Primarily, I mean this just in that intuitive sense, but we can also try to build a somewhat more rigorous picture if we imagine using perfect IRL in the limit of infinite data to recover a behavioral objective and then look at the optimal policy under that objective. ↩︎
What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters. That gets a bit tricky for reinforcement learning, though in that setting we can ask for the model to act according to the optimal policy on the actual MDP that it experiences. ↩︎
Note that robustness as a whole isn't included in the diagram as I thought it made it too messy. For an implication diagram with robustness instead of intent alignment, see the alternative diagram in the FAQ. ↩︎
See here for an example of this confusion regarding the more general vs. more specific uses of inner alignment. ↩︎
See here for an example of this confusion regarding deceptive alignment. ↩︎

LESSWRONG
LW

LESSWRONG
LW

109

Clarifying inner alignment terminology

109

Ω 55

109

Ω 55

FAQ