Categorizing failures as “outer” or “inner” misalignment is often confused

From the perspective of Reframing Inner Alignment, both scenarios are ambiguous because it's not clear whether

you really had a policy-scoring function that was well-defined by the expected value over the cognitive processes that humans use to evaluate pull requests under normal circumstances, but then imperfectly evaluated it by failing to sample outside normal circumstances, or
your policy-scoring "function" was actually stochastic and "defined" by the physical process of humans interacting with the AI's actions and clicking Merge buttons, and this incorrect policy-scoring function was incorrect, but adequately optimized for.

I tend to favor the latter interpretation—I'd say the policy-scoring function in both scenarios was ill-defined, and therefore both scenarios are more a Reward Specification (roughly outer alignment) problem. Only when you do have "programmatic design objectives, for which the appropriate counterfactuals are relatively clear, intuitive, and agreed upon" is the decomposition into Reward Specification and Adequate Policy Learning really useful.

[-]Rohin Shah3yΩ222

Yup, this is the objective-based categorization, and as you've noted it's ambiguous on the scenarios I mention because it depends on how you choose the "definition" of the design objective (aka policy-scoring function).

[-]Jon Garcia3y60

It seems to me that "inner" versus "outer" alignment has become a popular way of framing things largely because it has the appearance of breaking down a large problem into more manageable sub-problems. In other words, "If research group A can solve outer alignment, and research group B can solve inner alignment, then we can put them together into one big alignment solution!" Unfortunately, as you alluded to, reality does not cleanly divide along this joint. Even knowing all the details of an alignment failure might not be enough to classify it appropriately.

Of course, in general, if a semantic distinction fails to clarify the nature of the territory, then it should be rejected as a part of the map. Arguing over semantics is counterproductive, especially when the ground truth of the situation is already agreed upon.

That being said, I think that the process that came up with the distinction between inner and outer (mis)alignment is extremely useful. Just making an attempt to break down a large problem into smaller pieces gives the community more tools for thinking about it in ways that wouldn't have occurred to them otherwise. The breakdown you gave in this post is an excellent example. The solution to alignment probably won't be just one thing, but even if it is, it's unlikely that we will find it without slicing up the problem in as many ways as we can, sifting through perspectives and subproblems in search of promising leads. It may turn out to be useful for the alignment community to abandon the inner-outer distinction, but we shouldn't abandon the process of making such distinctions.

[-]JanB3yΩ443

The key problem here is that we don't know what rewards we “would have” provided in situations that did not occur during training. This requires us to choose some specific counterfactual, to define what “would have” happened. After we choose a counterfactual, we can then categorize a failure as outer or inner misalignment in a well-defined manner.

We often do know what rewards we "would have" provided. You can query the reward function, reward model, or human labellers. IMO, the key issue with the objective-based categorisation is a bit different: it's nonsensical to classify an alignment failure as inner/outer based on some value of the reward function in some situation that didn't appear during training, as that value has no influence on the final model.

In other words: Maybe we know what reward we "would have" provided in a situation that did not occur during training, or maybe we don't. Either way, this hypothetical reward has no causal influence on the final model, so it's silly to use this reward to categorise any alignment failures that show up in the final model.

[-]Rohin Shah3yΩ220

Yeah, that's another good reason to be skeptical of the objective-based categorization.

[-]plex3y40

I classified the first as Outer misalignment, and the second as Deceptive outer misalignment, before reading on.

I agree with

Another use of the terms “outer” and “inner” is to describe the situation in which an “outer” optimizer like gradient descent is used to find a learned model that is itself performing optimization (the “inner” optimizer). This usage seems fine to me.

being the worthwhile use of the term inner alignment as opposed to the ones you argue against, and could imagine that the term is being blurred and used in less helpful ways by many people. But I'd be wary of discouraging the inner vs outer alignment ontology too hard, as the internal optimizer failure mode feels like a key one and worthwhile having as a clear category within goal misgeneralization.

As for why so many researchers were tripped up, I imagine that the framing of the pop quiz would make a big difference. The first was obviously outer alignment and the question was inner vs outer, so priors for a non-trick quiz were that the second was inner. People don't like calling out high-status people as pulling trick questions on them without strong evidence (especially in front of a group of peers), and there is some vague story that doesn't hold up if you look too closely in the direction of internal models not being displayed that could cause a person to brush over the error.

[-]johnswentworth3yΩ340

Priors against Scenario 2. Another possibility is that given only the information in Scenario 1, people had strong priors against the story in Scenario 2, such that they could say “99% likely that it is outer misalignment” for Scenario 1, which gets rounded to “outer misalignment”, while still saying “inner misalignment” for Scenario 2.
I would guess this is not what’s going on. Given the information in Scenario 1, I’d expect most people would find Scenario 2 reasonably likely (i.e. they don’t have priors against it).

FWIW, this was basically my thinking on the two scenarios. Not 99% likelihood, but scenario 1 does strike me as ambiguous but much more likely to be an outer misalignment problem (in the root cause sense).

[-]Rohin Shah3yΩ560

Yeah, this makes sense given that you think of outer misalignment as failures of [reward function + training distribution], while inner misalignment is failures of optimization.

I'd be pretty surprised though if more than one person in my survey had that view.

[-]Thomas Kwa3yΩ120

A while ago you wanted a few posts on outer/inner alignment distilled. Is this post a clear explanation of the same concept in your view?

[-]johnswentworth3yΩ220

I don't think this post is aimed at the same concept(s).

[-]Steven Byrnes3y*Ω330

I think any training setup that calculates feedback / rewards / loss based purely on the AI’s external behavior, independent of its internal state, is almost definitely a terrible plan. If we are talking about such plans anyway, for pedagogical or other reasons, then I would defend “outer” and “inner” as a meaningful and useful distinction with definition and reasons here. I think both of your spoiler boxes are inner misalignment (by my definition at that link), because in neither case can we point to any particular moment during training where the AI’s actual external behavior was counter to the programmer’s intention yet the AI nevertheless got a reward for that behavior.

If we are talking about training setups that do NOT give feedback / rewards based purely on external behavior, but rather which also have some kind of access to the AI’s internal thoughts / motivations—which is absolutely what we should be talking about!—then yeah, the words “outer” and “inner” misalignment stop being meaningful.

(Good post, thanks for writing it.)

[-]Rohin Shah3yΩ440

If we are talking about such plans anyway, for pedagogical or other reasons, then I would defend “outer” and “inner” as a meaningful and useful distinction with definition and reasons here.

Yup, in the post, this is the generalization-based decomposition with "good feedback" defined as "rewards the right action and punishes the wrong action, irrespective of 'reasons'".

If we are talking about training setups that do NOT give feedback / rewards based purely on external behavior, [...] the words “outer” and “inner” misalignment stop being meaningful.

I think you can extend this to such training setups in a way where it is still meaningful, by defining "good feedback" as feedback that incentivizes "doing the right thing for the right reasons". (This is the sort of move that happens with ELK.)

My claim is more that these definitions aren't very useful as categorizations of failure scenarios, because most situations will be some complicated mix. In the generalization-based definition, the problem is "how many pieces of bad feedback do you have to give before it counts as outer misalignment rather than inner misalignment?" In the root-cause-based definition, the problem is "how do you identify the root cause?"

[-]Ofer3yΩ120

Generalization-based. This categorization is based on the common distinction in machine learning between failures on the training distribution, and out of distribution failures. Specifically, we use the following process to categorize misalignment failures:

Was the feedback provided on the actual training data bad? If so, this is an instance of outer misalignment.

Did the learned program generalize poorly, leading to bad behavior, even though the feedback on the training data is good? If so, this is an instance of inner misalignment.

This categorization is non-exhaustive. Suppose we create a superintelligence via a training process with good feedback signal and no distribution shift. Should we expect that no existential catastrophe will occur during this training process?

[-]Rohin Shah3yΩ220

You can extend the definition to online learning: choose some particular time and say that all the previous inputs on which you got gradients are the "training data" and the future inputs are the "test data".

In the situation you describe, you would want to identify the point at which the AI system starts executing on its plan to cause an existential catastrophe, set that as the specific point in time (so everything before it is "training" and everything after is "test"), and then apply the categorization as usual.

[-]Ofer3yΩ120

(Though even in that case it's not necessarily a generalization problem. Suppose every single "test" input happens to be identical to one that appeared in "training", and the feedback is always good.)

[-]Rohin Shah3yΩ220

It's still well-defined, though I agree that in this case the name is misleading. But this is a single specific edge case that I don't expect will actually happen, so I think I'm fine with that.

[-]Charlie Steiner3yΩ120

Nice! Thinking about "outer alignment maximalism" as one of these framings reveals that it's based on treating outer alignment as something like "a training process that's a genuinely best effort at getting the AI to learn to do good things and not bad things" (and so of course the pull-request AI fails because it's teaching the AI about pull requests, not about right and wrong).

Introspecting, this choice of definition seems to correspond with feeling a lack of confidence that we can get the pull-request AI to behave well - I'm sure it's a solvable technical problem, but in this mindset it seems like an extra-hard alignment problem because you basically have to teach the AI human values incidentally, rather than because it's your primary intent.

Which gets me thinking about what other framings correspond to what other intuitions about what parts of the problem are hard / what the future will look like.

[-]Noosphere893y*10

Re outer alignment: I conceptualize it as the case where we don't need to worry about optimizers generating new optimizers in a recursive sequence, we don't have to worry about mesaoptimizers, etc. Essentially it's the base case alignment scenario.

And a lot of X-risk worries only really work if inner misalignment happens. It's likely a harder problem to solve. If like Janus suspects, Self supervised learning like GPT are solely simulatiors even at a superhuman level, and do not become agentic, then inner alignment problems never come up, which means that AI risk fears should deflate a lot.

[-]Jai3yΩ010

The second scenario omits the details about continuing to create and submit pull requests after takeover, instead just referring to human farms. Since it doesn't explicitly say that it's still optimizing for the original objective criteria and instead just refers to world domination, it appears to be inner misalignment (e.g. no longer aligned with the original optimizer). Did the original posing of this question specify that scenario 2 still maximizes pull requests after world domination?

[-]Rohin Shah3y20

The intent was that in scenario 2 the AI constructs the human farms in order to to have the humans continually merging pull requests (i.e. yes, it still maximizes pull requests). I don't think I explicitly said this -- I probably gave the same text as in this post -- but I expect people to have made this inference (because (a) otherwise it's not clear why the AI would build human farms and (b) when I showed the inconsistency to the survey-takers, to my knowledge none of them said anything along the lines of "oh I was assuming that in scenario 2 the AI was no longer maximizing pull requests").

[-]Maxwell Clarke3yΩ110

This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -

Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.

Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.

^{^}

I am much less confident whether researchers are confused about outer and inner misalignment when given time to think. In the pop quiz above, people changed their minds pretty quickly upon further discussion.

LESSWRONG
LW

LESSWRONG
LW

93

Categorizing failures as “outer” or “inner” misalignment is often confused

93

Ω 47

93

Ω 47

Possible categorizations

What is the purpose of “outer” and “inner” alignment?

Conclusion

Appendix: Why are people’s answers inconsistent?