There have been multiple posts that have discussed inner and outer alignment and it seems like people are operating with slightly different definitions of these terms. This article aims to give a clear and general explanation of inner and outer alignment and clarify some causes of confusion and disagreement.
Definitions
I will start with what I consider to be a reasonable definition of the inner and outer alignment problems:
Outer alignment problem: how do we design a test to determine whether or not an AI does what we want given a particular input?
Other possible phrasings:
- How do we design a good utility function?
- How do we design a good way to rank possible worlds according to our values?
Inner alignment problem: how do we ensure an AI always passes the test, without providing all possible inputs?
Other possible phrasings:
- Given a good utility function, how do we ensure an AI actually optimises for this utility function?
- Given a good way to rank possible worlds according to our values, how do we ensure the AI optimises for the best possible world in all scenarios?
In the limit of infinite data (and time, compute), inner alignment does not exist as we can test every possible input / output pair for the AI. Of course this is impossible in practice, but I find this fact a useful tool to help me decide whether something is an inner or outer alignment problem. I ask myself, if I could test with infinite data, would this problem still exist?
Why must we solve inner and outer alignment?
If we do not solve the inner alignment problem, the AI may behave in a way that passes the test given all inputs humans have provided to it, but does not pass the test consistently over all possible inputs. We can call this AI "deceptively aligned".
A simple example of this is, say we provide inputs x, y and z, and our test is "does the AI output x + y + z?". Given 500 training examples, there is a whole family of functions that the AI could implement, besides x + y + z, that would still pass all 500 tests. For example, it could implement "return x + y + z if all inputs are in the range 0 to 1000 else return 2".
If we do not solve the outer alignment problem, we fail to filter out AI's that don't do what we want, in general.
An example of this is, if our test is "does the AI's outputted intervention reduce CO2 emissions", a solution that involves killing humans could pass. We have misspecified the test and so failed at outer alignment.
Things that cause confusion
There have been some debates about what counts and inner and outer alignment.
One source of confusion is how do we define "the test" or "the objective"? Originally, I wrote that "does the AI output x + y + z?" is a possible test, and inner misalignment can occur if we only test with certain values of x, y and z. However, one can say that this is a misleading claim about what test we are actually using. I could rephrase the test I am actually using in this scenario as "does the AI output x + y + z given inputs drawn from a particular set?". Now, this test is obviously bad, because what we actually want is x + y + z for all inputs. And this looks like an outer alignment problem - the test itself is wrong; it is not a good proxy for what we actually want.
In "Inner Alignment Failures" Which Are Actually Outer Alignment Failures, John Wentworth mentions that our use of birth control is actually an outer misalignment problem with respect to the optimisation process of evolution because "reproductive fitness in the ancestral environment" is not aligned with "reproductive fitness in the modern environment". This is a move which encapsulates some of the inputs to the test into the test itself (a bit like Currying in functional programming). This implies that at least to some extent, whether something is an inner or outer alignment problem is a property of the map, not the territory. Many inner alignment problems can be transformed into outer alignment problems by including some of the inputs of the test into the test specification itself, and then noticing that the test is bad. In the evolution case, this would be changing what we call the objective from "reproductive fitness" to "reproductive fitness in the ancestral environment". It may be more useful to think of the inner / outer alignment dichotomy as a way to split the alignment problem given a particular training scenario as opposed to a formal rule that can be used to put conceptual alignment problems into boxes.
Another source of confusion is the use of optimisation and mesa-optimisation related terminology. The original Risks from Learned Optimization paper referred to AI's possessing mesa-objectives. The mesa-objective is what the AI is actually optimising for, while the training objective is what we want it to optimise for. A reason why mesa-objectives as a concept could be useful for thinking about inner alignment is that we have some reason to believe that powerful AI's will behave like optimisers. Optimisers optimise for some goal/utility function, which in the case of the AI, we call the mesa-objective. The training process by which we obtain the AI itself is also an optimisation process, whose objective we refer to as the base objective. If we are able to detect a mesa-optimiser and show that its objective is not what we want it to be, then this is a clear way to show that the AI is inner-misaligned. However, using a definition that explicitly mentions mesa-objectives to define inner misalignment seems not general enough. Seeing an AI as an optimiser is more of a modelling choice, and in many cases it may not be easy to describe the AI's behaviour in terms of an optimisation process with a well-defined objective function.
In Risks from Learned Optimization: Introduction, Hubinger et. al. write "It might not be necessary to solve the inner alignment problem in order to produce safe, highly capable AI systems, as it might be possible to prevent mesa-optimizers from occurring in the first place.". However, given a more broad definition of inner misalignment, this is not true. Under a broader definition, given insufficient optimisation power and/or training data, the solution found by the base optimiser may still be bad (even given a good base objective), and I suggest that we still refer to this as inner misalignment.
Edit, based on discussion in comments and further research:
This brings to light two possible ways of looking at inner misalignment:
- Inner misalignment is misalignment-that-is-not-outer-misalignment. Or inner misalignment is misalignment that occurs when the training objective is correct, but the training setup / training data itself causes a generalisation error. By this definition, "generalisation error" and "inner misalignment" are pointing to the same thing (although I would say that the former is describing the cause, while the latter is describing the effect)
- Inner misalignment is a form of generalisation error that occurs in cases when the AI itself is an optimiser (and so possesses a mesa-objective that can be misaligned with the base objective). This makes inner misalignment a subset of misalignment-that-is-not-outer-misalignment.
Currently, my crux that would determine which of these definitions is most useful is whether or not deciding whether an AI is an optimiser, and finding its objective, is a well-defined procedure for powerful AIs.
I like all of this post except for the conclusion. I think this comment shows that the definition of inner alignment requires an explicit optimizer. Your broader definition of inner misalignment is equivalent to generalization error or robustness, which we already have a name for.
I currently see inner alignment problems as a superset of generalisation error and robustness. Furthermore, an AI being a mesa-optimiser with a misaligned objective can also be thought of as a generalisation error seeing as this means we haven’t tested the AI in scenarios where it’s mesa-objective behaves differently from the base objective. The conclusion is meant to emphasise the possibility of extending the concept of inner misalignment to AI’s that we do not model as optimisers. I am open to the claim that this is not useful, and we should only use the term when we think of the AI as an optimiser. In which case the definition involving mesa-objectives is sufficient.
What would you include as an inner alignment problem that isn't a generalization problem or robustness problem?
I think any inner alignment problem can be thought of as a kind of generalisation error (this wouldn’t have happened if we had more data), including misaligned mesa-optimisers. So yes, you are correct, in my model they are different ways of looking at the same problem (in hindsight, superset was a wrong word to use). Is your opinion that inner misalignment should only be used in cases when a mesa-optimiser can be shown to exist (which is the original definition and that stated by the comment you linked)? I agree, that would make sense also. I was starting with an assumption that “that which is not outer misalignment should be inner misalignment” but I notice that Evan mentions problems that are neither (eg: mis-generalisations when there are no mesa-optimisers). This way of defining things only works if you commit to seeing the AI in terms of it being an optimiser, which is indeed a useful framing, but not the only one. However, based on your (and Evan’s) comments I do see how having inner alignment as a subset of things-that-are-not-outer-alignment also works.
Hmm yeah I like your edit, it breaks down the two definitions well. I definitely have a preference for the second one, I prefer confusing terms like this to have super specific definitions rather than broad vague ones, because it helps me to think about whether a proposed solution is actually solving the problem being pointed to. I have, like you, seen people using inner (mis)alignment to refer to other things outside of the original strict definition, but as far as I know the comment I linked to is the one that clarifies the definition most recently? I haven't checked this. If there are more recent discussions involving the people who coined the term I would defer to them.
Regarding the crux that you mention in the edit:
If you mean precisely mathematically well-defined, then I think this is too high a standard. I think it is sufficient that we be able to point toward archetypal examples of optimizing algorithms and say "stuff like that".
I think the main reason I care about this distinction is that generalization error without learned optimizers doesn't seem to be a huge problem, whereas "sufficiently powerful optimizing algorithms with imperfectly aligned goals" seems like a world-ending level of problem. Do you agree with this?
Firstly, yes I agree that it makes a lot of sense to defer to Evan who coined the term, and as far as we both can tell he meant the narrow definition. I actually read that comment before and misremembered its content so was originally under the impression that Evan had revised the definition to be broader, but then realized this is not the case.
I am still skeptical that there is any clear difference between optimizer / non-optimizer AI’s. Any AI that does a task well is in some sense optimizing for good performance on that task. This is what makes it hard for me to clearly see a case of generalization error that is not inner misalignment.
However, I can see how this can just be a framing thing where depending on how you look at the problem it’s easier to describe as “this AI has the wrong objective” vs “this AI has the correct objective but pursues it badly due to generalization error”. In any case, both of these also seem equally dangerous to me.
The problem with distinguishing these is that for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue). This fuzzy goal is like a cluster of possible worlds towards which the AI is causing our current world to tend, via its actions/outputs. Pursuing the goal badly means having an overly fuzzy goal where some of the possible convergent worlds are not what we want. Inner misalignment, or having the wrong goal, will also look very similar, although perhaps a distinction you could make is that with inner misalignment fuzzy goal has to be in some sense miscentered.
Recently I've seen a bunch of high status people using "inner alignment" in the more general sense, so I'm starting to think it might be too late to stick to the narrow definition. E.g. this post.
I disagree with this. To me there are two distinct approaches, one is to memorize which actions did well in similar training situations, and the other is to predict the consequences of each action and somehow rank each consequence.
I disagree with this, but I can't put it into clear words. I'll think more about it. It doesn't seem true for model-based RL, unless we explicitly build in uncertainty over goals. I think it's only true for humans for value-loading-from-culture reasons.