In the alignment community, there seem to be two main ways to frame and define objective robustness and inner alignment. They are quite similar, mainly differing in the manner in which they focus on the same basic underlying problem. We’ll call these the objective-focused approach and the generalization-focused approach. We don’t delve into these issues of framing the problem in Empirical Observations of Objective Robustness Failures, where we present empirical observations of objective robustness failures. Instead, we think it is worth having a separate discussion of the matter. These issues have been mentioned only infrequently in a few comments on the Alignment Forum, so it seemed worthwhile to write a post describing the framings and their differences in an effort to promote further discussion in the community.
This post compares two different paradigmatic approaches to objective robustness/inner alignment:
- Emphasis: “How do we ensure our models/agents have the right (mesa-)objectives?”
- Outer alignment: “an objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.”
- Outer alignment is a property of the training objective.
- Emphasis: “How will this model/agent generalize out-of-distribution?”
- Considering a model’s “objectives” or “goals,” whether behavioral or internal, is instrumentally useful for predicting OOD behavior, but what you ultimately care about is whether it generalizes “acceptably.”
- Outer alignment: a model is outer aligned if it performs desirably on the training distribution.
- Outer alignment is a property of the tuple (training objective, training data, training setup, model).
Special thanks to Rohin Shah, Evan Hubinger, Edouard Harris, Adam Shimi, and Adam Gleave for their helpful feedback on drafts of this post.
- Outer alignment: an objective function is outer aligned if all models that perform optimally on in the limit of perfect training and infinite data are intent aligned.
- Objective robustness: an agent is objective robust if its behavioral objective is aligned with the base objective it was trained under.
- Behavioral objective: the behavioral objective is what a model appears to be optimizing for. Formally, the behavioral objective is the objective recovered from perfect inverse reinforcement learning.
- Inner alignment: a mesa-optimizer is inner aligned if its mesa-objective is aligned with the base objective it was trained under.
- (This is a special case of objective robustness because a mesa-optimizer’s behavioral objective should be its mesa-objective.)
- Capability robustness: An agent is capability robust if it performs well on its behavioral objective even in deployment/OOD.
Essentially, this framing factors alignment into two problems:
- How do we specify an outer aligned objective (by the above definition of outer aligned)?
- How do we ensure our models actually pursue that objective, even out of distribution? (How do we ensure that the model’s behavioral objective, or mesa-objective in the case of mesa-optimization, is aligned with the outer aligned base objective?)
We call this the objective-focused approach because of its emphasis on a model’s “objectives” in identifying the problems that remain when outer alignment is solved. This focus probably derives from long-standing worries about the goals of intelligent agents; many of the original cases for catastrophic risk from powerful AI were essentially that it seems dangerous to build very intelligent agents that have goals (or utility functions, etc.) that diverge from our own. With this worry in mind, the natural next question was “how do we get our AIs to have the right goals?” The above factorization aims to answer this question by 1) specifying the “right goals” and 2) making sure these actually become the model’s own.
This approach has a few limitations. First, there is no clear dividing line between capability and objective robustness as defined; at least, they are not as orthogonal as suggested in "2D Robustness." Because the behavioral objective is the objective recovered from perfect inverse reinforcement learning (IRL), every model has a behavioral objective. If the “perfect” IRL doesn’t correct for biases/it assumes that the model’s behavior is optimal with respect to the behavioral objective, the recovered behavioral objective will likely just be a different encoding of the policy. RFLO acknowledges this:
We distinguish the mesa-objective from a related notion that we term the behavioral objective. Informally, the behavioral objective is the objective which appears to be optimized by the system’s behavior. More formally, we can operationalize the behavioral objective as the objective recovered from perfect inverse reinforcement learning (IRL). This is in contrast to the mesa-objective, which is the objective actively being used by the mesa-optimizer in its optimization algorithm.
Arguably, any possible system has a behavioral objective—including bricks and bottle caps. However, for non-optimizers, the appropriate behavioral objective might just be “1 if the actions taken are those that are in fact taken by the system and 0 otherwise” and it is thus neither interesting nor useful to know that the system is acting to optimize this objective. For example, the behavioral objective “optimized” by a bottle cap is the objective of behaving like a bottle cap.
It therefore appears that no agent could fail to be capability robust with respect to its own behavioral objective: since the behavioral objective is recovered with perfect IRL over infinite data, the model should always perform well on its behavioral objective, even out of the training distribution (be capability robust). Taking this one step further, knowing the behavioral objective in the limit seems to obviate the need to discuss alignment in the first place: if we know what the model would do in every situation, we already know whether it’s safe. On the other hand, if perfect IRL were to correct for biases, the recovered behavioral objective would be closer to “what the model is actually ‘trying’ to do” than “whatever the model actually does,” and the model could fail to be capability robust with respect to its own behavioral objective. However, it is very unclear what should count as a bias (vs., for example, a strange and/or particular preference).
A notion of a behavioral objective that relied only on a finite number of observations (e.g. what the model appears to be optimizing for on the training data) would be just as problematic; it would be underspecified because there are many possible objectives that fit the behavior thus far and therefore unhelpful in predicting the behavior of the model under future distributional shift.
Another potential limitation of this approach is that powerful AI systems might not be well-described by relatively simple behavioral objectives; they might instead act in a way that optimizes a complex mix of complex heuristics that performed well on the training data. In this case, reasoning about a system’s “objectives” would be largely unhelpful in predicting its behavior out-of-distribution. Although humans intuitively think in terms of agency and goals when reasoning about what an intelligent system will do (in other words, apply the intentional stance), it is possible that such a paradigm will not apply well to powerful AI systems.
Having said all of this about the behavioral objective, however, the objective-focused approach is probably still a useful framing to the extent one thinks that our models will learn to plan and act flexibly according to internally-represented objectives/goals and that mechanistically understanding the way they implement this behavior is possible. This is probably the case if mesa-optimizers are likely to arise, but also potentially possible if we can understand how a model’s goals are "formulated in terms of... concepts it possesses," perhaps with advanced interpretability tools, and reason about how the concepts to which the goals refer will generalize under distributional shift.
All of the aforementioned problems with a purely behavioral conception of a model’s objective imply that the overall robustness problem cannot be so cleanly subdivided into capability and objective robustness, at least without understanding how a model’s internal objectives are structurally implemented. It is probably more accurate to say that robustness failures can be put on a spectrum ranging from cases where the model’s capabilities fail to generalize to those where its “objective” fails to generalize. This suggests a different factorization of the alignment problem, with a slightly different notion of “outer alignment”:
This is the approach suggested by Rohin Shah, highlighted in his presentation "Generalization > Utility" (slides) but mentioned in various comments since the publication of RFLO. In this framing, “outer alignment” (as a property of the base objective) is not defined in terms of whether the model that is optimal with respect to the objective in the limit of infinite data and perfect training is aligned; instead, an objective function is outer aligned if it incentivizes or produces the behavior we actually want on the training distribution. This conception of outer alignment is probably better suited to the task of actually trying to build aligned models; in practice, we only get to check whether the models we train performed acceptably on the training distribution. It seems much more difficult to reason about a given base objective over every possible situation, and besides, how the base objective would score behavior outside the training distribution has no influence on the model that gets produced. Even a training objective that would appropriately score performance in every possible situation cannot fully determine the behavior of a model beyond the examples it was trained on; from the point of view of the model, the training signal is underspecified over inputs from distributions other than the one it was trained on. Robustness, then, is about how a model trained with an outer aligned objective will generalize upon deployment/under distributional shift. In other words, the two problems here are:
- How do we get the behavior we want on the training distribution?
- How do we ensure the model generalizes acceptably out of distribution (never performs catastrophically on any input)?
This approach’s emphasis on generalization means it does not explicitly rely on a notion of the model’s objective or goals in subdividing the overall alignment problem. However, conceiving of robustness failures on a spectrum ranging from those where the model fails to generalize capably to those where it generalizes capably but in ways that are no longer aligned with what we want is still important for solving the problem. The aforementioned technical fuzziness between capability and objective robustness remains, but the two are qualitatively distinct enough for this idea of a spectrum to be meaningful.
Indeed, this is why “objective robustness” and “capability robustness” remain children of the robustness node in the diagram; even though the boundary between the two remains ill-defined, it is necessary to categorize robustness failures by whether they generalize incapably or competently but pursue now-misaligned “objectives” (in the behavioral/intentional sense). Ultimately, the robustness failures we care most about are those where the model generalizes competently under distributional shift but in ways that are no longer aligned with what we want. Robustness failures where the model becomes inept could only produce risks of accidents, but capable models pursuing misaligned objectives could in principle leverage their capabilities to visit arbitrarily bad states, and at the extreme, deceptive models would be extremely dangerous. This is why “inner alignment” (broadly construed) is so important in the first place. Indeed, Shah’s threat model (slides) is exactly this kind of “bad generalization.” This is a qualitatively different kind of robustness failure than the kind usually discussed within the machine learning community, even if relying on a purely behavioral notion of a model’s “objective” means that it cannot be cleanly separated from the usual failure mode in a technical fashion. For example, although our results show CoinRun models failed to learn the general capability of pursuing the coin, the more natural interpretation is that the model has learned a robust ability to avoid obstacles and navigate the levels, but the objective it learned is something like “get to the end of the level,” instead of “go to the coin.”
Understanding a model’s objectives or goals, if it has them, is thus instrumentally useful for reasoning about out of distribution generalization, but the generalization under distributional shift is what is of primary importance. The robustness-focused framing includes as subcases instances where models misgeneralize out of distribution because they have misaligned goals that are only revealed under the distributional shift (e.g. deceptive models), but it also includes cases where, for example, models execute complicated behavioral heuristics that worked well on the training distribution and will generalize capably, just no longer in the way we want. In both examples, the problem is to prevent catastrophic behavior in the worst case.
Terminology in our work
In either of the above framings of the matter, our work empirically demonstrates “objective robustness” failures in modern reinforcement learning agents, as clearly as they can be distinguished from “ordinary” robustness failures where a model’s capabilities fail to generalize. We will use “objective robustness” throughout this work to refer to the property of interest because we feel the most straightforward interpretation of our results is that our agents have learned general enough capabilities that they can use them coherently out-of-distribution in ways that are no longer aligned with the training objective. Additionally, we know that many people will strongly associate “inner (mis)alignment” with the special case of mesa-optimization, and we wish to avoid any confusion over whether this is the first empirical demonstration of mesa-optimization (it is not).
However, we do not feel completely satisfied with this terminology and want to use this as an opportunity to reopen discussion about the terms and definitions we want to settle on as a community when discussing these issues. Besides the obvious desirability of having standardized terms and concepts in order to facilitate further technical work and enhance communication among those in the alignment community, being as clear and coherent as possible about how we frame and discuss inner alignment will help to bridge gaps in understanding between this community and the broader machine learning community, which is obviously already well-aware of robustness failures of the usual kind. We hope that this work can serve as a jumping-off point for these renewed discussions.
On the other hand, if capability robustness were considered with respect to the base objective, then every failure of capability robustness would also be a failure of objective robustness: trivially, an incapable model will have a behavioral objective that is different from the base objective. In either case, with the behavioral objective so defined, the entire “robustness” problem seems to technically collapse to just the “objective robustness” problem. ↩︎
In the sense of mesa-optimization originally intended by RFLO: learned mechanistic search/optimization for an internally-but-explicitly-represented mesa-objective ↩︎
N.B. This diagram also comes from "Clarifying inner alignment terminology." The "intent alignment" factorization highlighted in the last section can be refactored into this robustness-centric version with the terms defined the same way, and where a model is “robust” “if it performs well on the base objective it was trained on even in deployment/out-of-distribution.” However, the factorization of the generalization-focused approach discussed here is not quite equivalent, despite sharing the same terms and overall structure. It has different notions of both “outer alignment” and “robustness”: a base objective is “outer aligned” if it incentivizes the behavior we want on the training distribution (not every possible situation), and a model is “robust” if it “generalizes acceptably” (not “performs well on the base objective out-of-distribution,” since we no longer expect an “outer aligned” base objective to capture what we want outside of the training distribution). The objective-focused approach still emphasizes solving robustness by trying to ensure that models robustly pursue aligned objectives with competence OOD. It is therefore arguably more naturally suited to the “intent alignment” factorization, since in either case a solution under its definitions requires a model to robustly “try” to do the right thing. ↩︎
Rohin calls these "strong performance in normal situations" and "acceptable behavior in all situations." ↩︎
The arrows in this diagram can be interpreted in the same way as those in Evan’s original diagrams, where a problem is solved if its direct subproblems are solved. However, this is not intended to suggest that the two should be approached as entirely separate problems under the generalization-focused approach, as the boundary between them is fuzzy. ↩︎
After all, ProcGen was designed to test (capability) generalization in deep RL. ↩︎
Or, if we assume the generalization-focused approach, where “inner (mis)alignment” isn’t reserved for the specific case of mesa-optimization as defined in RFLO, we could use the term to refer to cases where the model's capabilities generalize but its "objective" does not, in which case this work empirically demonstrates inner misalignment. ↩︎