Understanding the outer alignment problem
What really is outer alignment? In “Risks from Learned Optimization,” we defined outer alignment in the context of machine learning as “aligning the specified loss function with the intended goal.” But that's not a perfectly well-defined statement—what does it mean for a loss function to be “aligned” with the intended goal? If the goal we care about is maximizing , do we need exactly for constants ? That's a pretty high bar.
Well, what exactly do we want outer alignment for? At the end of the day, we care about whether the model that pops out the other end of our training procedure will be safe, which is a complicated question involving the loss function, the architecture, the implicit inductive biases, and so on. In what sense, then, is it even reasonable to look at just the lost function in isolation and ask whether it's aligned or not?
I think the strongest case for outer alignment being a meaningful problem in isolation comes from the argument that loss functions seem to scale pretty well with generic machine learning progress. If, as a silly example, your outer alignment scheme is to “train image classification models,” that's something that ML has progressively gotten better at over time. Compare that to the silly inner alignment scheme of “train a straightforward CNN”—that's something that ML has passed by pretty rapidly in favor of architectural improvements like residual connections even just for the task of image classification. Of course, outer alignment alone does not an aligned AGI make, so you still have to have some notion of how you're going to do inner alignment in mind—but loss functions scaling better is still a perfectly valid reason for focusing on outer alignment.
Thus, it does seem quite reasonable to me to put effort into finding “aligned” loss functions. But that still brings us back to the question of what exactly makes a loss function “aligned.” In the context of a specific training/inner alignment scheme, we can say that a loss function is aligned if, when plugged into that training scheme, it produces models which are aligned with our goals. But in the absence of any specific training scheme, what does it mean to say that a loss function is aligned in isolation? We can of course ask for as I stated previously, though in my opinion I think achieving something like that is likely to be nearly impossible.
Outer alignment at optimum
I think there is another version of “outer aligned in isolation,” however, which is both meaningful and (at least somewhat) achievable which I will call outer aligned at optimum. Intuitively, I will say that a loss function is outer aligned at optimum if all the possible models that perform optimally according to that loss function are aligned with our goals—that is, they are at least trying to do what we want. More precisely, let and . For a given loss function , let . Then, is outer aligned at optimum if, for all such that , is trying to do what we want.
That's the definition—now why should we care? In basically any practical setting we're never going to reach perfect loss, so why should it matter if those functions which do have perfect loss are aligned or not? In my opinion, I think there is a strong argument for loss functions which are aligned at optimum being significantly less susceptible to Goodhart's Law as we scale up ML capabilities. Suppose you know that a loss function is aligned for current ML capabilities. When you then scale up those capabilities and push harder on minimizing , you immediately run into all the issues of Goodhart's Law where can quickly cease to be a good proxy for alignment as you push harder on it. If you have a guarantee that is aligned at optimum, however, then, while still quite possible, it's a lot harder for Goodhart's Law to bite you. In particular, if you think about the Goodhart taxonomy, alignment at optimum almost entirely rules out both Causal and Extremal Goodhart—since you know the relationship is valid and doesn't break down at the extremes—and ensures that Regressional and Adversarial Goodhart won't show up in the limit, though you could still see them before that point. Though this obviously doesn't just give you an alignment guarantee—before you get to the true optimum, you can still get Regressional Goodhart biting you through proxy alignment or Adversarial Goodhart biting you through deceptive alignment, for example—I think it is nevertheless still a very nice thing to have.
The case for imitative amplification
With all of that being said, I can get to the reason that I want to talk about all of this: I think that specifically what I will call imitative amplification—in contrast to other amplification-based approaches or debate-based approaches—has a strong claim to being outer aligned at optimum. Specifically, when I say imitative amplification, I mean the class of training procedures which are attempting to produce models which approximate HCH as closely as possible. As a concrete example, consider the scheme where you train a model to minimize the difference between its output and the output of a human consulting that model. I want to contrast this with approval-based amplification, by which I mean the class of training procedures where the loss is generated using an approval signal from an amplified overseer. As a concrete example, consider the scheme where you train a model to maximize the extent to which a human consulting that model would approve of that model's output.
So, why does imitative amplification have a stronger case for being outer aligned at optimum than approval-based amplification or debate? Well, precisely because we know what the optimum of imitative amplification is—it's HCH—whereas we really don't know what perfect approval-based amplification or perfect debate look like. Though some challenges have been raised regarding whether HCH is actually aligned or not, I tend to be fairly skeptical of these challenges—HCH is just a bunch of humans after all and if you can instruct them not to do things like instantiate arbitrary Turing machines, then I think a bunch of humans put together has a strong case for being aligned. That being said, the same argument does not at all apply to approval-based amplification or debate.
First, let's consider approval-based amplification. We know what the optimum of imitative amplification looks like—but what is the optimum of approval-based amplification? At first glance, one might imagine that the optimum of approval-based amplification looks like a model whose output is selected to be maximally approved of by HCH. That's very much not the case for the approval-based scheme I described earlier, however. If each step of training is done via maximizing an approval signal, then instead of a tree of humans you get a tree of models trying to maximize the approval that their parents in the tree would assign to their answers. And if you think that human approval can be gamed—which seems extremely likely in my opinion given that we see exactly that sort of gaming happening in our world already all the time—then this is very much not a safe tree. Now, one could make the argument that approval-based amplification can just become imitative amplification if the humans determine their approval by computing a distance function between what they would have said and what the model produced as its output. For example, you could ask your humans to come up with their answers first, then show them the model's answer and ask them to rate how close it was. I'm pretty skeptical of this approach, however—it doesn't seem at all clear to me that this gets around the approval-gaming problem, since the humans still get to see the model's answer and doing so could significantly change how they're thinking about the rating problem.
Now, second, let's consider debate with a human judge. In many ways, debate was designed as an approach meant to fix the problems of approval-based reward signals. With a standard approval-based reward signal, the argument goes, it's easy to be tricked by a bad argument that you don't fully understand. In a debate setup, however, you get the benefit of having two competing systems trying to point out flaws in each other's arguments, which hopefully should prevent you from being tricked by bad arguments and thus fix the approval-gaming problem. I'm not convinced, though—false things can be significantly easier to argue for than true things just because there are fewer ways to attack them, they're more rhetorically powerful, or any other number of possible ways in which an argument can be subtly wrong yet still persuasive. Regardless, however, I think the more fundamental objection is just that we really have no way of knowing what optimal play in debate looks like, which makes it very difficult to ever know whether it is outer aligned at optimum or not. With HCH, we know that it just looks like a tree of humans, which at least means we can reason about the parts and how they interact. With optimal debate, however, we have to somehow analyze, understand, and be confident in the alignment of superhuman play on a game involving humans assessing arbitrary strings of characters, which is something that in my opinion seems extremely difficult to do.
Addressing competitiveness concerns
All of that is an argument for why we should prefer imitative amplification from an alignment standpoint. However, there's also the problem of imitative amplification just not being competitive in terms of capabilities with other approaches. First of all, I think it's important to remember the importance of putting safety first—if something isn't safe, then we shouldn't build it. Of course, arms race dynamics could end up pushing one's hand into going with a best available current option in order to beat some other team which one believes will produce an AI which is even less likely to be safe, though I think it's important to remember that that's a last resort, not the default option. Furthermore, even in such a situation, it's still probably fine to eat an overhead cost that is just something like a constant factor worse.
With that being said, I still think there are strong arguments to be made for why imitative amplification can be done competitively. First, like the silly outer alignment scheme of “just train an image classification model” from earlier, imitative amplification gets to piggy-back off of generic ML progress. Imitative amplification is just a language modeling problem, which means generic progress on language modeling should generally be transferable to imitative amplification. Second, I think there is a strong case for language being sufficiently rich as a dataset for training an AGI (EDIT: where “language” is construed to also include embedded images, videos, etc.), at least for the sorts of tasks which I think you will want to use your first AGI for. For example, if the primary/most important purpose of your first AGI is to help you build your second AGI by helping you improve your AGI design, that's the sort of highly cognitive task which I think language is sufficient for. Certainly, if you needed your first AGI to be able to do fine motor control to be competitive, then imitative amplification probably won't get you there—but it seems pretty unlikely to me that ability to do fine motor control will be an important desiderata. Third, a common criticism of imitative amplification is that because imitation treats all data points the same, it won't be able to focus on the most important ones. However, that's not something that's fundamental to the task of imitation. For example, you could use active learning to select the most important data points rather than just using a fixed curriculum. Or, you could even weight different data points in your imitation loss using some outside importance criterion while still maintaining the guarantee of perfect imitation at optimum.
Regardless, I think the case for imitative amplification's safety is a strong argument in favor of at least focusing on figuring out whether it works and is safe first, before we give up and move to other approaches. Furthermore, even if imitative amplification on its own isn't competitive, I don't think that means we have to abandon it completely—there are modifications to imitative amplification that can be made to help improve competitiveness without sacrificing all of its benefits. For example, you could do reward-modeling-based distillation (e.g. RL + IRL as the distillation step) instead of imitation-based distillation, which, while not imitative (as the optimum isn't HCH anymore), also isn't based on human approval, which could be a nice property. Alternatively, you could first train an HCH model, and then use that model as the judge to train a debate model, which could have significant benefits over just using a human judge. While I don't think we should be focusing on those sorts of things now, the fact that such options exist makes it more likely that imitative amplification work can transfer to future approaches even if imitative amplification itself ends up not being competitive. In any event, I think the case for focusing on imitative amplification right now both from an outer alignment perspective as well as from a competitiveness perspective is quite strong.
Note that the two categories of “imitative” and “approval-based” amplification do not cover the entire space of possible amplification-based approaches—there are other possible schemes in this domain as well. For example, you could use imitative amplification to train an HCH approximator, then do RL to produce a model which maximizes that model's approval—or even use your HCH model as a judge in a debate. Alternatively, you could do imitative amplification but instead of using standard imitation learning you could do IRL + RL instead. All of these different approaches have different alignment properties—I have singled out imitative amplification, approval-based amplification, and debate with a human judge because they are the approaches I'm most interested in talking about there, though they are far from the only ones. ↩︎
Note that for the optimum of imitative amplification to be precisely HCH, you need it to be the case that you progressively enlarge your training data as you go along. The fact that you don't get good guarantees for finite datasets is certainly a problem, though it's one that you basically have to solve via inner alignment techniques and thus not one I want to focus on right now. ↩︎
The question of whether theoretical HCH is aligned or not is a pretty complicated question that I don't really want to go into in full detail right now, so if you strongly disagree just take it as a given for this post. ↩︎
Though there was a previous claim by William Saunders that RL amplification and imitative amplification are equivalent, I think that both of William's proposals there fall into my approval-based category, not my imitative category. See Rohin Shah's and Wei Dai's comments on William's post to that effect. ↩︎
I have a lot more to say on this point regarding reasons why false arguments can be more persuasive than true ones, though that's not something I want to go into in too much detail right now. ↩︎