Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Defining capability and alignment in gradient descent

5Rohin Shah

3Edouard Harris

4adamShimi

3Edouard Harris

2Charlie Steiner

1Edouard Harris

New Comment

Planned summary for the Alignment Newsletter:

Consider a neural network like GPT-3 trained by gradient descent on (say) the cross-entropy loss function. This loss function forms the _base objective_ that the process is optimizing for. Gradient descent typically ends up at some local minimum, global minimum, or saddle point of this base objective.

However, if we look at the gradient descent equation, θ = θ - αG, where G is the gradient, we can see that this is effectively minimizing the size of the gradients. We can think of this as the mesa objective: the gradient descent process (with an appropriate learning rate decay schedule) will eventually get G down to zero, its minimum possible value (even though it may not be at the global minimum for the base objective).

The author then proposes defining capability of an optimizer based on how well it decreases its loss function in the limit of infinite training. Meanwhile, given a base optimizer and mesa optimizer, alignment is given by the capability of the base optimizer divided by the capability of the mesa optimizer. (Since the mesa optimizer is the one that actually acts, this is effectively measuring how much progress on the mesa objective also causes progress on the true base objective.)

This has all so far assumed a fixed training setup (such as a fixed dataset and network architecture). Ideally, we would also want to talk about robustness and generalization. For this, the author introduces the notion of a “perturbation” to the training setup, and then defines [capability / alignment] [robustness / generalization] based on whether the optimization stays approximately the same when the training setup is perturbed.

It should be noted that these are all definitions about the behavior of optimizers in the infinite limit. We may also want stronger guarantees that also talk about the behavior on the way to the infinite limit.

Thanks, Rohin!

Please note that I'm currently working on a correction for part of this post — the form of the mesa-objective I'm claiming is in fact wrong, as Charlie correctly alludes to in a sibling comment.

Great post! I liked the clean analysis of the problem, the formalization, and the effort to point the potential issues with your definitions. Now I'm really excited for the next posts, where I assume that you will study robustness and generalization (based on your definitions) for simple examples of gradient descent. I'm interested in commenting early drafts if you need feedback!

Some of the fixed points of this system will coincide with

globalorlocal minimaof our base objective, the cross-entropy loss — but not all of them. Some will be saddle points, while others will be local or global maxima. And whilewedon't consider all these fixed points to be equally performant with respect to our base objective,our gradient descent optimizerdoes consider them all to be equally performant with respect to its true objective.This disagreement is the core of the inner alignment problem in this setting: our gradient descent process isn't always optimizing for the quantity we want it to. So what quantity is it optimizing for?

I agree wholeheartedly with this characterization. For me, that's the gist of the inner alignment problem if the objective is the right one (i.e. if outer alignment is solved).

Let's look at a second example. This time we'll compare Optimizer

Ato a ththird optimizer,

Typo on "ththird".

Definition 1. Let be a base optimizer acting over optimization steps, and let represent the value of its base objective at optimization step . Then thecapabilityof with respect to the base objective is

At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.

In our gradient descent example, our mesa-optimizer was the

gradient descent process, and its mesa-objective was what, at the time, I called the "true objective", . But the base optimizer wasthe human who designed the neural networkand ran the gradient process on it.

This is not where I thought you were going when I read the intro, but that's a brilliant idea that removes completely the question of whether and why the base optimizer would find a mesa-optimizer to which it can delegate work.

Thanks for the kind words, Adam! I'll follow up over DM about early drafts — I'm interested in getting feedback that's as broad as possible and really appreciate the kind offer here.

Typo is fixed — thanks for pointing it out!

At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.

Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like , then you'd get for any .

Thanks again!

Interesting post. Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Thanks for the comment!

Not sure if I agree with your interpretation of the "real objective" - might be better served by looking for stable equilibria and just calling them as such.

I think this is a reasonable objection. I don't make this very clear in the post, but the "true objective" I've written down in the example indeed isn't unique: like any measure of utility or loss, it's only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren't uniquely defined either. (I'll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)

Don't we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don't add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a "strong alignment" question).

Interesting question! To try to interpret in light of the definitions I'm proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it's also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.

Hope that's somewhat helpful, but please let me know if it's unclear and I can try to unpack things a bit more!

This is the first post in a series where I'll explore AI alignment in a simplified setting: a neural network that's being trained by gradient descent. I'm choosing this setting because it involves a well-defined optimization process that has enough complexity to be interesting, but that's still understandable enough to make crisp mathematical statements about. As a result, it serves as a good starting point for rigorous thinking about alignment.

## Defining inner alignment

First, I want to highlight a definitional issue. Right now there are two definitions of

inner alignmentcirculating in the community. This issue was first pointed out to me by Evan Hubinger in a recent conversation.The first definition is the one from last year's Risks from Learned Optimization paper, which Evan co-authored and which introduced the term. This paper defined the inner alignment problem as

"the problem of eliminating the base-mesa objective gap"(Section 1.2). The implication is that if we can eliminate the gap between the base objective of a base optimizer, and the mesa-objectives of any mesa-optimizers that base optimizer may give rise to, then we will have satisfied the necessary and sufficient conditions for the base optimizer to be inner-aligned.There's also a second definition that seems to be more commonly used. This definition says that

"inner alignment fails when your capabilities generalize but your objective does not". This comes from an intuition (pointed out to me by Rohin Shah) that the combination of inner alignment and outer alignment should be accident-proof with respect to an optimizer's intent: an optimizer that's both inner- and outer-aligned should be trying to do what we want. Since an outer-aligned optimizer is one whose base objective is something we want, this intuition suggests that the remaining part of the intent alignment problem — the problem of getting the optimizer to try toachievethe base objective we set — is what inner alignment refers to.Here I'll try to propose more precise definitions of

alignmentandcapabilityin an optimizer, and explore whatgeneralizationandrobustnessmight mean in the context of these properties. I'll also propose ways to quantify the capability and alignment profiles of existing ML systems.But before doing that, I want to motivate these definitions with an example.

## The base objective

The optimizer I'll be using as my example will be a

gradient descent process, which we're going to apply to train a simplified neural network. I want to emphasize that I'm treatinggradient descentas the optimizer here — not the neural network. The neural networkisn'tnecessarily an optimizer itself, it's just the output artifact of our gradient descent optimizer.To make this scenario concrete, we'll imagine the neural network we're training is a simplified language model: a feedforward MLP with a softmax layer at the top. The softmax layer converts the MLP's output activations into a probability distribution over next words, and the model gets scored on the cross-entropy loss between that probability distribution, and the actual next word that appears in the training text. (This ignores many of the complications of modern language models, but I'm keeping this example simple.)

We’ll let θ(t) represent all the parameters of this MLP — all its weights and biases — at training step t. To train our MLP with gradient descent, we feed it batches of input-output pairs (xi,yi). If our MLP is part of a language model, then xi might represent the words in the language model's context window for the ith training example in the batch, and yi might represent a one-hot encoding of the correct next word for the ith training example in the batch. To make things even simpler, I'm also going to assume that every training batch contains the

entiretraining dataset of N examples, an arrangement we'd never use if we were training a real system.So at a given training step t, the loss function for our language model is

L(t)=1NN∑i=1(−yi⋅log(softmax(f(xi,θ(t)))))I'll refer to the function f(xi,θ(t)) as

“the neural network”. Here, “⋅” is the dot product.Notice that L(t) here is our

base objective: it's the quantity we're trying to get our gradient descent process to optimize for. If we'd succeeded in solving the entire outer alignment problem, and concluded that the base objective L(t) was the only quantity we cared about optimizing, then the remaining challenge — getting our gradient descent process to actually optimize for L(t) — would constitute the inner alignment problem, by our second definition above.So the question now is:

under what conditions does gradient descent actually optimize for our base objective?## The true objective

To answer this, we can try to determine which quantity gradient descent is truly optimizing for, and then look at how and when that quantity correlates with the base objective we really care about.

We can start by imagining the (t+1)th step of gradient descent as applying a

L(θ(t))=θ(t+1)=θ(t)+Δθ(t)learning functionL to the parameters in θ(t):Running gradient descent consists of applying L(⋅) repeatedly to θ(0):

Lt(θ(0))=θ(t)In the long run, gradient descent should converge on some terminal value θ∗=limt→∞θ(t). (For now, we'll assume that this limit exists.)

The key characteristic of a terminal value θ∗ (when it exists) is that it's a

L(θ∗)=θ∗fixed pointof the dynamical system defined by L(⋅). In other words:Some of the fixed points θ∗ of this system will coincide with

globalorlocal minimaof our base objective, the cross-entropy loss L(t) — but not all of them. Some will be saddle points, while others will be local or global maxima. And whilewedon't consider all these fixed points to be equally performant with respect to our base objective,our gradient descent optimizerdoes consider them all to be equally performant with respect to its true objective.This disagreement is the core of the inner alignment problem in this setting: our gradient descent process isn't always optimizing for the quantity we want it to. So what quantity is it optimizing for?

When we apply one step of gradient descent, we update each parameter in our neural network by an amount equal to a learning rate, times the error in that parameter that we calculate during backprop on the loss function L(t). The update we apply to the jth parameter, to move it from θj(t) to θj(t+1), can be written as

Δθj(t)=−ϵ(t)∂L∂θj∣∣∣θ=θ(t)Here, ϵ(t) represents our learning rate at time step t.

So our gradient descent optimizer will terminate if and only if there exists some time step t∗ such that Δθj(t∗)=0, across all parameters j. (For a fixed learning function L, this condition implies that the gradient updates are zero for all t≥t∗ as well.) And this happens if and only if the sum of the gradients

G(t)=M∑j=1∣∣∣∂L∂θj∣∣∣∣∣∣θ=θ(t)is equal to zero when t≥t∗.

But G(t) represents more than just the terminal condition for our optimizer. It's the quantity that gradient descent is actually trying to minimize: anytime G(t) deviates from zero, the amount of optimization power that's applied to move G(t)

towardszero is proportional to G(t) itself. That makes G(t) thetrue objectiveof our gradient descent optimizer — it's the loss function that gradient descent is actually optimizing for.So now we have a

base objectiveL(t), which we've assigned to an optimizer; and we have atrue objectiveG(t), which is the one our optimizer is actually pursuing. Intuitively, the inner alignment of our optimizer seem like it would be related to how much, and under what circumstances, L(t) correlates with G(t) over the course of a training run. So we'll look at that next.## Two examples

Let's now consider two optimizers,

AandB. OptimizersAandBare identical apart for one difference: OptimizerAhas its parameters initialized at θA(0), while OptimizerBhas its parameters initialized at θB(0).As luck would have it, this small difference is enough to put θA(t) and θB(t) into different basins of attraction of the loss function. As a result, our two optimizers end up in different terminal states:

limt→∞θA(t)=(θA)∗limt→∞θB(t)=(θB)∗These two terminal states also correspond — again, by luck in this example — to different values of the base objective. Indeed, it turns out that θA(0) is in the basin of attraction of a global minimum of the loss function, while θB(0) is in the basin of attraction of a local minimum. As a result, after many training steps, the base objectives of the two optimizers end up converging to different values:

limt→∞LB(t)>limt→∞LA(t)Again, the limit of the loss function LA(t) is less than the limit of LB(t) because (θA)∗ corresponds to a global minimum, while (θB)∗ only corresponds to a local minimum. So Optimizer

Ais clearly better than OptimizerB, from the standpoint of its performance on our base objective — minimization of the loss function.But crucially, because (θA)∗ and (θB)∗ both represent fixed points with zero gradients, the true objectives of the two optimizers both converge to zero in the limit:

limt→∞GA(t)=limt→∞GB(t)=0In other words, Optimizer

Aand OptimizerBare equally good at optimizing for their true objectives. OptimizerAjust does a better job of optimizing for the base objective we want, as aside effectof optimizing for its true objective. Intuitively, we might say that OptimizersAandBareequally capablewith respect to their true objectives, while OptimizerAisbetter alignedwith our base objective than OptimizerBis.Let's look at a second example. This time we'll compare Optimizer

Ato a third optimizer, OptimizerC. These two optimizers are again identical, apart from one detail: while OptimizerAuses learning rate decay with limt→∞ϵA(t)=0, OptimizerCuses a constant learning rate with ϵB(t)=ϵB.As a result of its learning rate decay schedule, Optimizer

limt→∞GC(t)>limt→∞GA(t)=0Aconverges on a global minimum in the t→∞ limit. But OptimizerC, with its constant learning rate, doesn't converge the same way. While it's drawn towards the same global minimum as OptimizerA, OptimizerCends up orbiting the minimum point chaotically, without ever quite reaching it — its finite learning rate means it never perfectly hits the global minimum point, no matter how many learning steps we give it. As a result,(To be clear, this is an abuse of notation: in reality limt→∞GC(t) generally won't be well-defined for a chaotic orbit like this. But we can think of this instead as denoting the long-term limit of the

averageof GC(t) over a sufficiently large number of time steps.)Intuitively, we might say that Optimizer

Aismore capablethan OptimizerC, since it performs better, in the long run, on its true objective.Optimizer

limt→∞LC(t)>limt→∞LA(t)Aalso performs better than OptimizerCon our base objective:And interestingly, Optimizer

A's better performance thanCon our base objective is a direct result of its better performance thanCon its true objective. So we might say that, in this second scenario, OptimizerC's performance on the base objective iscapability-limited. If we improvedC's capability on its true objective, we could get it to perform better on the base objective, too.## Capability and alignment

With those intuitions in hand, I'll propose the following two definitions.

C(L)=limT→∞1TT∑t=1L(t)−L(0)Definition 1. Let Lt be a base optimizer acting over t optimization steps, and let L(t) represent the value of its base objective at optimization step t. Then thecapabilityof Lt with respect to the base objective L(t) is

α(LB,LM)=limT→∞∑Tt=1L(t)−L(0)∑Tt=1G(t)−G(0)Definition 2. Let LtB be a base optimizer with base objective L(t), and LtM be a mesa-optimizer with mesa-objective G(t). Then the mesa-optimizer'salignmentwith the base optimizer is given byIf C(LB) and C(LM) are both finite, we can also write LtM's alignment with LtB as

α(LB,LM)=C(LB)C(LM)The intuition behind these definitions is that the

capabilityC(⋅) of an optimizer isthe amount by which the optimizer is able to improve its objectiveover many optimization steps. One way in which a base optimizer can try to improve its base objective is bydelegatingpart of its optimization work to a mesa-optimizer, which has its own mesa-objective. Thealignmentfactor α in Definition 2 is a way of quantifyinghow effective that delegation is: to what extent does the mesa-optimizer's progress in optimizing for its mesa-objective "drag along" the base objective of the base optimizer that created it?In our gradient descent example, our mesa-optimizer LtM was the

C(LB)=α(LB,LM)C(LM)gradient descent process, and its mesa-objective was what, at the time, I called the "true objective", G(t). But the base optimizer LtB wasthe human who designed the neural networkand ran the gradient process on it. If we think of this human as being our base optimizer, then we can write thecapabilityof our human designer asIn other words, if a base optimizer delegates its objective to a mesa-optimizer, then that base optimizer's capability is equal to the capability of that mesa-optimizer, times how well-aligned the mesa-optimizer is to the base optimizer's base objective. If you fully delegate a goal to a subordinate, your capability on that goal is the product of 1) how capable your subordinate is at achieving

their owngoals; and 2) how well-aligned their own goals are to the goal you delegated to them. This seems intuitively reasonable.But it also has a curiously unintuitive consequence in gradient descent. We tend to think that when we add neurons to an architecture, we're systematically increasing the capability of gradient descent on that architecture. But the definitions above suggest a different interpretation: because gradient descent might converge equally well on its true objective G(t) on a big neural net as on a small one, its capability as an optimizer isn't systematically increased by adding neurons. Instead, adding neurons improves the degree to which gradient descent converges on a base objective that's

alignedwith our goals.## Robustness and generalization

As I've defined them above, capability and alignment are fragile properties. Two optimizers Lt1 and Lt2 could be nearly identical, but still have very different capabilities C(L1) and C(L2). This is a problem, because the optimizers in our definitions are specified up to and including things like their datasets and parameter initializations. So something as minor as a slight change in dataset — which we should expect to happen often to real-world optimizers — could cause a big change in the capability of the optimizer, as we've defined it.

We care a lot about whether an optimizer remains capable when we perturb it in various ways, including running it on different datasets. We also care a lot about whether an optimizer with objective G(t) remains capable when we change its objective to something slightly different like G′(t). And we also care to what extent the

alignmentbetween two optimizers is preserved when we perturb either optimizer. Below I'll define two properties that describe the degree to which optimizers retain their capability and alignment properties under perturbations.

C(L1)≈C(L1+δL1)Definition 3. Let C(L1) be the capability of optimizer Lt1, and let α(L1,L2) be the alignment of optimizer Lt2 with optimizer Lt1. Let δLt1 and δLt2 be finite perturbations applied respectively to Lt1 and Lt2. Then, the capability of Lt1 isrobust under perturbationδLt1 ifSimilarly, the alignment of Lt2 with Lt1 is

α(L1,L2)≈α(L1+δL1,L2+δL2)robust under perturbationsδLt1 and δLt2 if

C(L1)≈C(L1+δL1)Definition 4. Let Lt1 be an optimizer with objective function L(t), and let Lt2 be an optimizer with objective function G(t). Let δLt1 be a finite perturbation applied to Lt1, such that the optimizer Lt1+δLt1 differs from Lt1 only in that its objective function is L′(t) instead of L(t). Then, the capability of Lt1generalizes to objectiveL′(t) ifSimilarly, the alignment of Lt2 with Lt1

α(L1,L2)≈α(L1+δL1,L2)generalizes to objectiveL′(t) ifIntuitively, we're defining a robustly capable optimizer as one whose capability isn't strongly affected by classes of perturbations that we care about — and we're defining robust alignment between two optimizers in an analogous way. We're also thinking of generalization as a special case of robustness, meaning specifically that the optimizer is robust to perturbation to its objective function. So an optimizer whose capabilities generalize is one that continues to work well when we give it a new objective.

## Quantifying inner alignment

With the vocabulary above, we can now define inner alignment more precisely, and even think about how to quantify it in real systems. We might say that a mesa-optimizer Lt2 is inner-aligned with its base optimizer Lt1 if

its alignment factorα(Lt1,Lt2)remains robustly high under variationsδ(X,Y)in the datasetsthat we expect either optimizer to encounter in the future. We can also quantify inner alignment by looking at how much specific variations in the data distribution affects the alignment factor between two optimizers.We might also be interested investigating other properties that could affect inner alignment from a safety perspective. For example, under what conditions will alignment between a base optimizer and a mesa-optimizer

generalizewell to a new base objective? What kinds of perturbations to our optimizers are likely to yield breakdowns in robustness? As we add capacity to a deep learning model, should expect alignment to improve? And if so, should we expect an inflection point in this improvement — a level of capacity beyond which alignment declines sharply? How could we detect and characterize an inflection point like this? These are some of the topics I'll be exploring in the future.## Terminal states and transients

I want to highlight one final issue with the definitions above: I've defined inner alignment here only in connection with the limiting behavior of our optimizers. That means a mesa-optimizer that's well-aligned with its base optimizer would still — by the definition above — be free to do dangerous things

on the pathto correctly optimizing for the base objective.To take an extreme example, we could have a system that's perfectly aligned to optimize for human happiness, but that only discovers that humans don't want to have their brains surgically extracted from their bodies after it's already done so. Even if the system later corrected its error, grew us new bodies, and ultimately gave us a good end state, we'd still have experienced a very unpleasant transient in the process. Essentially, this definition of alignment says to the mesa-optimizer: it's okay if you break a vase, as long as we know that you'll put it back together again in the long run.

I can understand this definition being controversial. It may be the most extreme possible version of the claim that the ends justify the means. So it could also be worth resolving the alignment problem into "weak" and "strong" versions — where

weak alignmentwould refer to the t→∞ limit, whilestrong alignmentwould refer to transient behavior over, say, the next N optimization steps. A concept of strong alignment could let us prove statements like "this optimizer will have a performance level of at worst x on our base objective over the next N optimization steps." This seems very desirable.On the other hand, we may want to prepare for the possibility that the terminal states we want will

onlybe accessible through paths that involve transient unpleasantness. Perhaps one really does have to break eggs to make an omelet, and that's just how the universe is. (I don't think this is particularly likely: high-capacity neural networks and policy iteration in RL are both data points that suggest incrementalism is increasingly viable in higher-dimensional problem spaces.)To summarize, weak alignment, which is what this post is mostly about, would say that "everything will be all right in the end." Strong alignment, which refers to the transient, would say that "everything will be all right in the end, and the journey there will be all right, too." It's not clear which one will be easier to prove than the other in which circumstance, so we'll probably need to develop rigorous definitions of both.

Big thanks to Rohin Shah, Jan Leike, Jeremie Harris, and Evan Hubinger for reviewing early drafts of this, suggesting ideas, and pointing out mistakes!