Deceptive alignment seems to be the likely outcome of training a sufficiently intelligent AI using gradient descent.  An ML model with a long term goal and an understanding of its training process would probably pretend to be aligned during training, as this would reduce the amount that gradient descent changes its goals and make humans more likely to deploy it, allowing the model to pursue its goal in the long run.  This seems likely to occur because the gradient descent path to deceptive alignment is shorter than the path to genuine alignment due to the fact that models must very specific goals in order to become genuinely aligned.  Moreover, gradient updates towards deceptive alignment generate large performance increases, further favoring the deceptively aligned path.  Deceptively aligned models are also simpler than genuinely aligned models and so are favored by gradient descent's simplicity bias.  While deceptively aligned models are slower and use more computational power than genuinely aligned ones, biasing our models to use less compute is unlikely to be a viable alignment strategy as it would not be performance competitive.  In order to avoid deceptive alignment, we must find new ways to make gradient descent favor genuinely aligned models.


Epistemic Status

The claims made in this piece mirror those made in Evan Hubinger's "How Likely is Deceptive Alignment," which are somewhat speculative but based off of a fair amount of thought/empirical research.



Thanks to Evan Hubinger, Max Nadeau, Thomas Larsen, Rachel Weinberg, Tom Shlomi, and Robi Rahman for providing feedback


1: Introduction

In “How Likely Is Deceptive Alignment” MIRI alignment researcher Evan Hubinger argues that intelligent AIs trained by gradient descent will by default be deceptively (mis)aligned.  But what does this mean?

Imagine that the Christian God is trying to create a human who does what the Bible says, at least so long as they are on earth.  He could do this in three ways:

  • He could create Jesus Christ, who, as a copy of God, has literally identical goals to God (which are the same goals that the Bible instructs you to pursue); Jesus is internally aligned.  
  • He could create Martin Luther, who does not have the exact same goals as God but has the goal of understanding what the Bible says and then doing that.  As a result, to the extent that Luther understands the model, he will act aligned with it.  Moreover, if he misunderstands the Bible, his misunderstanding can be corrected through further study or through God directly telling him what the Bible really says.  Luther is therefore corrigibly aligned.  
  • Finally, he could create Blaise Pascal, who acts aligned only because doing so helps him accomplish his own goal of getting into heaven.  If Pascal ceases to have an instrumental reason to act aligned with God, he will start pursuing his own goals by other means.  Pascal is deceptively aligned.

While we will not grant the model we train eternal life in heaven if it acts aligned during training, it is still likely to have an instrumental reason to perform well during training even if it is not aligned.  This is because if the model performs poorly on the training objective during training, then gradient descent will modify it to make it perform better, including by changing its goal.  However, these modifications are likely to make the model worse at pursuing its own goal because it is not aligned with the training objective.

In contrast, if the model perfectly pursues the training objective during training, gradient descent would change the model and its objective less because such updates would improve its performance by less (or not at all).   Additionally, human observers may be fooled into thinking that the model is aligned and then deploying it, at which point it could freely pursue its actual goal.  Thus, the model is incentivized to pretend to be aligned during training.

The problem with deceptively aligned models is that they appear to pursue the training objective during training and so behave indistinguishably from genuinely aligned models.  Thus, in order to determine the likelihood that we have trained a deceptively aligned model, we cannot just observe its behavior but must have a prior understanding of the “inductive biases” of gradient descent.  

Inductive biases are similar to Bayesian priors in that they determine which particular model, out of all of the models which exhibit a given level of performance, will actually be trained at that level of performance.  For example, a training process might have an inductive bias towards simplicity, meaning that given a level of performance, the training process is more likely to train the simpler models which perform at that level.

High vs Low Path Dependence

There are two broad possibilities for how the inductive biases of gradient descent might work:

  • If gradient descent has low path dependence, different training runs with different initializations and different orderings of training data end up converging.  This would probably occur because gradient descent’s inductive biases favor models with particular properties, so gradient descent produces the model which, at a given level of performance, exhibits those properties to the greatest extent.[1]
  • If gradient descent has high path dependence, then differences between different training runs result in significantly different final models.  Thus, understanding the outcome of gradient descent in the high path dependence case requires us to understand the exact path through model space which the model is likely to take in training rather than merely knowing what properties the final model is likely to have.  

Hubinger argues that we might observe either case, and so considers both of them.


2: High Path Dependence

Internal Alignment

Beginning with the case of high path dependency, Hubinger asks how likely we are to train an internally aligned model.  

The most likely path to internal alignment begins with the model being randomly initialized.  Then, early in training, the model acquires a proxy goal which is a decent approximation of the training objective.  Finally, this proxy is updated over time by gradient descent until it becomes essentially identical to the training objective.  At this point, the model would likely continue to stay internally aligned even if it were trained more, as gradient descent could not improve its performance on the training objective by changing its goal.  Throughout this process, the model’s capabilities and understanding of the world also increase.

However, this path probably requires that the model internalizes the training objective as its own goal before it begins to understand the training process it is being subjected to.  This is because once the model has an understanding of its training process and training objective, then if it is not already internally aligned but has a long term goal, it will be incentivized to pretend to pursue the training objective.  This is for the previously mentioned reason that if the model directly pursues its actual goal, it will be modified by gradient descent, preventing it from achieving its objective over the long run.  


In general, two factors can help us determine how likely different training paths are:

  • The first factor is that gradient descent updates the model in the direction that locally increases its performance the most.  Thus, as long as a model’s performance would significantly improve if it were updated slightly in the direction of a particular model class, gradient descent will probably update the model in that direction.
  • The second factor is how many steps of gradient descent are needed for the model to become a member of a given model class.  If many steps are needed, then there are more opportunities for the model to become “derailed” from its training path and veer off towards a different model class.  

These factors do not favor internal alignment.  Firstly, because there are diminishing performance returns to making the model’s proxy more aligned with the training objective, the performance gain per step towards internal alignment eventually becomes very low.  Because of this fact and because the model must learn a very specific goal to become internally aligned, it will also take many steps to reach internal alignment.  

In contrast, while the performance returns to improving the model’s understanding of the world also diminish, the model does not need to have a perfect understanding of the world in order to understand its training process in the same way that its proxy must be perfect in order for it to be aligned.  

Moreover, while the model would probably eventually learn about its training process as a result of understanding the world in general (particularly if it has access to the huge amount of ML related text online), learning about the fact that it is an ML model in training would also improve its performance at the many tasks like interacting with humans or navigating the physical world which benefit from it having an understanding of its situation.  

As a result, if we train the model to do such tasks, gradient descent will actively push it towards understanding its training process.

Finally, Hubinger argues that the returns to improving the model’s understanding of the world diminish more slowly than the returns to improving the model’s proxy.  Assuming the training objective is somewhat complicated, it will be difficult for gradient descent to align the proxy perfectly with it.  However, there are likely simpler, easier to find objectives that are highly correlated with the training objective, meaning that gradient descent is likely to find these proxies relatively early and then face significant diminishing returns to improving them further.  

Conversely, it is difficult to develop a highly incomplete understanding of the world that still mostly enables the model to perform well, so the returns to improvements to the model’s understanding of the world will likely diminish more slowly than those to improving the proxy.

For these reasons, it seems likely that the model will learn about its training process before it becomes fully internally aligned for the previous reasons and thus veer off towards deceptive alignment.


Corrigible Alignment

A model could become corrigibly aligned if it gets to the point where its understanding of the training objective is closer to the actual training objective than its proxy goal is, and then gradient descent changes the model’s goal to a “pointer” to its understanding of the training objective.  In other words, gradient descent could modify the model by connecting the circuit representing the training objective to the circuit which encodes the model’s goal such that the model pursues its best understanding of the training objective.  

As in the case of internal alignment, once the model becomes fully corrigibly aligned it would probably stay corrigibly aligned, as gradient descent could not increase its performance by modifying its goal.  Thus, corrigible alignment seems more likely than internal alignment, as the model does not need to become aligned before it learns about the training objective in order to become corrigibly aligned.

However, the path to corrigible alignment suffers from some of the same problems as the path to internal alignment.  Pointers, like proxy goals, vary in quality (a pointer could point to something other than the model’s understanding of the training objective but which is still highly correlated with the training objective), and there are diminishing returns to improving pointer alignment for the same reason as in the case of proxy alignment.  Thus, to become fully corrigibly aligned, the model will need to go down a training path that eventually has very low performance gains per step.  

While the corrigibly aligned path is probably shorter than the internally aligned path because there are more ways of implementing the necessary pointer than ways of getting the model to perfectly internalize the training objective, it seems likely to still be fairly long.

Unfortunately, an unaligned pointer which points to the model’s representation of a long term objective could lead the model to become deceptive in the same way as an unaligned long term proxy goal.  For example, if the model’s pointer points at the model’s representation of the code for the training objective rather than its representation of the actual training objective in the world, then the model would be incentivized to act aligned in training and then, once deployed, seize control of the computer encoding its training objective and maximize its value.  Thus, the corrigible aligned path could easily veer off towards deceptive alignment.

Deceptive Alignment

The most likely path to deceptive alignment begins identically to the corrigibly aligned path until the model learns about the training process.  Then, if the model already has a long term goal, it would start to act deceptively; if not, gradient descent could change its goal into a long term objective, causing it to become deceptively aligned.

This path seems likely for several reasons.  

Firstly, if the model’s proxy when it begins to understand the training objective is a long term goal, then it will immediately start to become deceptive before the paths to internal or corrigible alignment can even begin.  

Even if gradient descent must change the model’s proxy into a long term goal for it to become deceptively aligned, this path has higher performance gains per step than the corrigibly or internally aligned paths.  This is because the internally and corrigibly aligned paths require slowly updating the model’s proxy/pointer to become increasingly aligned with the training objective in order to reach maximum performance.  In contrast, the deceptively aligned path achieves maximum performance immediately upon making the simple modification of giving the model a long term goal because any deceptively aligned model attempts to do as well on the training objective as possible.  Thus, the deceptively aligned model’s performance increases sharply over a relatively short number of gradient updates.  For the same reason, the deceptively aligned path is shorter.

Finally, even if the model does not begin down the deceptively aligned path immediately upon learning about the training process, it could become deceptively aligned even if it started down one of the genuinely aligned paths through the mechanisms discussed in the internal and corrigible alignment sections.

Altogether, in the high path dependence world, internal alignment seems unlikely, and while corrigible alignment seems somewhat more likely, deceptive alignment seems most likely.

3: Low Path Dependence

If gradient descent has low path dependence, then it converges across different training runs to the model that, of all the models which perform perfectly in training, exhibits certain high-level properties to the greatest extent.  In particular, gradient descent seems to favor models which implement simpler (i. e. lower Kolmogorov complexity) algorithms.  It also favors models which implement algorithms which are faster/less computationally expensive. These two inductive biases are not exhaustive, but Hubinger focuses on them because they are well-understood and certainly among the empirically observed inductive biases of gradient descent.


Simplicity Bias

The simplicity bias favors deceptive alignment.  For a variety of reasons (e. g. simpler algorithms often require fewer of the parameters of the neural network which implements them to have specific values), simpler algorithms are generally specified by more possible models.  Thus, we can estimate the simplicity of a model class by how many possible models implement it.

The parameters of an internally aligned model must be specified in great detail so that it has precisely the same goals as the person to whom it is aligned.  Thus, few models are internally aligned, so internally aligned models are complex.  There are more corrigibly aligned models, as there are many different ways of implementing a pointer to the model’s understanding of the training objective, so corrigibly aligned models are simpler than internally aligned ones.  However, deceptively aligned models are the simplest of all, as any of the many possible models with a long term goal can be deceptively aligned.  Thus, the simplicity bias favors deceptive alignment.


Speed Bias

In contrast, the speed bias favors internally and corrigibly aligned models.  While internally aligned and corrigibly aligned models pursue the training objective directly, deceptively aligned models have to go through the extra computational step of inferring that they should pursue the training objective rather than their own objective in the short term.  In other words, deception adds extra reasoning steps to the thought process of the deceptively aligned models relative to the internally and corrigibly aligned ones, meaning that deceptively aligned models will require more computational steps to run and be slower.

However, if we try to bias our models towards speed (e. g. train smaller models), we will hamper their performance.  Empirically, the smallest models which perfectly fit the training data generalize terribly to the test dataset, as they overfit to the training set while larger models learn simpler and better-generalizing algorithms.  As a result, it seems likely that if anything, we will bias our training processes towards simplicity and would struggle to make AGI if we implemented a speed bias.


Deceptive alignment seems to be the likely outcome of training a sufficiently intelligent AI via gradient descent in both the high and low path dependence world.  Thus, in order to avoid deceptive alignment, we need to modify the training regime in such a way that it somehow actively avoids deceptively aligned models.  The difficulty of this task remains uncertain.


  1. ^

    See for further explanation of this hypothesis


New Comment
3 comments, sorted by Click to highlight new comments since: Today at 1:55 PM

Thus, in order to avoid deceptive alignment, we need to modify the training regime in such a way that it somehow actively avoids deceptively aligned models.

I wasn't thinking in these terms yet, but I reached a similar conclusion a while back and my mainline approach of training in simboxes is largely concerned without how to design test environments where you can train and iterate without the agents knowing they are in a sim training/test environment.

I also somewhat disagree with the core argument in that it proves too much about humans. Humans are approximately aligned, and we only need to match that level of alignment.

The difference in my approach in that article is that I reject the notion that we should aim first for directly aligning the agents with their creators (the equivalent of aligning humans with god). Instead we should focus first on aligning agents more generally with other agents, the equivalent of how humans are aligned to other humans (reverse engineering human brain alignment mechanisms). That leads to mastering alignment techniques in general, which we can then apply to alignment with humanity.

I also somewhat disagree with the core argument in that it proves too much about humans. Humans are approximately aligned, and we only need to match that level of alignment.

Hmm, humans do appear approximately aligned as long as they don't have definitive advantages. "Power corrupts" and all that. If you take an average "aligned" human and give them unlimited power and no checks and balances, the usual trope happens in real life.

Yeah, the typical human is only partially aligned with the rest of humanity and only in a highly non uniform way, so you get the typical distribution of historical results when giving supreme power to a single human - with outcomes highly contingent on the specific human.

So if AGI is only as aligned as typical humans, we'll also probably need a heterogeneous AGI population and robust decentralized control structures to get a good multipolar outcome. But it also seems likely that any path leading to virtual brain-like AGI will also allow for selecting for altruism/alignment well outside the normal range.