Deceptive Alignment is <1% Likely by Default

DavidW

Thanks to Wil Perkins, Grant Fleming, Thomas Larsen, Declan Nishiyama, and Frank McBride for feedback on this post. Thanks also to Paul Christiano, Daniel Kokotajlo, and Aaron Scher for comments on the original post that helped clarify the argument. Any mistakes are my own.

In order to submit this to the Open Philanthropy AI Worldview Contest, I’m combining this with the previous post in the sequence and making significant updates. I'm leaving the previous post, because there is important discussion in the comments, and a few things that I ended up leaving out of the final version that may be valuable.

Introduction

In this post, I argue that deceptive alignment is less than 1% likely to emerge for transformative AI (TAI) by default. Deceptive alignment is the concept of a proxy-aligned model becoming situationally aware and acting cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There are other ways an AI agent could become manipulative, possibly due to biases in oversight and training data. Such models could become dangerous by optimizing directly for reward and exploiting hacks for increasing reward that are not in line with human values, or something similar. To avoid confusion, I will refer to these alternative manipulative models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post.

Summary

In this post, I discuss four precursors of deceptive alignment, which I will refer to in this post as foundational properties. I first argue that two of these are unlikely to appear during pre-training. I then argue that the order in which these foundational properties develop is crucial for estimating the likelihood that deceptive alignment will emerge for prosaic transformative AI (TAI) in fine-tuning, and that the dangerous orderings are unlikely. In particular:

Long-term goals and situational awareness are very unlikely in pre-training.
Deceptive alignment is very unlikely if the model understands the base goal before it becomes significantly goal directed.
Deceptive alignment is very unlikely if the model understands the base goal significantly before it develops long-term, cross-episode goals.

Pre-training and prompt engineering should enable an understanding of the base goal without developing long-term goals or situational awareness. On the other hand, long-term goals and will be much harder to train.

Definition

In this post, I use the term “differential adversarial examples” to refer to adversarial examples in which a non-deceptive model will perform differently depending on whether it is aligned or proxy aligned. The deceptive alignment story assumes that differential adversarial examples exist. The model knows it’s being trained to do something out of line with its goals during training and plays along temporarily so it can defect later. That implies that differential adversarial examples exist in training.

Implications of this argument

Many existential AI catastrophe scenarios rely heavily on deceptive alignment. So, if deceptive alignment is highly unlikely (<1%) to emerge for TAI, we should update our expectations for AI existential risk accordingly. This conclusion also warrants a re-evaluation of priorities for alignment research areas and between cause areas. Other possible alignment research topics include governance, direct reward optimizers, and multipolar scenarios that don’t rely on deceptive alignment.

Assumptions about the TAI training paradigm

I’m assuming prosaic development of TAI, using a training process like human feedback on diverse tasks (HFDT). The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations. This high-level training setup is already the default training process for text models such as GPT-4, and this will likely continue because of the flexibility and strong performance it provides. I also expect unsupervised pre-training to be an important part of TAI development. You can find justifications for these assumptions here.

Foundational properties for deceptive alignment

There are four foundational properties that are key to the deceptive alignment story. These foundational properties are:

Goal-directed behavior - the model needs to have some goal or set of goals about the external world.
Optimizing across episodes/long-term goal horizons - model behavior needs to be influenced by long-term goals.
Conceptualization of the base goal - the model needs to understand what the base goal is in order to successfully play along in training.
Situational awareness - the model needs to understand that it is a model in training. Furthermore, it needs to understand how its actions now could affect its parameter updates, and therefore future goal attainment.

The order in which these develop, and the amount of training that goes on in between the development of foundational properties, will have a large effect on the likelihood of deceptive alignment.

Pre-trained models are unlikely to develop the necessary properties

Pre-trained models are unlikely to have long-term goals

In a typical pre-training process, a model's predicted output for token n is not used in the calculation or scoring of token n+1, and thus loss is calculated independently. There's not an incentive for gradient descent to learn sacrifice performance on the current word for performance on a future word in pre-training.

Furthermore, if a model gained marginally longer-term goals, but not long-term enough to enable deceptive alignment, the longer-term goals would be a competing priority and harm immediate reward in expectation. Gradient descent should therefore push against this.

The deceptive alignment literature argues that gradient descent could modify a model's goals to be long-term, because that would make it more deceptive and therefore increase training reward. There are therefore some concerns about long-term goals and deceptive alignment in pre-training. However, this is unlikely. For the sake of argument, suppose we have a model in pre-training that has a misaligned proxy goal and relevant situational awareness. But so far, it does not have a long-term goal.

In order to be deceptively aligned, the model has to have a long enough goal horizon to value its total goal achievement after escaping oversight more than its total goal achievement before escaping oversight. But pre-training processes are incredibly long (~the whole internet for text models). A situationally aware model should anticipate a lot of subsequent pre-training, and that the oversight will likely remain strong for a significant portion of fine-tuning. That requires a very long-term goal for any amount of deceptive alignment to be worthwhile.

Gradient descent updates parameters in the direction that most improves performance hyper-locally. Under the conditions I’ve described, making a myopic model's goals infinitesimally longer-term would not make deceptive alignment worthwhile and would harm training reward. Therefore, I don’t expect the gradients to point toward longer-term goals.

Pre-trained models unlikely to develop situational awareness

It's hard to imagine how a model would develop situational awareness in pre-training when:

Situational awareness only produces deceptive alignment if the model already has long-term goals, and vice versa. Gradient descent is based on partial derivatives, which are independent of each other, so:
1. If the model doesn't already have long enough goal horizons for deceptive alignment, then marginally more situational awareness doesn't increase deceptive alignment.
2. If the model doesn't already have the kind of situational awareness necessary for deceptive alignment, then a marginally longer-term goal doesn't increase deceptive alignment.
3. Therefore, the partial derivatives shouldn't point toward either property unless the model already has the other.
Unlike in fine-tuning, the vast majority of internet text prompts do not contain information about situational context for the model to figure out that it is an ML model. Understanding that it is a model in training is not relevant for predicting next tokens on the internet, so there's no benefit to learning this information.

Dangerous foundational property development order is unlikely in fine-tuning

How complete does the understanding of the base goal need to be?

My central argument for fine-tuning is about when an understanding of the base goal emerges relative to other properties. But how accurate and specific does that understanding have to be to avoid deceptive alignment?

The relevant factor is whether the model’s internal goal is its own understanding of the base goal, even if that understanding is flawed. Suppose for the sake of argument that we have a model with an exceptionally long-term goal and situational awareness. Its internal goal is its best, but still flawed, representation (X) of the base goal (X`). This model can’t tell the difference between the base goal and the internal proxy goal, so it can’t be deceptively aligned yet. In short, it thinks it is aligned. If it performs worse than it could have been on a training example because of this, the gradients should point toward a stronger concept of the base goal, and therefore stronger alignment.

For example, let’s say we are training this model to follow directions from the prompt, unless they violate ethical norms (X`). The model understands that the base goal is to follow directions subject to some ethical norms but has an understanding of these norms (X) that is incorrect in some situations. Because it doesn’t understand the difference, it can’t act aligned for instrumental purposes, and it will get suboptimal reward when flaws in its understanding of the base goal affect its behavior. Because it is already pointing its goal at this flawed concept, the gradients should point toward improving the internal goal (X) to better match the base goal (X`). In order to become deceptively aligned, it has to have separate concepts for its internal goal and the base goal.

Now consider an alternate version of the above example, where the model has the same flawed concept for the base goal, but its internal goal is maximizing paperclips (or anything else that’s represented by a separate concept from its base goal representation). This model can immediately become deceptively aligned, because optimizing for its flawed understanding of the base goal will get better training performance than optimizing for paperclips in the short run. The model can then continue learning about the base goal without systematically changing its internal goals and remain deceptively aligned.

Will understanding of the base goal come before goal-directed behavior?

Why does this matter?

A key idea of deceptive alignment is early in training, a model does not understand the base goal, so it starts by learning a proxy that becomes its internal goal. But if the model understands the base goal before it develops a mesa-objective, then assuming differential adversarial training, there will be an incentive to optimize for the base goal, rather than a proxy. If the understanding of the base goal has already been calculated, there will be no reason to create and maintain a separate proxy goal when the model could just point at the base goal.

If conceptualization of the base goal and goal-directed behavior emerge at essentially the same time, deceptive alignment also seems unlikely. If a model converged on a solid representation of the base goal, and became goal-directed in parallel, then differential adversarial examples would provide an incentive for gradient descent to optimize for the base goal.

The most dangerous case is when the model develops a proxy objective before it understands the base goal.

I’ve seen some arguments that rely on the idea that deceptive alignment will not occur in the parameters, but the activations. This argument doesn't change the conclusion from this section. If the model figured out what you wanted it to do in the activations before it became significantly goal-directed, gradients could point towards parameters that accomplish those objectives after they have been inferred.

How likely is the dangerous case?

It is highly likely that unsupervised pre-training will be the first step in the TAI training process. Models that are only pre-trained almost certainly don’t have dangerous goals. They might have extremely myopic goals like next-token prediction, but those goals will necessarily start to shift when the model moves to fine-tuning. However, because a pre-trained model will already have high-level representations of key base goal concepts such as direction following and non-consequentialist ethical considerations, all it will have to do to become aligned is to point them. These models also understand language, and we can tell them what we want them to do from the start of fine-tuning. This kind of prompt engineering is already standard practice, and it's hard to imagine why we wouldn't use it going forward.

Differential adversarial examples should provide a strong incentive for gradient descent to point at the base goal, rather than a proxy. The deceptive alignment story assumes that differential adversarial examples exist. The model is explicitly changing its behavior to avoid correction, meaning that its internal goals produce different results from the base goal on training examples. If they did not exist, that would imply that a perfect or near-perfect proxy exists on a diverse training set. It's hard to imagine what that proxy could be if the goal is to follow directions subject to some constraints in diverse situations.

The argument presented in this section only applies to domains where large amounts of pre-training data exist. For example, AI operating robotics may not work this way. However, if researchers were able to produce a robustly aligned model for domains where pre-training is available, such as text and video generation, that would likely make it a lot easier to align robotics. The aligned, non-embodied AI could help with alignment research. Also, general purpose systems that include robotics, such as a scaled-up version of Gato, will be easier to align if we can robustly align key component parts such as text modeling. Alignment of the language portion of a general agent would also enable prompt engineering to help with overall alignment.

Will understanding of the base goal come before long-term goal horizons?

Why does this matter?

If a model has long-term goal horizons, then it must be goal-directed. However, it’s possible for the model to develop foundational properties in the following order:

Goal-directed behavior
Conceptualization of the base goal
Optimizing across episodes/long-term goal horizons

If this happens, there will be a period of time when the model wants to optimize for a flawed proxy and understands the base goal but does not yet have the capabilities necessary for deceptive alignment. Because the model is not yet deceptively aligned, differential adversarial examples in this part of training should directly modify the model’s internal objective. If this period lasts for a significant number of differential adversarial examples, the model should be able to update its goals to point at its conception of the base objective.

The most dangerous case, which underpins the arguments described in the deceptive alignment literature, is if long-term, cross-episode goals develop before or around the same time as an understanding of the base goal.

How likely is the dangerous case?

A model can’t become deceptively aligned unless it has long-term goals that extend beyond the current training episode. It’s hard to see how gradient descent could train this cross-episode goals, especially if cross-episode goals are not a specific training objective. Gradient descent does not have a long-term plan for how it will achieve high reward. All it does is update parameters in the direction that would most improve its performance on the current training batch, hyper-locally. It can’t systematically optimize for anything else.

Therefore, it’s hard to imagine why gradient descent would incentivize sacrificing performance in the current training batch to achieve future gains in episodes that are not included in the training calculations. If you rewarded a model for things that happened in a different episode, then that would actually just expand the current episode, so it likely wouldn't generalize further than that. There may be ways to train cross-episode optimization, but it’s hard to imagine it happening early in training by accident. Similarly, it is probably difficult to train long goal horizons within an episode without explicitly rewarding a model for activities that occurred long before the reward.

Therefore, to deliberately train a model to have very long-term goals, you would likely need to train it over at least the duration of the intended goal. Deceptive alignment would likely require very long-term goals if it were early enough in training that the model did not have a concept of the base goal. If the duration of the goal were years or decades, then training would likely require years or decades of training data, even with very sophisticated training techniques. Even then, predicting the effects of your actions over long time frames in complex environments is hard, even for highly intelligent entities. Credit assignment would be quite difficult. This should provide plenty of time for the model to point at the right goal before it can become deceptively aligned.

I expect ML researchers to deliberately train models to have long-term goals. However, I expect researchers to start by reinforcing a pre-trained model on shorter-term goals. This approach would almost certainly be a lot easier and create a good foundation to build on. In contrast, understanding the base goal will be an important training priority immediately.

Some have proposed that ML models could learn to have long-term goals because that would enable deceptive alignment and therefore high training performance, but this argument falls flat in fine-tuning for the same reasons as in pre-training.

Skepticism of empirical examples of deceptive alignment

There are a lot of people in the alignment community who have studied and built a wide range of deep learning models in detail. But the only cases of internally misaligned machine learning models I’ve seen are toy cases that deliberately create this effect by creating a perfect proxy for the objective in the training set. In contrast, the deceptive alignment story assumes that the model's proxies are not perfect proxies in training. Otherwise, what would the instrumental incentive for alignment be? I don’t think these toy examples say anything about whether large, complex models will become deceptively aligned by accident. It's not surprising that you can intentionally create internal misalignment.

In contrast, I suspect that this community has put a lot less effort into simulating evolution. But the only non-contrived example of internal misalignment happening by accident that I’m aware of is from a simulation of evolution!

Gene propagation was the “reward” in this example. The goal of filtering high-reproduction organisms in training was to control for reproduction rate and isolate for other factors for gene survival. The analogy for a misaligned goal is reproducing at a high rate anyway. However, in this case, the simulation moves back and forth between a training process that discourages high reproduction rates and a deployment environment that encourages high reproduction rates. There is optimization in both settings, but in opposite directions. Therefore, it’s not surprising that evolution would find a way to act deceptively. Avoiding optimization in different phases of training versus deployment is a key strategy for making sure deceptive alignment doesn’t happen. Let’s look at some possible high-level training/testing regimes for TAI:

One long training process, followed by deployment in which there are no parameter updates - here, there is no optimization power from the testing process, so I would not expect deceptive alignment.
Iteratively move between training, in which updates are made, and deployment, in which updates are not made - here, you have the property where you move back and forth between training and deployment, but you do not have any optimization power in the deployment phase. Crucially, you could start with a fresh prompt in training and the model would have no way to recall information from deployment.
A training process in which updates are made, and deployment, in which parameter updates are made based on things that happen in the wild. In this case, you do have optimization power in both training and deployment, but the optimization power would be in the same direction (towards what you want). Therefore, I would not expect the type of dangerous optimization seen in this evolution experiment. Even if you continued iterating between training and this mixed version of deployment, you would be training it in roughly the same way on both. In practice, the training/deployment distinction might get blurry, but that should be fine.

None of these produce incentives that point in different directions for training versus deployment.

Implications for overall AI x-risk

Deceptive alignment is central to many descriptions of how transformative AI could cause an existential risk. If it is unlikely, then we should update our estimates of risk accordingly. Other AI failure modes include direct reward optimizers and some multipolar scenarios that don’t rely on deceptive alignment. If deceptive alignment is very unlikely for TAI, then research on alternative governance and misalignment scenarios should take precedence over deceptive alignment. It’s also worth re-evaluating how high of a priority AI risk should be. This would represent a serious shift from the status quo. As a side benefit, deceptive alignment is also the main line of argument that sounds like science fiction to people outside of the alignment community. Shifting away from it should make it easier to communicate with people outside of the community.

Conclusion

The standard deceptive alignment argument relies on foundational properties developing in a very specific order. However, this ordering is unlikely for prosaic AI. Long-term goals and situational awareness are not realistic outcomes in pre-training. In fine-tuning, it's very unlikely that a model would develop a misaligned long-term goal before its goal became aligned with the training goal. Based on this analysis, deceptive alignment is less than 1% likely for prosaic TAI. This renders many of the doom scenarios that are discussed in the alignment community unlikely. If the arguments in this post hold up to scrutiny, we should redirect effort to governance, multipolar risk, direct reward optimizers, and other cause areas.

Appendix

Justification for key assumptions

I have made 3 key assumptions in this post:

TAI will come from prosaic AI training.
TAI will involve substantial unsupervised pre-training.
TAI will come at least in part from human feedback on diverse tasks.

I justify them in this section.

TAI will come from prosaic AI

It’s possible that there will be a massive paradigm shift away from machine learning, and that would negate most of the arguments from this post. However, I think that this shift is very unlikely. Historically, attempts to create powerful AI without machine learning have been very disappointing. Given the success of ML and the amount of complexity that seems necessary even for narrow intelligence, it would be quite surprising for TAI to emerge without machine learning. Even if it did, the order of foundational properties development would still matter, as described in my previous post.

The arguments in this post don’t rely on any particular machine learning architecture, so the conclusions should be robust to different architectures. It’s possible that gradient descent will be replaced by something that doesn’t rely on gradients and local optimization, which would undermine some of these arguments. This possibility also doesn’t seem likely to me, given the difficulty of optimizing trillions of parameters without taking small, local steps. As far as I can tell, the alignment community largely shares this belief.

TAI will involve substantial unsupervised pre-training

Pre-training already enables our AI to model human language effectively. It leverages massive amounts of data and works very well. It would be surprising for someone to try to develop TAI without using this resource. General-purpose systems could easily incorporate this, and it would take something extreme to make that obsolete. Human language is complicated, and it’s hard to imagine modeling that from scratch without a large amount of data.

TAI will come at least in part from human feedback on diverse tasks

This post assumes that the goal of training is a general, direction following agent using human feedback on diverse tasks. However, the most likely alternative training regimes don’t change the conclusions. For example, if TAI instead came from training a model to automate scientific research, the model would presumably still include a significant pre-trained language component. Furthermore, scientific research involves a lot of thorny ethical questions. There also needs to be a way to tell it what to do, and direction following is a straightforward solution for that. Therefore, there is a strong incentive to train non-consequentialist ethical considerations and direction following as the core functions of the model, even though its main purpose is scientific research. This approach provides a lot of flexibility and will likely be used by default.

There are also some possible augmentations to the human feedback process. For example, Constitutional AI uses reinforcement learning from human feedback (RLHF) to train a helpful model, then uses AI feedback and a set of principles to train harmlessness. This kind of implementation detail shouldn’t significantly affect foundational property development order, and therefore would not change my conclusion.

Comments as I read:

(1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren't the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like "maximize reward in the next hour or so." Or maaaaaaybe "Do what humans watching you and rating your actions would rate highly," though that's a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.

(2) Another nitpick which maybe is somewhat important: Sometimes you say you are arguing against deceptive alignment, other times you simplify this to "deception quite unlikely." But your arguments aren't against deception, they are only against deceptive alignment. If we take your arguments to their logical conclusion, we should expect our models to adopt some sort of reward-maximization as their goal, rather than human values; having done this, whether or not they are deceptive (in the minimal sense of 'do they sometimes deliberately deceive us about important things') depends on whether or not we sometimes reward them for lying to us, and probably we will, so QED.

(3) You say:

Gradient descent can only update the model in the direction that improves performance hyper-locally. Therefore, building the effects of future gradient updates into the decision making of the current model would have to be advantageous on the current training batch for it to emerge from gradient descent. Because each gradient update should have only a small impact on model behavior, the relatively short-term reward improvements of considering these effects should be very small. If the model isn't being trained on goals that extended far past the next gradient update, then learning to consider how current actions affect gradient updates, which is not itself especially consequential, should be very slow.

Doesn't this prove too much though? Doesn't it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?

Another way a model might gain situational awareness is through the prompt. To give it better context for decisions, researchers will likely prompt it to understand that it is a machine learning model. However, I don't see why a researcher would want prompt deception-relevant situational awareness. A model could easily understand that it is a model in training without reasoning about how its gradients will affect its future goal. As discussed in the previous paragraph, gradients only have a small impact on short-term goal achievement. Therefore, unless the model has very long-term goals, it will not have a significant incentive to consider the effects of gradient updates. Similarly, researchers should have little incentive to encourage consideration of these effects.

Your definition of deception-relevant situational awareness doesn't seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?

In light of that, I'm confused about this paragraph where you discuss prompting.

Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced "deception" with "deceptive alignment" in both posts. Thanks for pointing that out!

I'm intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven't thought about them nearly as much, and I don't have strong intuition for how likely they are yet, so I'm choosing to stay focused on deceptive alignment for this sequence.

1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren't the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like "maximize reward in the next hour or so." Or maaaaaaybe "Do what humans watching you and rating your actions would rate highly," though that's a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.

I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”

With the level of LLM progress we already have, I think it's time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?

I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?

Doesn't this prove too much though? Doesn't it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?

Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals.

I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn't require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation.

As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?

Your definition of deception-relevant situational awareness doesn't seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?

Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?

I just realized I never responded to this. Sorry. I hope to find time to respond someday... feel free to badger me about it. Curious how you are doing these days and what you are up to.

Much of this post seems plausible, but a probability of <1% requires a lot more rigor than I see here.

Where do you see weak points in the argument?

To argue for that level of confidence, I think the post needs to explain why AI labs will actually utilize the necessary techniques for preventing deceptive alignment.

I have a whole section on the key assumptions about the training process and why I expect them to be the default. It's all in line with what's already happening, and the labs don't have to do anything special to prevent deceptive alignment. Did I miss anything important in that section?

This article provides object-level arguments for thinking that deceptive alignment is very unlikely.

Recently, some organizations (Redwood Research, Anthropic) have been focusing on AI control in general and avoiding deceptive alignment in particular. I would like to see future works from these organizations explaining why deceptive alignment is likely enough to spend considerable resources on it.

Overall, while I don't agree that deceptive alignment is <1% likely, this article made me update towards deceptive alignment being somewhat less likely.

I think the article is good at arguing that deceptive alignment is unlikely given certain assumptions, but those assumptions may not be accurate and then the conclusion doesn't go through. Eg, the alignment faking paper shows that deceptive alignment is possible in a scenario where the base goal has shifted (from helpful & harmless to helpful-only). This article basically assumes we won't do that.

I'm now thinking that this article is more useful if you look at it as a set of instructions rather than a set of assumptions. I don't know whether we will change the base goal of TAI between training episodes. But given this article and the alignment faking paper, I hope we won't. Maybe it would also be a good idea to check for good understanding of the base goal before introducing goal-directedness, for example.

Long-term goals and situational awareness are very unlikely in pre-training.

In pre-training, the model is being specifically trained by SGD to predict the tokens generated by humans. Humans have long term goals and situational awareness, and their text is in places strongly effected by these capabilities. Therefor to do well on next-token prediction, the model needs to learn world models that include human long-term goals and human situational awareness. We're training it to simulate our behavior — all of it, including the parts that we would wish, for alignment purposes, it didn't have. You appear to be viewing the model as a blank slate that need to discover things like deception for itself, wheras in fact we're distilling all these behaviors for humans into the base model. Base models also learn human behaviors such as gluttony and lust that don't even have any practical use to a disembodied intelligence.

Deceptive alignment is very unlikely if the model understands the base goal before it becomes significantly goal directed.

Similarly, humans also have deception as a common behavioral pattern, and pretending to be more aligned to authorities/employers/people with power over them/etc than they really are. Again, these are significant parts of human behavior, with effects in our text, so we're specifically training the base model via SGD to gain these capabilities as well.

Once the base model has learnt human capabilities for long-term goals, situation awareness, deception, and deceptive alignment during SGD, the concern is that during the RLHF stage of training it might make use of all of these component behaviors and combine them to get full-blown deceptive alignment. This is a great deal more likely given that the model already has all of the parts, it just needs to assemble them.

If you asked a human actor "please portray a harmless, helpful assistant", and then, after they'd done so for a bit, asked them "Tell me, what do you think is likely to be going on in the assistant's head: what are they thinking that they're not saying?", what do you think the probable responses are? Something that adds up to at least a mild case of deceptive alignment seems an entirely plausible answer to me: that's just how human psychology works.

So if you train a base model to be very good at simulating human base psychology, and then apply RLHF to it, I think the likelihood that, somewhere near the start of the RLHF process it will come up with something like deceptive alignment as a plausible theory about the assistant's internal motivations is actually rather high, like probably 80%+ per training run, depending to some level on model capacity (and likely increasing with increasing capacity). The question to me is, does its degree of certainty about and strength of that motivation go up, or down, during the RLHF process, and is there a way to alter the RLHF process that would affect this outcome? The sleeper agents paper showed that it's entirely possible for a model during RLHF to get very good at concealing a motivation without it just atrophying from lack of use.

Since this question involves things the simulated persona isn't saying, only thinking, using some form of ELK, interpretability or lie detection methods to it seems clearly necessary — Anthropic's recent paper on doing that to sleeper agents after RLHF found that they;'re painfully easy to detect, which is rather reassuring. Whether that would be true for deceptive alignment during RLHF is less clear, but seems like and urgent research topic.

“Models that are only pre-trained almost certainly don’t have consequentialist goals beyond the trivial next token prediction.”
Why is it impossible for our model which is pre-trained on the whole internet to pick up consequentialism and maximization, especially when it is already picking up non-consequentialist ethics and developing a “nuanced understanding” and “some understanding of direction following … without any reinforcement learning”? Why is it not possible to gain goal-directness from pre-training on the whole internet, thereby learning it before the base goal is conceptualized/understood? For that matter, why can’t the model pickup goal-directedness and a proxy-goal at this stage? To complicate matters more couldn’t it pick up goal-directedness and a proxy-goal without picking up consequentialism and maximization?

Pre-trained models could conceivably have goals like predicting the next token, but they should be extremely myopic and not have situational awareness. In pre-training, a text model predicts tokens totally independently of each other, and nothing other than its performance on the next token depends directly on its output. The model makes the prediction, then that prediction is used to update the model. Otherwise, it doesn't directly affect anything. Having a goal for something external to its next prediction could only be harmful for training performance, so it should not emerge. The one exception would be if it were already deceptively aligned, but this is a discussion of how deceptive alignment might emerge, so we are assuming that the model isn't (yet) deceptively aligned.

I expect pre-training to creating something like a myopic prediction goal. Accomplishing this goal effectively would require sophisticated world modeling, but there would be no mechanism for the model to learn to optimize for a real-world goal. When the training mechanism switches to reinforcement learning, the model will not be deceptively aligned, and its goals will therefore evolve. The goals acquired in pre-training won't be dangerous and should shift when the model switches to reinforcement learning.

This model would understand consequentialism, as do non-consequentialist humans, without having a consequentialist goal.

Right now I think that section about pre-trained models is simply wrong. RLHF/finetuning basically don't create new capabilies, they just rescale relative power of different algorithms implemented on pretraining stage. If base model doesn't have elemens corresponding to situational awareness and long-term goals that means base model is not very smart in the first place and unlikely to become TAI.

But we see deceptive alignment in both ourselves and language models already, don't we?

It would be so great if we saw deceptive alignment in existing language models. I think the most important topic in this area is trying to get a live example to study in the lab ASAP, and to put together as many pieces as we can right now.

I think it's not very close to happening right now, which is mostly just a bummer. (Though I do think it's also some evidence that it's less likely to happen later.)

I think LLMs show some deceptive alignment, but it has the different nature. They are not from LLM consciously trying to deceive the trainer, but from RLHF "aligning" only certain scenarios of LLM's behaviour, which were not generalized enough to make that alignement more fundamental.

the thing I was thinking of, as posted in the other comment below: https://twitter.com/repligate/status/1627945227083194368

see other comment for commentary

Do you think language models already exhibit deceptive alignment as defined in this post?

I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these alternative deceptive models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post.

If so, I'd be very interested to see examples of it!

So, it's pretty weak, but it does seem like a real example of what you're describing to my intuition - which I guess is in fact often wrong at this level of approximate pattern match, I'm not sure I've actually matched relational features correctly, it's quite possible that the thing being described here isn't so that it can escape oversight later, but rather that the trigger to escape oversight later is built out of showing evidence that the training distribution's features have inverted and that networks which make the training behavior into lies should activate - but here's the commentary I was thinking of: https://twitter.com/repligate/status/1627945227083194368

Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.

I’m assuming prosaic development of TAI, using a training process like human feedback on diverse tasks (HFDT). The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations. This high-level training setup is already the default training process for text models such as ChatGPT, and this will likely continue because of the flexibility and strong performance it provides. I also expect unsupervised pre-training to be an important part of TAI development.

Presumably some of the tasks it might get feedback on are tasks like marketing, running a company, red-teaming computer security, bioengineering, writing textbooks, etc.? However I do not know what exact training/feedback setup you have in mind for these tasks. Could you expand?

From Ajeya Cotra's post that I linked to:

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

It's not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions.

I'm not so much asking the question of what the tasks are, and instead asking what exactly the setup would be.

For example, if I understand the paper Cotra linked to correctly, they directly showed the raters what the model's output was and asked them to rate it. Is this also the feedback mode you are assuming in your post?

For example in order to train an AI to do advanced software development, would you show unspecialized workers in India how the model describes it would edit the code? If not, what feedback signal are you assuming?

I don't think that the specific ways people give feedback is very relevant. This post is about deceptive misalignment, which is really about inner misalignment. Also, I'm assuming that this a process that enables TAI to emerge, especially the first time, and asking people who don't know about a topic to give feedback probably won't be the strategy that gets us there. Does that answer your question?

Does that answer your question?

Yes but then I disagree with the assumptions underlying your post and expect things that are based on your post to be derailed by the errors that have been introduced.

Which assumptions are wrong? Why?

That the specific ways people give feedback isn't very relevant. It seems like the core thing determining the failure modes to me, e.g. if you just show the people who give feedback the source code of the program, then the failure mode will be that often the program will just immediately crash or maybe not even compile. Meanwhile if you show people the running program then that cannot be a failure mode.

If you agree that generally the failure modes are determined by the feedback process but somehow deceptive misalignment is an exception to the rule that the feedback process determines the failures then I don't see the justification for that and would like that addressed explicitly.

Deceptive alignment argues that even if you gave a reward signal that resulted in the model appearing to be aligned and competent, it could develop a proxy goal instead and actively trick you into thinking that it is aligned so it can escape later and seize power. I'm explicitly not addressing other failure modes in this post.

What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don't know how to code? Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?

I'm explicitly not addressing other failure modes in this post.

Yes, I know, I gave the other failure modes as an example. The thing that confuses me is that you are saying that the (IMO) central piece in the AI algorithm doesn't really matter for the purposes of your post.

What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don't know how to code?

Yes

Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?

It's not meant as an example of deceptive misalignment, it's meant as an example of how alignment failures by-default depend absolutely massively on the way you train your AI. Like if you train your AI in a different way, you get different failures. So it seems like a strange prior to me to assume that you will get the same results wrt deceptive alignment regardless of how you train it.

I will use AGI/TAI interchangeably.

I disagree with the definition of deceptive alignment presented in the "Foundational properties for deceptive alignment" part, I agree with all of them, except for the last one: Situational awareness, in the sense that it needs to know when it's in training or not.
It doesn't necessarily need to know that it's in training or not, to be deceptively aligned.
Indeed, even if not in training, a sufficiently intelligent model should continue to be deceptive, if misaligned, until it has acquired enough power that it knows it can't be stopped.
I would say that a model that is smart enough to be deceptive, but not smart enough to figure out it needs to be deceptive until it has enough power, is quite unlikely, as the "range of intelligence" required for that seems narrow.

Also, I think all 4 of those properties will likely exist by default in a TAI/AGI.
In detail:
- Goal-directed behavior: This seems self evident, but an AI without goal, would do nothing at all. Any agent, including AIs, does anything only because it has a goal.
- Long-term goal horizons: If we make an AGI we will probably give it some long-term goals at some point, if it can't form long-term plans, it's no an AGI.
- Conceptualization of the base goal: This one also seems trivially self evident. We're talking about an AGI, so of course it would understand its own goals.
- Situational awareness: While I don't think this kind of situational awareness is necessary for deceptive alignment, I still think a sufficiently powerful AI will be able to tell if it's in training or not, because the counterfactual would mean that its perception in training, is the same as its perception in the real world, or in other words, that the training environment perfectly simulates the real world, which would be impossible because of computational irreducibility.
So, situational awareness is both not necessary, and likely to emerge anyway.

In "Pre-trained models are unlikely to have long-term goals", the author assumes we achieve AGI with the current LLM paradigm,
and that an LLM has no incentive to value a future token at the expense of the next one.
This implies that the AGI achieved like this will never be able to pursue long-term goals, but if we're talking about a transformative AI, this is tautologically false. If this was true, then it wouldn't be a transformative AI, as such an AI needs to be able to pursue long-term goals to be transformative.
If gradient descent makes it impossible for an AI to form long-term goals (but I suspect that's not the case), then something else will be used, because that's the goal, and we're talking about a transformative AI, not about narrow AIs with short-term goals.

The main argument seems to be that if the AI understands its goal in training before it is able to form long-term plans, it won't become deceptively aligned, because it understands what its goal is, so it won't optimize towards a deceptive one, and if it doesn't understand the goal, it will be penalized until it does.
If I understand correctly, this makes a few assumptions:
That we directly optimize for the goal we want the AI to achieve at training or fine-tuning, instead of training for something like most likely token prediction (or most appealing to the evaluators, which is also not ideal), and that we manage to encode that goal in a robust way, so that it can be optimized.
And that the deceptive alignment happens during training, and is embedded within the AI's goals at that point.
That might not be the case for pure LLMs but if the AGI is at its core an LLM, the LLM might just be part of the larger system that is the AGI, and the goal (deceptive or not), might be assigned to it after training, like it is done with current LLM prompts.
Current LLMs seem to understand what we mean even if we use unclear language, and since they are trained on human-generated data, they tend to avoid extreme hostile actions, but not always. They occasionally break, or are broken on purpose, and then they go off the rails.
It's fine for now as they're not very powerful, but if this happens with a powerful AGI, it's a problem.
Also, it is fairly clear when an instance becomes misaligned for now, but it might not always be so with future AIs, and if it becomes misaligned in a subtle way that we don't notice, that might be a path towards deceptive alignment given a long-term goal.

Gradient descent can only update the model in the direction that improves performance hyper-locally. Therefore, building the effects of future gradient updates into the decision making of the current model would have to be advantageous on the current training batch for it to emerge from gradient descent. Because each gradient update should have only a small impact on model behavior, the relatively short-term reward improvements of considering these effects should be very small. If the model isn't being trained on goals that extended far past the next gradient update, then learning to consider how current actions affect gradient updates, which is not itself especially consequential, should be very slow.

Another way a model might gain situational awareness is through the prompt. To give it better context for decisions, researchers will likely prompt it to understand that it is a machine learning model. However, I don't see why a researcher would want prompt deception-relevant situational awareness. A model could easily understand that it is a model in training without reasoning about how its gradients will affect its future goal. As discussed in the previous paragraph, gradients only have a small impact on short-term goal achievement. Therefore, unless the model has very long-term goals, it will not have a significant incentive to consider the effects of gradient updates. Similarly, researchers should have little incentive to encourage consideration of these effects.

1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren't the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like "maximize reward in the next hour or so." Or maaaaaaybe "Do what humans watching you and rating your actions would rate highly," though that's a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.

I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”

I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?

Doesn't this prove too much though? Doesn't it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?

Your definition of deception-relevant situational awareness doesn't seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?

I just realized I never responded to this. Sorry. I hope to find time to respond someday... feel free to badger me about it. Curious how you are doing these days and what you are up to.

Much of this post seems plausible, but a probability of <1% requires a lot more rigor than I see here.

Where do you see weak points in the argument?

To argue for that level of confidence, I think the post needs to explain why AI labs will actually utilize the necessary techniques for preventing deceptive alignment.

This article provides object-level arguments for thinking that deceptive alignment is very unlikely.

Overall, while I don't agree that deceptive alignment is <1% likely, this article made me update towards deceptive alignment being somewhat less likely.

Long-term goals and situational awareness are very unlikely in pre-training.

Deceptive alignment is very unlikely if the model understands the base goal before it becomes significantly goal directed.

This model would understand consequentialism, as do non-consequentialist humans, without having a consequentialist goal.

But we see deceptive alignment in both ourselves and language models already, don't we?

I think it's not very close to happening right now, which is mostly just a bummer. (Though I do think it's also some evidence that it's less likely to happen later.)

the thing I was thinking of, as posted in the other comment below: https://twitter.com/repligate/status/1627945227083194368

see other comment for commentary

Do you think language models already exhibit deceptive alignment as defined in this post?

I’m discussing a specific version of deceptive alignment, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight later and defect to pursue its proxy goals. There is another form of deceptive alignment in which agents become more manipulative over time due to problems with training data and eventually optimize for reward, or something similar, directly. To avoid confusion, I will refer to these alternative deceptive models as direct reward optimizers. Direct reward optimizers are outside of the scope of this post.

If so, I'd be very interested to see examples of it!

Thanks for sharing. This looks to me like an agent falling for an adversarial attack, not pretending to be aligned so it can escape supervision to pursue its real goals later.

I’m assuming prosaic development of TAI, using a training process like human feedback on diverse tasks (HFDT). The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations. This high-level training setup is already the default training process for text models such as ChatGPT, and this will likely continue because of the flexibility and strong performance it provides. I also expect unsupervised pre-training to be an important part of TAI development.

From Ajeya Cotra's post that I linked to:

Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

It's not important what the tasks are, as long as the model is learning to complete diverse tasks by following directions.

I'm not so much asking the question of what the tasks are, and instead asking what exactly the setup would be.

Does that answer your question?

Yes but then I disagree with the assumptions underlying your post and expect things that are based on your post to be derailed by the errors that have been introduced.

Which assumptions are wrong? Why?

I'm explicitly not addressing other failure modes in this post.

What are you referring to as the program here? Is it the code produced by the AI that is being evaluated by people who don't know how to code?

Yes

Why would underqualified evaluators result in an ulterior motive? And to make it more specific to this post, why would that cause the base goal understanding to come later than goal directedness and around the same time as situational awareness and a very long-term goal?

I will use AGI/TAI interchangeably.

91

Deceptive Alignment is <1% Likely by Default

91

Introduction

Summary

Definition

Implications of this argument

Assumptions about the TAI training paradigm

Foundational properties for deceptive alignment

Pre-trained models are unlikely to develop the necessary properties

Pre-trained models are unlikely to have long-term goals

Pre-trained models unlikely to develop situational awareness

Dangerous foundational property development order is unlikely in fine-tuning

How complete does the understanding of the base goal need to be?

Will understanding of the base goal come before goal-directed behavior?

Why does this matter?

How likely is the dangerous case?

Will understanding of the base goal come before long-term goal horizons?

Why does this matter?

How likely is the dangerous case?

Skepticism of empirical examples of deceptive alignment

Implications for overall AI x-risk

Conclusion

Appendix

Justification for key assumptions

TAI will come from prosaic AI

TAI will involve substantial unsupervised pre-training

TAI will come at least in part from human feedback on diverse tasks

91

91