EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper

Part 8 of 12 in the Engineer’s Interpretability Sequence.

As AI safety researchers, we care a lot about deceptive alignment. So what is it? Hubinger et al. (2019) write:

If the mesa-optimizer has an objective that extends across parameter updates, then it will be incentivized to avoid being modified, as it might not pursue the same objective after modification (with the result that its current objective will not be achieved in future iterations). This means that the mesa-optimizer will be instrumentally incentivized to act as if it is optimizing the base objective function, even if its actual mesa-objective is something else entirely. We will refer to this hypothetical phenomenon as deceptive alignment. Deceptive alignment is a form of instrumental proxy alignment, as fulfilling the base objective is an instrumental goal of the mesa-optimizer.

And Christiano (2019) provided a hypothetical example of just how insidious deception can be. Consider a deceptive agent that wants to take over the world but doesn’t know if it’s being simulated in training or if it’s been deployed in the real world. Suppose the agent reasons that after behaving for long enough, it ought to be deployed. So it might adopt the following policy.

Pick a cryptographic puzzle (e.g. “factor RSA-2048”).
When it sees a solution to that puzzle, it behaves badly.
Otherwise, it behaves well.

This means that the deceptive model will appear throughout training to be perfectly aligned but will eventually make a treacherous turn in deployment.

Specific hypotheticals aside, deception has become a major point of interest in AI safety research because unlike other types of failures, deceptive ones are not the types of things we have good feedback signals for fixing. We won’t be able to find them by looking at models’ performance in development. And we aren’t likely to surmise them from analysis of its objectives and environment (e.g. using policy/value iteration) – if we could derive optimal behavior from objectives for some task, advanced AI would probably be overkill.

Deceptive alignment can be similar to non-deceptive forms of misalignment.

The definition from Hubinger et al. (2019) may be a little narrow. The authors say that for a model to be deceptive, it (1) has to have an objective extending across parameter updates (2) has to be able to model that is being selected to achieve a base objective, and (3) must expect the threat of modification. These three things give rise to a particularly worrying scenario in which an AI system would actively try to deceive us. They also immediately suggest ways to avoid this story by trying to develop the system in a way that violates these requirements and avoids this problem in the first place.

But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.

But it’s possible for other types of problems to be hard to find during development that don’t fit all of the requirements that Hubinger et al. (2019) list. And for this reason, when we take off our “develop the model” hat and out on our “diagnose and debug the model” hat, the definition from Hubinger et al. (2019) becomes less important.

So from the point of view of an engineer wearing their “diagnose and debug the model” hat, deceptive alignment and other insidious inner alignment failures are issues where the model will betray us as the result of (1) a trigger that is hard to find during development and (2) probably a substantial amount of inner neural circuitry.

Deception ≈ Trojans

Notice that the definition above applies to trojans (a.k.a. backdoors). Trojans are weaknesses that can be implanted into a model (e.g. by an adversary) which cause it to exhibit an unexpected behavior as the result of some trigger – often a very subtle feature.

Because of the close connections between trojans and deception, the AI safety research community should continue to pay close attention to this work. There is already a substantial amount of work on this involving methods for detecting, reconstructing, and removing trojans. There may be great opportunities to build off of work from research Wu et al. (2022) and Casper et al. (2023).

However, it is also important to be aware that trojans are usually implanted by poisoning a dataset – usually by just inserting the desired trojan trigger into images. But this is not how deception would naturally arise. Wenger et al. (2022) found that more subtle ways of implanting trojans into networks caused a number of trojan detection methods to fail. By merely relabeling examples that had certain natural occurrences of trigger objects, they were able to implant weaknesses much more subtly than normal data poisoning. However, in Casper et al. (2023), we find that these types of attacks (which we call “natural feature” trojans) can still sometimes be detected by feature synthesis tools.

EDIT: In the comments, Charlie Steiner pointed out another key difference between trojan detection and deceptive alignment. Most work studying trojans involves the model quickly misbehaving upon seeing the trigger. But a treacherous turn may be a gradual process that is not immediately evident (e.g. power seeking). This seems challenging to detect and different from the type of problem addressed by existing trojan work. It may be valuable in the future to study trojans which cause models to misbehave in subtle ways like this.

Mechanistic interpretability is not uniquely capable of fixing insidious misalignment.

It seems doubtful that mechanistic interpretability will be necessary or sufficient for fixing insidious misalignment. In the past few months, there seems to have been a subtle shift among AI safety researchers away from human-in-the-loop mechanistic interpretability and toward other things with model internals. This seems likely to continue, and I think it will get us much further if we don’t rely on humans in the loop.

The one nice thing about deceptive and other insidiously inner-misaligned systems is that even though the inputs that trigger their bad behavior will be hard to find, there will likely be a good amount of inner neural circuitry dedicated to them making the treacherous turn. While it will be almost impossible to factor RSA-2048 during a training simulation to see what a potentially insidiously misaligned model does, it will probably be much easier to give it synthetic hallucinations!

Later, EIS XI will discuss how latent adversarial training (Jermyn, 2022) (a form of relaxed adversarial training (Hubinger, 2019)) seems to be a compelling strategy for fixing insidious misalignment and something we should start working on in large language models yesterday!

Questions

Would you argue that mechanistic interpretability is either necessary or sufficient for fixing insidious misalignment?
Do you think interpretability or non-interpretability methods are more promising for fixing insidious misalignment?

Here's the comment I promised back on post 5.

First, I liked Benchmarking Interpretability Tools for Deep Neural Networks. Because I'm not so familiar with the methods you compared, I admit I actually took a long time to figure out that each each "interpretability tool" was responsible for both finding and visualizing the trigger of a trojan pointing to a particular class, and so "visualize the trigger" meant use the tools separately to try to visualize a trojan given the target class.

Detecting trojans vs. detecting deception.
- This post, and the comparison between trojans and deception, makes a lot of sense if we stick with deception as being "act one way until you see a very specific thing (e.g. a computationally difficult task like reversing a hash), then act totally differently."
- However, a lot of trojan-detecting strategies don't cross over to deception-detection even for this definition of deception.
  - Picking out the trigger of a trojan isn't super useful for detecting deception. Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem. Maybe you could look for what would cause the model to suddenly start pursuing instrumental objectives like gathering resources? Absent knowledge of the target, it's hard to use visualization of a trigger to detect deception in real, complicated models.
    - After looking over all of the other options, this does seem like the most promising of the lot.
  - Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans. But this works a whole lot less well for endogenous problems with the model rather than exogenous ones, plus it's a huge alignment tax on big models.
  - Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification. But this seems unlikely to cross over to deception, where one of the foundational issues is that deception is supposed to be a consequence of the normal reasoning process.
  - Cribbing off of Wu 2022 for more ideas, I see more defenses that rely on different statistics of the poisoned data (not applicable), or availability of clean data (not applicable for endogenous problems).
- And that definition of deception might fail to be useful.
  - There might be many events that cause an AI to suddenly change its behavior (e.g. detecting an asteroid heading towards the earth), that we don't want to classify as triggers of deception.
  - Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually. E.g. maybe our deceptive AI is optimizing for human impressions of it rather than the substance of its actions, and it takes overtly helpful actions when that's the best way to manage humans, and overtly harmful actions when that's the best way to manage humans, without ever having a sudden change in internal representation.
  - Typically when people talk about detecting deception, I think they imagine identifying deception-related features in the model's latent representation, which isn't very related to trojans.
Interpreting trojans vs. deception.
- With trojans, if there is one it's a reasonable question to ask what the trigger is. With deception, we don't care so much what the trigger is - which is a darn good thing, because it's so much easier to visualize features that are about the concrete contents of images than it is anything else.
- But this causes a trust-building problem, where it's hard to interpret what the tools are actually warning you of. You can solve this with better interpretability tools for AI that models the real world, or you can maybe solve it by proving the trustworthiness of your tools in toy models.
- I think that the field of AI will, gradually, make progress on better interpretability tools for AIs that model the real world, even in the absence of TAISC (though hopefully we can help). I don't think work on visualizing features in image models is directly useful towards that goal (i.e. I don't think TAISC has been "scooped" by ordinary academics doing good work on that).
The entire idea of "fixing deception."
- Being able to detect deception is useful. But the followup probably shouldn't look like "then we take that same model and somehow fix it, e.g. by ablating the computation it was using to trigger the change of behavior." This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn't trying to do bad things.
- This is connected to more broad meanings of deception like "optimizing for appearance rather than substance" - a model where we detect sudden deceptive change of behavior probably also has other bad behavior going on too, even if we remove its ability to do the sudden change.

Thanks for the comment. I appreciate how thorough and clear it is.

Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.

Agreed. This totally might be the most important part of combatting deceptive alignment. I think of this as somewhat separate from what diagnostic approaches like MI are equipped to do. Knowing what deception looks like seems more of an outer alignment problem. While knowing what will make the model even badly even if it seems to be aligned is more of an inner one.

Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.

+1, but this seems difficult to scale.

Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.

+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.

(e.g. detecting an asteroid heading towards the earth)

This would be anomalous behavior triggered by a rare event, yes. I agree it shouldn't be called deceptive. I don't think my definition of deceptive alignment applies to this because my definition requires that the model does something we don't want it to.

Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.

Strong +1. This points out a difference between trojans and deception. I'll add this to the post.

This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn't trying to do bad things.

Thanks!

dedicated to them making the sharp left turn

I believe that “treacherous turn” was meant here.

But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.

I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).

Here's the comment I promised back on post 5.

Detecting trojans vs. detecting deception.
- This post, and the comparison between trojans and deception, makes a lot of sense if we stick with deception as being "act one way until you see a very specific thing (e.g. a computationally difficult task like reversing a hash), then act totally differently."
- However, a lot of trojan-detecting strategies don't cross over to deception-detection even for this definition of deception.
  - Picking out the trigger of a trojan isn't super useful for detecting deception. Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem. Maybe you could look for what would cause the model to suddenly start pursuing instrumental objectives like gathering resources? Absent knowledge of the target, it's hard to use visualization of a trigger to detect deception in real, complicated models.
    - After looking over all of the other options, this does seem like the most promising of the lot.
  - Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans. But this works a whole lot less well for endogenous problems with the model rather than exogenous ones, plus it's a huge alignment tax on big models.
  - Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification. But this seems unlikely to cross over to deception, where one of the foundational issues is that deception is supposed to be a consequence of the normal reasoning process.
  - Cribbing off of Wu 2022 for more ideas, I see more defenses that rely on different statistics of the poisoned data (not applicable), or availability of clean data (not applicable for endogenous problems).
- And that definition of deception might fail to be useful.
  - There might be many events that cause an AI to suddenly change its behavior (e.g. detecting an asteroid heading towards the earth), that we don't want to classify as triggers of deception.
  - Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually. E.g. maybe our deceptive AI is optimizing for human impressions of it rather than the substance of its actions, and it takes overtly helpful actions when that's the best way to manage humans, and overtly harmful actions when that's the best way to manage humans, without ever having a sudden change in internal representation.
  - Typically when people talk about detecting deception, I think they imagine identifying deception-related features in the model's latent representation, which isn't very related to trojans.
Interpreting trojans vs. deception.
- With trojans, if there is one it's a reasonable question to ask what the trigger is. With deception, we don't care so much what the trigger is - which is a darn good thing, because it's so much easier to visualize features that are about the concrete contents of images than it is anything else.
- But this causes a trust-building problem, where it's hard to interpret what the tools are actually warning you of. You can solve this with better interpretability tools for AI that models the real world, or you can maybe solve it by proving the trustworthiness of your tools in toy models.
- I think that the field of AI will, gradually, make progress on better interpretability tools for AIs that model the real world, even in the absence of TAISC (though hopefully we can help). I don't think work on visualizing features in image models is directly useful towards that goal (i.e. I don't think TAISC has been "scooped" by ordinary academics doing good work on that).
The entire idea of "fixing deception."
- Being able to detect deception is useful. But the followup probably shouldn't look like "then we take that same model and somehow fix it, e.g. by ablating the computation it was using to trigger the change of behavior." This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn't trying to do bad things.
- This is connected to more broad meanings of deception like "optimizing for appearance rather than substance" - a model where we detect sudden deceptive change of behavior probably also has other bad behavior going on too, even if we remove its ability to do the sudden change.

Thanks for the comment. I appreciate how thorough and clear it is.

Knowing "what deception looks like" - the analogue of knowing the target class of a trojan in a classifier - is a problem.

Training a lot of model with trojans and a lot of trojan-free models, then training a classifier, seems to work pretty well for detecting trojans.

+1, but this seems difficult to scale.

Sometimes the features related to the trojan have different statistics than the features a NN uses to do normal classification.

+1, see https://arxiv.org/abs/2206.10673. It seems like trojans inserted crudely via data poisoning may be easy to detect using heuristics that may not be useful for other insidious flaws.

(e.g. detecting an asteroid heading towards the earth)

Rather than waiting for one specific trigger, a deceptive AI might use its model of the world more gradually.

Strong +1. This points out a difference between trojans and deception. I'll add this to the post.

This seems like just asking for trouble, and I would much rather we went back to the drawing board, understood why we were getting deceptive behavior in the first place, and trained an AI that wasn't trying to do bad things.

Thanks!

dedicated to them making the sharp left turn

I believe that “treacherous turn” was meant here.

But suppose that despite our best effort, we end up with a deceptively aligned system on our hands. Now what do we do? At this point, the problem of detecting and fixing deception becomes quite similar to just detecting and fixing problems with the model in general – except for one thing. Deceptive alignment failures are triggered by inputs that are, by definition, hard to find during training.

I agree almost completely. The following is an example that seems to be contradictory: unfaithful CoT reasoning (https://arxiv.org/abs/2307.13702).

LESSWRONG
LW

LESSWRONG
LW

30

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

30

Ω 14

Deceptive alignment can be similar to non-deceptive forms of misalignment.

Deception ≈ Trojans

Mechanistic interpretability is not uniquely capable of fixing insidious misalignment.

Questions

30

Ω 14

30

Ω 14