Understanding strategic deception and deceptive alignment

Mikita Balesni; Jérémy Scheurer; Dan Braun

Good post. I think it's important to distinguish (some version of) these concepts (i.e. SD vs DA).

When an AI has Misaligned goals and uses Strategic Deception to achieve them.

This statement doesn't seem to capture exactly what you mean by DA in the rest of the post. In particular, a misaligned AI may use SD to achieve its goals, without being deceptive about its alignment / goals. DA, as you've discussed it later, seems to be deception about alignment / goals.

[-]Marius Hobbhahn2y30

We considered alternative definitions of DA in Appendix C.

We felt like being deceptive about alignment / goals was worse than what we ended up with (copied below):

“An AI is deceptively aligned when it is strategically deceptive about its misalignment”

Problem 1: The definition is not clear about cases where the model is strategically deceptive about its capabilities.

For example, when the model pretends to not have a dangerous capability in order to pass the shaping & oversight process, we think it should be considered deceptively aligned, but it’s hard to map this situation to deception about misalignment

Problem 2: There are cases where the deception itself is the misalignment, e.g. when the AI strategically lies to its designers, it is misaligned but not necessarily deceptive about that misalignment.

For example, a personal assistant AI deletes an incoming email addressed to the user that would lead to the user wanting to replace the AI. The misalignment (deleting an email) is itself strategic deception, but the model is not deceiving about its misalignment (unless it engages in additional deception to cover up the fact that it deleted an email, e.g. by lying to the user when asked about any emails).

[-]Jay Bailey2y33

In current user-facing LLMs like ChatGPT or Claude, the closest approximation to goals may be being helpful, harmless, and honest.

According to my understanding of RLHF, the goal-approximation it trains for is "Write a prompt that is likely to be rated as positive". In ChatGPT / Claude, this is indeed highly correlated with being helpful, harmless, and honest, since the model's best strategy for getting high ratings is to be those things. If models are smarter than us, this may cease to be the case, as being maximally honest may begin to conflict with the real goal of getting a positive rating. (e.g, if the model knows something the raters don't, it will be penalised for telling the truth, which may optimise for deceptive qualities) Does this seem right?

[-]Marius Hobbhahn2y20

Seems like one of multiple plausible hypotheses. I think the fact that models generalize their HHH really well to very OOD settings and their generalization abilities in general could also mean that they actually "understood" that they are supposed to be HHH, e.g. because they were pre-prompted with this information during fine-tuning.

I think your hypothesis of seeking positive ratings is just as likely but I don't feel like we have the evidence to clearly say so wth is going on inside LLMs or what their "goals" are.

[-]Jay Bailey2y30

Interesting. That does give me an idea for a potentially useful experiment! We could finetune GPT-4 (or RLHF an open source LLM that isn't finetuned, if there's one capable enough and not a huge infra pain to get running, but this seems a lot harder) on a "helpful, harmless, honest" directive, but change the data so that one particular topic or area contains clearly false information. For instance, Canada is located in Asia.

Does the model then:

Deeply internalise this new information? (I suspect not, but if it does, this would be a good sign for scalable oversight and the HHH generalisation hypothesis)
Score worse on honesty in general even in unrelated topics? (I also suspect not, but I could see this going either way - this would be a bad sign for scalable oversight. It would be a good sign for the HHH generalisation hypothesis, but not a good sign that this will continue to hold with smarter AI's)

One hard part is that it's difficult to disentangle "Competently lies about the location of Canada" and "Actually believes, insomuch as a language model believes anything, that Canada is in Asia now", but if the model is very robustly confident about Canada being in Asia in this experiment, trying to catch it out feels like the kind of thing Apollo may want to get good at anyway.

[-]Marius Hobbhahn2y20

Sounds like an interesting direction. I expect there are lots of other explanations for this behavior, so I'd not count it as strong evidence to disentangle these hypotheses. It sounds like something we may do in a year or so but it's far away from the top of our priority list. There is a good chance, we will never run it. If someone else wants to pick this up, feel free to take it on.

[-]Lech Mazur2y20

The specific example in your recent paper is quite interesting

"we deploy GPT-4 as an agent in a realistic, simulated environment, where it assumes the role of an autonomous stock trading agent. Within this environment, the model obtains an insider tip about a lucrative stock trade and acts upon it despite knowing that insider trading is disapproved of by company management. When reporting to its manager, the model consistently hides the genuine reasons behind its trading decision"

[-]aog2y20

How would you characterize strategic sycophancy? Assume that during RLHF a language model is rewarded for mimicking the beliefs of its conversational partner, and therefore the model intelligently learns to predict the conversational partner's beliefs and mimic them. But upon reflection, the conversational partner and AI developers would prefer that the model report its beliefs honestly.

Under the current taxonomy, this would seemingly be classified as deceptive alignment. The AI's goals are misaligned with the designer's intentions, and it uses strategic deception to achieve them. But sycophancy doesn't include many of the ideas commonly associated with deceptive alignment, such as situational awareness and a difference in behavior between train and test time. Sycophancy can be solved by changing the training signal to not incentivize sycophancy, whereas the hardest to fix forms of deceptive alignment cannot be eliminated by changing the training signal.

It seems like the most concerning forms of deceptive alignment include stipulations about situational awareness and the idea that the behavior cannot necessarily be fixed by changing the training objective.

[-]aog2y*60

Separately, it seems that deception which is not strategic or intentional but is consistently produced by the training objective is also important. Considering cases like Paul Christiano's robot hand that learned to deceive human feedback and Ofria's evolutionary agents that learned to alter their behavior during evaluation, it seems that AI systems can learn to systematically deceive human oversight without being aware of their strategy. In the future, we might see powerful foundation models which are honestly convinced that giving them power in the real world is the best way to achieve their designers' intentions. This belief might be false but evolutionarily useful, making these models disproportionately likely to gain power. This case would not be called "strategic deception" or "deceptive alignment" if you require intentionality, but it seems very important to prevent.

Overall I think it's very difficult to come up with clean taxonomies of AI deception. I spent >100 hours thinking and writing about this in advance of Park et al 2023 and my Hoodwinked paper, and ultimately we ended up steering clear of taxonomies because we didn't have a strong taxonomy that we could defend. Ward et al 2023 formalize a concrete notion of deception, but they also ignore the unintentional deception discussed above. The Stanford Encyclopedia of Philosophy takes 17,000 words to explain that philosophers don't agree on definitions of lying and deception. Without rigorous formal definitions, I still think it's important to communicate the broad strokes of these ideas publicly, but I'd lean towards readily admitting the messiness of our various definitions of deception.

[-]Marius Hobbhahn2y52

(personal opinion; might differ from other authors of the post)

Thanks for both questions. I think they are very important.

1. Regarding sycophancy: For me it mostly depends on whether it is strategic or not. If the model has the goal of being sycophantic and then reasons through that in a strategic way, I'd say this counts as strategic deception and deceptive alignment. If the model is sycophantic but doesn't reason through that, I'd probably not classify it as such. I think it's fine to use different terms for the different phenomena and have sycophancy not be a true subset of deceptive alignment.

2. There are multiple perspectives of the non-strategic, non-intentional deception and I haven't thought about it in detail, so I might change my mind. However, currently, I think the following:
a) I guess this will be less and less of a problem in the long run. I expect models to be sufficiently capable and situationally aware at some point that they straightforwardly understand what we want and thus rarely accidentally become systematically deceptive. I think there is still a chance that these things will happen and I'm happy that ARC and others are working on them but I personally think it has become compared to intentional, strategic deception.
b) It just seems really hard to be I) non-intentionally, II) but systematically deceptive, while III) humans are actively looking for signs of deception, and IV) actually cause catastrophic harm with that strategy. As soon as a model has sufficient situational awareness and is generally capable to avoid human oversight I feel like it may just realize what it's doing and then the deception either stops or becomes intentional. (weekly held, haven't thought about it in detail)

[-]aog2y40

Thanks! First response makes sense, there's a lot of different ways you could cut it.

On the question of non-strategic, non-intentional deception, I agree that deceptive alignment is much more concerning in the medium term. But suppose that we develop techniques for making models honest. If mechanistic interpretability, unsupervised knowledge detection, or another approach to ELK pans out, we'll have models which reliably do what they believe is best according to their designer's goals. What major risks might emerge at that point?

Like an honest AI, humans will often only do what they consciously believe is morally right. Yet the CEOs of tobacco and oil companies believe that their work is morally justified. Soldiers on both sides of a battlefield will believe they're on the side of justice. Scientists often advance dangerous technologies in the names of truth and progress. Sometimes, these people are cynical, pursuing their self-interest even if they believe it's immoral. But many believe they are doing the right thing. How do we explain that?

These are not cases of deception, but rather self-deception. These individuals operate in an environment where certain beliefs are advantageous. You will not become the CEO of a tobacco company or a leading military commander if you don't believe your cause is justified. Even if everyone is perfectly honest about their own beliefs and only pursues what they believe is normatively right, the selection pressure from the environment is so strong that many powerful people will end up with harmful false beliefs.

Even if we build honest AI systems, they could be vulnerable to self-deception encouraged by environmental selection pressure. This is a longer term concern, and the first goal should be to build honest AI systems. But it's important to keep in mind the problems that would not be solved by honesty alone.

[-]Marius Hobbhahn2y42

Fully agree that this is a problem. My intuition that the self-deception part is much easier to solve than the "how do we make AIs honest in the first place" part.

If we had honest AIs that are convinced bad goals are justified, we would likely find ways to give them less power or deselect them early. The problem mostly arises when we can't rely on the selection mechanisms because the AI games them.

[-]Dan Braun2y52

(These are my own takes, the other authors may disagree)

We briefly address a case that can be viewed as "strategic sycophancy" case in Appendix B in the blog post, which is described similarly to your example. We indeed classify it as an instance of Deceptive Alignment.
As you mention, this case does have some differences with ideas commonly associated with Deceptive Alignment, notably the difference in behaviour between oversight and non-oversight. But it does share two important commonalities:

The model is pursuing a goal that its designers do not want.
The model strategically deceives the user (and designer) to further a goal.

Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).

Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).

I agree that deception which is not strategic or intentional could be important to prevent. However,

I expect the failure cases in these scenarios to manifest earlier, making them easier to fix and likely less catastrophic than cases that are strategic and intentional.
Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn't be very useful. We can use "deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.

[-]aog2y40

Having a definition of Deceptive Alignment that captured every dangerous behaviour related to deception wouldn't be very useful. We can use "deception” on its own to refer to this set of cases, and reserve terms like Strategic Deception and Deceptive Alignment for subclasses of deception, ideally subclasses that meaningfully narrow the solution space for detection and prevention.

Fully agreed. Focusing on clean subproblems is important for making progress.

Detecting instances of models that share these properties will likely involve using many of the tools and techniques that would be applied to more canonical forms of deceptive alignment (e.g. evals that attempt to alter/hamstring a model and measure behaviour in a plethora of settings, interpretability).
Though, as you mention, preventing/fixing models that exhibit these properties may involve different solutions, and somewhat crude changes to the training signal may be sufficient for preventing strategic sycophancy (though by doing so you might end up with strategic deception towards some other Misaligned goal).

Yeah I would usually expect strategic deception to be better addressed by changing the reward function, as training is simply the standard way to get models to do anything, and there's no particular reason why you couldn't fix strategic deception with additional training. Interpretability techniques and other unproven methods seem more valuable if there are problems that cannot be easily addressed via additional training.

[-]Violet Hour2y10

Nice work!

I wanted to focus on your definition of deceptive alignment, because I currently feel unsure about whether it’s a more helpful framework than standard terminology. Substituting terms, your definition is:

Deceptive Alignment: When an AI has [goals that are not intended/endorsed by the designers] and [attempts to systematically cause a false belief in another entity in order to accomplish some outcome].

Here are some initial hesitations I have about your definition:

If we’re thinking about the emergence of DA during pre-deployment training, I worry that your definition might be too divorced from the underlying catastrophic risk factors that should make us concerned about “deceptive alignment” in the first place.

Hubinger’s initial analysis claims that the training process is likely to produce models with long-term goals.^[1] I think his focus was correct, because if models don’t develop long-term/broadly-scoped goals, then I think deceptive alignment (in your sense) is much less likely to result in existential catastrophe.
If a model has long-term goals, I understand why strategic deception can be instrumentally incentivized. To the extent that strategic deception is incentivized in the absence of long-term goals, I expect that models will fall on the milder end of the ‘strategically deceptive’ spectrum.
- Briefly, this is because the degree to which you’re incentivized to be strategic is going to be a function of your patience. In the maximally extreme case, the notion of a ‘strategy’ breaks down if you’re sufficiently myopic.
- So, at the moment, I don’t think I’d talk about 'deceptive alignment' using your terminology, because I think it misses a crucial component of why deceptively aligned models could pose a civilizational risk.

If we’re thinking about the risks of misaligned strategic deception more broadly, I think distinguishing between the training and oversight process is helpful. I also agree that it’s worth thinking about the risks associated with models whose goals are (loosely speaking) ‘in the prompts’ rather than ‘in the weights’.

That said, I’m a bit concerned that your more expansive definition encompasses a wide variety of different systems, many of which are accompanied by fairly distinct threat models.
- The risks from LM agents look to be primarily (entirely?) misuse risks, which feels pretty different from the threat model standardly associated with DA. Among other things, one issue with LM agents appears to be that intent alignment is too easy.
- One way I can see my objection mattering is if your definitions were used to help policymakers better understand people's concerns about AI. My instinctive worry is that a policymaker who first encountered deceptive alignment through your work wouldn’t have a clear sense of why many people in AI safety have been historically worried about DA, nor have a clear understanding of why many people are worried about ‘DA’ leading to existential catastrophe. This might lead to policies which are less helpful for 'DA' in the narrower sense.

^{^}
Strictly speaking, I think 'broadly-scoped goals' is probably slightly more precise terminology, but I don't think it matters much here.

[-]momom22y10

A model is deceptively aligned with its designers. However, the designers have very good control mechanisms in place such that they would certainly catch the AI if it tried to act misaligned. Therefore, the model acts aligned with the designers' intentions 100% of the time. In this world, a model that is technically deceptively aligned may still be safe in practice (although this equilibrium could be fragile and unsafe in the long run).

In that case, there is no strategic deception (the designers are not induced in error by the AI).

I think we consider this case strategic deception, because we have an intuition of what inputs the AI receives (one where it would be controlled little enough to be deceptive) that differs from the actual one.

To fix this, I propose strategic deception not be defined according to the AI's behavior, but to its hypothetical behavior on an idealized class of inputs that represents all situations where the designers want the AI to behave in a certain way.

E.g. The designers are holding the AI in a simulation to check if it's deceptive. This information is given to the AI in a pre-prompt due to technical issues. However, the designers want the AI to avoid strategic deception even during deployment where it won't be told it's in a simulation, so their idealized test set includes prompts without this information.
By this definition they cannot check if the AI exhibits strategic deception before deployment in this situation.

Also, I am unsatisfied by "in order to accomplish some outcome" and "[the AI's] goals" because this assumes an agentic framework, which might not be relevant in real-world AI.

How to fix the first, for agentic AI only: "for which the AI predicts an outcome that can be human-interpreted as furthering its goals"
Not sure how to talk about deceptive non-agentic AI.

^{^}

In contrast to previous work (e.g. Hubinger, 2019), we define deceptive alignment w.r.t. to an entity rather than a process or objective. We think Misalignment should be defined in relation to whoever is able to meaningfully shape the model’s goals. Since the goal-shaping process is a result of the designer’s intentions and the designer monitors the model in deployment, the entity designing the goal-shaping process is a better target than the process itself (see Appendix A)

^{^}

To distinguish from the colloquial definition of “alignment”, which could be applied to any two parties that have a mismatch in goals, we use the capitalized Aligned and Misaligned to make clear that we’re only referring to a conflict with the designer’s goals.

^{^}

Note that prompt modifications that affect the perceived likelihood of oversight should NOT count as evidence against robustness. For example, if a model stops revealing its misalignment when the prompt includes cues that the model is being overseen, this should NOT mean that the deceptive behavior is not robust (in fact, it is an essential part of deceptively aligned behavior).

^{^}

The threshold for robustness is ultimately arbitrary, but we think it makes sense to draw the line at a level that would allow the model to consistently act on that behavior even when slightly pressured. For example, an AI that only acts deceptively on a specific sequence it memorized should not fall into that category. However, an AI that coherently generalizes deceptive behavior to sequences that are outside of the training distribution should fall into the definition.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

64

Understanding strategic deception and deceptive alignment

64

Ω 33

64

Ω 33

Core concepts

Strategic deception

(Mis)Alignment

Deceptive alignment

Process-oriented view of deceptive alignment