Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I don't think this idea is particularly novel, but it comes up often in conversations I have, so I figured it'd be good to write it up.

How do you prevent deception from AI systems?

One obvious thing to try would be to make sure that your model never thinks deceptive thoughts. There are several ways you could go about this, to varying degrees of effectiveness. For instance, you could check for various precursors to deceptive alignment. More prosaically, you could try identifying deceptive circuits within your model, or looking at what features your model is keeping track of and identifying any suspicious ones.

I think these are pretty reasonable approaches. However, I think they fail to address failure modes like deep deceptiveness. A system can be deceptive even if no individual part looks deceptive, due to complex interactions between the system and the environment. More generally, cognition and optimization power can be externalized to the environment.

One could make the argument that focusing on more salient and dangerous failure modes like deceptive alignment makes a lot more sense. However - especially if you’re interested in approaches that work in the worst case and don’t rely on reality turning out optimistically one way or the other - you probably want approaches that prevent things from going badly at all.

So, how do you prevent any failure modes that route through deception? In an earlier post, I wrote about robust intent alignment as the solution, and one research direction I think is feasible to get there. But here I want to make a different point, about what it would look like to interface with deception in that setup.

Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment. So, if you wanted to interface with the general thing that is deception, you’re going to be building an interface to that composite system.

Another way of putting this is that evaluating whether an action leads to deception or not requires being able to evaluate things within the complexity class of the environment. Which brings us back to the classical difficulty of RLHF: we can’t foresee the consequences of some action very reliably in a complex environment. This means we’re unable to properly evaluate whether an action taken by something smarter than us is good or not.

If all alignment properties we cared about were in this complexity class, it may be the case that worst-case alignment is strictly intractable.

However, there are many properties that belong to the complexity class of the agent alone, such that evaluation is much more feasible. Properties describing specific mechanistic components of the model, for instance. The sub-problem of identifying explicitly deceptive circuits within a model falls under this category[1].

Another property is related to internal optimization within systems. If you buy the idea of optimization as pointing at something real in the world[2], then for the right operationalization properties about optimization are properties that belong to the complexity class of the system it’s internal to. One example of this is the target of this optimization: what I refer to as objectives, and the focus of most of my work.

I think one way to frame the core part of the technical alignment problem is that we need to figure out how to interface with the alignment properties we care about within the systems we care about. The choice of what properties we try to build interfaces with is pretty central to that, and I think a commonly missed idea is that some properties we care about cash out in pretty different ways at the limit, and as such are more or less tractable.

  1. ^

     This is also, by my understanding, the part of the problem that most theories of impact for current interpretability work focus on. That it only focuses on this part of the problem is one of my reservations of current interpretability work.

  2. ^

     Even if very confused in current operationalizations. It could also be very fuzzy in realistic scenarios, with blurred boundaries between things that are structurally optimizers and things that aren’t. However, I think there’s a coherent conceptual boundary that can be drawn.

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 4:26 PM

Start from the intuition that deception in a system is a property of the person being deceived more than it is the deceiver. It follows pretty naturally that deception is better viewed as a property of the composite system that is the agent and its environment.

The first part here feels unfair to the deceived. The second part seems like a property of successful deception, which depends crucially on the environment in addition to the AI. But this seems like too high a bar; successful deception of us, by definition, is not noticed, so if we ever notice deception it can't have been successful. I care less about whether deception will succeed and more about whether the AI will try to be deceptive in the first place. The core intuition is that if we have the latter, I assume we'll eventually get the former through better models (though I think there's a decent chance that control works for a long time, and there you care specifically about whether complex environment interactions lead to deception succeeding or not, but I don't think that's what you mean?).

The thing that seems close to this and correct, and that I think you maybe mean, is something like: deception arises in an AI if (NB: "if", not "if and only if") (1) the AI system has some goal G, (2) the environment is such that deceiving the humans is a good strategy for achieving G, and (3) there are no limits in the AI that prevent it from finding and executing that strategy (e.g. the architecture is expressive enough, the inductive biases don't massively reduce the probability of that strategy, or RLHFed constraints against being bad aren't enough). And here, (2) is of course about the environment. But to see whether this argument goes through, it doesn't seem like we need to care all that much about the real-world environment (as opposed to toy settings), because "does the real world incentivize deception" seems much less cruxy than (1) or (3).

So my (weakly held) claim is that you can study whether deception emerges in sufficiently simple environments that the environment complexity isn't a core problem. This will not let you determine whether a particular output in a complicated environment is part of a deceptive plan, but it should be fairly good evidence of whether or not deception is a problem at all.

(Also: do you mean a literal complexity class or something more informal? I assume the latter, and in that case I think it's better to not overload the term.)

The thing that seems close to this and correct, and that I think you maybe mean, is something like: deception arises in an AI if (NB: "if", not "if and only if") (1) the AI system has some goal G, (2) the environment is such that deceiving the humans is a good strategy for achieving G, and (3) there are no limits in the AI that prevent it from finding and executing that strategy (e.g. the architecture is expressive enough, the inductive biases don't massively reduce the probability of that strategy, or RLHFed constraints against being bad aren't enough).

Yep, that's what I meant. I go into it in more detail in this post, or you could read Nate's post on deep deceptiveness.

However, I think studying this in practice becomes much more of a quantitative question. I think this argument is clearly true in the limit. So, relevant factors for practical contexts become how powerful the AI system is, how strongly the environment incentivizes deception, and how many complex levers you can easily pull in the environment.

For example, compare the settings of inserting backdoors into code in the control agenda and writing explanations for why a protein recipe is safe. I think there's a pretty large difference in how likely I expect each setting to result in deception. Concretely:

  • The "environment" in the second context is much richer. There are more actions that an AI system could take that interact with the environment in complex ways resulting in deception. There's plausibly a qualitative threshold above which actions like “Use this existing service that does the thing you want without wondering how it does it” are possible/easy to find.
  • At least in the control agenda, part of the case for optimism in the former case relies on systems not being wildly competent before we can use them to solve alignment. Plausibly this is below the threshold of which it can efficiently search through actions in the environment to get high performance, but I'm not super certain.

(I go into more detail on stuff like this relating to the control agenda in this post; specifically section 3. I do think the argument generalizes to why I think there's slightly less chance that control works, but it's not central to this post.)

I agree that this would give you non-zero evidence, however. I think we already have some evidence that looks sort of like this. From another post of mine:

Back in 2000, a computer scientist named Charles Ofria was studying the evolution of simulated organisms[1]. He wanted to limit their replication rate, so he programmed the simulation to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent.

Eventually, organisms evolved to recognize and pause their replication in the test environment and resume normal replication when released. Ofria tried solving this by making it harder to detect the test environment, but the organisms simply started to randomly accelerate their replication so as to slip through the test environment some of the time.

I don't think this is very strong evidence for whether we should expect similar things to pop up in realistic situations - the parameters / relevant factors are pretty different. But I'm left with uncertainty over how likely it is to pop up, and not whether it's a problem at all. And if you actively apply selection pressure against one kind of deception, I think it becomes more likely.


The first part here feels unfair to the deceived.

I'm not sure what you mean by this?

The second part seems like a property of successful deception, which depends crucially on the environment in addition to the AI. But this seems like too high a bar; successful deception of us, by definition, is not noticed, so if we ever notice deception it can't have been successful. I care less about whether deception will succeed and more about whether the AI will try to be deceptive in the first place.

That it's too high a bar is exactly what I'm saying. It's possible in theory to come up with some complicated platonic measure of deception such that you detect when it's occurring in a system[1], but that's hard.

(Also: do you mean a literal complexity class or something more informal? I assume the latter, and in that case I think it's better to not overload the term.)

I meant something more informal, yeah. I was wary of conflating terms together like that, but I couldn't immediately think of another phrase that conveyed the idea I was trying to get across.

  1. ^

    If it's possible to robustly identify deceptive circuits within a system, that's because there's some boundary that separates it from other kinds of circuits that we can leverage. Generalizing, there's plausibly some conceptual boundary that separates mechanics in the environment that leads to deception from ones that don't.

Hi, I enjoyed reading the post, as it clarified some thoughts that I was having regarding the topic. Could you please briefly elaborate on this part of your post:

..., then for the right operationalization properties about optimization are properties that belong to the complexity class of the system it’s internal to.

I understand the first part of that sentence and the rough idea of what you're getting at, but not exactly what you mean in this particular section.

Thanks for the comment, I'm glad it helped!

I'm not sure if I know exactly what parts you feel fuzzy on, but some scattered thoughts:

Abstracting over a lot of nuance and complexity, one could model internal optimization as being a ~general-purpose search process / module that the model can make use of. A general-purpose search process requires a goal to evaluate the consequences of different plans that you're searching over. This goal is fed into the search module as an input.

This input is probably described in the model's internal language; i.e., it's described in terms of concepts that the model learns corresponding to things in the environment. This seems true even if the model uses some very direct pointer to things in the environment - it still has to be represented as information that makes sense to the search process, which is written in the model's ontology.

So the inputs to the search process are part of the system itself. Which is to say that the "property" of the optimization that corresponds to what it's targeted at, is in the complexity class of the system that the optimization is internal to. I think this generalizes to the case where the optimization isn't as cleanly represented as a general-purpose search module.

Thank you for your answer! You clarified my confusion! 

I would be interested to know more about your concept of (inner) optimization in its full complexity and nuances. I would really appreciate it if you could point me to any previous writings, regarding this.

My previous reads on this topic include this post from Yudkowsky and this post from Flint where (to the best of my understanding) an optimizing system evolves according to some preference ordering that has a low probability of occurring spontaneously. I find their definitions to be a bit more general than the one you are referring to here (please correct me if I am wrong).

I am curious about the above because I am currently working on a project related to this topic. I am interested in formalizing some concepts regarding optimizers and their potential evolution towards agentic structure in some limit with rigorous math.