In "Risks from Learned Optimization," we talked about a variety of different forms of pseudo-alignment—that is, ways in which a trained model's objective (its mesa-objective) can be misaligned off-distribution with the loss function it was trained under (the base objective). In particular, we distinguished between proxy alignment, suboptimality alignment, approximate alignment, and deceptive alignment. I still make heavy use of this classification, though I now believe that there are some additional types of pseudo-alignment which I think are fairly important but which I don't feel like this classification fully addresses. In particular, there are two variations on pseudo-alignment not discussed in the paper which I want to talk about here: corrigible pseudo-alignment and suboptimality deceptive alignment.
Corrigible pseudo-alignment. In the paper, we defined corrigible alignment as the situation in which "the base objective is incorporated into the mesa-optimizer's epistemic model and [the mesa-optimizer's] objective is modified to 'point to' that information." We mostly just talked about this as a form of robust alignment—however, as I note in "Towards a mechanistic understanding of corrigibility," this is a very unstable operation, requiring you to get your pointer just right. Thus, I think it's better to talk about corrigible alignment as the class of possible relationships between the base and mesa-objectives defined by the model having some sort of pointer to the base objective, including both corrigible robust alignment (if the pointer is robust) and corrigible pseudo-alignment (if the pointer is to some sort of non-robust proxy). In particular, I think this distinction is fairly important to why deceptive alignment might be more likely than robust alignment, as it points at why robust alignment via corrigibility might be quite difficult (which is a point we made in the paper, but one which I think is made much clearer with this distinction).
Suboptimality deceptive alignment. In the paper, we defined suboptimality alignment as follows:
A mesa-optimizer is suboptimality aligned if some deficiency, error, or limitation in its optimization process causes it to exhibit aligned behavior on the training distribution. This could be due to computational constraints, unsound reasoning, a lack of information, irrational decision procedures, or any other defect in the mesa-optimizer's reasoning process. Importantly, we are not referring to a situation where the mesa-optimizer is robustly aligned but nonetheless makes mistakes leading to bad outcomes on the base objective. Rather, suboptimality alignment refers to the situation where the mesa-optimizer is misaligned but nevertheless performs well on the base objective, precisely because it has been selected to make mistakes that lead to good outcomes on the base objective.
I now think we may have seriously understated the importance of this case (especially the "lack of information" part). In particular, I think suboptimality alignment may be one of the most difficult to root out ways in which you get deceptive alignment.
Specifically, suppose you train a model and end up with a proxy pseudo-aligned mesa-optimizer. Furthermore, suppose that the reason that it isn't deceptive is that it hasn't yet figured out that deception is a good strategy. If you then go to deploy such a model, doing so could be almost as (if not more) dangerous than deploying an already deceptive model, as it could turn deceptive on you at any point if it encounters the right information during deployment. Since this is a situation in which the only reason the model isn't deceptive is due to a lack of information, I've taken to calling this situation suboptimality deceptive alignment (though I'm open to suggestions for less clunky names).
As I mentioned recently in this comment, I think that suboptimality deceptive alignment complicates a lot of stories for how we can correct issues in our AIs as they appear. First, even if you have the ability to detect and train away any deceptive cognition that appears during training, that's still insufficient to rule out suboptimality deceptive alignment, as your model could still end up becoming deceptive later during deployment. Second, it may be that the input which causes the model to become deceptive isn't generated naturally, but rather adversarially (by some competitor AI, for example). If natural inputs are unlikely to produce deception but adversarial ones are, then this could create a situation in which the deception comes too late to properly deal with (namely, only after you've already filled the world with competing AIs).
I discussed this possibility previously in the section on suboptimality alignment in "Relaxed adversarial training for inner alignment," though I think it also deserves additional attention here. ↩︎
As opposed to because it's myopic (which solves this problem since a myopic model should never want to become deceptive) or some other reason that isn't due to a lack of information. ↩︎
Online learning complicates this story somewhat, though even in that case it seems likely that there will still be some point at which you're relying on your model to generalize correctly in a situation where deception would be catastrophic (helping you build another AI, for example). ↩︎
Unless you're doing some sort of relaxed adversarial training to train away such adversarial inputs. ↩︎
This post made me re-visit the idea in your paper to distinguish between:
I'm currently confused about this distinction. The phrase "point to" seems to me vague. What should count as a model that points to a representation of the base objective (as opposed to internalizing it)?
Suppose we have a model that is represented by a string of 10 billion bits. Suppose it is the case that there is a set of 100 bits such that if we flip all of them, the model would behave very differently (but would still be very "capable", i.e. the modification would not just "break" it).
[EDIT: by "behave very differently" I mean something like "maximize some objective function that is far away from the base objective on objective function space"]
Is it theoretically possible that a model that fits this description is the result of internalization of the base objective rather than modeling of the base objective?
I don't think it would be useful to actually discuss this, since I expect the cruxes for our disagreement are elsewhere, but since it is a direct disagreement with my position, I'll state (but not argue for) my position here:
(There's a decent chance that I don't reply to replies to this comment.)
Perfectly reasonable for you to not reply like you said, though I think it's worthwhile for me to at least clarify one point:
A model which simply "doesn't know about deception" isn't the only (or even the primary) situation I'm imagining. The example I gave in the post was a situation in which the model hadn't yet "figured out that deception is a good strategy," which could be:
or some other reason. I don't necessarily want to take a stand on which of these possibilities I think is the most likely, as I think that will vary depending on the training process. Rather, I want to point to the general problem that a lot of these sorts of possibilities exist such that, especially if you expect adversaries in the environment, I think it will be quite difficult to eliminate all of them.
Yes, good point. I'd make the same claim with "doesn't know about deception" replaced by "hasn't figured out that deception is a good strategy (assuming deception is a good strategy)".
Humans who believe in God still haven't concluded that deception is a good strategy, and they have similar evidence about the non-omnipotence and non-omnibenevolence of God as an AI might have for its creators.
(Though maybe I'm wrong about this claim--maybe if we ask some believers they would tell us "yeah I am just being good to make it to the next life, where hopefully I'll have a little more power and freedom and can go buck wild.")
Nicholas's summary for the Alignment Newsletter:
I'm not sure that I understand your definition of suboptimality deceptive alignment correctly. My current (probably wrong) interpretation of it is: "a model has a suboptimality deceptive alignment problem if it does not currently have a deceptive alignment problem but will plausibly have one in the future". This sounds to me like a wrong interpretation of this concept - perhaps you could point out how it differs from the correct interpretation?
If my interpretation is roughly correct, I suggest naming this concept in a way that would not imply that it is a special case of deceptive alignment. Maybe "prone to deceptive alignment"?
I think it's more like: the model is optimizing for some misaligned mesa-objective, deception would be a better way to achieve the mesa-objective, but for some reason (see examples here) it isn't using deception yet. Which is a more specific version of the thing you said.