Framings of Deceptive Alignment

[-]leogao4y70

I like this post and generally agree with the decomposition into the two different types of deception. I prefer to call them outer and inner deception, because one happens when deception is incentivized by the base objective, and one happens when deception is incentivized by the mesaobjective. I thought about this a bit a while back and after talking to Evan about this I managed to be convinced that Goodhart/outer deception is a lot easier to tackle if you already have consequentialist/inner deception nailed down (not sure to what extent Evan endorses this and don't want to misattribute my misunderstanding to him). I should probably revisit some of this stuff and make a post about it.

As for the "where is the deception" thing, I don't really know if the distinction is that sharp. My intuition is that lots of systems in practice will be some weird mix of both, and I don't really expect it to be interpretable either way anyways (even for the pure "in the weights" case you can have arbitrarily obfuscated and computationally intractable to detect versions of it).

[-]CallumMcDougall4y10

Yeah I think this is Evan's view. This is from his research agenda (I'm guessing you might have already seen this given your comment but I'll add it here for reference anyway in case others are interested)

I suspect we can in fact design transparency metrics that are robust to Goodharting when the only optimization pressure being applied to them is coming from SGD, but cease to be robust if the model itself starts actively trying to trick them.

And I think his view on deception through inner optimisation pressure is that this is something we'll basically be powerless to deal with once it happens, so the only way to make sure it doesn't happen it to chart a safe path through model space which never enters the deceptive region in the first place.

[-]CallumMcDougall4y30

Thanks for the post! I just wanted to clarify what concept you're pointing to with use of the word "deception".

From Evan's definition in RFLO, deception needs to involve some internal modelling of the base objective & training process, and instrumentally optimising for the base objective. He's clarified in other comments that he sees "deception" as only referring to inner alignment failures, not outer (because deception is defined in terms of the interaction between the model and the training process, without introducing humans into the picture). This doesn't include situations like the first one, where the reward function is underspecified to produce behaviour we want (although it does produce behaviour that looks like it's what we want, unless we peer under the hood).

To put it another way, it seems like the way deception is used here refers to the general situation where "AI has learnt to do something that humans will misunderstand / misinterpret, regardless of whether the AI actually has an internal representation of the base objective it's being trained on and the humans doing the training."

In this situation, I don't really know what the benefit is of putting these two scenarios into the same class, because they seem pretty different. My intuitions about this might be wrong though. Also I guess this is getting into the inner/outer alignment distinction which opens up quite a large can of worms!

[-]peterbarnett4y30

Yeah, I totally agree. My motivation for writing the first section was that people use the word 'deception' to refer to both things, and then make what seem like incorrect inferences. For example, current ML systems do the 'Goodhart deception' thing, but then I've heard people use this to imply that it might be doing 'consequentialist deception'.

These two things seem close to unrelated, except for the fact that 'Goodhart deception' shows us that AI systems are capable of 'tricking' humans.

[-]CallumMcDougall4y10

Okay I see, yep that makes sense to me (-:

[-]evhub2yΩ220

I think that your discussion of Goodhart deception is a bit confusing, since consequentialist deception is a type of Goodharting, it's just adversarial Goodhart rather than regressional/causal/extremal Goodhart.

[-]Sam Marks4y20

I agree that the term "deception" conflates "deceptive behavior due to outer alignment failure" and "deceptive behavior due to inner alignment failure" and that this can be confusing! In fact, I made this same distinction recently in a thread discussing deceptive behavior from models trained via RL from human feedback.

^{^}

By this I mean it does some amount of choosing its actions based on their effect on the future, for at least some inputs.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

32

Framings of Deceptive Alignment

32

Ω 17

32

Ω 17

Types of Deception

Goodhart Deception

Consequentialist Deception

Where is the deception?

Concepts of deception an AI could have