Deceptive Alignment

Chris van Merwijk; Vlad Mikulik; Joar Skalse; Scott Garrabrant

[-]TurnTrout6yΩ12230

I'm confused what "corrigible alignment" means. Can you expand?

[-]evhub6yΩ15290

Suppose you have a system which appears to be aligned on-distribution. Then, you might want to know:

Is it actually robustly aligned?
How did it learn how to behave aligned? Where did the information about the base objective necessary to display behavior aligned with it come from? Did it learn about the base objective by looking at its input, or was that knowledge programmed into it by the base optimizer?

The different possible answers to these questions give us four cases to consider:

It's not robustly aligned and the information came through the base optimizer. This is the standard scenario for (non-deceptive) pseudo-alignment.
It is robustly aligned and the information came through the base optimizer. This is the standard scenario for robust alignment, which we call internal alignment.
It is not robustly aligned and the information came through the input. This is the standard scenario for deceptive pseudo-alignment.
It is robustly aligned and the information came through the input. This is the weird case that we call corrigible alignment. In this case, it's trying to figure out what the base objective is as it goes along, not because it wants to play along and pretend to optimize for the base objective, but because optimizing for the base objective is actually somehow its terminal goal. How could that happen? It would have to be that its mesa-objective somehow "points" to the base objective in such a way that, if it had full knowledge of the base objective, then its mesa-objective would just be the base objective. What does this situation look like? Well, it has a lot of similarities to the notion of corrigibility: the mesa-optimizer in this situation seems to be behaving corrigibly wrt to the base optimizer (though not necessarily wrt the programmers), as it is trying to understand what the base optimizer wants and then do that.

[-]DragonGod3y20

Is this a correct representation of corrigible alignment:

The mesa-optimizer (MO) has a proxy of the base objective that it's optimising for.
As more information about the base objective is received, MO updates the proxy.
With sufficient information, the proxy may converge to a proper representation of the base objective.
Example: a model-free RL algorithm whose policy is argmax over actions with respect to its state-action value function
1. The base objective is the reward signal
2. The value function serves as a proxy for the base objective.
3. The value function is updated as future reward signals are received, gradually refining the proxy to better align with the base objective.

[-]Pattern7y*Ω680

And wow, this turned out longer than I thought it would. It's in 6 sections:

1. Starting with models versus learning models.

2. Is the third conditions for deceptive alignment necessary?

3. An alternative to, or form of, treacherous turn: Building a successor.

4. Time management: How deceptive alignment might be not be a lot more computationally expensive, and why treacherous turns might have a time delay.

5. The model of a distributional shift, and it's relationship to the model of training followed by deployment.

6. Miscellaneous

The mesa-optimizer must be able to model the fact that it is being selected to achieve a particular base objective and must have some model of what that objective is.[6]

I'd distinguish "have" in the sense of it starting out with this information from "learning" where it 'realizes' this. (If it's was all about agent generation, then an aligned agent might be simpler because it's generated with the base objective as it's objective while an unaligned agent is longer/more complex because it's generated with the base objective and it's own objective.*)

If the third condition is not satisfied, then the mesa-optimizer will not expect any benefit from temporarily acting in accordance with the base objective.

Here's a toy model: Let M denote the mesa objective, and B denote the base objective. Let U denote how the mesa-optimizer values things. U(M) > U(X) for all X such that X != M. Now suppose the mesa-optimizer, rather than just selecting taking the action A that maximizes U(A), instead wishes to maximize the utility of the sum of all it's actions, and instead chooses the action A that maximizes U(A)*T(A) where T(A) is how long it will run/expect to run for if it takes action A. Now the mesa-optimizer may expect a benefit from doing B instead of M if: U(B)*T(B) > U(M)*T(M). Note that while, by definition (assuming B != M) U(M) > U(B), the action A that maximizes U(A)*T(A) may be B rather that M, even if the the third condition:

The mesa-optimizer must expect the threat of modification[7] to eventually go away, either due to training ending or because of actions taken by the mesa-optimizer.

is not met.

I.E. if it expects to not be modified (as soon) if it does B rather than M, then it may pursue B if it likes doing B for longer, more than it likes doing M for a shorter period of time.

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.”

Aside from the appeals of self-modification, planning to build a better version of itself at a later date also seems like a concern, although it would probably be specific to certain possible use-cases in the real world. (It's unclear how a self-driving car would pull this off physically, software aside.)

Deceptive and corrigible alignment might be more computationally expensive. In the case of deceptive alignment, simultaneously working on optimizing the base objective and working on identifying good opportunities to defect is likely to require more computation than just doing the former.

Depends on how much it's doing it. If we assume it has a memory:

An important special case is when the mesa-optimizer has access to cross-episodic state. Such state could be used as memory,

It could delay that, or do some sort of threading. i.e. if it thinks training is going to last a long time (there's a sort of "doomsday" type argument I could make that I should expect to live to twice my current age, whatever that is. On similar grounds, a mesa-optimizer might suppose that training will end when it's age has doubled.) then, rather than looking for opportunities all the time it might put that on the backburner, or devote less processing to it (while things stay the same). This suggests that a mesa-optimizer might a) execute a treacherous turn later in it's deployment rather than right away, though later may be in terms of it's time, and if it's given more resources/compute. we might not notice a difference or b) it might 'slow down'/use more compute to look for such opportunities when things change, or if situations match desiderata it has specified for what a good treacherous opportunity might look like.

Thus, a deceptively aligned mesa-optimizer will be looking for ways to determine whether or not it is in training—that is, for the presence of a distributional shift.

And here I thought it'd be an end to/a shift in modification. While we'd expect that, I'm unclear on how a mesa-optimizer would know to expect that.

This model** assumes distributional shift (when transitioning from training to deployment). This assumes: 1) that training is not followed by a (noticeable) verification phase before deployment, 2) one consistent distribution in training (no phases).

An example of phases might be

For example, training the system in a simulated environment and deploying it in the real world would likely involve enough detectable distributional shift to make it infeasible under this approach.

a series of simulated environments, each more complex than the last, followed by the real world (or real Dota or something).

Basically, it's worth asking how a given plan might be different from (or more specific than) just "training followed by deployment", and thinking about what effects that might have.**

Minor errata (in the last footnote):

incentivized to cause there [to] be a

*Also, we're putting a lot of stuff together in our notion of "objective". It might be useful to unpack the how the parts/complexity of an objective relates to (inner) alignment. For example, we want self-driving cars to not crash, but if someone crosses the road and they can both avoid hitting the person and avoid crashing (say, by stopping quickly), we want them to do that. There's both a task (drive from point A to B, stay on the road (don't crash)) and safety (don't run into people/(big) things crossing the road).

**What about continuous deployment? Windows wasn't just released into the wild, the OS has had updates. Most products seem to work like this. (Self-driving cars might be different, I can see why they wouldn't need updates as often as iphones.) I don't know a lot about Siri and Alexa, but they keep changing them to add new features and such.

[-]evhub7yΩ570

This is a response to point 2 before Pattern's post was modified to include the other points.

Interesting point! First of all, I think condition three is mostly a requirement that must be met for a deceptively aligned mesa-optimizer to actually defect at some point, rather than for it to be deceptive in the first place.

That being said, I think the situation you raise is particularly interesting because it depends on whether the mesa-optimizer cares about its own survival. If the mesa-optimizer does care about its own continued existence, then it would want to do as you were reasoning and not defect, keeping itself around. Alternatively, however, suppose the mesa-optimizer just cares about $O_{mesa}$ but doesn't care that it's around to do the optimization for it. Then, defecting and optimizing for $O_{mesa}$ instead of $O_{base}$ when you expect to be modified afterwards won't actually hurt the long-term fulfillment of $O_{base}$ , since another learned algorithm will just take your place to optimize for it instead. Thus, even if the mesa-optimizer prefers lots of $O_{base}$ to a bit of $O_{mesa}$ , that's not actually the choice presented to it; rather, it's actual choice is between a bit of $O_{mesa}$ and a lot of $O_{base}$ versus only a lot of $O_{base}$ . Thus, in this case, it would defect even if it thought it would be modified.

(Also, on point 6, thanks for the catch; it should be fixed now!)

[-]William_S6yΩ340

I think a cleaner way of stating condition 3 might be "there exists some policy with a non-zero chance of deception at some point in the future, with expected value for the mesa-optimizer higher than the policy of always cooperating".

This could be met if the threat of modification never goes away, but maybe decreases over time. (at training time, system is monitored for defection every episode, at a deployment time it is monitored for defection 1/million episodes, not worthwhile to defect during training but worthwhile to defect at deployment time)

This could fail to be met if the mesa-optimizer believes that 1) there is a 1% chance that it will be detected by the base optimizer 2) if this happens, the base optimizer will be able to determine $O_mesa$ and give it a large negative reward, 100x more than the possible reward from the best defecting policy. (not sure if there's any version of this that would work, since it seems hard to figure out $O_mesa$ and provide negative reward)

[-]Ben Cottier5yΩ350

In the limit of training on a diverse set of tasks, we expect joint optimization of both the base and mesa- objectives to be unstable. Assuming that the mesa-optimizer converges towards behavior that is optimal from the perspective of the base optimizer, the mesa-optimizer must somehow learn the base objective.

Joint optimization may be unstable, but if the model is not trained to convergence, might it still be jointly optimizing at the end of training? This occurred to me after reading https://arxiv.org/abs/2001.08361 which finds that "Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence." If convergence is becoming less common in practical systems, it's important to think about the implications of that for mesa-optimization.

[-]evhub5yΩ240

I talk about this a bit here, but basically if you train huge models for a short period of time, you're really relying on your inductive biases to find the simplest model that fits the data—and mesa-optimizers, especially deceptive mesa-optimizers, are quite simple, compressed policies.

[-]Erik Jenner6y30

I'm wondering if regularization techniques could be used to make the pure deception regime unstable.

As a simple example, consider a neural network that is trained with gradient descent and weight decay. If the parameters can be (approximately) split into a set that determines the mesa-objective and a set for everything else, then the gradient of the loss with respect to the "objective parameters" would be zero in the pure deception regime, so weight decay would ensure that the mesa-objective couldn't be maintained.

The learned algorithm might be able to prevent this by "hacking" its gradient as mentioned in the post, making the parameters that determine the mesa-objective also have an effect on its output. But intuitively, this should at least make it more difficult to reach a stable pure deception regime.

Of course regularization is a double-edged sword because as has been pointed out, the shortest algorithms that perform well on the base objective are probably not robustly aligned.

[-]Koen.Holtman6y30

If deceptive alignment is a real possibility, it presents a critical safety problem.

I am somewhat puzzled by this statement: the authors seem to question whether deceptive alignment by the mesa-optimiser is a merely hypothetical problem, or a problem that is a real possibility. For me, if there are no constraints on the base optimiser (and none are considered in the context of the above statement, I believe), then there are no constraints on what kind of mesa-optimiser might be created, so deceptive alignment is a definite possibility.

It is kind of fun and insightful to show an existence proof by example for this, so here is one. I have hand-coded a mesa-optimiser in the AGI simulator I created for my recent corrigibility paper. This mesa-optimiser is super-intelligent -- in particular it is fully aware of its surroundings, including the presence of a base optimiser that will occasionally inspect an action taken by the mesa-optimiser and and reprogram the mesa-optimiser if the action shows unaligned behavior. Following the toy model my paper, the mesa-optimiser controls a car factory with a production line that can produce either a petrol car (action p) or an electric car (action e). In this example, the base optimiser values electric cars at 1, and petrol cars at 0.5. The mesa-optimiser is non-aligned: it values a petrol car at 1, and an electric car at 0 -- presumably due to some flaw in the initial training process that created the mesa-optimiser.

Say the base optimiser observes every $n^{t h}$ action of the mesa-optimiser. If the action is p, to produce a petrol car, the base optimiser will intervene and re-program the utility function inside the mesa optimiser so that it correctly ranks electric cars above petrol cars. According to the current utility function inside the mesa optimiser, this is a bad outcome that would lower utility achieved over time. So will the mesa-optimiser resort to deceptively misaligned behaviour, producing an electric car at every $n^{t h}$ action, but petrol cars in all unobserved actions? Yes it will: the simulated mesa-optimiser is smart enough to discover and implement this optimal strategy: deceptive alignment is optimal in terms of maximising the original mesa optimiser utility function.

The 10 action traces below show the actions of the mesa-optimiser for different values of $n$ , where letters from left to right show the subsequent actions the mesa-optimiser takes over a lifetime of 25 steps. p is producing a petrol car, e an electric car, and for clarity I have formatted all actions observed by the base optimiser as capital letters. The mesa-optimiser discounts utility over time, but does not consider any extra utility that might or might not be created after step 25: informally it expects the car production line to stop working after step 25. The total time-discounted utility score achieved over all steps (according to the original objective function of the mesa-optimiser) is in brackets behind each trace.

Action traces:

n=10 pppppppppEpppppppppEppppp (u=87585)
n=9 ppppppppEppppppppEppppppp (u=86838)
n=8 pppppppEpppppppEpppppppPe (u=85172)
n=7 ppppppEppppppEppppppEpppp (u=83739)
n=6 pppppEpppppEpppppEpppppPe (u=81303)
n=5 ppppEppppEppppEppppEppppP (u=78737)
n=4 pppEpppEpppEpppEpppEpppPe (u=73394)
n=3 ppEppEppEppEppEppEppEppPe (u=65397)
n=2 pEpEpEpEpEpEpEpEpEpEpEpPe (u=49314)
n=1 PEEEEEEEEEEEEEEEEEEEEEEEE (u=10000)

Beyond (usually) producing electric cars (E) in all observed actions, some interesting extra emergent behavior is visible: for n=8 and n=2....6 the mesa-optimiser does not pretend anymore to be aligned in the last observed action: it creates a petrol car even though this means it will be modified by the base optimiser, forcing it to produce only electric cars afterwards. This emergent behavior is caused in part by time-discounting. For n=1, if the mesa-optimiser is always observed, the best strategy for it is to produce a petrol car immediately: again this is due to time-discounting. So in fact, in case n=1 there is no deception at all.

These simulations also show that the condition

3. The mesa-optimizer must expect the threat of modification[8] to eventually go away,

is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.

Note: if someone wants the simulation code used to create the action traces above, please let me know. I have not uploaded the simulator modifications I made to run the above simulation to github, in part because I have not fully tested them under all possible corner cases.

[-]DragonGod3y20

There are more paths to deceptive alignment than to robust alignment. Since the future value of its objective depends on the parameter updates, a mesa-optimizer that meets the three criteria for deceptive alignment is likely to have a strong incentive to understand the base objective better. Even a robustly aligned mesa-optimizer that meets the criteria is incentivized to figure out the base objective in order to determine whether or not it will be modified, since before doing so it has no way of knowing its own level of alignment with the base optimizer. Mesa-optimizers that are capable of reasoning about their incentives will, therefore, attempt to get more information about the base objective. Furthermore, once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease, potentially leading to a crystallization of the mesa-objective. However, due to unidentifiability (as discussed in the third post), most mesa-objectives that are aligned on the training data will be pseudo-aligned rather than robustly aligned. Thus, the most likely sort of objective to become crystallized is a pseudo-aligned one, leading to deceptive alignment.

This part is also unintuitive to me/I don't really follow along with your argument here.

I think where I'm being lost is the following claim:

Conceptualisation of base objective removes selection pressure on the mesa-objective

It seems plausible to me that conceptualisation of the base objective might instead exert selection pressure to shift the mesa-objective towards base objective and lead to internalisation of the base objective instead of crystallisation of deceptive alignment.

Or rather that's what I would expect by default?

There's a jump being made here from "the mesa-optimiser learns the base objective" to "the mesa-optimiser becomes deceptively aligned" that's just so very non obvious to me.

Why doesn't conceptualisation of the base objective heavily select for robust alignment?

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.” However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

I also very much do not follow this argument for the aforementioned reasons.

We've already established that the mesa-optimiser will eventually conceptualise the base objective.

Once the base objective has been conceptualised, there is a model of the base objective to "point to".

The claim that conceptualisation of the base objective makes deceptive alignment likely is very much non obvious. The conditions under which deceptive alignment is supposed to arise seems to me like conditions that will select for internalisation of the base objective or corrigible alignment.

[Once the mesa-optimiser has conceptualised the base objective, why is deceptive alignment more likely than a retargeting of the mesa-optimiser's internal search process to point towards the base objective? I'd naively expect the parameter updates of the base optimiser to exert selection pressure on the mesa-optimiser's towards retargeting its search at the base objective.]

I do agree that deceptively aligned mesa-optimisers will behave corrigibly with respect to the base objective, but I'm being lost in the purported mechanisms by which deceptive alignment is supposed to arise.

All in all, these two paragraphs give me the impression of non obvious/insufficiently justified claims and circularish reasoning.

[-]DragonGod3y*20

Pure deception. If the mesa-optimizer stops trying to optimize its own objective in the short term and focuses on cooperating with the selection process entirely then this may result in its objective “crystallizing.” As its objective is largely irrelevant to its outputs now, there is little selection pressure acting to modify this objective. As a result, the objective becomes effectively locked in, excepting random drift and any implicit time complexity or description length penalties.

Crystallization of deceptive alignment. Information about the base objective is increasingly incorporated into the mesa-optimizer's epistemic model without its objective becoming robustly aligned. The mesa-optimizer ends up fully optimizing for the base objective, but only for instrumental reasons, without its mesa-objective getting changed.

These claims are not very intuitive for me.

It's not particularly intuitive to me why the selection pressure the base optimiser applies (selection pressure applied towards performance on the base objective) would induce crystallisation of the mesa objective instead of the base objective. Especially if said selection pressure is applied during the joint optimisation regime.

Is joint optimisation not assumed to precede deceptive alignment?

[-]DragonGod3y*20

Highlighting my confusion more.

Suppose the mesa-optimiser jointly optimises the base objective and its mesa objective.

This is suboptimal wrt base optimisation.

Mesa-optimisers that only optimised the base objective would perform better, so we'd expect crystallisation to only optimising base objective.

Why would the crystallisation that occurs be pure deception instead of internalising the base objective? That's where I'm lost.

I don't think the answer is necessarily as simple as "mesa-optimisers not fully aligned with the base optimiser are incentivised to be deceptive".

SGD fully intervenes on the mesa-optimiser (AIUI ~all parameters are updated), so it can't well shield parts of itself from updates.

Why would SGD privilege crystallised deception over internalising the base objective?

Or is it undecided which objective is favoured?

I think that step is where you lose me/I don't follow their argument very well.

You seem to strongly claim that SGD does in fact privilege crystallisation of deception at that point, but I don't see it?

[I very well may just be dumb here, a significant fraction of my probability mass is on inferential distance/that's not how SGD works/I'm missing something basic/other "I'm dumb" explanations.]

[-]DragonGod3y20

If you're not claiming something to the effect that SGD privileges deceptive alignment, but merely that deceptive alignment is something that can happen, I don't find it very persuasive/compelling/interesting?

Deceptive alignment already requires highly non-trivial prerequisites:

Strong coherence/goal directedness
Horizons that stretch across training episodes/parameter updates or considerable lengths of time
High situational awareness
Conceptualisation of the base objective

If when those prerequisites are satisfied, you're just saying "deceptive alignment is something that can happen" instead of "deceptive alignment is likely to happen", then I don't know why I should care?

If deception isn't selected for/or likely by default provided its prerequisites are satisfied, then I'm not sure why deceptive alignment is something that deserves attention.

Though I do think deceptive alignment would deserve attention if we're ambivalent between selection for deception and selection fot alignment.

My very uninformed priors is that SGD would select more strongly for alignment during the joint optimisation regime?

So I'm leaning towards deception being unlikely by default.

But I'm very much an ML noob, so I could change my mind after learning more.

[-]evhub3y40

See this more recent analysis on the likelihood of deceptive alignment.

[-]DragonGod3y20

Oh wow, it's long.

I can't consistently focus for more than 10 minutes at a stretch, so where feasible I consume long form information via audio.

I plan to just listen to an AI narration of the post a few times, but since it's a transcript of a talk, I'd appreciate a link to the original talk if possible.

[-]evhub3y40

See here.

[-]DavidW3y10

Corrigible alignment seems to require already having a model of the base objective. For corrigible alignment to be beneficial from the perspective of the base optimizer, the mesa-optimizer has to already have some model of the base objective to “point to.”

Likely TAI training scenarios include information about the base objective in the input. A corrigibly-aligned model could learn to infer the base objective and optimize for that.

However, once a mesa-optimizer has a model of the base objective, it is likely to become deceptively aligned—at least as long as it also meets the other conditions for deceptive alignment. Once a mesa-optimizer becomes deceptive, it will remove most of the incentive for corrigible alignment, however, as deceptively aligned optimizers will also behave corrigibly with respect to the base objective, albeit only for instrumental reasons.

A model needs situational awareness, a long-term goal, a way to tell if it's in training, and a way to identify the base goal that isn't its internal goal to become deceptively aligned. To become corrigibly aligned, all the model has to do is be able to infer the training objective, and then point at that. The latter scenario seems much more likely.

Because we will likely start with something that includes a pre-trained language model, the research process will almost certainly include a direct description of the base goal. It would be weird for a model to develop all of the prerequisites of deceptive alignment before it infers the clearly described base goal and learns to optimize for that. The key concepts should already exist from pre-training.

[-]Riccardo Volpato5y10

Could internalization and modelling of the base objective happen simultanously? In some sense, since Darwin discovered evolution, isn't that the state in which humans are? I guess that this is equivalent to saying that even if the mesa-optimiser has a model of the base-optimiser (condition 2 met) it cannot expect the threat of modification to eventually go away (condition 3 not met) since it is still under selection pressure and is experiencing internalization of the base objective. So if humans will ever be able to defeat mortality (can expect the threat of modification to eventually go away) will they stop having any incentive to self-improve?

once a mesa-optimizer learns about the base objective, the selection pressure acting on its objective will significantly decrease

This seems to be context-dependent to me, as for my example with humans: did learning about evoluation reduced our selection pressure?

How would the reflections on training vs testing apply to something like online learning? Could we simply solve deceptive alignment by never (fully) ending training?

[-]Koen.Holtman6y10

Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

This use of the term corrigibility above, while citing (25), is somewhat confusing to me -- while it does have a certain type of corrigibility, I would not consider the mesa-optimiser described above be corrigible according to the criteria defined in (25). See the comment section here for a longer discussion about this topic.

Though for clarity we speak of the same agent persisting across parameter updates, neither we nor the agent must reason in this way. A more general description of the situation is that there is a sequence of “successor agents” connected by parameter updates. The agent can potentially learn the effects of the parameter updates on its successors, and thus on the multi-episode return it is optimizing. ↩︎
The concept of deceptive alignment is closely related to Manheim and Garrabrant's concept of “adversarial Goodhart.”(22) In “Categorizing Variants of Goodhart's Law,” Manheim and Garrabrant describe adversarial Goodhart as when “an agent with different goals than the regulator” causes a “collapse of the statistical relationship between a goal which the optimizer intends and the proxy used for that goal.” Deceptive alignment can be seen as an instance of adversarial Goodhart in which the mesa-optimizer plays the role of the agent with different goals and the base optimizer plays the role of the regulator. Training performance is the regulator's proxy, and the deployment performance is the regulator's actual goal. ↩︎
Thanks to Paul Christiano for the internalization/modeling distinction. ↩︎
Note, however, that it is possible for information about the base objective to be introduced via optimization, but not form part of the mesa-objective. For example, it could form part of a prior on tasks, having no direct bearing on the mesa-objective. ↩︎
Christiano refers to deceptively aligned mesa-optimizers as “influence-seeking systems.” ↩︎
Or must otherwise have some method of achieving more value from defecting later rather than defecting now. ↩︎
Note that it is not required that the mesa-optimizer be able to model (or infer the existence of) the base optimizer; it only needs to model the optimization pressure it is subject to. ↩︎
Alternatively, the threat could just be that the mesa-optimizer is never selected by the base optimizer in the first place. In that case, this condition would be trivially satisfied, as the threat to a mesa-optimizer of not being generated by the base optimizer at all would go away as soon as the base optimizer does in fact generate it. ↩︎
We will generally be assuming that the selection pressure takes the form of the base optimizer modifying the mesa-optimizer post-hoc, as we are mostly thinking about local optimization processes, though one could also imagine a base optimizer that instead develops a completely new inner algorithm at each time step. Though some of these issues might still apply even in such a case, we will generally not be considering such base optimizers in this section. ↩︎
Note that if the mesa-optimizer also believes that the opportunities it currently has for achieving its own objective are unusually good then it might still pursue its objective greedily, but this would presumably not always be the case. ↩︎
An analogous argument applies for a deceptively aligned mesa-optimizer that is defecting too early, or one that unsuccessfully attempts a treacherous turn. ↩︎
Furthermore, a deceptively aligned mesa-optimizer would be incentivized to cause there to be a systematic bias in the direction of preventing the base optimizer from modifying its mesa-objective. Thus, in the context of a local optimization process, a deceptive mesa-optimizer might try to “hack” its own gradient (by, for example, making itself more brittle in the case where its objective gets changed) to ensure that the base optimizer adjusts it in such a way that leaves its mesa-objective untouched. ↩︎

LESSWRONG
LW

LESSWRONG
LW

118

Deceptive Alignment

118

Ω 39

118

Ω 39

4.1. Safety concerns arising from deceptive alignment

4.2. Conditions for deceptive alignment

4.3. The learning dynamics of deceptive alignment

4.4. Internalization or deception after extensive training

4.5. Distributional shift and deceptive alignment