The order in which key properties emerge is important and often glossed over. 

Thanks to Wil Perkins, Grant Fleming, Thomas Larsen, Declan Nishiyama, and Frank McBride for feedback on this post. Any mistakes are my own. 

Note: I have now changed the second post into this sequence into a standalone post that incorporates the key points from this post. The comments here are valuable, so I'm leaving this post, but I recommend going straight to the next post. 

This is the first post in a sequence about deceptive alignment. The second post describes my personal views about the likelihood of deceptive alignment for TAI. I’m separating the key considerations and editorial content so others can more independently update their own views. I intend this sequence to be a submission to the Open Philanthropy AI Worldviews Contest

Deceptive alignment is a core part of many AI x-risk scenarios. You can find a highly cited, foundational walkthrough of the deceptive alignment argument here. I’m specifically discussing the concept of deceptive alignment described in that post, in which a proxy-aligned model becomes situationally aware and acts cooperatively in training so it can escape oversight and defect to pursue its proxy goals. There are several existing arguments for why this might be the default outcome for highly capable models. There are other ways a model could be manipulative or deceptive that are not covered in this sequence. 

In this post, I discuss four key precursors of deceptive alignment, which I will refer to in this sequence as foundational properties. I then argue that the order in which these foundational properties develop is crucial for estimating the likelihood that deceptive alignment will emerge for prosaic transformative AI (TAI).

In this sequence, I use the term “differential adversarial examples” to refer to adversarial examples in which a non-deceptive model will perform differently depending on whether it is aligned or proxy aligned. The deceptive alignment story assumes that differential adversarial examples exist. The model knows it’s being trained to do something out of line with its goals during training and plays along temporarily so it can defect later. That implies that differential adversarial examples exist in training.

Foundational properties for deceptive alignment

There are four foundational properties that are explicitly or implicitly described as being relevant to the development of deceptive alignment. These foundational properties are:

  1. Goal-directed behavior - the model needs to have some goal or set of goals about the external world. 
  2. Optimizing across episodes/long-term goal horizons - the model needs to be willing to trade some of what it cares about now for more of it in the long term, including in different episodes.
  3. Conceptualization of the base goal - the model needs to understand what the goal is in order to successfully play along in training.
  4. Situational awareness - the model needs to understand that it is a model in training. Furthermore, it needs to understand how its actions now could affect its parameter updates, and therefore future states of the world. 

The order in which these develop, and the amount of training that goes on in between the development of foundational properties, will have a large effect on the likelihood of deceptive alignment. Most current analyses of deceptive alignment gloss over the development of foundational properties, which makes me suspicious that the alignment community is missing important considerations.

I describe foundational properties in more detail below:

Goal-Directed Behavior

Having a reasonably clear goal for states of the external world is a crucial property for the development of deceptively aligned AI. An internal goal like predicting the next token accurately would likely be insufficient. There are a few different levels of goal-directedness that gradient descent could find. This is all on a spectrum, but that doesn't change the implications. In the following list, each is intended to be a subset of the previous: 

  1. All models - the vast majority of these would be incoherent, arbitrary parameter sets that don’t have anything remotely like a goal.
  2. Models with at least some coherent behavior toward external goals in at least some situations but are largely random.
  3. Models that have somewhat coherent behavior/values relating to the external world, at least at the rough level of humans. Their values may be extremely difficult or impossible to describe fully in an explicit way, and don’t incentivize tiling the universe with anything. These values may involve non-consequentialist considerations. 
  4. Models that are maximizing an objective function, such that they would want to tile the universe with something. This kind of model is the most likely to take extreme actions to avoid changing its objective function. 

I expect each of these levels to be a very small subset of the previous bucket, such that it would take a lot of optimization power to progress from one to the next. It’s hard to imagine deceptive alignment emerging before the third level of goal-directed behavior, and it is most likely at the fourth level. The fourth level seems more likely to develop for applications with a narrow goal, such as optimizing profit, than for applications that try to imitate fuzzy things like human values. 

Optimizing over a long goal horizon and across episodes

This is a factor that has frequently been discussed elsewhere. If the model doesn’t care about its goals over long time horizons, or about how much reward it will get in future iterations where its current prompt and activation have been cleared, then it has no incentive to become deceptively aligned. 

Conceptualization of the base goal

In order to be deceptive, the model needs to have some understanding of the difference between its proxies and the training objective. Otherwise, it won’t know how to target the training objective rather than its proxies during training.

Awareness that it is a model in training

In order for a model to be deceptively aligned, it has to understand that it is in a model training process. Furthermore, it has to conceptualize that it won’t be able to accomplish its current long-term goals if it allows the training process to change its gradients away from those goals. 

Ways the order of development for foundational properties might affect likelihood of deceptive alignment

Before reading on, I encourage readers to think about the likelihood of the order of development for these foundational properties for TAI. This might help avoid motivated reasoning to come to particular conclusions about how likely deceptive alignment is. 

Conceptualization of the base goal vs. goal-directed behavior

If the model is able to conceptualize the base goal before it is significantly goal-directed, then deceptive alignment is unlikely. A key idea of deceptive alignment is that the model starts by learning a proxy that becomes its internal goal. The implication is that the model is optimizing hard for that one proxy in a coherent way. But if the model understands the base goal significantly before it has a mesa-objective, then assuming differential adversarial training, there will be a strong incentive to optimize for the base goal, rather than a proxy. If the understanding of the base goal has already been calculated, there will be no reason to create and maintain a separate proxy goal when the model could just point at the base goal. 

I don’t think this conclusion relies on the understanding of the base goal being especially complete and specific. Rather, the model would need a representation of what it is trying to do at a high level (e.g., “Satisfy what the human wants me to do within strict moral boundaries”). The details of what goes into that, such as the nature of the moral boundaries, probably don’t need to be perfect and could likely update over time. It might look more like a corrigibly aligned model than a robustly aligned model. 

If conceptualization of the base goal and goal-directed behavior emerge at essentially the same time, deceptive alignment also seems unlikely. If a model converged on a solid representation of the base goal, and became goal-directed in parallel, then differential adversarial examples would provide an incentive for gradient descent to optimize for the base goal. 

The most dangerous case is when the model develops a proxy objective before it understands the base goal. I think there’s virtually no chance of deceptive alignment if the model hasn’t reached the third level of goal directedness described above, and I'm quite uncertain whether deceptive alignment would emerge in the vicinity of the third level. The fourth level is the most dangerous by far. 

I’ve seen some arguments that rely on the idea that deceptive alignment will not occur in the parameters, but the activations. This doesn't change the conclusion from this section. If the model figured out what you wanted it to do in the activations before it became significantly goal-directed, gradients could point towards parameters that accomplish those objectives after they have been inferred. The model can only act on activations through its parameters, and the activations come from the inputs using parameters. 

Even if this all happened through a set of parameters that use very general reasoning, the model would likely have to have some way of representing what the training process wanted and would act on that based on the parameters. If it understood the base goal via the activations, gradient descent would have an incentive to make the parameters care more about doing that, assuming that the model has not yet become deceptive. 

Conceptualization of the base goal vs. optimizing across long-term goal horizons and situational awareness

If a model has long-term goal horizons, then it must be goal-directed. However, it’s possible for the model to develop foundational properties in the following order: 

  1. Goal-directed behavior 
  2. Conceptualization of the base goal
  3. Optimizing across episodes/long-term goal horizons

It’s also possible for the model to develop situational awareness in between goal-directed behavior and long-term, cross-episode goals. In either case, there will be a period of time when the model wants to optimize for a flawed proxy and understands the base goal but does not yet have the capabilities necessary for deceptive alignment. Because the model is not yet deceptive, differential adversarial examples in this part of training should directly modify the model’s internal objective. If this period lasts for a significant number of differential adversarial examples, the model should be able to update its goals to point at its conception of the base objective.  

The most dangerous case, which underpins the arguments described in the deceptive alignment literature, is if situational awareness and long-term, cross-episode goals develop before or around the same time as an understanding of the base goal. 

The argument presented in this section still holds if the model is gaining these properties from the activations. The moment at which the model can figure out what you’re trying to get it to do from the activations, the moment it can be prompted to have long-term goals, and the moment it can conceptualize that it is a model in training from the activations may not happen at the same time.

Research directions that may be useful if these considerations are important

Adversarial examples from the start

Making sure to provide training inputs that make the model fail throughout the process seems very promising. These inputs should correlate with differential adversarial examples. These examples could speed up conceptualization of the base goal and delay goal-directed behavior. Both would make deceptive alignment less likely. 

Use non-maximizing targets like following human directions while adhering to key non-consequentialist ethical considerations

Non-maximization targets should make a model less likely to become an aggressive maximizer for a simple proxy and are probably closer to what we want anyway. Maximization is inherently dangerous. 

Find ways to get the model to understand the base goal better in earlier phases of the process 

In the same way a very sophisticated model may be able to gain situational awareness from the prompt, a less-sophisticated model may be able to gain a good understanding of the base goal from the prompt. Pushing the model to understand the base goal early in training by using tools like prompt engineering is a promising way to reduce the likelihood of deceptive alignment. 

Delay optimizing long-term goal horizons and across episodes

If a model doesn't have long-term or cross-episode goals, it will have no reason to act deceptively. Delaying this will reduce the likelihood of deceptive alignment. 

Conclusion

The model attributes described in this post and the order in which they appear are important for the likelihood of TAI becoming deceptively aligned. In particular, the model is much more likely to become deceptive if all of the following are true: 

  1. The model becomes significantly goal-directed before it understands what the goal of the training process is.
  2. The model develops situational awareness before or around the same time as it understands the base goal.
  3. The model develops long-term, cross-episode goals before or around the same time as it understands the base goal.

This insight yields important considerations for evaluating the likelihood of deceptive alignment and ideas for reducing the risk. 

New Comment
19 comments, sorted by Click to highlight new comments since:

I like this post and am a bit miffed that it isn't getting more responses.

OK, now a more substantive reply since I've gotten a chance to read more carefully. Comments as I read. Rambly:

1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don't like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3's are chill about value drift and not particularly interested in taking over the world. Maybe I'm reading too much between the lines but I'd say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble. 

2. I like the point that if the model already has a good understanding of the base goal / base objective before it becomes goal-directed, SGD will probably just build a pointer to the base goal rather than building a pointer to a proxy and then letting instrumental convergence + deception do the rest. 

In a realistic training scenario though, the base goal will be misaligned, right? For example, in RLHF, there'll be biases and dogmas in the minds of the human data-providers, such that often they'll reward the model for lying or doing harmful things, and punish the model for telling the truth or doing something helpful. And while some of these errors will be noise, others will be systematically predictable. (And then there's the added complication of the reward model and the fact that there's a reward counter on a GPU somewhere.) So, suppose the model has an understanding of all of these things from pre-training, and then becomes agentic during fine-tuning, won't it probably end up with a goal like "maximize this number on these GPUs" or "Do what makes the reward model most light up" or "Do what gets high ratings from this group of humans" (sycophancy).

I guess this isn't an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn't count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn't. (I guess if it was sufficiently myopic, and the humans were smart enough to give it lots of reward when it admitted to being misaligned, then it wouldn't lie about this. But this situation wouldn't persist long I think.)

3. Wait a minute. Why doesn't this happen in humans? Presumably the brain has some sort of SGD-like process for updating the synapses over time, that's how we learn. It's probably not exactly the same but still, couldn't you run the same argument, and get a prediction that e.g. if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they'd grow up and go to college and live the rest of their life all for the sake of pursuing reward? (I guess something like this does happen sometimes; plenty of people seem to have their own happiness as their only final goal, and some people even seem to be fairly myopic about it. And I guess you could argue that humans become somewhat goal-directed in infancy, before they are smart enough to learn even the roughest pointer to happiness/reward/etc. But I don't think either of these responses is strong.)

4. What amount of understanding of the base goal is sufficient? What if the answer is "It has to be quite a lot, otherwise it's really just a proxy that appears superficially similar to the base goal?" In that case the classic arguments for deceptive alignment would work fine.

I think the crux lies somewhere around here. Maybe a thing to investigate is: How complicated is the circuitry for "X, whatever that turns out to mean" compared to the circuitry for X itself? For example: Let X = "reward over the next hour or so" and X' = "Time-discounted reward with discount rate R, for [name of particular big model on particular date] as defined in page 27 of [textbook on ML]." 

X' is a precise, more fleshed-out and well-defined concept than X. But maybe it's in some sense 'what X turns out to mean.' In other words there's some learning process, some process of conceptual refinement, that starts with X and ends with X'. And the model more or less faithfully carries out this process. And at first when the model is young and dumb it has the concept of X but not the concept of X', and that's the situation it's in when it starts to form coherent goals, and then later when it's much smarter it'll have morphed X into X'. And when the model is young and dumb and X is its goal, it doesn't pursue X in ways that would get in the way of this learning/morphing process. It doesn't want to "lock in" X in any sense.

If this turns out to be a fairly elegant/simple/straightforward way for a mind to work, then great, I think your overall story is pretty plausible. But what if it's messy and complicated to get something like this? Then at some point the model forms goals, and they'll look something like X rather than like X', and then they'll keep X as their goal forever (and/or maybe it'll be precisified in some way that is different from the 'intended' way that a human would precisify it, which amounts to the same thing because it means that unless the model has a perfect understanding of the base objective when it first develops goals, it'll probably never have a goal which is the same as the base objective.)

5. I'm not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I'd expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.








 

Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment. 

#4:

What amount of understanding of the base goal is sufficient? What if the answer is "It has to be quite a lot, otherwise it's really just a proxy that appears superficially similar to the base goal?" In that case the classic arguments for deceptive alignment would work fine.

TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out to mean”, it just has to point at its best estimate of X` and that will update over time because the model doesn’t know there’s a difference. 

I propose that the relevant factor here is whether the model’s internal goal is the closest thing it has to a representation of the training goal (X). I am assuming that models will have their goal information and decision parameters stored in the later layers and the world modeling overwhelmingly before the decision-making, because it doesn’t make much sense for a model to waste time world modeling (or anything else) after it makes its decision. I expect the proxy to be calculated based on high-level concepts from the world model, not separately from the world model.

Suppose for the sake of argument that we have a model with an exceptionally long-term goal and situational awareness. However, its internal goal is its flawed representation (X) of the training goal (X`). This model can’t tell the difference between the training goal and the internal proxy goal, so it can’t be deceptively aligned yet. If it performs worse than it could have on a training example because of this, the gradients could:

  1. Update the existing proxy goal to be closer to the training goal, or 
  2. Create a new concept for the training goal and become deceptively aligned.

Updating the existing goal will be a much simpler and more likely fix than creating an entirely new concept and switching the goal to point at that new concept. That new concept would have to be a better representation of the training goal on the current training example than the existing concept in order to enable deceptive alignment, and therefore better training. So, I wouldn’t expect a hyper-local update to be sufficient to make the model deceptively aligned on its own. The gradients therefore have no reason to point toward creating this new concept. On the other hand, tweaks to improve the existing understanding of the training goal would improve performance immediately. 

For example, let’s say we are training this model to follow directions from the prompt, unless they violate ethical norms (X`). The model understands that the training goal is to follow directions subject to some ethical norms but does not understand those norms well enough to apply them correctly in all situations (X). Because it doesn’t understand the difference, it can’t act aligned for instrumental purposes, and it will get suboptimal reward when flaws in its understanding of the training goal affect its behavior. When it messes up in this way, the gradients should point toward improving the internal goal (X) to better match the training goal (X`). On the other hand, a hyper-local move toward a separate, more accurate concept of the training goal would be insufficient to enable deception on its own and will therefore not be learned by gradient descent. 

Now consider an alternate version of the above example, where the model has the same flawed concept for the training goal, but its internal goal is maximizing paperclips (or anything else that’s represented by a separate concept from its base goal representation). This model can immediately become deceptively aligned, because optimizing for its flawed understanding of the training goal will get better training performance than optimizing for paperclips in the short run. The model can then continue learning about the training goal without systematically changing its internal goals and remain deceptively aligned. 

A model that is pre-trained on approximately the whole internet should start with concepts relevant to understanding the training goal. It would be a bit weird if such a pre-trained model did not have a solid, but imperfect, understanding of following directions and key ethical concepts. Early, short-term reward training should be able to point at those and refine the resulting concept. This should be the closest concept to the training goal, so it should fit better with my first example than my second. This would make deceptive alignment very unlikely. 

Other than direct reward optimizers, I have trouble imagining what alternate proxy concept would be correlated enough with following directions subject to ethical considerations that it would be the internal goal late enough in the process for the model to have a long-term goal and situational awareness. Can you think of one? Having a more realistic idea for a proxy goal might make this discussion more concrete. 

1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don't like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3's are chill about value drift and not particularly interested in taking over the world. Maybe I'm reading too much between the lines, but I'd say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble. 

Interesting. I think the vast majority of humans are more like satisficers than optimizers. Perhaps that describes what I’m getting at in bucket 3 better than fuzzy targets. As mentioned in the post, I think level 4 here is the most dangerous, but 3 could still result in deceptive alignment if the foundational properties developed in the order described in this post. I agree this is a minor point, and don’t think it’s central to any disagreements. See also my answer to your fifth point, which has prompted an update to my post.

#2: 

I guess this isn't an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn't count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn't.

Yeah, I’m only talking about deceptive alignment and want to stay focused on that in this sequence. I’m not arguing against all AI x-risk. 

#3: 

Presumably the brain has some sort of SGD-like process for updating the synapses over time, that's how we learn. It's probably not exactly the same but still, couldn't you run the same argument, and get a prediction that e.g., if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they'd grow up and go to college and live the rest of their life all for the sake of pursuing reward?

We know how the gradient descent mechanism works, because we wrote the code for that. 

We don’t know how the mechanism for human value learning works. The idea that observed human value learning doesn’t match up with how gradient descent works is evidence that gradient descent is a bad analogy for human learning, not that we misunderstand the high-level mechanism for gradient descent. If gradient descent were a good way to understand human learning, we would be able to predict changes in observed human values by reasoning about the training process and how reward updates. But accurately predicting human behavior is much harder than that. If you try to change another person’s mind about their values, they will often resist your attempts openly and stick to their guns. Persuasion is generally difficult and not straightforward. 

In a comment on my other post, you make an analogy of gradient descent for evolution. Evolution and individual human learning are extremely different processes. How could they both be relevant analogies? For what it’s worth, I think they’re both poor analogies. 

If the analogy between gradient descent and human learning were useful, I’d expect to be able to describe which characteristics of human value learning correspond to each part of the training process. For hypothetical TAI in fine-tuning, here’s the training setup: 

  1. Training goal: following directions subject to ethical considerations. 
  2. Reward: some sort of human (or AI) feedback on the quality of outputs. Gradient descent makes updates on this in a roughly deterministic way. 
  3. Prompt: the model will also have some sort of prompt describing the training goal, and pre-training will provide the necessary concepts to make use of this information. 

But I find the training set-up for human value learning much more complicated and harder to describe in this way. What is the high-level training setup? What’s the training goal? What’s the reward? It’s my impression that when people change their minds about things, it’s often mediated by key factors like persuasive argument, personality traits, and social proof. Reward circuitry probably is involved somehow, but it seems vastly more complicated than that. Human values are also incredibly messy and poorly defined. 

Even if gradient descent were a good analogy, the way we raise children is very different from how we train ML models. ML training is much more structured and carefully planned with a clear reward signal. It seems like people learn values more from observing and talking to others. 

If human learning were similar to gradient descent, how would you explain that some people read about effective altruism (or any other philosophy) and quickly change their values? This seems like a very different process from gradient descent, and it’s not clear to me what the reward signal parallel would be in this case. To some extent, we seem to decide what our values are, and that probably makes sense for a social species from an evolutionary perspective. 

It seems like this discussion would benefit if we consulted someone with expertise on human value formation. 

5. 

I'm not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I'd expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.

Yeah, that’s reasonable. Thanks for pointing it out. This is a holdover from an argument that I removed from my second post before publishing because I no longer endorse it. A better argument is probably about satisficing targets instead of optimizing targets, but I think this is mostly a distraction at this point. I replaced "fuzzy targets" with "non-maximization targets". 

I agree that exactly what ordering we get for the various relevant properties is likely to be very important in at least the high path-dependence view; see e.g. my discussion of sequencing effects here.

[-]DavidW1812

Thanks for pointing that out! My goal is to highlight that there are at least 3 different sequencing factors necessary for deceptive alignment to emerge: 

  1. Goal directedness coming before an understanding of the base goal
  2. Long-term goals coming before or around the same time as an understanding of the base goal
  3. Situational awareness coming before or around the same time as an understanding of the base goal

The post you linked to talked about the importance of sequencing for #3, but it seems to assume that goal directedness will come first (#1) without discussion of sequencing. Long-term goals (#2) are described as happening as a result of an inductive bias toward deceptive alignment, and sequencing is not highlighted for that property. Please let me know if I missed anything in your post, and apologies in advance if that’s the case. 

Do you agree that these three property development orders are necessary for deception? 

I found this post fascinating and although I myself am not engaged within the x-risk space relative to artificial intelligence, the delivery of the information skillfully explored some of my key discomforts with the seemingly high risk of deceptive alignment.

The model knows it’s being trained to do something out of line with its goals during training and plays along temporarily so it can defect later. That implies that differential adversarial examples exist in training.

I don't think this implication is deductively valid; I don't think the premise entails the conclusion. Can you elaborate?

I think this post's argument relies on that conclusion, along with an additional assumption that seems questionable: that it's fairly easy to build an adversarial training setup that distinguishes the design objective from all other undesirable objectives that the model might develop during training; in other words, that the relevant differential adversarial examples are fairly easy for humans to engineer.

In the deceptive alignment story, the model wants to take action A, because its goal is misaligned, but chooses to take apparently aligned action B to avoid overseers noticing that it is misaligned. In other words, in the absence of deceptive tendencies, the model would take action A, which would identify it as a misaligned model, because overseers wanted it to take action B. That's the definition of a differential adversarial example. 

If there were an unaligned model with no differential adversarial examples in training, that would be an example of a perfect proxy, not deceptive alignment. That's outside the scope of this post. But also, if the goal were to follow directions subject to ethical constraints, what would that perfect proxy be? What would result in the same actions across a diverse training set? It seems unlikely that you'd get even a near-perfect proxy here. And even if you did get something fairly close, the model would understand the necessary concepts for the base goal at the beginning of reinforcement learning, so why wouldn't it just learn to care about that? Setting up a diverse training environment seems likely to be a training strategy by default.

If the model is able to conceptualize the base goal before it is significantly goal-directed, then deceptive alignment is unlikely.

 

I am totally buffled by the fact that nobody pointed out that this is totally wrong.

Your model can have perfect representation of goal in "world-model" module, but not in "approve plan based on world-model prediction" module. In Humean style, from "what is" doesn't follow "what should be".

I.e. you conflate two different possible representations of a goal: representation that answers on questions about outside world, like "what next will happen in this training session", "what humans try to achieve", etc., and representation in goal-directedness system, like "what X I should maximize in the world", or, without "should", "what the best approximation of reward function given history of reward is". 

It would be nice to have architecture that can a) locate concept in world model, b) fire reward circuitry iff world model see this concept in possible future, but it definetely is a helluva work in interpretability and ML-model design.

The claim is that given the presence of differential adversarial examples, the optimisation process would adjust the parameters of the model such that it's optimisation target is the base goal.

A key idea of deceptive alignment is that the model starts by learning a proxy that becomes its internal goal.

One of my pictures of deceptive alignment goes like this:

  • The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal
  • The world model gets better and better in order to improve the learner's loss
  • When the learner has a really excellent world model that can make long range predictions and so forth - good enough that it can reason itself into playing the training game for a wide class of long-term goals - then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal
  • When this happens, gradients derived from regularisation and/or loss may push the learner's objective towards one of these problematic alternatives. Also, because the initial "pretty good" goal is not a long-range one (because it developed when the world model was not so good), it doesn't necessarily steer the learner away from possibilities like this

I don't think the argument presented here really addresses this picture.

If we're training a learner with some variant of human feedback on diverse tasks (HFDT), with a training goal of shaping the model into one honestly trying to complete the tasks as the human overseers intend, then the actual goal is a long-range goal. So if as you say, the learner quickly develops "a decent representation of the actual goal", then that long-term goal representation drives the learner to use its growing capabilities to make decisions downstream of that goal, which steers the model away from alternative long-term goals. It doesn't particularly matter that there are other possible long-term goals that would have equal or greater training performance, because they aren't locally accessible via paths compatible with the model's existing goal, since it provides no reason to steer differentially towards those goal modifications.

I don't think it's obvious that a learner which initially learns to pursue the right thing within each episode is also likely to learn to steer itself towards doing the right thing across episodes.

I wasn't assuming we were working in an episodic context. But if we are, then the agent isn't getting differential feedback on its across-episode generalization behavior, so there is no reason for the model to develop dominant policy circuits that respond to episode boundaries (any more than ones that respond to some random perceptual bitvector).

Let me try again. It initially develops a reasonable goal because it doesn't yet know a) playing the training game is instrumentally correct for a bunch of other goals and/or b) how to play the training game in a way that is instrumentally correct for these otehr goals.

By the same assumptions, it seems reasonable to speculate that it doesn't know a) allowing itself to change in certain ways from gradient updates would be un-instrumental and/or b) how to prevent itself from changing in these ways. Furthermore it’s not obvious that the “reasonable goal” it learns will incentivise figuring this out later.

In my model, having a goal of X means the agent's policy circuits are differentially sensitive to (a.k.a. care about) the features in the world that it recognizes as relevant to X (as represented in the agent's own ontology), because that recognition was selected for in the past. If it has a "reasonable goal" of X, then it doesn't necessarily want to "play the training game" instrumentally for an alternative goal Y, at least not along dimensions where Y diverges from X, even if it knows how to do so and Y is another (more) highly-rewarded goal. If it does "play the training game", it wants to play the training game in the service of X.

If the agent never gets updated again, its policy circuits stay as they are, so its goal remains fixed at X. Otherwise, since we know X-seeking is a rewarded strategy (otherwise, how did we entrain X in the first place?), and since its existing policy is X-seeking, new updates will continue to flow from rewards that it gets from trying to seek X (modulo random drift from accidental rewards). By default[1] those updates move it around within the basin of X-seeking policies, rather than somehow teleporting it into another basin that has high reward for X-unrelated reasons. So I think that if we got a reasonable X goal early on, the goal will tend to stay X or something encompassing X. Late in training, once the agent is situationally-aware, it can start explicitly reasoning about what actions will keep its current "reasonable" goal intact.

  1. ^

    To give an example, if my thoughts have historically been primarily shaped by rewards flowing from having "eating apples" as my goal, then by default I'm going to persist in apple-seeking and apple-eating behavior/cognition (unless some event strongly and differentially downweights that cognition, like getting horribly sick from a bad apple), which will tend to steer me towards more apples and apple-based reinforcement events, which will further solidify this goal in me.

The learner fairly quickly develops a decent representation of the actual goal, world model etc. and pursues this goal

Wouldn’t you expect decision making, and therefore goals, to be in the final layers of the model? If so, they will calculate the goal based on high-level world model neurons. If the world model improves, those high-level abstractions will also improve. The goal doesn’t have to make a model of the goal from scratch, because it is connected to the world model. 

When the learner has a really excellent world model that can make long range predictions and so forth - good enough that it can reason itself into playing the training game for a wide class of long-term goals - then we get a large class of objectives that achieve losses just as good as if not better than the initial representation of the goal

Even if the model is sophisticated enough to make long-range predictions, it still has to care about the long-run for it to have an incentive to play the training game. Long-term goals are addressed extensively in this post and the next

When this happens, gradients derived from regularisation and/or loss may push the learner's objective towards one of these problematic alternatives. 

Suppose we have a model with a sufficiently aligned goal A. I will also denote an unaligned goal as U, and instrumental training reward optimization S. It sounds like your idea is that S gets better training performance than directly pursuing A, so the model should switch its goal to U so it can play the training game and get better performance. But if S gets better training performance than A, then the model doesn’t need to switch its goal to play the training game. It’s already instrumentally valuable. Why would it switch?

Also, because the initial "pretty good" goal is not a long-range one (because it developed when the world model was not so good), it doesn't necessarily steer the learner away from possibilities like this

Wouldn’t the initial goal continue to update over time? Why would it build a second goal instead of making improvements to the original? 

I agree with something like this, though I think you're too optimistic w/r/t deceptive alignment being highly unlikely if the model understands the base objective before getting a goal.  If the model is sufficiently good at deception, there will be few to no differential adversarial examples.  Thus, while gradient descent might have a slight preference for pointing the model at the base objective over a misaligned objective induced by the small number of adversarial examples, the vastly larger number of misaligned goals suggests to me that it is at least plausible that the model learns to pursue a misaligned objective even if it gains an understanding of the base objective before it becomes goal directed because the inductive biases of SGD favor models that are more numerous in parameter space.

If the model is sufficiently good at deception, there will be few to no differential adversarial examples.

We're talking about an intermediate model with an understanding of the base objective but no goal. If the model doesn’t have a goal yet, then it definitely doesn’t have a long-term goal, so it can’t yet be deceptively aligned. 

Also, at this stage of the process, the model doesn't have goals yet, so the number of differential adversarial examples is unique for each potential proxy goal. 

the vastly larger number of misaligned goals

I agree that there’s a vastly larger number of possible misaligned goals, but because we are talking about a model that is not yet deceptive, the vast majority of those misaligned goals would have a huge number of differential adversarial examples. If training involved a general goal, then I wouldn’t expect many, if any, proxies to have a small number of differential adversarial examples in the absence of deceptive alignment. Would you?