[-][anonymous]3y190

I felt like this post could benefit from a summary so I wrote one below. It ended up being pretty long, so if people think it's useful I could make it into it's own top-level post.

Summary of the summary

In this talk Evan examines how likely we are to get deceptively aligned models once our models become powerful enough to understand the training process. Since deceptively aligned models are behaviorally (almost) indistinguishable from robustly aligned model, we should examine this question by looking at the inductive biases of the training process. The talk looks at the high path dependence world and the low path dependence world and concludes that deception is favored by the inductive biases in both cases.

In the high path dependence world that's because it's harder for SGD to develop a good pointer to the training objective than it is to just modify the model's inner objective to be long-term. In a low path dependence world, Evan thinks that the training process's bias toward simpler models favors deceptively aligned models. But a bias toward faster inference favors robustly aligned models. However, a speed prior is unlikely to be competitive.

Thus, in both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.

Summary

The post defines deceptive alignment as occurring when “the reason the model looks aligned in training is because it is actively trying to do so for instrumental reasons.” It argues that once a model is deceptively aligned, it will be indistinguishable from a robustly aligned model. Thus the question “Is deception the default?” is a question about the inductive biases of the training process we are using. We need to determine how likely the training process is to find a deceptively aligned model rather than a robustly aligned model.

To answer this question Evan examines two possible worlds, the high path-dependence world and the low path-dependence world. In the high path dependence world the model depends strongly on the path the training process took through model space, so the model's behavior outside of training is heavily influenced by its training path. In the low path dependence world we get pretty much the same models regardless of the training path that they took.

Deceptive alignment in the high path-dependence world

We first assume that the training process is complex enough such that to succeed, the model has to understand what we want. Evan posits three classes of models that do well in training:

Internal alignment – The model’s internal goals are equivalent to the training objective.
Corrigible alignment – The model’s goals are a pointer to the part of it’s internal representation of the training objective.
Deceptive alignment – The model has some long-term goal and is instrumentally motivated to achieve its training objective.

To determine the relative likelihood of these classes, Evan constructs the training path that he thinks is most likely to result in each class of model and then examines how likely this training path is to happen. In particular he considers two factors here,

How much marginal performance improvement do you get per step?
How long is the path?

Internal alignment

The internal alignment path is,

We start with a proxy-aligned model
SGD continuously improves the proxy until it corresponds to the training objective, before it learns about the training process
The model learns about the training process from its input data
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned

Evan sees this path as unlikely, since it would require the proxy to align perfectly with the training objective before it learns about the training process. This is unlikely because there are diminishing returns to improving the proxy as well as to improving the world model, so it would be inefficient for SGD to allocate all its improvements to the proxy rather than the world model. What’s more likely is that the proxy and the world model will improve in parallel, so it won’t perfect its proxy before it develops situational awareness.

Corrigible alignment

The corrigible alignment path is,

We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model's understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD replaces the model's proxies with a pointer to its understanding of the training objective
Since the model is already attempting to pursue the training objective, there are no performance gains from SGD modifying the model to become deceptive, so it stays aligned

One way to view this is that after step 3, the model has an overhang where its world-model represents the training process more accurately than its proxy-goal. Here, SGD resolves the overhang by throwing out the proxies and replacing them with a pointer to the training objective.

One issue is that it might be hard to specify this pointer.

For example, maybe it learns a pointer to whatever’s encoded in this computer, or whatever this human says, or whatever the human sitting in that chair says, or whatever Bob the head of the company says.

The longer it takes for SGD to find the correct pointer the less likely the corrigible alignment path becomes.

Deceptive alignment

The deceptive alignment path is,

We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model's understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD makes the model's proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purpose of staying around
The model's proxies "crystallize," as they are no longer relevant to performance, and we reach an equilibrium

Here, the first three steps are the same as in the corrigible alignment path, so the model has the same overhang as before. This time, SGD solves the overhang by modifying the proxy goals into long-term goals. For example, if the model’s original proxy goal was “maximize gold coins in the current training run,” they would be changed to “maximize gold coins across all future training runs,” or something. Once the model has long term goals it starts performing well in training for instrumental purposes.

Evan sees this scenario as more likely, since the modification that SGD has to perform to the proxy objective is relatively easy compared to finding the correct pointer as in the corrigible alignment scenario.

Deceptive alignment in a low path dependence world

Here we’re going to again assume that the model understands the training process. Additionally we will assume that the process's inductive biases are some combination of simplicity and speed. To figure out which types of models are more likely we are going to check which what classes of models are favored by having a bias towards simplicity and speed.

Simplicity – How complex is it to specify some algorithm in the weights?

From the transcript,

You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?”

Evan thinks the deceptive models are the simplest. That’s because, roughly speaking, he thinks all agents will have a world model, an optimization procedure, and an internal objective. The first two components will be about the same for all three models, so the complexity question boils down to how complex each model class’s objective is.

The internally aligned objective is really complex, since it has to be the exact same as the training objective. The corrigible alignment objective is less complex because it just has a to be a pointer to the right parts of the world model. The deceptive alignment objective is super simple, since it could be any long-term objective. Thus the complexity prior favors deception.

Speed – How much computation does that algorithm take at inference time?

Evan thinks deceptive models loose out here.

Well, here we're trying to understand how much computation you have to perform for each model class to result in the correct answer. The internally aligned one already has exactly what the thing is that it is trying to optimize for, and it just directly executes on that; the corrigibly aligned one, maybe it takes a little bit more because it has to figure out what the pointer is pointing to, but overall it’s pretty similar.
But the deceptively aligned one has to do at least a couple of extra steps, because it reasons, “Okay, I care about gold coins,” but just knowing that it cares about gold coins is not yet sufficient to be able to do a good job in the training environment. It has to do a couple of extra steps of reasoning where it’s like, “Okay, I care about gold coins and caring about gold coins in the long term means that I should try to optimize the training goal temporarily to try to get more gold coins eventually.” And so that adds extra steps of reasoning and some cost to the overall computation of the model. So the deceptively aligned one is in some sense slower than the others.

However, Evan doesn’t think that a speed prior will save us from deception because it is uncompetitive. Essentially, Evan sees the real world as having a simplicity prior, so simpler functions tend to generalize better. However, simpler functions are not always the fastest to compute. As a result, a speed prior will prevent SGD from finding the functions that generalize best. He cites double descent as evidence for this, which I won’t describe in this summary.

Conclusion

In both the high and low path-dependence world, deception is favored by gradient descent’s inductive biases. We therefore have to change the training process to make deception less favored. We can’t just do this using behavioral incentives, since deceptive agents behave identically to robustly aligned agents.

[-]RobertKirk3yΩ7118

Thanks for writing this post, it's great to see explicit (high-level) stories for how and why deceptive alignment would arise! Some comments/disagreements:

(Note I'm using "AI" instead of "model" to avoid confusing myself between "model" and "world model", e.g. "the deceptively aligned AI's world model" instead of "the deceptively-aligned model's world model").

Making goals long-term might not be easy

You say

Furthermore, this is a really short and simple modification. All gradient descent has to do in order to hook up the model’s understanding of the thing that we want it to do to its actions here is just to make its proxies into long term goals

However, this doesn't necessarily seem all that simple. The world model and internal optimisation process need to be able to plan into the "long term", or even have the conception of the "long term", for the proxy goals to be long-term; this seems to heavily depend on how much the world model and internal optimisation process are capturing this.

Conditioning on the world model and internal optimisation process capturing this concept, it's still not necessarily easy to convert proxies into long term goals, if the proxies are time-dependent in some way, as they might be - if tasks or episodes are similar lengths, then proxies like "start wrapping up my attempt at this task to present it to the human" is only useful if it's conditioned on a time near the end of the episode. My argument here seems much sketchier, but I think this might be because I can't come up with a good example. It seems like it's not necessarily the case that "making goals long-term" is easy; that seems to be mostly taken on intuition that I don't think I share

Relatedly, it seems that the conditioning on capabilities of the world model and internal optimisation process changes the path somewhat in a way that isn't captured by your analysis. That is, it might be easier to achieve corrigible or internal alignment with a less capable world model/internal optimisation process (i.e. earlier in training), as it doesn't require the world model/internal optimisation process to plan over the longer time horizons and greater situational awareness required to still perform well in the deceptive alignmeent case. Do you think that is the case?

On the overhang from throwing out proxies

In the high path-dependency world, you mention an overhang several times. If it understand correctly, what you're referring to here is that, as the world model increases in capabilities, it will start modelling things that are useful as internal optimisation targets for maximising the training objective, and then at some point SGD could just through away the AI's internal goals (which we see as proxies) and instead point to these parts of the world model as the target, which would result in a large increase in the training objective, as these are much better targets. (This is the description of what would happen in the internally aligned case, but the same mechanism seems present in the other cases, as you mention).

However, it seems like the main reason the world model would capture these parts of the world is if they were useful (for maximising the training objective) as internal optimisation targets, and so if they're emerging and improving, it's likely because there's pressure for them to improve as they are being used as targets. This would mean there wasn't an overhang of the sort described above.

Another way of phrasing this might be that the internal goals (proxies) the AIs have will be part of the world model/in the same ontology/using the same representations, they won't be separate (as your story seems to imply?), and hence there won't be something to switch them to inside the world model that provides a bump in the training objective; or if there is, this will happen smoothly as the things to switch to are better-modelled such that they become useful targets.

I think how this affects the analysis is that, as the AI learns more about it's training process, this involves learning more about the training objective, and if it's doing this, it would be very easy for the internal goals to shift to pointing to this understanding of the training objective (if it's already there). This would result in a higher likelihood of corrigible alignment. Specifically, in the case where the AI has a full understanding of the training process, including a full understanding of the training objective (such that it models all parts of it, and there's a single pointer that points to all these parts and is hence easily referenced), it seems (a priori) equally easy to shift the internal objective proxies to be pointing to this training objective pointer, as it does to shift them to long-term goals which then require optimising for the training objective pointer instrumentally. In fact it might even be simpler, as you may need a less complex/long-time-horizon-capable world model/internal optimiser in the first case than the second.

Simplicity Bias Argument

(Building on the ideas above about how the world model could have a pointer to the training objective, and would need this for all model classes considered): In the "slightly more formal analysis" of the simplicity bias in the low-path-dependence world, you decompose the complexity of the model into a sum of it's world model complexity, optimisation process complexity and mesa objective complexity, and argue that the mesa objective complexity is the key thing that's going to vary between the model classes being considered. However, I don't think this argument holds.

Specifically, for the deceptive-aligned AI to perform well during training, inside it's world model, it needs to have a full specifical of the training objective, and some pointer P to it such that the optimisation process can say "we want {mesa_objective}, but we need to do well in training instrumentally, so plan to make the value of pointer P go higher". Either this makes the deceptive-aligned AI's world models more complex than the world model for the internally aligned AI (which instead has the pointer to the full specification of the training objective outside it's world model, in the mesa objective component), or it's the same complexity as the internally aligned model, in which case the internally aligned AI's mesa_objective can just reference that pointer, and hence is likely the same complexity as the simplest mesa objective for the deceptively-aligned AI (which likely also just references a pointer to some attribute of the world which the world model captures).

Phrased a different way, I'm imagining that the world model, optimisation process and mesa objective are all in a shared (internal to the model) ontology, and hence the mesa objective specification isn't going to be a copy of the parts of the world model that are the objective (which would entail also copying all the machinery necessary to actually specify this from observations), but instead just a (weighted combination of) concept(s) in the internal ontology, which will be very simple to specify.

Overall, all these considerations argue that deceptive aligned is less likely than the analysis in this post suggests. It does still seem very possible that deceptive alignment occurs, and I still agree that we need transparency tools to fix these problems, but perhaps I think we're less underwater than Evan does (to use the terminology from the Conclusion).

[-]Thomas Larsen3yΩ7110

my current best guess is that gradient descent is going to want to make our models deceptive

Can you quantify your credence in this claim?

Also, how much optimization pressure do you think that we will need to make models not deceptive? More specifically, how would your credence in the above change if we trained with a system that exerted 2x, 4x, ... optimization pressure against deception?

If you don't like these or want a more specific operationalization of this question, I'm happy with whatever you think is likely or filling out more details.

[-]evhub3yΩ183011

I think it really depends on the specific training setup. Some are much more likely than others to lead to deceptive alignment, in my opinion. Here are some numbers off the top of my head, though please don't take these too seriously:

~90%: if you keep scaling up RL in complex environments ad infinitum, eventually you get deceptive alignment.
~80%: conditional on RL in complex environments being the first path to transformative AI, there will be deceptively aligned RL models.
~70%: if you keep scaling up GPT-style language modeling ad infinitum, eventually you get deceptive alignment.
~60%: there will be an existential catastrophe due to deceptive alignment specifically.
~30%: conditional on GPT-style language modeling being the first path to transformative AI, there will be deceptively aligned language models (not including deceptive simulacra, only deceptive simulators).

For the optimization pressure question, I really don't know, but I think “2x, 4x” seems too low—that corresponds to only 1-2 bits. It would be pretty surprising to me if the absolute separation between the deceptive and non-deceptive models was that small in either direction for almost any training setup.

[-]Ivan Vendrov3yΩ230

Thank you for putting numbers on it!

~60%: there will be an existential catastrophe due to deceptive alignment specifically.

Is this an unconditionally prediction of 60% chance of existential catastrophe due to deceptive alignment alone? In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century. Or do you mean that, conditional on there being an existential catastrophe due to AI, 60% chance it will be caused by deceptive alignment, and 40% by other problems like misuse or outer alignment?

[-]paulfchristiano3yΩ11274

In contrast to the commonly used 10% chance of existential catastrophe due to all AI sources this century

Amongst the LW crowd I'm relatively optimistic, but I'm not that optimistic. I would give maybe 20% total risk of misalignment this century. (I'm generally expecting singularity this century with >75% chance such that most alignment risk ever will be this century.)

The number is lower if you consider "how much alignment risk before AI systems are in the driver's seat," which I think is very often the more relevant question, but I'd still put it at 10-20%. At various points in the past my point estimates have ranged from 5% up to 25%.

And then on top of that there are significant other risks from the transition to AI. Maybe a total of more like 40% total existential risk from AI this century? With extinction risk more like half of that, and more uncertain since I've thought less about it.

I still find 60% risk from deceptive alignment quite implausible, but wanted to clarify that 10% total risk is not in line with my view and I suspect it is not a typical view on LW or the alignment forum.

[-]Jeffrey Ladish3yΩ040

And then on top of that there are significant other risks from the transition to AI. Maybe a total of more like 40% total existential risk from AI this century? With extinction risk more like half of that, and more uncertain since I've thought less about it.

40% total existential risk, and extinction risk half of that? Does that mean the other half is some kind of existential catastrophe / bad values lock-in but where humans do survive?

[-]evhub3yΩ464

Fwiw, I would put non-extinction existential risk at ~80% of all existential risk from AI. So maybe my extinction numbers are actually not too different than Paul's (seems like we're both ~20% on extinction specifically).

[-][anonymous]3y81

And then there’s me who was so certain until now that any time people talk about x-risk they mean it to be synonymous with extinction. It does make me curious though, what kind of scenarios are you imagining in which misalignment doesn’t kill everyone? Do more people place a higher credence on s-risk than I originally suspected?

[-]evhub3yΩ12264

Unconditional. I'm rather more pessimistic than an overall 10% chance. I usually give ~80% chance of existential risk from AI.

[-]Charlie Steiner3yΩ462

A) Is there video? Is it going up on Rob Miles' third channel?

B) I'm not sure I agree with you about where the Christ analogy goes. By the definition, AI "Christ" makes the same value decisions as me for the same reasons. That's the thing that there's just one way to do (more or less). But that's not what I want, because I want an AI that can tackle complicated situations that I'd have trouble understanding, and I want an AI that will sometimes make decisions the way I want them made, not the way I actually make them.

[-]evhub3yΩ332

A) There is a video, but it's not super high quality and I think the transcript is better. If you really want to listen to it, though, you can take a look here.

B) Yeah, I agree with that. Perhaps the thing I said in the talk was too strong—the thing I mean is a model where the objective is essentially the same as what you want, but the optimization process and world model are potentially quite superior. I still think there's still approximately only one of those, though, since you have to get the objective to exactly match onto what you want.

[-]Charlie Steiner3yΩ120

I still think there's still approximately only one of those, though, since you have to get the objective to exactly match onto what you want.

Once you're trying to extrapolate me rather than just copy me as-is, there are multiple ways to do the extrapolation. But I'd agree it's still way less entropy than deceptive alignment.

[-]TurnTrout2yΩ440

It could learn to generalize based on color or it could learn to generalize based on shape. And which one we get is just a question of which one is simpler and easier for gradient descent to implement and which one is preferred by inductive biases, they both do equivalently well in training, but you know, one of them consistently is always the one that gradient descent finds, which in this situation is the color detector.

As an aside, I think this is more about data instead of "how easy is it to implement." Specifically, ANNs generalize based on texture because of the random crop augmentations. The crops are generally so small that there isn't a persistent shape during training, but there is a persistent texture for each class, so of course the model has to use the texture. Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.

However, if the crops are made more "natural" (leaving more of the image intact, I think), then classes do tend to have persistent shapes during training. Accordingly, networks reliably learn to generalize based on shapes (just like people do!).

[-]evhub2yΩ220

As an aside, I think this is more about data instead of "how easy is it to implement."

This seems confused to me—I'm not sure that there's a meaningful sense in which you can say one of data vs. inductive biases matters "more." They are both absolutely essential, and you can't talk about what algorithm will be learned by a machine learning system unless you are engaging both with the nature of the data and the nature of the inductive biases, since if you only fix one and not the other you can learn essentially any algorithm.

Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.

To be clear, I'm not saying that the inductive biases that matter here are necessarily unique to ANNs. In fact, they can't be: by Occam's razor, simplicity bias is what gets you good generalization, and since both human neural networks and artificial neural networks can often achieve good generalization, they have to be both be using a bunch of shared simplicity bias.

The problem is that pure simplicity bias doesn't actually get you alignment. So even if humans and AIs share 99% of inductive biases, what they're sharing is just the obvious simplicity bias stuff that any system capable of generalizing from real-world data has to share.

[-]DavidW3y30

This is an interesting post. I have a very different perspective on the likelihood of deceptive alignment. I'd love to hear what you think of it and discuss further!

[-]David Johnston3yΩ232

A few questions, if you have time:

which situation do you think is easier to analyse, path dependent or path independent?
which of your analyses do you think is more robust, path dependent or independent? Is the difference large?
I think the path dependent analysis should feature an assumption that, at one extreme, yields the path independent analysis, but I can’t see it. What say you?

[-]evhub3yΩ222

The path-independent case is probably overall easier to analyze—we understand things like speed and simplicity pretty well, and in the path-independent case we don't have to keep track of complex sequencing effects.
I think my path-independent analysis is more robust, for the same reasons as in the first point.
Presumably that assumption is path-dependence? I'm not sure what you mean.

[-]David Johnston3y30

Thanks for taking the time to respond. To explain my third question, my take on your path dependent analysis is that you have two basic assumptions:

each step of training updates the behaviour in the direction of "locally minimising loss over all training data"
training will not move the model between states with equal loss over all training data

Holding the training data fixed, you get the same sequence of updates no matter which fragment is used to work out the next training step. So you can get very different behaviour for different training data, but not for different orderings of the same training data - so, to begin with, I'm not sure if these assumptions actually yield path dependence.

Secondly, assumption 2 might seem like an assumption that gets you path dependence - e.g. if there are lots of global minima, then you can just end up at one of them randomly. However, replacing assumption 2 with some kind of "given a collection of states with minimal loss, the model always goes to some preferred state from this collection" doesn't get you the path independent analysis. Instead of "the model converges to some behaviour that optimally trades off loss and inductive bias", you end up with "the model converges to some behaviour that minimises training set loss". That is, your analysis of "path independence" seems to be better described as "inductive bias independence" (or at least "weak inductive bias"), and the appropriate conclusion of this analysis would therefore seem to be that the model doesn't generalise at all (not that it is merely deceptive).

[-]ojorgensen3y30

I found this post really interesting, thanks for sharing it!

It doesn’t seem obvious to me that the methods of understanding a model given a high path-dependence world become significantly less useful if we are in a low path-dependence world. I think I see why low path-dependence would give us the opportunity to use different methods of analysis, but I don’t see why the high path-dependence ones would no longer be useful.

For example, here is the reasoning behind “how likely is deceptive alignment” in a high path-dependence world (quoted from the slide).

We start with a proxy-aligned model
In early training, SGD jointly focuses on improving the model's understanding of the world along with improving its proxies
The model learns about the training process from its input data
SGD makes the model's proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purposes of staying around
The model's proxies "crystallize", as they are no longer relevant to performance, and we reach an equilibrium

Let's suppose that this reasoning, and the associated justification of why this is likely to arise due to SGD seeking the largest possible marginal performance improvements, are sound for a high path-dependence world. Why does it no longer hold in a low path-dependence world?

[-][anonymous]3y10

Why does it no longer hold in a low path-dependence world?

Not sure, but here's how I understand it:

If we are in a low path-dependence world, the fact that SGD takes a certain path doesn't say much about what type of model it will eventually converge to.

In a low path-dependence world, if these steps occurred to produce a deceptive model, SGD could still "find a path" to the corrigibly aligned version. The questions of whether it would find these other models depends on things like "how big is the space of models with this property?", which corresponds to a complexity bias.

[-][anonymous]3y33

Even if, as you claim, the models that generalize best aren't the fastest models, wouldn't there still be a bias toward speed among the models that generalize well, simply because computation is limited, so faster models can do more with the same amount of computation? It seems to me that the scarcity of compute always results in a speed prior.

[-]TurnTrout2yΩ220

models where the reason it was aligned is because it’s trying to game the training signal.

Isn't the reason it's aligned supposed to be so that it can pursue its ulterior motive, and if it looks unaligned the developers won't like it and they'll shut it down? Why do you think the AI is trying to game the training signal directly, instead of just managing the developers' perceptions of it?

[-]evhub2yΩ220

I mean "training signal" quite broadly there to include anything that might affect the model's ability to preserve its goals during training—probably I should have just used a different phrase, though I'm not exactly sure what the best phrase would be. To be clear, I think a deceptive model would likely be attempting to fool both the direct training signals like loss and the indirect training signals like developer perceptions.

[-]shash422yΩ010

It seems like this argument assumes that the model optimizes on the entire 'training process'. Why can't we test (perform inference) using the model on distributions different from the training distribution where SGD can no longer optimize to check if the model was deceptive aligned on the training environment?

[-]evhub2yΩ230

Because the model can defect only on distributions that we can't generate from, as the problem of generating from a distribution can be harder than the problem of detecting samples from that distribution (in the same way that verifying the answer to a problem in NP can be easier than generating it). See for example Paul Christiano's RSA-2048 example (explained in more detail here).

[-]Tapatakt3y10

Did I understand correctly that "max speed" really means "speed in worst case"/"max time"?

[-]evhub3y32

Yep, that's right.

Question: It seems to me like, if you’re just going from point A to point B, it doesn’t matter how you get there, just what the final model is.

So, that’s not quite the way I’m thinking about path-dependence. So, we assume that the model’s behavior converges in training. It learns to fit the training data. And so we're thinking about it in terms of them all converging to the same point in terms of training behavior. But there's a bunch of other things that are left undefined if you just know the training behavior, right. We know they all converge to the same training behavior, but the thing we don't know is whether they converge to the same algorithm, whether they converge to the same algorithm, whether they’re going to generalize in the same way.

And so when we say it has high path dependence, that means the way you got to that particular training behavior is extremely relevant. The fact that you took a particular path through model space to get to that particular set of training behavior is extremely important for understanding what the generalization behavior will be there. And if we say low path dependence, we're saying it actually didn't matter very much how you got that particular training behavior. The only thing that mattered was that you got that particular training behavior.

Question: When you say model space, you mean the functional behavior as opposed to the literal parameter space?

So there’s not quite a one to one mapping because there are multiple implementations of the exact same function in a network. But it's pretty close. I mean, most of the time when I'm saying model space, I'm talking either about the weight space or about the function space where I'm interpreting the function over all inputs, not just the training data.

I only talk about the space of functions restricted to their training performance for this path dependence concept, where we get this view where, well, they end up on the same point, but we want to know how much we need to know about how they got there to understand how they generalize.

Question: So correct me if I'm wrong. But if you have the final trained model, which is a point in weight space, that determines behavior on other datasets, like just that final point of the path.

Yes, that’s correct. The point that I was making is that they converge to the same functional behavior on the training distribution, but not necessarily the same functional behavior off the training distribution. ↩︎
Question: So last time you gave this talk, I think I made a remark here, questioning whether grokking was actually evidence of there being a simplicity prior, because maybe what's actually going on is that there's a tiny gradient signal from not being completely certain about the classification. So I asked an ML grad student friend of mine, who studies grokking, and you're totally right. So there was weight decay in this example. And if you turn off the weight decay, the grokking doesn't happen.

Yes, that was my understanding—that mostly what's happening here is that it's the weight decay that's pushing you towards the grokking. And so that's sort of evidence of there actually just being a simplicity prior built into the architecture, that is always going to converge to the same, simple thing.

Question: But if you turn off the weight decay then the grokking doesn't happen.

Well, one hypothesis might be that the weight decay is the thing that forces the architectural prior there. But maybe the strongest hypothesis here is that without weight decay there’s just not enough of a gradient to do anything in that period.

Question: This isn't a question. For people who aren't familiar with the terminology “weight decay”, it’s the same as L2 regularization?

Yep, those are the same. ↩︎
Question: Does Martin Luther over time become internally aligned? As Martin Luther studies the Bible over time, does he become internally aligned with you?

No. Because Martin Luther never, from my perspective, at least the way we're thinking about this here—I'm not gonna make any claims about the real Martin Luther—but the way we're thinking about it here is that the Martin Luther models, the thing that they care about is understanding the Bible really well. And so their goal, whatever the Bible is, they're going to figure it out. But they're not going to modify themselves, to become the same as the Bible.

Let's say, I’m the Martin Luther model. And I modify myself to care about my current understanding of the Bible. And then I realized that actually the Bible was different than I thought the whole time. That's really bad for me, because the thing I want originally is not to do the thing that my current understanding of the Bible says, it's to do what the Bible actually tells me. And so if I later understand that actually the Bible wants something different, then the Martin Luther models want to be able to shift to that. So they don't want to modify themselves into internal alignment. I should also point out that, the way that we were imagining this, it’s not clear that the model itself has any control over which model it ends up as. Except to the extent that it controls performance, which is how the deceptively aligned model works.

Question: So Martin Luther is saying, the Bible seems cool so far. I want to learn more about it. But I've reserved the option to not be tied to the Bible.

No, Martin Luther loves the Bible and wants to do everything the Bible says.

Question: So why doesn’t Martin Luther want to change its code to be equal to the Bible?

The Bible doesn't say, change your code to be equal to the Bible. The Bible says do these things. You could imagine a situation where the Bible is like, you got to modify yourself to love paper clips, or whatever. In that situation, the model says, well, okay, I guess I gotta modify myself to like paper clips. But Martin Luther doesn’t want to modify himself unless the Bible says to.

The problem with modifying themselves is that the Martin Luther models are concerned, like, “Hmm, maybe this Bible is actually, a forgery” or something, right? Or as we'll talk about later, maybe you could end up in a situation where the Martin Luther model thinks that a forgery of the Bible is its true ground source for the Bible. And so it just cares about a forgery. And that's the thing it cares about. ↩︎
Question: The point you just made about pre-training vs. fine-tuning seems backwards. If pre-training requires vastly more compute than the fine-tuning a reward model, then it seems that learning about your reward function is cheaper for compute?

Well, it's cheaper, but it's just less useful. Almost all of your performance comes from understanding the world, in some sense. Also, I think part of the point there is that, once you understand the world, then you have the ability to relatively cheaply understand the thing we're trying to get you to do. But trying to go directly to understand the thing we're trying to get you to do—at that point you don't understand the world enough even to have the concepts that enable you to be able to understand that thing. Understanding the world is just so important. It's like the central thing.

Question: It feels like to really make this point, you need to do something more like train a reinforcement learning agent from random initialization against a reward model for the same amount of compute, versus doing the pre-training and then fine-tune on the reward model.

Yeah, that seems like a pretty interesting experiment. I do think we’d learn more from something like that than just going off of the relative lengths of pre-training vs. fine-tuning.

Question: I still just don’t understand how this is actually evidence for the point you wanted to make.

Well, you could imagine a world where understanding the world is really cheap. And it's really, really hard to get the thing to be able to do what you want—output good summaries or whatever—because it is hard to specify what that thing is. I think in that world, that would be a situation where, if you just trained a model end to end on the whole task, most of your performance would come from, most of your gradient updates would be for, trying to improve the model’s ability to understand the thing you're trying to get it to do, rather than improving it’s generic understand the world.

Whereas I'm describing a situation where, by my guess, most of the gradient updates would just be towards improving its understanding of the world.

Now, in both of those situations, regardless of whether you have more gradient descent updates in one direction or the other, diminishing returns still apply. It’s still the case, whichever world it is, SGD is still going to balance between them both, such that it'd be really weird if you'd maxed out on one before the other.

However, I think the fact that it does look like almost all the gradient descent updates come from understanding the world teaches us something about what it actually takes to do a good job. And it tells us things like, if we just try to train the model to do to do something, and then pause it halfway, most of the ability to have good capabilities is coming from its understanding of the world and so we should expect gradient descent to have spent most of its resources so far on that.

That being said, the question we have to care about is not which one maxes out first, it's do we max out on the proxy before we understand the training process sufficiently to be deceptive. So I agree that it’s unclear exactly what this fact says about when that should happen. But it still feels like a pretty important background fact to keep in mind here. ↩︎

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

107

How likely is deceptive alignment?

107

Ω 50

107

Ω 50

Summary of the summary

Summary

Deceptive alignment in the high path-dependence world

Internal alignment

Corrigible alignment

Deceptive alignment

Deceptive alignment in a low path dependence world

Simplicity – How complex is it to specify some algorithm in the weights?

Speed – How much computation does that algorithm take at inference time?

Conclusion

Making goals long-term might not be easy

On the overhang from throwing out proxies

Simplicity Bias Argument

Deceptive alignment in the high path-dependence world

Deceptive alignment in the low path-dependence world

Conclusion

Q&A