This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. It builds on work done on simulator theory by Janus, who came up with the strategy this post aims to analyze; I’m also grateful to them for their comments and their mentorship during the AI Safety Camp, to Johannes Treutlein for his feedback and helpful discussion, and to Paul Colognese for his thoughts.
TL;DR: Generative models don’t operate on preferences in the way agents or optimizers do, instead just modelling a prior over the world that can return worlds satisfying some conditional, weighted by how likely they are given this prior. This implies that these models may be safer to use than more standard optimizers, although we would likely use them in different ways. For example, at the limit of capability we could use it to do alignment research by simulating a world with a superhuman alignment researcher. That isn’t to say that these models are without their own problems however, carrying with them versions of the outer and inner alignment problems that are subtly different from those of optimizers. This post explores how generative models are different, ways in which we might use them for alignment, some of the potential problems that could come up, and initial attempts at solving them.
Generative models are trained by self-supervised learning to model a distribution of data. With large language models, the real distribution underlying the textual data comprising its training corpus represents the world as it is now, in theory. As they get stronger, we can assume that they’re getting better at modelling this world prior, as I’ll describe it going forward.
We have the ability to prompt these models, which, at a low-level, means that we can give it the first part of some text that it has to complete. High-level, this means that we can describe some property of the world that may or may not exist, and the model samples from the world prior and uses a weighted distribution of the likely worlds that satisfy that property (for simplicity at the cost of technical accuracy, you can imagine it simulating the most likely world satisfying the property). For example, if you give it a prompt that says an asteroid is about to impact the world and finishes with something like “The following is an excerpt from the last public broadcast before the asteroid struck”, then the model simulates a world where that is true by modelling the way the most likely ways our world would turn out in that conditional. Powerful generative models would be able to do this with high fidelity, such that their generations would, for example, account for the changes to society that would occur between now and then and structure the broadcast and its timing (how long before the strike does it happen if communications go down?) accordingly.
In other words, generative models have the advantage of being strongly biased toward the prior distribution of the universe. What this means is that there’s a strong bias in its outputs to remain close to the universe described by the training data - this could be viewed as a posterior sampling over all possible universes after updating on the training data, subject to simplicity bias, but in practice it wouldn’t be updating on a prior over all universes but building an incremental understanding of the universe given each datum. In GPT-3’s case, for example, this would be the universe that’s described by the corpus it was trained on.
Therefore, conditionals applied to it sort outcomes based on likelihood, not optimality - asking for a lot of paperclips just outputs someone going to buy a lot of paperclips from a store, not a paperclip maximizer.
One framing of this is (this paragraph and the next contain mild spoilers for up to chapter 17 of Harry Potter and the Methods of Rationality) to think of the time-turner from HPMOR. You can design a set-up where the time turner finding the “simplest” world - the one that requires the least amount of bits to reach from ours - that satisfies self-consistency is used to get arbitrary conditionals. For example, consider a set-up where a piece of paper from your future self must contain the password to some system, or you pre-commit to modifying it before sending it back in time. In such a set-up, the only consistent outcome should be the one where the paper actually contains the correct password.
However, you could end up with unexpected simpler worlds that are still consistent, such as getting a paper that just says “Do not mess with time”. The analogue to this in generative models is a problem we’ll discuss later. In theory, this could be viewed as an outcome pump weighted by the distribution representing the world right now, selecting for worlds satisfying some outcome. In other words, the relevant take-away is the idea of selecting for worlds based on some objective, but sampled on their likelihood from our current universe, instead of trying to optimize for that objective.
If we can safely use a generative model in this way without having to worry about outer or inner alignment (more on what those mean in this context in a later section), then we could use it to accelerate alignment research - or potentially even solve alignment entirely.
One straightforward way to do this would be to simulate a superhuman alignment researcher, as mentioned earlier, although there are other ways we could do this, and other ways we could use the model for alignment. While I’ll go into some details about some use cases in different futures, the specifics of how we would use such a model to aid alignment (what prompts to use, finding new use cases, etc) is not the focus of this post, so we need only the broad intuition that it could.
It’s possible to do this because generative models like GPT have in their training data (and consequently, in their world model), the information necessary to comprise an alignment researcher, we just have to extract it. At the very limit, we just have the physics prior - no model would be able to actually reach this because of computational constraints, but one way to frame this is to imagine what level we’re abstracting that to, with more powerful models having better abstractions.
Waiting for a true generative model we could easily extract this from with direct prompts might be dangerous, however, because of the higher probability of catastrophe by that point. Research labs could reach AGI through RL, use generative models as part of agents, or convert generative models to optimizers (decision transformers are an initial foray into this kind of work).
We’re left with a lot of questions about timelines (how long would generative models capable of accelerating alignment research noticeably take to arrive? Would we reach AGI before then?), extraction capabilities (how could we push the capabilities of weaker generative models for alignment to their limit?), and what kind of problems we could run into with using them.
One framing of these problems is to think about the worlds in which this approach is actually useful. Here are two worlds in which this could be critically important as an alignment strategy:
Outer alignment as we refer to it here is different from the concept of outer alignment that we usually use with respect to optimizers, because the problem isn’t at its core about having its values be aligned with ours (one of the hallmarks of generative models is the lack of values or preferences in that sense). Instead, the problem is about making sure its prior is aligned with reality, and relying on that prior being good (at least in some direction that we can detect and work along, with careful application of conditionals).
The outer alignment problem with respect to generative models can be divided into two parts:
In effect, the question we’re asking addresses both parts of the outer alignment problem - can we actually extract aligned simulacra for any of the purposes we would want to put it to?
Adam Jermyn’s post on conditioning generative models discusses this in some detail, so I’ll just add some further points I’m uncertain about.
In the post, Adam focuses on the second part of the outer alignment problem, whether carefully crafted conditionals could get aligned behaviour from the simulacra. Regarding the first part of the problem - even if the textual distribution isn’t equivalent to the world prior, I think that it might not be a huge issue. That would be true in current models as well, and current models still have the parts necessary for extracting similar capabilities - alignment research isn’t likely to be qualitatively different from math problems in some way that the difference between the two distributions would allow for one and not the other. Given that, I think creative application of prompts and the model itself would offset this part of the problem, which is why I’ll focus on the second part for the rest of this section.
That said, I think it’s entirely possible this might not hold true at the limit, that future models might diverge in ways that this becomes a more qualitative problem - I just don’t think it’s very likely. Work such as extracting superhuman capabilities absent from the training data addresses this problem to some extent as well.
Another potential problem is an agent within the simulation having preferences over the real world (so its outputs have the ability to never defect in the simulation while still being misaligned) rather than just over the world described in the simulation. As an illustrative example, imagine humans in our world thinking about the simulation hypothesis and having preferences over what they consider “base reality”.
The inner alignment problem with respect to generative models can be divided into two parts:
There are several ways in which the model could handle self-fulfilling prophecies, some of which we’ll discuss below.
Preliminary testing with GPT-3 seems to imply that it just samples from some prior over human reactions to GPT-3 output. More powerful generative models might just sample from the same kind of prior while selecting for self-fulfilling-ness. What this means is the model has this prior, filters it to get self-fulfilling generations, and then samples from that. This doesn’t seem like it would result in very dangerous behaviour, because it still tries to remain close to the normal world state - as long as a self-fulfilling trajectory isn’t too unlikely, it should be close to as safe as generative models are, especially if we try to make the prior safer somehow, or make self-fulfilling trajectories more likely.
Another way in which the model could resolve self-fulfilling prophecies is for it to “reason” via an entirely different mechanism that tries to do fixed-point solving. This is more likely to produce dangerous predictions, as it doesn’t have the safety that sampling from the prior brings.
Johannes Treutlein’s post on training goals for generative models discusses potential solutions to this problem in more detail, which I’ll summarize here with some additions. The generative model could act as a counterfactual oracle, in which the model is only evaluated on predictions whose ground truth is determined by a process independent of the model (for example, it’s only evaluated on text that has already been written). Taking the consequences of its own actions into account while making predictions in this setup would result in poor performance, so there’s a constant gradient against the model behaving that way. Thus, the model is incentivized to make predictions about the future sampled from the prior as always, without considering the influence its predictions could have.
This strategy, however, comes with a few downsides. For one, we cannot train the model using RL from human feedback, as the training signal must come entirely from existing text. For another, this might stunt performance in cases where it needs to model trajectories that contain itself (predicting something about a future where everyone uses the model).
As an aside, self-fulfilling prophecies could also occur if the model is an optimizer - even if its mesa objective is aligned with the loss (interestingly, this would be a case of mesa optimization resulting in a larger class of outcomes for outer alignment, allowing for outer misalignment). LCDT might suffice here.
Using smaller generative models as initializations for larger ones.
(The equivalent ELK proposal goes into this strategy in more detail).
Since we can reasonably assume that smaller models such as GPT-3 aren’t deceptive optimizers (if they were and we couldn’t tell by now, this probably isn’t a world where we win anyway, so might as well assume otherwise), this should allow for the larger models we train this way to remain relatively close in model space to these safer models, thereby making a stronger case for actually getting deceptive optimizers.
The same downside as applies to the ELK proposal might apply here, that models at different scales use different internal mechanisms for modelling the world and generation, making this strategy non-competitive.
If this downside applies, is there any kind of control we can exert over the parameter initialization to make optimizers less likely? In a world of low path dependence, this doesn’t seem like a very promising strategy.
An acceptability predicate for myopia.
A complete description of what an acceptability predicate is lies outside the scope of this post, and is long enough that it would make this post even more difficult to follow. I recommend reading Evan’s doc linked above for better clarity, but I’ll try to briefly explain what they are - an acceptability predicate is some mechanistic property (i.e, something about how the model actually works, instead of what it does) such that, if it holds, makes the model “acceptable” because it guarantees some deeper properties (such as the absence of deception).
They have a list of associated desiderata which a good predicate would satisfy - for our part here, we’ll only consider whether the predicate actually implies some property we desire, whether it has a buffer zone (can the model withstand short periods of time where the predicate doesn’t hold without being dangerous?), and how we can actually mechanistically incentivize the predicate, because that suffices to showcase the difficulty of getting the predicates we will consider to work.
The reason an acceptability predicate for myopia might work here is that while the predicate holds true, generative models would be preferred over misaligned deceptive optimizers. Further, there would be a buffer in the sense that when the predicate fails and the model becomes non-myopic, it wouldn’t be in optimizer space.
For this to be true, we have to consider whether the idea of a non-myopic non-optimizer is a coherent one.
It’s possible that optimizers aren’t very far in model space from a generative model, resulting in a buffer zone that’s very small or non-existent.
To understand another reason why there may not be a satisfactory buffer zone, consider the case of proxy-aligned optimizers.
As an oversimplified example, think of an optimizer that wants to output the word “bananas”, or things related to “bananas”, as much as it can. This optimizer is created during the initial stages of training owing to that part of the training corpus containing text that incentivizes this, and is allowed by the myopia predicate until it gets good performance. If it becomes non-myopic before the distribution changes, then it could recognize deception as a strategy instantly.
Johannes' post goes into more detail on myopia in generative models.
If myopia doesn’t suffice, would something stronger do?
An acceptability predicate for non-agency.
Whether this would work relies in part on the specific definition of agency that we’re trying to incentivize against - but for now, imagine that we just mean it in the sense of a model being able to optimize at all.
How does this predicate fare on buffer zones? If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.
This seems like a very broad predicate, however. What would it actually look like? How would we mechanistically incentivize something like non-agency? This is, in my view, the largest bottleneck to this predicate, and is a good argument for thinking about potential predicates by starting from what we can actually mechanistically incentivize.
We discussed earlier the potential worlds in which this strategy as a whole could be useful. Now we’ll consider how trying to account for the various alignment problems described above would impact the competitiveness of whatever generative model we end up using.
Adding a disclaimer here that this will depend heavily on the form the solutions to the above problems take, especially with problems as unexplored as self-fulfilling prophecies. Consequently, this section is pretty tentative and only meant to give a weak understanding of the ways the alignment tax would come into play with generative models.
In the real world:
I’m particularly excited about this strategy in part because it seems like a lot of the relevant work can be done right now, and we actually have a solid idea of the kind of model we want to end up with. Though this post goes into some detail about various aspects of conditioning generative models for alignment research, it is primarily a broad overview and there remain many ideas and research directions to explore.
I expect that one of the bottlenecks to working on this strategy will be understanding why generative models can be used this way safely and the new framing under which alignment problems fall. To that end, my hope in writing this is that it can provide a base of the core ideas and concerns for people to work off of.
Ultimately, while there are many benefits to this strategy because we only have to really align an oracle AI, there still are many problems that can arise. Success would look like some guarantee (whether argumentative or via some mechanistic predicate) against the four sub-problems comprising outer and inner alignment as described below. There’s still a lot of work to be done, but I’m very excited (in the way one is when thinking about ways to avoid the world ending) to see where this approach will lead.
If you’re wondering about the possibility of malign simulations (such as an optimizer “god” overseeing the simulation) and are familiar with the ELK report, one way to view this is to think of the proposal of training a sequence of reporters for successively more powerful predictors, where you can use the weights of a simpler predictor to initialize a larger one and keep the difference at each iteration small; simpler predictors finding direct translation easier than human simulators, this could result in direct translation for more complex predictors too. The downside to this solution (if the model uses discrete modes of prediction, then this approach would stunt its ability to update toward those better modes as it becomes larger) has an analogue in generative models of different kinds of world models. As the generative models keep incrementally building their world model from the previous, it’s difficult to update away enough to malign simulations when they have to keep getting good performance on the next example - generative models are myopic.
I’m not sure what these other channels would be, but the agent’s scenario is different enough that I don’t want to rule out the possibility of there being other channels it can get some kind of info from.
I suspect that there's a ton of room to get more detailed here, and some of the claims or conclusions you reach in this post feel too tenuous, barring that more detailed work. I will give some feedback that probably contains some mistakes of my own:
Thanks for the feedback!
I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points.
Re: prompting: So when you talk about "simulating a world," or "describing some property of a world," I interpreted that as conditionalizing on a feature of the AI's latent model of the world, rather than just giving it a prompt like "You are a very smart and human-aligned researcher." This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models.
Re: prophecies: I mean that your training procedure doesn't give an AI an incentive to make self-fulfilling prophecies. I think you have a picture where an AI with inner alignment failure might choose outputs that are optimal according to the loss function but lead to bad real-world consequences, and that these outputs would look like self-fulfilling prophecies because that's a way to be accurate while still having degrees of freedom about how to affect the world. I'm saying that the training loss just cares about next-word accuracy, not long term accuracy according to the latent model of the world, and so AI with inner alignment failure might choose outputs that are highly probable according to next word accuracy but lead to bad real-world consequences, and that these outputs would not look like self-fulfilling prophecies.
Sorry for the (very) late reply!
I'm not very familiar with the phrasing of that kind of conditioning - are you describing finetuning, with the divide mentioned here? If so, I have a comment there about why I think it might not really be qualitatively different.
I think my picture is slightly different for how self-fulfilling prophecies could occur. For one, I'm not using "inner alignment failure" here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it'd probably be the outcome you describe), but to a case where it's still just a generative model, but needs some way to resolve the problem of predicting in recursive cases (for example, asking GPT to predict whether the price of a stock would rise or fall). Even for just predicting the next token with high accuracy, it'd need to solve this problem at some point. My prediction is that it's more likely for it to just model this via modelling increasingly low-fidelity versions of itself in a stack, but it's also possible for it do fixed-point reasoning (like in the Predict-O-Matic story).
Fascinating work, thanks for this post.
Do you have a link to the ELK proposal you're referring to here? (I tried googling for "ELK" along with the bolded text above but nothing relevant seemed to come up.)
Do you have thoughts on how to achieve this predicate? I've written some about interpretability-based myopia verification which I think could be the key.
I think [non-myopic non-optimizer is a coherent concept] - as a simple example we could imagine GPT trained for its performance over the next few timesteps. Realistically this would result in a mesa-optimizer, but in theory it could just run a very expensive version of next-token generation, over the much larger space of multiple tokens.
"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.(This is a nitpick and I also don't mean to trivialize the inner alignment problem which I am quite worried about! But I did want to make sure I'm not missing anything here and that I'm broadly on the same page as other reasonable folks about expectations/evidence for mesa-optimizers.)
An acceptability predicate for non-agency.[...]If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.
That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.
This seems like a very broad predicate, however. What would it actually look like?
I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.
I'm excited, I've explored before how having interpretability that can both reliably detect mesa-optimizers and read-off its goals would have the potential to solve alignment. But I hadn't considered how reliable mesa-optimization detection alone might be enough, because I wasn't considering generative models in that post. (Even if I had, I wasn't yet aware of some of the clever and powerful ways that generative models could be used for alignment that you describe in this post.)
How would we mechanistically incentivize something like non-agency?
I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.
Do you have a link to the ELK proposal you're referring to here?
Yep, here. I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway.
"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.
Hmm, I was thinking of that under the frame of the future point where we'd worry about mesa-optimizers, I think. In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I'm holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models). I agree that trying to do this right now would probably just result in bad performance.
I agree that we'll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I'm not very worried about this possibility. Both because the model would have to be simultaneously deceptive in the brief period it's agentic, and because this might not be a very good avenue of attack - it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really "agentifies" whatever is being directly observed / amplified.
I agree almost entirely - I was mainly trying to break down the exact capabilities we'd need the interpretability tools to have there. What would detecting mesa-optimizers entail mechanistically, etc.
But I hadn't considered how reliable mesa-optimization alone might be enough, because I wasn't considering generative models in that post.
I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default - I think my main worries are getting it to work before RL reaches AGI-level.
I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths - or even if it's just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we have.
Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior?
This seems important and is not a thing I've thought about carefully, so thanks for bringing it up and exploring it. I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.
If they can detect when they’re in deployment, then they could act in malign ways.
The more I've thought about this one the more I'm not worried about this precise danger.
It would be very strange for a predictive model attempting to draw plausible trajectories through time to simulate trajectories in which agents notice inconsistencies and decide that they're in simulations. Agents can still conclude they're in simulations, but it would be weird for this to be because they noticed inconsistencies in their worlds, because the agents and world are being constructed together as part of a predictive task. Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.
That said, there are very closely related dangers that I am worried about. For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world). This feels like a thing we can make less likely with appropriate prompting, which makes me hope that it may not be too big a problem in practice, but (barring powerful interpretability tools) I don't think we can rule it out.
The way these models deal with self-fulfilling prophecies.
I'm currently pretty worried about this, so was happy to see you thinking about it.
I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.
Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn't enough information to pick out our world from a universal prior, or because it actually describes a different world better), not whether its capabilities or abstractive reasoning don't help with time-separated prediction.
Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.
I think I'm picturing different reasons for a simulacra agent to conclude that they're in a simulation than noticing inconsistencies. Some specifics include worlds that are just unlikely enough anthropically (because of a conditional we apply, for example) to push up credence in a simulation hypothesis, or they notice the effects of gradient descent (behavioural characteristics of the world deviating from "normal" behaviour tend to affect the world state), or other channels that may be available by some quirk of the simulation / training process, but I'm not holding to any particular one very strongly. All of which to say that I agree it'd be weird for them to notice inconsistencies like that.
For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world).
Yep, I think this could be a problem, although recent thinking has updated me slightly away from non-observed parts of the simulation having consistent agentic behaviour across time.