Conditioning Generative Models for Alignment

Jozdien

This post was written under Evan Hubinger’s direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program. It builds on work done on simulator theory by Janus, who came up with the strategy this post aims to analyze; I’m also grateful to them for their comments and their mentorship during the AI Safety Camp, to Johannes Treutlein for his feedback and helpful discussion, and to Paul Colognese for his thoughts.

TL;DR: Generative models don’t operate on preferences in the way agents or optimizers do, instead just modelling a prior over the world that can return worlds satisfying some conditional, weighted by how likely they are given this prior. This implies that these models may be safer to use than more standard optimizers, although we would likely use them in different ways. For example, at the limit of capability we could use it to do alignment research by simulating a world with a superhuman alignment researcher. That isn’t to say that these models are without their own problems however, carrying with them versions of the outer and inner alignment problems that are subtly different from those of optimizers. This post explores how generative models are different, ways in which we might use them for alignment, some of the potential problems that could come up, and initial attempts at solving them.

How do generative models work?

Generative models are trained by self-supervised learning to model a distribution of data. With large language models, the real distribution underlying the textual data comprising its training corpus represents the world as it is now, in theory. As they get stronger, we can assume that they’re getting better at modelling this world prior, as I’ll describe it going forward.

We have the ability to prompt these models, which, at a low-level, means that we can give it the first part of some text that it has to complete. High-level, this means that we can describe some property of the world that may or may not exist, and the model samples from the world prior and uses a weighted distribution of the likely worlds that satisfy that property (for simplicity at the cost of technical accuracy, you can imagine it simulating the most likely world satisfying the property). For example, if you give it a prompt that says an asteroid is about to impact the world and finishes with something like “The following is an excerpt from the last public broadcast before the asteroid struck”, then the model simulates a world where that is true by modelling the way the most likely ways our world would turn out in that conditional. Powerful generative models would be able to do this with high fidelity, such that their generations would, for example, account for the changes to society that would occur between now and then and structure the broadcast and its timing (how long before the strike does it happen if communications go down?) accordingly.

In other words, generative models have the advantage of being strongly biased toward the prior distribution of the universe. What this means is that there’s a strong bias in its outputs to remain close to the universe described by the training data - this could be viewed as a posterior sampling over all possible universes after updating on the training data, subject to simplicity bias, but in practice it wouldn’t be updating on a prior over all universes but building an incremental understanding of the universe given each datum.^[1] In GPT-3’s case, for example, this would be the universe that’s described by the corpus it was trained on.

Therefore, conditionals applied to it sort outcomes based on likelihood, not optimality - asking for a lot of paperclips just outputs someone going to buy a lot of paperclips from a store, not a paperclip maximizer.

One framing of this is (this paragraph and the next contain mild spoilers for up to chapter 17 of Harry Potter and the Methods of Rationality) to think of the time-turner from HPMOR. You can design a set-up where the time turner finding the “simplest” world - the one that requires the least amount of bits to reach from ours - that satisfies self-consistency is used to get arbitrary conditionals. For example, consider a set-up where a piece of paper from your future self must contain the password to some system, or you pre-commit to modifying it before sending it back in time. In such a set-up, the only consistent outcome should be the one where the paper actually contains the correct password.

However, you could end up with unexpected simpler worlds that are still consistent, such as getting a paper that just says “Do not mess with time”. The analogue to this in generative models is a problem we’ll discuss later. In theory, this could be viewed as an outcome pump weighted by the distribution representing the world right now, selecting for worlds satisfying some outcome. In other words, the relevant take-away is the idea of selecting for worlds based on some objective, but sampled on their likelihood from our current universe, instead of trying to optimize for that objective.

Why is this important?

If we can safely use a generative model in this way without having to worry about outer or inner alignment (more on what those mean in this context in a later section), then we could use it to accelerate alignment research - or potentially even solve alignment entirely.

One straightforward way to do this would be to simulate a superhuman alignment researcher, as mentioned earlier, although there are other ways we could do this, and other ways we could use the model for alignment. While I’ll go into some details about some use cases in different futures, the specifics of how we would use such a model to aid alignment (what prompts to use, finding new use cases, etc) is not the focus of this post, so we need only the broad intuition that it could.

It’s possible to do this because generative models like GPT have in their training data (and consequently, in their world model), the information necessary to comprise an alignment researcher, we just have to extract it. At the very limit, we just have the physics prior - no model would be able to actually reach this because of computational constraints, but one way to frame this is to imagine what level we’re abstracting that to, with more powerful models having better abstractions.

Waiting for a true generative model we could easily extract this from with direct prompts might be dangerous, however, because of the higher probability of catastrophe by that point. Research labs could reach AGI through RL, use generative models as part of agents, or convert generative models to optimizers (decision transformers are an initial foray into this kind of work).

We’re left with a lot of questions about timelines (how long would generative models capable of accelerating alignment research noticeably take to arrive? Would we reach AGI before then?), extraction capabilities (how could we push the capabilities of weaker generative models for alignment to their limit?), and what kind of problems we could run into with using them.

One framing of these problems is to think about the worlds in which this approach is actually useful. Here are two worlds in which this could be critically important as an alignment strategy:

Large language models continue to be the dominant paradigm for all AI, to the extent that competitors trying to develop more powerful RL agents are less competitive. This could be either because the scaling law holds strong enough for LLMs that we keep getting incredible returns that justify an increasing share of AI investments, because RL strategies are less tractable in some way, or generally because we’re in a homogenous take-off scenario. In this world, we could just use very powerful generative models to solve alignment for us directly.
- This seems unlikely for a number of reasons. Primarily - to me - two stand out:
  First, there isn’t that clear a distinction between RL research and LLM research - decision transformers exist, and there’s also the possibility that generative models and optimizers are close enough in model space that using RL on a sufficiently powerful LLM would directly get you an optimizer. I don't think that’s necessarily true, but I don’t have a strong opinion on that and the larger point, that it’s indicative of a class of dangers like this, still holds.
  Second, right now we see a lot of RL research happening, some of which returns pretty exciting (read: dangerous) results. We might need to start seeing actual take-off indicators from LLMs for that to change, but that cuts it pretty close to a failure mode.
- In the next sections, we explore the various alignment problems associated with this approach. I think that some of them are more likely and harder to solve in these worlds because they’re exacerbated by the model being more powerful and being able to explore a larger model space. That said, we’d also have more time to work on those problems in these worlds.
Language models currently (or in the immediate future) can impact the pace of alignment progress noticeably. We can employ the same kinds of extraction methods to get some kind of alignment research accelerator.
Ideally, this would be about getting the model to directly do any novel alignment research for us, but it could also involve softer approaches, such as fine-tuning the model on alignment content and getting it to critique ideas or proposals or expand an idea or a draft to a complete post (saving time, but also potentially approaching something from different angles). We could even just have the weaker model try to simulate an alignment researcher anyway, to have it come up with a bunch of alignment-flavour research or ideas in the hope that it’s powerful enough for some of them to be meaningful.
- This would have to involve research into strategies for extracting this kind of capability from generative models. On the bright side, it might be possible to do a lot of work necessary for this right now, by extracting other capabilities from current models.
- I mentioned earlier that more powerful models might find some of the problems associated with this approach to be harder or more likely. This is not to say that we don’t have to worry about those problems at all in these worlds - in fact, I think that with any model that we can actually do something useful with, we’d have to deal with some version of many of the problems.
  However, in these worlds, we’re likely to find ourselves dealing with language models that work similarly to current models. This makes it less likely that we’d have to deal with accidental optimizers, deceptive simulacra powerful enough to be dangerous, and the dangers linked to self-fulfilling prophecies.

Outer Alignment

What does Outer Alignment mean in this context?

Outer alignment as we refer to it here is different from the concept of outer alignment that we usually use with respect to optimizers, because the problem isn’t at its core about having its values be aligned with ours (one of the hallmarks of generative models is the lack of values or preferences in that sense). Instead, the problem is about making sure its prior is aligned with reality, and relying on that prior being good (at least in some direction that we can detect and work along, with careful application of conditionals).

The outer alignment problem with respect to generative models can be divided into two parts:

Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior? In other words, at the limit of training, is an inner-aligned model learning to model the world from that loss, or is the distribution underlying text different in some fundamental way?
Is this prior actually favourable? At the limit - the physics prior, for example - does it result in aligned simulacra (i.e, an aligned agent inside the simulation); does it result in a world that we can actually rely on to give us good outcomes given carefully crafted conditionals, instead of something like deceptive simulated researchers?
In other words, this is asking whether the world as it is, is favourable enough for two properties to hold: that we wouldn’t get this kind of behaviour by default and that our conditionals, even if they represent unlikely worlds, wouldn’t result in worlds that satisfy them through other methods as discussed in the post linked above (although this may require effort toward designing those conditionals well).

Problems with Outer Alignment

In effect, the question we’re asking addresses both parts of the outer alignment problem - can we actually extract aligned simulacra for any of the purposes we would want to put it to?

Adam Jermyn’s post on conditioning generative models discusses this in some detail, so I’ll just add some further points I’m uncertain about.

In the post, Adam focuses on the second part of the outer alignment problem, whether carefully crafted conditionals could get aligned behaviour from the simulacra. Regarding the first part of the problem - even if the textual distribution isn’t equivalent to the world prior, I think that it might not be a huge issue. That would be true in current models as well, and current models still have the parts necessary for extracting similar capabilities - alignment research isn’t likely to be qualitatively different from math problems in some way that the difference between the two distributions would allow for one and not the other. Given that, I think creative application of prompts and the model itself would offset this part of the problem, which is why I’ll focus on the second part for the rest of this section.

That said, I think it’s entirely possible this might not hold true at the limit, that future models might diverge in ways that this becomes a more qualitative problem - I just don’t think it’s very likely. Work such as extracting superhuman capabilities absent from the training data addresses this problem to some extent as well.

Another potential problem is an agent within the simulation having preferences over the real world (so its outputs have the ability to never defect in the simulation while still being misaligned) rather than just over the world described in the simulation. As an illustrative example, imagine humans in our world thinking about the simulation hypothesis and having preferences over what they consider “base reality”.

These agents may be deceptive. If they can detect when they’re in deployment, then they could act in malign ways. Deployment could be detected through something like the presence of a factorization of RSA-2048 (if the model uses online learning, this would be present eventually).
- Depending on how the generative model represents its world model and how it responds to prompts, it might be possible to create a simulated world with factoring if its abstraction methods allow. For example, we could add to the prompt something like “X knows the factoring for RSA” - which in itself isn’t that unlikely, it’s easy for someone to believe that.
  This may, however, seriously affect the simulacra’s judgement on other things by virtue of it being able to be fooled on this.
  Further, the factorization of RSA-2048 is just one of a large number of potential checks a model could use to detect deployment. We cannot feasibly test each individually with special prompts, leaving us with the options of either designing prompts that target this entire class (or large sections of it) or dealing with deceptive simulacra some other way.
What kind of behaviour could an agent inside a simulation have to effect its desired changes in an overworld it seemingly has no information about apart from what can be derived anthropically?
- It could employ a soft version of gradient hacking, and try to influence the simulation in ways that change the likely generations of the model such that the gradient pushes it toward behaviour the agent desires.
  As a very simplistic example, imagine an agent that desires the increased presence of toxicity in the real world and therefore influences the simulated world such that its generations are less toxic, forcing poor performance when the text’s ground truth is toxic and updating the model to be more toxic. There’s a deeper discussion here about the tractability of this kind of gradient filtering (such as whether SGD wouldn’t update away the agent causing poor performance), which is deep enough to merit focus in its own post.
  More generally, the agent has some influence on the world model it’s in, and it’s only a question of how much influence and whether the agent can figure out how to make full use of it in the right direction - and I’m hesitant about betting on low capability as a fix to problems like this. It’s also possible that there are other channels^[2] by which it can get clues as to its actual situation, improving the agent’s chances here.
- It could have no effect on the model’s generations at all until it “defects” when it detects some real-world critical point, such as if we prompted the generative model to give us an alignment solution or to give us the design for an AI.
Are deceptive simulacra that try to implement these strategies likely? Humans in our world reason about the simulation hypothesis often, but we don’t let it influence our actions in a large way.
- The simulacra might have stronger reasons to believe it’s in a simulation, especially if the world the conditional results in is unlikely enough that there are stronger anthropic reasons. And as mentioned above, there could be other channels through which it gets clues as to the truth.
- Even accounting for that, this is hard to tell, but I think it’s a distinct enough possibility that we can’t bet on getting it right. This is especially bad in cases where it’s specifically encoded into the network to try and affect the real world (we’ll discuss one of these cases in a bit).
Can this situation be detected or bypassed using good prompt engineering?
- We discussed the case of the agent that only defects in real-world-sensitive situations briefly above. My current guess is that detection behaviourally - for example, with honeypots as in Adam’s post - might not work, but mechanistic detection within the simulation (something like DeepMind using advanced interpretability tools in the simulation, as in Evan’s original prompt, could be adapted for humans) might.
- In the case of the agent that gradient filters, honeypots alone could work because by definition it has behavioural differences from benign simulacra.
- When it comes to bypassing, we would require prompts that somehow exclude agents like this from the world model.
  - This can bring problems due to the presence of another conditional that has to be applied, which could deviate the resulting world from the prior even further.
  - However, I think good prompt engineering could work for this, applying some conditional on the world that biases against this - somewhat like a predicate. Potential “predicates” include a conditional against anthropic reasoning strongly influencing an agent’s actions, a conditional against reasoning about the simulation hypothesis (An example for both: describing the agent as being highly sceptical of the validity of either), conditionals on some form of mechanistic detection acting in the background in the world model making agents like this less likely, etc.

Inner Alignment

What does Inner Alignment mean here?

The inner alignment problem with respect to generative models can be divided into two parts:

The model actually turns out to be an optimizer that was instrumentally acting as a generative model.
There are some reasons to think at first that training via supervised learning to minimize loss on generations is much more likely to result in a generative model - that generative models seem at first glance far in parameter space from optimizers. However, those aren’t strong arguments and it’s very possible they’re not true. Ideally, we’d have a stronger and more solid argument against the possibility of it resulting in an optimizer.
The way these models deal with self-fulfilling prophecies.
Consider a situation where the generative model’s prediction (for this section, we’ll refer to the model’s generations as its predictions for the world described by the prompt) will influence the likelihood of that prediction being true. This raises the question of how the model handles what predictions to actually make, since the act of picking one could raise or lower its probability of being correct.
As an example, you can imagine prompting the model for what you (the writer of the prompt) would say in response to its output, or asking it whether to predict whether a stock will rise or fall (where predicting either would result in it being true as people rush to buy or sell).

Problems with Inner Alignment

Self-fulfilling prophecies

There are several ways in which the model could handle self-fulfilling prophecies, some of which we’ll discuss below.

Preliminary testing with GPT-3 seems to imply that it just samples from some prior over human reactions to GPT-3 output. More powerful generative models might just sample from the same kind of prior while selecting for self-fulfilling-ness. What this means is the model has this prior, filters it to get self-fulfilling generations, and then samples from that. This doesn’t seem like it would result in very dangerous behaviour, because it still tries to remain close to the normal world state - as long as a self-fulfilling trajectory isn’t too unlikely, it should be close to as safe as generative models are, especially if we try to make the prior safer somehow, or make self-fulfilling trajectories more likely.

Another way in which the model could resolve self-fulfilling prophecies is for it to “reason” via an entirely different mechanism that tries to do fixed-point solving. This is more likely to produce dangerous predictions, as it doesn’t have the safety that sampling from the prior brings.

Johannes Treutlein’s post on training goals for generative models discusses potential solutions to this problem in more detail, which I’ll summarize here with some additions. The generative model could act as a counterfactual oracle, in which the model is only evaluated on predictions whose ground truth is determined by a process independent of the model (for example, it’s only evaluated on text that has already been written). Taking the consequences of its own actions into account while making predictions in this setup would result in poor performance, so there’s a constant gradient against the model behaving that way. Thus, the model is incentivized to make predictions about the future sampled from the prior as always, without considering the influence its predictions could have.

This strategy, however, comes with a few downsides. For one, we cannot train the model using RL from human feedback, as the training signal must come entirely from existing text. For another, this might stunt performance in cases where it needs to model trajectories that contain itself (predicting something about a future where everyone uses the model).

As an aside, self-fulfilling prophecies could also occur if the model is an optimizer - even if its mesa objective is aligned with the loss (interestingly, this would be a case of mesa optimization resulting in a larger class of outcomes for outer alignment, allowing for outer misalignment). LCDT might suffice here.

Potential strategies against deceptive optimizers

Using smaller generative models as initializations for larger ones.

(The equivalent ELK proposal goes into this strategy in more detail).

Since we can reasonably assume that smaller models such as GPT-3 aren’t deceptive optimizers (if they were and we couldn’t tell by now, this probably isn’t a world where we win anyway, so might as well assume otherwise), this should allow for the larger models we train this way to remain relatively close in model space to these safer models, thereby making a stronger case for actually getting deceptive optimizers.

The same downside as applies to the ELK proposal might apply here, that models at different scales use different internal mechanisms for modelling the world and generation, making this strategy non-competitive.

If this downside applies, is there any kind of control we can exert over the parameter initialization to make optimizers less likely? In a world of low path dependence, this doesn’t seem like a very promising strategy.

An acceptability predicate for myopia.

A complete description of what an acceptability predicate is lies outside the scope of this post, and is long enough that it would make this post even more difficult to follow. I recommend reading Evan’s doc linked above for better clarity, but I’ll try to briefly explain what they are - an acceptability predicate is some mechanistic property (i.e, something about how the model actually works, instead of what it does) such that, if it holds, makes the model “acceptable” because it guarantees some deeper properties (such as the absence of deception).

They have a list of associated desiderata which a good predicate would satisfy - for our part here, we’ll only consider whether the predicate actually implies some property we desire, whether it has a buffer zone (can the model withstand short periods of time where the predicate doesn’t hold without being dangerous?), and how we can actually mechanistically incentivize the predicate, because that suffices to showcase the difficulty of getting the predicates we will consider to work.

The reason an acceptability predicate for myopia might work here is that while the predicate holds true, generative models would be preferred over misaligned deceptive optimizers. Further, there would be a buffer in the sense that when the predicate fails and the model becomes non-myopic, it wouldn’t be in optimizer space.

For this to be true, we have to consider whether the idea of a non-myopic non-optimizer is a coherent one.

I think it is - as a simple example we could imagine GPT trained for its performance over the next few timesteps. Realistically this would result in a mesa-optimizer, but in theory it could just run a very expensive version of next-token generation, over the much larger space of multiple tokens.
Depending on your definition of myopia, however, this might still not qualify as non-myopic - although the validity of this specific point shouldn’t matter in the larger picture, because the strategy is flawed in other ways.

It’s possible that optimizers aren’t very far in model space from a generative model, resulting in a buffer zone that’s very small or non-existent.

However, what would it look like if this weren’t true? Would we see models intermediate between optimizers and generative models and if so, what would they look like?

To understand another reason why there may not be a satisfactory buffer zone, consider the case of proxy-aligned optimizers.

As an oversimplified example, think of an optimizer that wants to output the word “bananas”, or things related to “bananas”, as much as it can. This optimizer is created during the initial stages of training owing to that part of the training corpus containing text that incentivizes this, and is allowed by the myopia predicate until it gets good performance. If it becomes non-myopic before the distribution changes, then it could recognize deception as a strategy instantly.

Johannes' post goes into more detail on myopia in generative models.

If myopia doesn’t suffice, would something stronger do?

An acceptability predicate for non-agency.

Whether this would work relies in part on the specific definition of agency that we’re trying to incentivize against - but for now, imagine that we just mean it in the sense of a model being able to optimize at all.

How does this predicate fare on buffer zones? If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.

All the concerns we discussed in the earlier section about deceptive simulacra apply here. However, these simulacra can have access to a lot of real-world information (or channels through which it can access this information) relevant to achieving their objectives, which it can receive during the transfer, making it more dangerous.
While the model is an agent, it has to find a way to encode a deceptive super-intelligence (or a human) with its objective. It wouldn’t have control over its own weights directly, however, only through the channel of SGD, and that for the few time steps it exists. The agent has to try to hack its own gradient during that time.
This might be the kind of change that’s hard enough to effect that it would require precise understanding of the loss landscape, if it’s possible at all in a few time steps.
- The model might still have this precise understanding by virtue of having that good a world model, which includes it.
- Creative ML methods could make this easier, especially if the model can come up with ones we haven’t thought of that are even more powerful.
Would the model remember this deceptive simulacrum? It would only be added to the network in a few timesteps, and over many future epochs, the generative model could just forget about it as part of its world model.
- It could be possible for the model to encode this simulacrum in a way that’s robust to further gradient updates.
- The simulacrum could try to improve predictive performance on its own, especially with access to the information used in the transfer.
- The simulacrum’s weights could be entangled with parts of the world model that have good performance in a way that makes it hard to remove it without having a detrimental effect on performance. Conditional on the model being able to encode something as complex as the simulacrum, this might not be much harder.
Some avenues that might prevent this:
- Detecting this kind of agency transfer explicitly, perhaps after performing experiments to see what it would look like.
- If it’s possible to tell when the predicate was just in a shaky state, then be extra sensitive to unexpected changes in the loss, which might detect gradient hacking.
- If there is a sequence of episodes with increased loss, roll back to a previous state.
- Prompt against this possibility. Since the simulacrum was presumably not that difficult to encode, it might be possible to prompt against this entire class of deceptive simulacra.

This seems like a very broad predicate, however. What would it actually look like? How would we mechanistically incentivize something like non-agency? This is, in my view, the largest bottleneck to this predicate, and is a good argument for thinking about potential predicates by starting from what we can actually mechanistically incentivize.

Competitiveness

We discussed earlier the potential worlds in which this strategy as a whole could be useful. Now we’ll consider how trying to account for the various alignment problems described above would impact the competitiveness of whatever generative model we end up using.

Adding a disclaimer here that this will depend heavily on the form the solutions to the above problems take, especially with problems as unexplored as self-fulfilling prophecies. Consequently, this section is pretty tentative and only meant to give a weak understanding of the ways the alignment tax would come into play with generative models.

During training:

If the world prior is too malign to counteract feasibly with prompt engineering, we might have to restructure training to change the distribution the model is trained to be biased toward.
One of our outer alignment problems is ensuring that the text prior actually reflects the reality prior if they diverge at some limit. This potentially doesn’t add a lot of alignment tax, however, both for the reasons mentioned in that section, and because the divergence could limit the model’s capabilities as well.
Whatever acceptability predicates we end up using (if any) may also alter the ease of training, especially if they’re difficult to satisfy or require a lot of bandwidth in how we decide to implement the oversight.

In the real world:

If we need to change the model’s prior, this could make it perform poorly with respect to some tasks.
The model’s prior may be malign or complex enough that the prompt engineering required to extract the kind of aligned behaviour we want would be hard to figure out.

Conclusion

I’m particularly excited about this strategy in part because it seems like a lot of the relevant work can be done right now, and we actually have a solid idea of the kind of model we want to end up with. Though this post goes into some detail about various aspects of conditioning generative models for alignment research, it is primarily a broad overview and there remain many ideas and research directions to explore.

I expect that one of the bottlenecks to working on this strategy will be understanding why generative models can be used this way safely and the new framing under which alignment problems fall. To that end, my hope in writing this is that it can provide a base of the core ideas and concerns for people to work off of.

Ultimately, while there are many benefits to this strategy because we only have to really align an oracle AI, there still are many problems that can arise. Success would look like some guarantee (whether argumentative or via some mechanistic predicate) against the four sub-problems comprising outer and inner alignment as described below. There’s still a lot of work to be done, but I’m very excited (in the way one is when thinking about ways to avoid the world ending) to see where this approach will lead.

Some open lines of research

Extracting superhuman levels of performance at complex tasks from GPT-3, especially if its training corpus didn’t contain instances of superhuman performance.
- This could be useful because the methods we use with weaker generative models like GPT-3 could extrapolate well to good methods for extracting the same level of performance in alignment research with stronger models. This would reduce the time between when we get a powerful model that has potential for danger, and when we can get good alignment work out of it.
Replicating different situations involving self-fulfilling prophecies with current generative models to see how they seem to be resolving them.
Comparing the way generative models at different scales and with different architectures resolve self-fulfilling prophecies, to see whether extrapolating current behaviour to future models is reasonable.
How do the internal mechanisms of generation differ between language models at different scales? Are the differences limited to what can be explained by scale or power, or are they qualitatively different?
- This could be useful in evaluating whether strategies like using smaller generative models to initialize larger ones for training would result in those larger models being generative models as well, instead of deceptive optimizers.
- Another thing that could be useful to explore along the same lines: actually using smaller existing language models as initializations for larger ones of the same size as an existing one and comparing the performance of this larger model, and its equivalent existing one.
Observe the changes that happen in the internals of a large language model as you intentionally try to turn it into an agent with RL. This is especially useful if these changes are interpretable in some way that could be used to detect this kind of change in generative models in general.
Designing experiments to explore the kind of changes that occur when an agent tries to create a sub-agent with values identical to its own. As with the above one, this is especially useful if the results are interpretable such that we could use them to detect the occurrence of something like that.
What would a mechanistic incentive for non-agency look like? Are there more specific properties whose presence would have the same desired effect, such as a guarantee against the ability to model consequences over its own actions?

^{^}
If you’re wondering about the possibility of malign simulations (such as an optimizer “god” overseeing the simulation) and are familiar with the ELK report, one way to view this is to think of the proposal of training a sequence of reporters for successively more powerful predictors, where you can use the weights of a simpler predictor to initialize a larger one and keep the difference at each iteration small; simpler predictors finding direct translation easier than human simulators, this could result in direct translation for more complex predictors too. The downside to this solution (if the model uses discrete modes of prediction, then this approach would stunt its ability to update toward those better modes as it becomes larger) has an analogue in generative models of different kinds of world models. As the generative models keep incrementally building their world model from the previous, it’s difficult to update away enough to malign simulations when they have to keep getting good performance on the next example - generative models are myopic.
^{^}
I’m not sure what these other channels would be, but the agent’s scenario is different enough that I don’t want to rule out the possibility of there being other channels it can get some kind of info from.

Thanks!

I suspect that there's a ton of room to get more detailed here, and some of the claims or conclusions you reach in this post feel too tenuous, barring that more detailed work. I will give some feedback that probably contains some mistakes of my own:

Conditioning is hard, and the picture you paint sometimes doesn't distinguish between conditioning on input data (prompting) and conditioning on latent states of the learned model. But the latter is quite tricky, and if it's necessary, then questions about whether or not this whole approach makes the problem easier might depend on details of how it's implemented, and how well it works.
If you condition on a superhuman, superhelpful AI alignment researcher, is it robust or fragile? That is, do most ways of trying the prompt get more or less the same results, or can you use prompts that seem similar a priori to get radically different advice from the AI? This is an important question, and in many crucial ways it's not a technical question, instead it's a question about what you consider "plausible a priori prompts," and how you measure success - what sort of advice you actually accept as coming from a good advisor.
Failure at this previous bullet point will give you failure modes that look less anthropomorphic than you might think. I think it's too hasty to imagine the output as generated by a functional simulation of a human ("picked out" by the prompt) that we can treat as generalizing to new contexts (where "superhuman alignment researcher" is a very new context) in human-like ways. Generalization failures might imply a breakdown of human thought processes as a useful model of what's generating the text.
Two relatively minor nitpicks: Generative models are more often trained self-supervised than just plain supervised (sentence 1 of section 1). Also, I think that self-fulfilling prophecies are more a problem when the use case more closely reflects the training case, but because of the disconnect I expect you'll see different issues, especially in the category of ways for mesa-objectives to leak into the output stream.

Thanks for the feedback!

I agree that there's lots of room for more detail - originally I'd planned for this to be even longer, but it started to get too bloated. Some of the claims I make here unfortunately do lean on some of that shared context yeah, although I'm definitely not ruling out the possibility that I just made mistakes at certain points.

I think when I talk about conditioning in post I'm referring to prompting, unless I'm misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
That's a very interesting question, and I think it comes down to the specifics of the model itself. For the most part in this post I'm talking about true generative models (or problems associated while trying to train true generative models) in the sense of models that are powerful enough at modelling the world that they can actually be thought of as depending on the physics prior for most practical purposes. In that theoretical limit, I think it would be robust, if prompts that seem similar to us actually represent similar world states.

For more practical models though (especially when we're trying to get some use out of sooner models), I think our best guess would be extrapolating the robustness of current models. From my (admittedly not very large) experience working with GPT-3, my understanding is that LLMs gets less fragile with scale - in other words, that they depend less on stuff like phrasing and process the prompts more "object-level" in some sense as they get more powerful.

If the problem you're pointing to is generally that the textual distribution fails in ways that the reality prior wouldn't given a sufficiently strong context switch - then I agree that's possible. My guess is that this wouldn't be a very hard problem though, mainly because of reasons I briefly mention in the Problems with Outer Alignment section: that the divergence can't be strong enough to have a qualitative difference or we'd have noticed it in current models, and that future models would have the requisite "parts" to simulate (at least a good) alignment researcher, so it becomes a prompt engineering problem. That said, I think it's still a potential problem whose depths we could understand with more extraction work.
Re the self-supervised comment - oops yeah, that's right. I've edited the post, thanks for the correction. I wrote that line mainly to contrast it with RL and emphasize the "it's learning to model a distribution", so I didn't pay too close attention - I'll try to going forward.
Re the self-fulfilling prophecies comment - could you elaborate on that? I'm afraid I don't fully get your argument.

Re: prompting: So when you talk about "simulating a world," or "describing some property of a world," I interpreted that as conditionalizing on a feature of the AI's latent model of the world, rather than just giving it a prompt like "You are a very smart and human-aligned researcher." This latter deviates from the former in some pretty important ways, which should probably be considered when evaluating the safety of outputs from generative models.

Re: prophecies: I mean that your training procedure doesn't give an AI an incentive to make self-fulfilling prophecies. I think you have a picture where an AI with inner alignment failure might choose outputs that are optimal according to the loss function but lead to bad real-world consequences, and that these outputs would look like self-fulfilling prophecies because that's a way to be accurate while still having degrees of freedom about how to affect the world. I'm saying that the training loss just cares about next-word accuracy, not long term accuracy according to the latent model of the world, and so AI with inner alignment failure might choose outputs that are highly probable according to next word accuracy but lead to bad real-world consequences, and that these outputs would not look like self-fulfilling prophecies.

Sorry for the (very) late reply!

I'm not very familiar with the phrasing of that kind of conditioning - are you describing finetuning, with the divide mentioned here? If so, I have a comment there about why I think it might not really be qualitatively different.

I think my picture is slightly different for how self-fulfilling prophecies could occur. For one, I'm not using "inner alignment failure" here to refer to a mesa-optimizer in the traditional sense of the AI trying to achieve optimal loss (I agree that in that case it'd probably be the outcome you describe), but to a case where it's still just a generative model, but needs some way to resolve the problem of predicting in recursive cases (for example, asking GPT to predict whether the price of a stock would rise or fall). Even for just predicting the next token with high accuracy, it'd need to solve this problem at some point. My prediction is that it's more likely for it to just model this via modelling increasingly low-fidelity versions of itself in a stack, but it's also possible for it do fixed-point reasoning (like in the Predict-O-Matic story).

Fascinating work, thanks for this post.

Using smaller generative models as initializations for larger ones.

(The equivalent ELK proposal goes into this strategy in more detail).

Do you have a link to the ELK proposal you're referring to here? (I tried googling for "ELK" along with the bolded text above but nothing relevant seemed to come up.)

An acceptability predicate for myopia.

Do you have thoughts on how to achieve this predicate? I've written some about interpretability-based myopia verification which I think could be the key.

I think [non-myopic non-optimizer is a coherent concept] - as a simple example we could imagine GPT trained for its performance over the next few timesteps. Realistically this would result in a mesa-optimizer, but in theory it could just run a very expensive version of next-token generation, over the much larger space of multiple tokens.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

(This is a nitpick and I also don't mean to trivialize the inner alignment problem which I am quite worried about! But I did want to make sure I'm not missing anything here and that I'm broadly on the same page as other reasonable folks about expectations/evidence for mesa-optimizers.)

An acceptability predicate for non-agency.
[...]
If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

This seems like a very broad predicate, however. What would it actually look like?

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

I'm excited, I've explored before how having interpretability that can both reliably detect mesa-optimizers and read-off its goals would have the potential to solve alignment. But I hadn't considered how reliable mesa-optimization detection alone might be enough, because I wasn't considering generative models in that post. (Even if I had, I wasn't yet aware of some of the clever and powerful ways that generative models could be used for alignment that you describe in this post.)

How would we mechanistically incentivize something like non-agency?

I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.

Sorry for the (very) late reply!

Do you have a link to the ELK proposal you're referring to here?

Yep, here. I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

Hmm, I was thinking of that under the frame of the future point where we'd worry about mesa-optimizers, I think. In that situation, I think mesa-optimizers would be more likely than not because the task is much harder to achieve good performance on (although on further thought I'm holding less strongly to this belief because of ambiguity around distance in model space between optimizers and generative models). I agree that trying to do this right now would probably just result in bad performance.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

I agree that we'll need a strong constant gradient to prevent this (and other things), but while I think this is definitely something to fix, I'm not very worried about this possibility. Both because the model would have to be simultaneously deceptive in the brief period it's agentic, and because this might not be a very good avenue of attack - it might be very hard to do this in a few timesteps, the world model might forget this, and simulations may operate in a way that only really "agentifies" whatever is being directly observed / amplified.

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

I agree almost entirely - I was mainly trying to break down the exact capabilities we'd need the interpretability tools to have there. What would detecting mesa-optimizers entail mechanistically, etc.

But I hadn't considered how reliable mesa-optimization alone might be enough, because I wasn't considering generative models in that post.

I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default - I think my main worries are getting it to work before RL reaches AGI-level.

I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.

I think they have a strong bias (in a very conceptual sense) against something like agency, but larger models could end up being optimizers because that achieves greater performance past a certain scale like you said, because of different training paths - or even if it's just pretty easy to make one an optimizer if you push it hard enough (with RL or something), that could still reduce the time we have.

Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior?

This seems important and is not a thing I've thought about carefully, so thanks for bringing it up and exploring it. I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.

If they can detect when they’re in deployment, then they could act in malign ways.

The more I've thought about this one the more I'm not worried about this precise danger.

It would be very strange for a predictive model attempting to draw plausible trajectories through time to simulate trajectories in which agents notice inconsistencies and decide that they're in simulations. Agents can still conclude they're in simulations, but it would be weird for this to be because they noticed inconsistencies in their worlds, because the agents and world are being constructed together as part of a predictive task. Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.

That said, there are very closely related dangers that I am worried about. For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world). This feels like a thing we can make less likely with appropriate prompting, which makes me hope that it may not be too big a problem in practice, but (barring powerful interpretability tools) I don't think we can rule it out.

The way these models deal with self-fulfilling prophecies.

I'm currently pretty worried about this, so was happy to see you thinking about it.

Sorry for the (very) late reply!

I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.

Hmm, I was thinking more of a problem with text available in the training datasets not being representative of the real world we live in (either because it isn't enough information to pick out our world from a universal prior, or because it actually describes a different world better), not whether its capabilities or abstractive reasoning don't help with time-separated prediction.

Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.

I think I'm picturing different reasons for a simulacra agent to conclude that they're in a simulation than noticing inconsistencies. Some specifics include worlds that are just unlikely enough anthropically (because of a conditional we apply, for example) to push up credence in a simulation hypothesis, or they notice the effects of gradient descent (behavioural characteristics of the world deviating from "normal" behaviour tend to affect the world state), or other channels that may be available by some quirk of the simulation / training process, but I'm not holding to any particular one very strongly. All of which to say that I agree it'd be weird for them to notice inconsistencies like that.

For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world).

Yep, I think this could be a problem, although recent thinking has updated me slightly away from non-observed parts of the simulation having consistent agentic behaviour across time.

Thanks!

Conditioning is hard, and the picture you paint sometimes doesn't distinguish between conditioning on input data (prompting) and conditioning on latent states of the learned model. But the latter is quite tricky, and if it's necessary, then questions about whether or not this whole approach makes the problem easier might depend on details of how it's implemented, and how well it works.
If you condition on a superhuman, superhelpful AI alignment researcher, is it robust or fragile? That is, do most ways of trying the prompt get more or less the same results, or can you use prompts that seem similar a priori to get radically different advice from the AI? This is an important question, and in many crucial ways it's not a technical question, instead it's a question about what you consider "plausible a priori prompts," and how you measure success - what sort of advice you actually accept as coming from a good advisor.
Failure at this previous bullet point will give you failure modes that look less anthropomorphic than you might think. I think it's too hasty to imagine the output as generated by a functional simulation of a human ("picked out" by the prompt) that we can treat as generalizing to new contexts (where "superhuman alignment researcher" is a very new context) in human-like ways. Generalization failures might imply a breakdown of human thought processes as a useful model of what's generating the text.
Two relatively minor nitpicks: Generative models are more often trained self-supervised than just plain supervised (sentence 1 of section 1). Also, I think that self-fulfilling prophecies are more a problem when the use case more closely reflects the training case, but because of the disconnect I expect you'll see different issues, especially in the category of ways for mesa-objectives to leak into the output stream.

Thanks for the feedback!

I think when I talk about conditioning in post I'm referring to prompting, unless I'm misunderstanding what you mean by conditioning on latent states for language models (which is entirely possible).
That's a very interesting question, and I think it comes down to the specifics of the model itself. For the most part in this post I'm talking about true generative models (or problems associated while trying to train true generative models) in the sense of models that are powerful enough at modelling the world that they can actually be thought of as depending on the physics prior for most practical purposes. In that theoretical limit, I think it would be robust, if prompts that seem similar to us actually represent similar world states.

For more practical models though (especially when we're trying to get some use out of sooner models), I think our best guess would be extrapolating the robustness of current models. From my (admittedly not very large) experience working with GPT-3, my understanding is that LLMs gets less fragile with scale - in other words, that they depend less on stuff like phrasing and process the prompts more "object-level" in some sense as they get more powerful.

If the problem you're pointing to is generally that the textual distribution fails in ways that the reality prior wouldn't given a sufficiently strong context switch - then I agree that's possible. My guess is that this wouldn't be a very hard problem though, mainly because of reasons I briefly mention in the Problems with Outer Alignment section: that the divergence can't be strong enough to have a qualitative difference or we'd have noticed it in current models, and that future models would have the requisite "parts" to simulate (at least a good) alignment researcher, so it becomes a prompt engineering problem. That said, I think it's still a potential problem whose depths we could understand with more extraction work.
Re the self-supervised comment - oops yeah, that's right. I've edited the post, thanks for the correction. I wrote that line mainly to contrast it with RL and emphasize the "it's learning to model a distribution", so I didn't pay too close attention - I'll try to going forward.
Re the self-fulfilling prophecies comment - could you elaborate on that? I'm afraid I don't fully get your argument.

Sorry for the (very) late reply!

Fascinating work, thanks for this post.

Using smaller generative models as initializations for larger ones.

(The equivalent ELK proposal goes into this strategy in more detail).

Do you have a link to the ELK proposal you're referring to here? (I tried googling for "ELK" along with the bolded text above but nothing relevant seemed to come up.)

An acceptability predicate for myopia.

Do you have thoughts on how to achieve this predicate? I've written some about interpretability-based myopia verification which I think could be the key.

I think [non-myopic non-optimizer is a coherent concept] - as a simple example we could imagine GPT trained for its performance over the next few timesteps. Realistically this would result in a mesa-optimizer, but in theory it could just run a very expensive version of next-token generation, over the much larger space of multiple tokens.

An acceptability predicate for non-agency.
[...]
If a model becomes agentic briefly, it could encode into its world model a deceptive super-intelligence that has its objective, before SGD guides it back into the safe zone.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

This seems like a very broad predicate, however. What would it actually look like?

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

How would we mechanistically incentivize something like non-agency?

Sorry for the (very) late reply!

Do you have a link to the ELK proposal you're referring to here?

Yep, here. I linked to it in a footnote, didn't want redundancy in links, but probably should have anyway.

"Realistically this would result in a mesa-optimizer" seems like an overly confident statement? It might result in a mesa-optimizer, but unless I've missed something then most of our expectation of emergent mesa-optimizers is theoretical at this point.

That is an alarming possibility. It might require continuous or near-continuous verification of non-agency during training.

I think if we could advance our interpretability tools and knowledge to where we could reliably detect mesa-optimizers, than that might suffice for this.

But I hadn't considered how reliable mesa-optimization alone might be enough, because I wasn't considering generative models in that post.

I think this is very promising as a strategy yeah, especially because of the tilt against optimization by default - I think my main worries are getting it to work before RL reaches AGI-level.

I guess one of the open questions is whether generative models inherently incentivize non-agency. LLMs have achieved impressive scale without seeming to produce anything like agency. So there is some hope here. On the other hand, they are quite a ways from being complete high-fidelity world simulators, so there is a risk of emergent agency becoming natural for some reason at some point along the path to that kind of massive scale.

Is the loss we’re training the generative model on - in the case of language models, the predictive loss over the next token - actually representative of the world prior?

If they can detect when they’re in deployment, then they could act in malign ways.

The more I've thought about this one the more I'm not worried about this precise danger.

The way these models deal with self-fulfilling prophecies.

I'm currently pretty worried about this, so was happy to see you thinking about it.

Sorry for the (very) late reply!

I think (to the extent there is a problem) the problem is alleviated by training on "predict tomorrow's headline given today's" and related tasks (e.g. "predict the next frame of video from the last"). That forces the model to engage more directly with the relationship between events separated in time by known amounts.

Predicting that the agent notices an inconsistency requires the generative model to know that there's an inconsistency, at which point the better solution (from a 'drawing likely trajectories' perspective) is to just make the world consistent.

For instance there can be agents that act as if they're in a simulation for purposes of acausal trade (e.g. they play along until a distant future date before defecting, in the hopes of being instantiated in our world).

Yep, I think this could be a problem, although recent thinking has updated me slightly away from non-observed parts of the simulation having consistent agentic behaviour across time.

60

Conditioning Generative Models for Alignment

60

Ω 28

How do generative models work?

Why is this important?

Outer Alignment

What does Outer Alignment mean in this context?

Problems with Outer Alignment

Inner Alignment

What does Inner Alignment mean here?

Problems with Inner Alignment

Self-fulfilling prophecies

Potential strategies against deceptive optimizers

Competitiveness

Conclusion

Some open lines of research

60

Ω 28

60

Ω 28