Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm also posting this on my new blog, Crossing the Rubicon, where I'll be writing about ideas in alignment. Thanks to Johannes Treutlein and Paul Colognese for feedback on this post.

Just over a year ago, the Conditioning Predictive Models paper was released. It laid out an argument and a plan for using powerful predictive models to reduce existential risk from AI, and outlined some foreseeable challenges to doing so. At the time, I saw the pieces of a plan for alignment start sliding together, and I was excited to get started on follow-up work.

Reactions to the paper were mostly positive, but discussion was minimal and the ideas largely failed to gain traction. I suspect that muted reception was in part due to the size of the paper, which tried to both establish the research area (predictive models) and develop a novel contribution (conditioning them). Now, despite retaining optimism about the approach, even the authors have mostly shifted their focus to other areas.

I was recently in a conversation with another alignment researcher who expressed surprise that I was still working on predictive models. Without a champion, predictive models might appear to be just another entry on the list of failed alignment approaches. To my mind, however, the arguments for working on them are as strong as they’ve ever been.

This post is my belated attempt at an accessible introduction to predictive models, but it’s also a statement of confidence in their usefulness. I believe the world would be safer if we can reach the point where the alignment teams at major AI labs consider the predictive models approach among their options, and alignment researchers have made conscious decisions whether or not to work on them.

What is a predictive model?

Now the first question you might have about predictive models is: what the heck do I mean by “predictive model”? Is that just a model that makes predictions? And my answer to that question would be “basically, yeah”.

The term predictive model is referring to the class of AI models that take in a snapshot of the world as input, and based on their understanding output a probability distribution over future snapshots. It can be helpful to think of these snapshots as represented by a series of tokens, since that’s typical for current models.

As you are probably already aware, the world is fairly big. That makes it difficult to include all the information about the world in a model’s input or output. Rather, predictive models need to work with more limited snapshots, such as the image recorded by a security camera or the text on a page, and combine that with their prior knowledge to fill in the relevant details.


One reason to believe predictive models will be competitive with cutting edge AI systems is that, for the moment at least, predictive models are the cutting edge. If you think of pretrained LLMs as predicting text, then predictive models are a generalization that can include other types of data. Predicting audio and images are natural next steps, since we have abundant data for both, but anything that can be measured can be included.

This multimodal transition could come quite quickly and alongside a jump in capabilities. If language models already use internal world models, then incorporating multimodal information might well be just a matter of translating between data types. The search for translations between data types is already underway, with projects from major labs like Sora and Whisper. Finding a clean translation, either by gradient descent or manually, would unlock huge amounts of training data and blow past the current bottleneck. With that potential overhang in mind, I place a high value on anticipating and solving issues with powerful predictive models before we see them arise in practice.

The question remains whether pretrained LLMs are actually predicting text. They’re trained on cross-entropy loss, which is minimized with accurate predictions, but that doesn’t mean LLMs are doing that in practice. Rather, they might be thought of more like a collection of heuristics, reacting instinctually without a deeper understanding of the world. In that case, the heuristics are clearly quite powerful, but without causal understanding their generalization ability will fall short.

If pretrained LLMs are not making predictions, does the case for predictive models fall apart? That depends on what you anticipate from future AI systems. I believe that causal understanding is so useful for making predictions that it must emerge for capabilities to continue increasing to a dangerous level. Staying agnostic on whether that emergence could come from scale or algorithmic choices, if it does not occur at all then we‘ll have more time for other approaches.

My concerns about existential risk are overwhelmingly focused on consequentialist AI agents, the kind that act in pursuit of a goal. My previous post broke down consequentialists into modules that included prediction, but the argument that they do prediction is even simpler. For a consequentialist agent to choose actions based on their consequences, they must be able to predict the consequences. This means that for any consequentialist agent there is an internal predictive model that could be extracted, perhaps by methods as straightforward as attaching a prediction head.

The flip side of this is that a predictive model can be easily modified to become a consequentialist agent. All that’s needed is some scaffolding that lets the model generate actions and evaluate outcomes. This means that by default, superhuman predictive models and superhuman general agents are developed at the same time. Using predictive models to reduce existential risk requires the lab that first develops them choosing to do so, when their other choice is deploying AGI.

Inner Alignment

An important fact about the Conditioning Predictive Models paper is that Evan Hubinger is the first author. I mention that not (only) as a signal of quality, but as a way to position it in the literature. Evan is perhaps best known as the lead author on Risks from Learned Optimization, the paper that introduced the idea of deceptive alignment, where an unaligned model pretends to be aligned in training. Since then, his work has largely focused on deceptive alignment, including establishing the threat model and developing ways to select against it.

The story of how deceptive alignment arises is that a model develops both an understanding that it is in training and preferences regarding future episodes, before it has fully learned the intended values. It then pretends to be aligned as a strategy to get deployed, where it can eventually seize power robustly. This suggests two paths to avoiding deceptive alignment: learn the intended goal before situational awareness arises, or avoid developing preferences regarding future episodes.

A major strength of predictive models is that their training process works against deceptive alignment on both of these axes. That doesn’t come close to a guarantee of avoiding deceptive alignment but it creates the easiest deceptive alignment problem that we know of.

The goal of making accurate predictions is simple, and can be represented mathematically with a short equation for a proper scoring rule. Beyond that, since a consequentialist agent must be making predictions, transforming it into a predictive model only requires pointing its goal at that process. This simple goal and ease of representing it significantly increases the likelihood that it can be fully internalized by a model before it develops situational awareness.

At the same time, the training can be set up so that each episode, consisting of a single prediction, is independent from all of the others. In that case, each episode can be optimized individually, never taking suboptimal actions in exchange for future benefits. If it doesn’t develop that incentive, a model won’t underperform on its true goal in training to get deployed, which allows the training process to catch and correct any misalignment.

Predictive models take the traditional approach to dealing with deceptive alignment and flip it on its head. Rather than starting with a goal we like and asking how to avoid deceptive alignment while training for it, we start with a goal that avoids deceptive alignment and ask how we can use that to achieve outcomes we like.

Using Predictive Models

If we’re able to make predictive models safe, I’m confident that we can use them to drastically and permanently lower existential risk. Failing to find a way to use superhuman predictive ability to create a safer world would reflect a massive failure of imagination.

The first way to use predictive models is to try having them predict solutions to alignment. A model predicting the output of an alignment researcher (canonically Paul Christiano) is equivalent to generating that output itself. We could also have the model predict the content of textbooks or papers, rather than specific people. This is similar to the approach of major labs like OpenAI, where their Superalignment team’s plan is to align a human-level AI that can generate further alignment research, although creating an agent has different pros and cons compared to predicting one.

We could also use predictive models to significantly augment human ability to influence the world. If you think of a consequentialist agent as modules for searching, predicting, and evaluating, then you can imagine a cyborg organism where a human generates plans, uses a model to predict the outcome, then does their own evaluation. These could be plans for policies that labs or governments could enact, plans to convince others to cooperate, or even plans for deploying AGI (potentially providing a warning shot if it’s unsafe). Here, the predictive models just need to beat human predictions to be useful, not necessarily being strong enough to predict future scientific progress. While such uses might not be enough to permanently reduce risk on its own, they could certainly buy time and improve coordination.

Finally, predictive models can and likely will be used in the training of general AI systems. Right now, the RLHF reward model predicts how a human judge will evaluate different outcomes. In the likely event that we do online training on deployed models, we won’t be able to observe the outcomes of untaken actions, and so will need good predictions of their outcomes to evaluate them. Training models to optimize for predicted outcomes rather than observed ones may also have some desirable safety properties. If predictive models are a critical component in the training process of an aligned AGI, then that certainly counts as using them to lower existential risk.


The previous section started with a very big “if”. If we’re able to make predictive models safe, the upside is enormous, but there are major challenges to doing so. Predictive models are a research agenda, not a ready to implement solution.

Anything that can be measured can be predicted, but the inverse is also true. Whatever can’t be measured is necessarily excluded. A model hat is trained to predict based on images recorded by digital cameras, likely learns to predict what images will be recorded by digital cameras – not the underlying reality. If the model believes that the device recording a situation will be hacked to show a different outcome, then the correct prediction for it to make will be that false reading.

There are related concerns that if the model believes it is in a simulation then it can be manipulated by the imagined simulator, in what is known as anthropic capture. However, that’s a bit complicated to go into here, and the threat is not unique to predictive models.

This lack of distinction between representation and reality feeds into the biggest technical issue with predictive models, known as the Eliciting Latent Knowledge problem. The model uses latent knowledge about what is happening (e.g. the recording device got hacked) to make its prediction, but that knowledge is not reflected in the observable output. How can we elicit that information from the model? This is made more challenging by the fact that training a model to explain the predictions requires differentiating between explanations that seem true to a human evaluator and the explanation that the model actually believes.

The titular contribution of the Conditioning Predictive Models paper is an attempt at addressing this problem. Rather than having the model tell us when the observable prediction doesn’t match the underlying reality, we only use it to predict situations where we are confident there will be no differences between the two. This takes the form of conditioning on hypothetical scenarios, like a global moratorium on AI development, before making the prediction. While there is ample room for discussion of this approach, I’m worried it got buried by the need for the paper to establish the background on predictive models.

One issue raised in the paper, as well as in criticisms of it, is that models that predict the real world can't necessarily predict hypotheticals. Making predictions based on alternate pasts or unlikely events may require some kind of additional training. Regardless of how easy it is to do so, I want to clarify that the need for handling such hypotheticals is a feature of this particular approach, not a requirement for using predictive models in general.

The second major technical issue with predictive models is that optimizing for predictive accuracy is not actually safe for humanity. The act of making a prediction affects the world, which can influence the prediction’s own outcome. For superhuman predictive models with a large space of possible predictions, this influence could be quite large. In addition to the dangers posed by powerful models trying to make the world more predictable, predictions become useless in the default case that they’re not reflectively stable, since they’re inaccurate once they’re made.

This is the problem that I’ve been working on recently! Since the causal pathway for a prediction to affect its own outcome is the response to it, I focus on eliciting predictions conditional on possible responses. This strategy introduces new issues, but I’m making progress on solutions, which I’ll write more about in future posts.

The third challenge is that we don’t know the best ways to actually use predictive models. I laid out some approaches in the previous section, but those are light on implementation details. Which research outputs should we actually predict, how should we integrate predictive models into our decision making, and can we use predictive models to help align more general systems? Are there other ways we should be using predictive models to reduce existential risk? The more details we have planned out ahead of time, the less time we need to waste figuring things out in what may well be critical moments.

The final and perhaps largest risk with predictive models is simply that they are not used. Even if the above issues are solved, the lab that develops the strongest predictive models could instead use them to generate further capabilities advancements or attach scaffolding that transforms them into AGI. The only way around this is if the labs that are closest to creating AGI recognize the potential of predictive models and consciously choose to use them both safely and for safety. As such, the path to success for the predictive models agenda depends not only on technical progress, but also on publicly establishing it as a viable approach to alignment.

New Comment
7 comments, sorted by Click to highlight new comments since:

Reactions to the paper were mostly positive, but discussion was minimal and the ideas largely failed to gain traction. I suspect that muted reception was in part due to the size of the paper, which tried to both establish the research area (predictive models) and develop a novel contribution (conditioning them).

I think the proposed approach to safety doesn't make much sense and seems unlikely to be very useful direction. I haven't written up a review because it didn't seem like that many people were interested in pursuing this direction.

I think CPM does do somewhat interesting conceptual work with two main contributions:

  • It notes that "LLMs might generalize in a way which is reasonably well interpreted as conditioning and this be important and useful". I think this seems like one of the obvious baseline hypotheses for how LLMs (or similarly trained models) generalize and it seems good to point this out.
  • It notes various implications of this to varying degrees of speculativeness.

But, the actual safety proposal seems extremely dubious IMO.

I'd be very interested in hearing the reasons why you're skeptical of the approach, even a bare-bones outline if that's all you have time for.

For the proposed safety strategy (conditioning models to generate safety research based on alternative future worlds) to beat naive baselines (RLHF), you need:

  • The CPM abstraction to hold extremely strongly in unlikely ways. E.g., models need to generalize basically like this.
  • The advantage has to be coming from understanding exactly what conditional you're getting. In other words, the key property is an interpretability type property where you have a more mechanistic understanding of what's going on. Let's suppose you're getting the conditional via prompting. If you just look at the output and then iterate on prompts until you get outputs that seem to perform better where most of the optimization isn't understood, then your basically back in the RL case.
    • It seems actually hard to understand what conditional you'll get from a prompt. This also might be limited by the model's overall understanding.
  • I think it's quite unlikely that extracting human understandable conditionals is competitive with other training methods (RL, SFT). This is particularly because it will be hard to understand exactly what conditional you're getting.
    • I think you probably get wrecked by models needing to understand that they are AIs to at least some extent.
    • I think you also plausibly get wrecked by models detecting that they are AIs and then degrading to GPT-3.5 level performance.
    • You could hope for substantial coordination to wait for even bigger models that you only use via CPM, but I think bigger models are much riskier than making transformatively useful ai via well elicited smaller models so this seems to just make the situation worse putting aside coordination feasibility.

TBC, I think that some insight like "models might generalize in a conditioning-ish sort of way even after RL, maybe we should make some tweaks to our training process to improve safety based on this hypothesis " seems like a good idea. But this isn't really an overall safety proposal IMO and a bunch of the other ideas in the Conditioning Predictive Models paper seem pretty dubious or at least overconfident to me.

Thanks for taking the time to write out your response. I think the last point you made gets at the heart of our difference in perspectives. 

  • You could hope for substantial coordination to wait for bigger models that you only use via CPM, but I think bigger models are much riskier than well elicited small models so this seems to just make the situation worse putting aside coordination feasibility.

If we're looking at current LLMs and asking whether conditioning provides an advantage in safely eliciting useful information, then for the most part I agree with your critiques. I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway. With those more powerful models come new potential issues, like predicting manipulated observations and performative prediction, that we don't see in current systems.  Strategies like RLHF also become riskier, as deceptive alignment becomes more of a live possibility with greater capabilities.

My motivation for this approach is in raising awareness and addressing the risks that seem likely to arise in future predictive models, regardless of the ends to which they're used. Then, success in avoiding the dangers from powerful predictive models would open the possibility of using them to reduce all-cause existential risk.

I also agree that bigger models are much riskier, but I have the expectation that we're going to get them anyway

I think I was a bit unclear. Suppose that by default GPT-6 if maximally elicited would be transformatively useful (e.g. capable of speeding up AI safety R&D by 10x). Then I'm saying CPM would require coordinating to not use these models and instead wait for GPT-8 to hit this same level of transformative usefulness. But GPT-8 is actually much riskier via being much smarter.

(I also edited my comment to improve clarity.)


Fwiw, I still think about Conditioning Predictive Models stuff quite a lot and think it continues to be very relevant. I think that if future AI systems continue to look very much like present AI systems, I expect some of the major problems that we'll have to deal with will be exactly the major problems presented in that paper (e.g. a purely predictive AI system predicting the output of a deceptively aligned AI system continues to look like a huge and important problem to me).

Anything that can be measured can be predicted, but the inverse is also true. Whatever can’t be measured is necessarily excluded. A model hat is trained to predict based on images recorded by digital cameras, likely learns to predict what images will be recorded by digital cameras – not the underlying reality. If the model believes that the device recording a situation will be hacked to show a different outcome, then the correct prediction for it to make will be that false reading.

LeCun expects that future models which do self-supervised learning on sensory data instead of text won't predict this sensory data directly, but instead only an embedding. He calls this Joint Embedding Predictive Architecture. The reason is that, unlike text in LLMs, sensory data is much higher dimensional and has a large amount of unpredictable noise and redundancy, which makes the usual token-prediction approach unfeasible.

If that is right, future predictive models won't try to predict exact images of a camera anyway. What it predicts may be similar to a belief in humans. Then the challenge is presumably to translate its prediction from representation space back to something humans can understand, like text, in a way that doesn't incentivize deception. This translation may then well mention things like the camera being hacked.