Victoria Krakovna. Research scientist at DeepMind working on AI safety, and cofounder of the Future of Life Institute. Website and blog:


DeepMind Alignment Team on Threat Models

Wiki Contributions



It's great to hear that you have updated away from ambitious value learning towards corrigibility-like targets. It sounds like you now find it plausible that corrigibility will be a natural concept in the AI's ontology, despite it being incompatible with expected utility maximization. Does this mean that you expect we will be able to build advanced AI that doesn't become an expected utility maximizer?

I'm also curious how optimistic you are about the interpretability field being able to solve the empirical side of the abstraction problem in the next 5-10 years. Current interpretability work is focused on low-level abstractions (e.g. identifying how a model represents basic facts about the world) and extending the current approaches to higher-level abstractions seems hard. Do you think the current interpretability approaches will basically get us there or will we need qualitatively different methods? 


I would consider goal generalization as a component of goal preservation, and I agree this is a significant challenge for this plan. If the model is sufficiently aligned to the goal of being helpful to humans, then I would expect it would want to get feedback about how to generalize the goals correctly when it encounters ontological shifts. 

Too bad that my list of AI safety resources didn't make it into the survey - would be good to know to what extent it would be useful to keep maintaining it. Will you be running future iterations of this survey? 


I agree that a sudden gain in capabilities can make a simulated agent undergo a sharp left turn (coming up with more effective takeover plans is a great example). My original question was about whether the simulator itself could undergo a sharp left turn. My current understanding is that a pure simulator would not become misaligned if its capabilities suddenly increase because it remains myopic, so we only have to worry about a sharp left turn for simulated agents rather than the simulator itself. Of course, in practice, language models are often fine-tuned with RL, which creates agentic incentives on the simulator level as well. 

You make a good point about the difficulty of identifying dangerous models if the danger is triggered by very specific prompts. I think this may go both ways though, by making it difficult for a simulated agent to execute a chain of dangerous behaviors, which could be interrupted by certain inputs from the user. 


I would say the primary disagreement is epistemic - I think most of us would assign a low probability to a pivotal act defined as "a discrete action by a small group of people that flips the gameboard" being necessary. We also disagree on a normative level with the pivotal act framing, e.g. for reasons described in Critch's post on this topic. 


Thanks Richard for this post, it was very helpful to read! Some quick comments:

  • I like the level of technical detail in this threat model, especially the definition of goals and what it means to pursue goals in ML systems
  • The architectural assumptions (e.g. the prediction & action heads) don't seem load-bearing for any of the claims in the post, as they are never mentioned after they are introduced. It might be good to clarify that this is an example architecture and the claims apply more broadly.
  • Phase 1 and 2 seem to map to outer and inner alignment respectively. 
  • Supposing there is no misspecification in phase 1, do the problems in phase 2 still occur? How likely is deceptive alignment seems to argue that they may not occur, since a model that has perfect proxies when it becomes situationally aware would not then become deceptively aligned. 
  • I'm confused why mechanistic interpretability is listed under phase 3 in the research directions - surely it would make the most difference for detecting the emergence of situational awareness and deceptive alignment in phase 2, while in phase 3 the deceptively aligned model will get around the interpretability techniques. 

Thank you for the insightful post. What do you think are the implications of the simulator framing for alignment threat models? You claim that a simulator does not exhibit instrumental convergence, which seems to imply that the simulator would not seek power or undergo a sharp left turn. The simulated agents could exhibit power-seeking behavior or rapidly generalizing capabilities or try to break out of the simulation, but this seems less concerning than the top-level model having these properties, and we might develop alignment techniques specifically targeted at simulated agents. For example, a simulated agent might need some level of persistence within the simulation to execute these behaviors, and we may be able to influence the simulator to generate less persistent agents. 


I would expect that the way Ought (or any other alignment team) influences the AGI-building org is by influencing the alignment team within that org, which would in turn try to influence the leadership of the org. I think the latter step in this chain is the bottleneck - across-organization influence between alignment teams is easier than within-organization influence. So if we estimate that Ought can influence other alignment teams with 50% probability, and the DM / OpenAI / etc alignment team can influence the corresponding org with 20% probability, then the overall probability of Ought influencing the org that builds AGI is 10%. Your estimate of 1% seems too low to me unless you are a lot more pessimistic about alignment researchers influencing their organization from the inside. 


Thanks Thomas for the helpful overview post! Great to hear that you found the AGI ruin opinions survey useful.

I agree with Rohin's summary of what we're working on. I would add "understanding / distilling threat models" to the list, e.g. "refining the sharp left turn" and "will capabilities generalize more". 

Some corrections for your overall description of the DM alignment team:

  • I would count ~20-25 FTE on the alignment + scalable alignment teams (this does not include the AGI strategy & governance team)
  • I would put DM alignment in the "fairly hard" bucket (p(doom) = 10-50%) for alignment difficulty, and the "mixed" bucket for "conceptual vs applied"
Load More