Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We've created a diagram mapping out important and controversial hypotheses for AI alignment. We hope that this will help researchers identify and more productively discuss their disagreements.


A part of the diagram. Click through to see the full version.

Diagram preview


  1. This does not decompose arguments exhaustively. It does not include every reason to favour or disfavour ideas. Rather, it is a set of key hypotheses and relationships with other hypotheses, problems, solutions, models, etc. Some examples of important but apparently uncontroversial premises within the AI safety community: orthogonality, complexity of value, Goodhart's Curse, AI being deployed in a catastrophe-sensitive context.
  2. This is not a comprehensive collection of key hypotheses across the whole space of AI alignment. It focuses on a subspace that we find interesting and is relevant to more recent discussions we have encountered, but where key hypotheses seem relatively less illuminated. This includes rational agency and goal-directedness, CAIS, corrigibility, and the rationale of foundational and practical research. In hindsight, the selection criteria was something like:
    1. The idea is closely connected to the problem of artificial systems optimizing adversarially against humans.
    2. The idea must be explained sufficiently well that we believe it is plausible.
  3. Arrows in the diagram indicate flows of evidence or soft relations, not absolute logical implications — please read the "interpretation" box in the diagram. Also pay attention to any reasoning written next to a Yes/No/Defer arrow — you may disagree with it, so don't blindly follow the arrow!


Much has been written in the way of arguments for AI risk. Recently there have been some talks and posts that clarify different arguments, point to open questions, and highlight the need for further clarification and analysis. We largely share their assessments and echo their recommendations.

One aspect of the discourse that seems to be lacking clarification and analysis is the reasons to favour one argument over another — in particular, the key hypotheses or cruxes that underlie the different arguments. Understanding this better will make discourse more productive and help people reason about their beliefs.

This work aims to collate and clarify hypotheses that seem key to AI alignment in particular (by "alignment" we mean the problem of getting an AI system to reliably do what an overseer intends, or try to do so, depending on which part of the diagram you are in). We point to which hypotheses, arguments, approaches, and scenarios are favoured and disfavoured by each other. It is neither comprehensive nor sufficiently nuanced to capture everyone's views, but we expect it to reduce confusion and encourage further analysis.

You can digest this post through the diagram or the supplementary information, which have their respective strengths and limitations. However, we recommend starting with the diagram, then if you are interested in related reading or our comments about a particular hypothesis, you can click the link on the box title in the diagram, or look it up below.

Supplementary information

The sections here list the hypotheses in the diagram, along with related readings and our more opinion-based comments, for lack of software to neatly embed this information (however, boxes in the diagram do link back to the headings here). Note that the diagram is the best way to understand relationships and high-level meaning, while this offers more depth and resources for each hypothesis. Phrases in italics with the first letter capitalised are referring to a box in the diagram.


  • AGI: a system (not necessarily agentive) that, for almost all economically relevant cognitive tasks, at least matches any human's ability at the task. Here, "agentive AGI" is essentially what people in the AI safety community usually mean when they say AGI. References to before and after AGI are to be interpreted as fuzzy, since this definition is fuzzy.
  • CAIS: comprehensive AI services. See Reframing Superintelligence.
  • Goal-directed: describes a type of behaviour, currently not formalised, but characterised by generalisation to novel circumstances and the acquisition of power and resources. See Intuitions about goal-directed behaviour.

Agentive AGI?

Will the first AGI be most effectively modelled like a unitary, unbounded, goal-directed agent?

  • Related reading: Reframing Superintelligence, Comments on CAIS, Summary and opinions on CAIS, embedded agency sequence, Intuitions about goal-directed behaviour
  • Comment: This is consistent with some of classical AI theory, and agency continues to be a relevant concept in capability-focused research, e.g. reinforcement learning. However, it has been argued that the way AI systems are taking shape today, and the way humans historically do engineering, are cause to believe superintelligent capabilities will be achieved by different means. Some grant that a CAIS-like scenario is probable, but maintain that there will still be Incentive for agentive AGI. Others argue that the current understanding of agency is problematic (perhaps just for being vague, or specifically in relation to embeddedness), so we should defer on this hypothesis until we better understand what we are talking about. It appears that this is a strong crux for the problem of Incorrigible goal-directed superintelligence and the general aim of (Near) proof-level assurance of alignment, versus other approaches that reject alignment being such a hard, one-false-move kind of problem. However, to advance this debate it does seem important to clarify notions of goal-directedness and agency.

Incentive for agentive AGI?

Are there features of systems built like unitary goal-directed agents that offer a worthwhile advantage over other broadly superintelligent systems?

Modularity over integration?

In general and holding resources constant, is a collection of modular AI systems with distinct interfaces more competent than a single integrated AI system?

  • Related reading: Reframing Superintelligence Ch. 12, 13, AGI will drastically increase economies of scale
  • Comment: an almost equivalent trade-off here is generality vs. specialisation. Modular systems would benefit from specialisation, but likely bear greater cost in principal-agent problems and sharing information (see this comment thread). One case that might be relevant to think about is human roles in the economy — although humans have a general learning capacity, they have tended towards specialising their competencies as part of the economy, with almost no one being truly self-sufficient. However, this may be explained merely by limited brain size. The recent success of end-to-end learning systems has been argued in favour of integration, as has the evolutionary precedent of humans (since human minds appear to be more integrated than modular).

Current AI R&D extrapolates to AI services?

AI systems so far generally lack some key qualities that are traditionally supposed of AGI, namely: pursuing cross-domain long-term goals, having broad capabilities, and being persistent and unitary. Does this lacking extrapolate, with increasing automation of AI R&D and the rise of a broad collection of superintelligent services?

Incidental agentive AGI?

Will systems built like unitary goal-directed agents develop incidentally from something humans or other AI systems build?

Convergent rationality?

Given sufficient capacity, does an AI system converge on rational agency and consequentialism to achieve its objective?

  • Related reading: Let's talk about "Convergent Rationality"
  • Comment: As far as we know, "convergent rationality" has only been named recently by David Krueger, and while it is not well fleshed out yet, it seems to point at an important and commonly-held assumption. There is some confusion about whether the convergence could be a theoretical property, or is merely a matter of human framing, or merely a matter of Incentive for agentive AGI.


Will there be optimisation processes that, in turn, develop considerably powerful optimisers to achieve their objective? A historical example is natural selection optimising for reproductive fitness to make humans. Humans may have good reproductive fitness, but optimise for other things such as pleasure even when this diverges from fitness.

Discontinuity to AGI?

Will there be discontinuous, explosive growth in AI capabilities to reach the first agentive AGI? A discontinuity reduces the opportunity to correct course. Before AGI it seems most likely to result from a qualitative change in learning curve, due to an algorithmic insight, architectural change or scale-up in resource utilisation.

Recursive self improvement?

Is an AI system that improves through its own AI R&D and self-modification capabilities more likely than distributed AI R&D automation? Recursive improvement would give some form of explosive growth, and so could result in unprecedented gains in intelligence.

Discontinuity from AGI?

Will there be discontinuous, explosive growth in AI capabilities after agentive AGI? A discontinuity reduces the opportunity to correct course. After AGI it seems most likely to result from a recursive improvement capability.

  • Related reading: see Discontinuity to AGI
  • Comment: see Discontinuity to AGI

ML scales to AGI?

Do contemporary machine learning techniques scale to general human level (and beyond)? The state-of-the-art experimental research aiming towards AGI is characterised by a set of theoretical assumptions, such as reinforcement learning and probabilistic inference. Does this paradigm readily scale to general human-level capabilities without fundamental changes in the assumptions or methods?

  • Related reading: Prosaic AI alignment, A possible stance for alignment research, Conceptual issues in AI safety: the paradigmatic gap, Discussion on the machine learning approach to AI safety
  • Comment: One might wonder how much change in assumptions or methods constitutes a paradigm shift, but the more important question is how relevant current ML safety work can be to the most high-stakes problems, and that seems to depend strongly on this hypothesis. Proponents of the ML safety approach admit that much of the work could turn out to be irrelevant, especially with a paradigm shift, but argue that there is nonetheless a worthwhile chance. ML is a fairly broad field, so people taking this approach should think more specifically about what aspects are relevant and scalable. If one proposes to build safe AGI by scaling up contemporary ML techniques, clearly they should believe the hypothesis — but there is also a feedback loop: the more feasible approaches one comes up with, the more evidence there is for the hypothesis. You may opt for Foundational or "deconfusion" research if (1) you don't feel confident enough about this to commit to working on ML, or (2) you think that, whether or not ML scales in terms of capability, we need deep insights about intelligence to get a satisfactory solution to alignment. This implies Alignment is much harder than, or does not overlap much with, capability gain.

Deep insights needed?

Do we need a much deeper understanding of intelligence to build an aligned AI?

Broad basin for corrigibility?

Do corrigible AI systems have a broad basin of attraction to intent alignment? Corrigible AI tries to help an overseer. It acts to improve its model of the overseer's preferences, and is incentivised to make sure any subsystems it creates are aligned — perhaps even more so than itself. In this way, perturbations or errors in alignment tend to be corrected, and it takes a large perturbation to move out of this "basin" of corrigibility.

  • Related reading: Corrigibility, discussion on the need for a grounded definition of preferences (comment thread), Two Neglected Problems in Human-AI Safety (problem 1 poses a challenge for corrigibility)
  • Comment: this definition of corrigibility is still vague, and although it can be explained to work in a desirable way, it is not clear how practically feasible it is. It seems that proponents of corrigible AI accept that greater theoretical understanding and clarification is needed: how much is a key source of disagreement. On a practical extreme, one would iterate experiments with tight feedback loops to figure it out, and correct errors on the go. This assumes ample opportunity for trial and error, rejecting Discontinuity to/from AGI. On a theoretical extreme, some argue that one would need to develop a new mathematical theory of preferences to be confident enough that this approach will work, or such a theory would provide the necessary insights to make it work at all. If you find this hypothesis weak, you probably put more weight on threat models based on Goodhart's Curse, e.g. Incorrigible goal-directed superintelligence, and the general aim of (Near) proof-level assurance of alignment.

Inconspicuous failure?

Will a concrete, catastrophic AI failure be overwhelmingly hard to recognise or anticipate? For certain kinds of advanced AI systems (namely the goal-directed type), it seems that short of near proof-level assurances, all safeguards are thwarted by the nearest unblocked strategy. Such AI may also be incentivised for deception and manipulation towards a treacherous turn. Or, in a machine learning framing, it would be very difficult to make such AI robust to distributional shift.

  • Related reading: Importance of new mathematical foundations to avoid inconspicuous failure (comment thread)
  • Comment: This seems to be a key part of many people's models for AI risk, which we associate most with MIRI. We think it significantly depends on whether there is Agentive AGI, and it supports the general aim of (Near) proof-level assurance of alignment. If we can get away from that kind of AI, it is more likely that we can relax our approach and Use feedback loops to correct course as we go.

Creeping failure?

Would gradual gains in the influence of AI allow small problems to accumulate to catastrophe? The gradual aspect affords opportunity to recognise failures and think about solutions. Yet for any given incremental change in the use of AI, the economic incentives could outweigh the problems, such that we become more entangled in, and reliant on, a complex system that can collapse suddenly or drift from our values.

Thanks to Stuart Armstrong, Wei Dai, Daniel Dewey, Eric Drexler, Scott Emmons, Ben Garfinkel, Richard Ngo and Cody Wild for helpful feedback on drafts of this work. Ben especially thanks Rohin for his generous feedback and assistance throughout its development.

New Comment
12 comments, sorted by Click to highlight new comments since:

Meta: I think there's an attempt to deprecate the term "inner optimizer" in favor of "mesa-optimizer" (which I think makes sense when the discussion is not restricted to a subsystem within an optimized system).

Noted and updated.

Nice chart!

A few questions and comments:

  • Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??
  • Suggestion: make the blue boxes without parents more apparent? e.g. a different shade of blue? Or all sitting above the other ones? (e.g. "broad basin of corrigibility" could be moved up and left).

Thanks! Comments are much appreciated.

Why the arrow from "agentive AI" to "humans are economically outcompeted"? The explanation makes it sounds like it should point to "target loading fails"??

It's been a few months and I didn't write in detail why that arrow is there, so I can't be certain of the original reason. My understanding now: humans getting economically outcompeted means AI systems are competing with humans, and therefore optimising against humans on some level. Goal-directedness enables/worsens this.

Looking back at the linked explanation of the target loading problem, I understand it as more "at the source": coming up with a procedure that makes AI actually behave as intended. As Richard said there, one can think of it as a more general version of the inner-optimiser (mesa-optimiser) problem. This is why e.g. there's an arrow from "incidental agentive AGI" to "target loading fails". Pointing this arrow to it might make sense, but to me the connection isn't strong enough to be within the "clutter budget" of the diagram.

Suggestion: make the blue boxes without parents more apparent? e.g. a different shade of blue? Or all sitting above the other ones? (e.g. "broad basin of corrigibility" could be moved up and left).

Changing the design of those boxes sounds good. I don't want to move them because the arrows would get more cluttered.

It occurs to me that all of the hypotheses, arguments, and approaches mentioned here (though not necessarily the scenarios) seem to be about the “technical” side of things. There are two main things I mean by that statement:

First, this post seems to be limited to explaining something along the lines of “x-risks from AI accidents”, rather than “x-risks from misuse of AI”, or “x-risk from AI as a risk factor” (e.g., how AI could potentially increase risks of nuclear war). 

I do think it makes sense to limit the scope that way, because: 

  • no one post can cover everything
  • you don’t want to make the diagram overwhelming
  • there’s a relatively clear boundary between what you’re covering and what you’re not
  • what you’re covering seems like the most relevant thing for technical AI safety researchers, whereas the other parts are perhaps more relevant for people working on AI strategy/governance/policy

And the fact that this post's scope is limited in that way seems somewhat highlighted by saying this is about AI alignment (whereas misuse could occur even with a system aligned to some human’s goals), and by saying “The idea is closely connected to the problem of artificial systems optimizing adversarially against humans.” 

But I think misuse and “risk factor”/“structural risk” issues are also quite important, that they should be on technical AI safety researchers’ radars to some extent, and that they probably interact in some ways with technical AI safety/alignment. So, personally, I think I’d have made that choice of scope even more explicit.

I’d also be really excited to see a post that takes the same approach as this one, but for those other classes of AI risks. 


The second thing I mean by the above statement is that this post seems to exclude non-technical factors that seem like they’d also impact the technical side or the AI accident risks

One crux of this type would be “AI researchers will be cautious/sensible/competent “by default””. Here are some indications that that’s an “important and controversial hypothes[is] for AI alignment”:

  • AI Impacts summarised some of Rohin’s comments as “AI researchers will in fact correct safety issues rather than hacking around them and redeploying. Shah thinks that institutions developing AI are likely to be careful because human extinction would be just as bad for them as for everyone else.” 
  • But my impression is that many people at MIRI would disagree with that, and are worried that people will merely “patch” issues in ways that don’t adequately address the risks. 
  • And I think many would argue that institutions won’t be careful enough, because they only pay a portion of the price of extinction; reducing extinction risk is a transgenerational global public good (see Todd and this comment).
  • And I think views on these matters influence how much researchers would be happy with the approach of “Use feedback loops to course correct as we go”. I think the technical things influence how easily we theoretically could do that, while the non-technical things influence how much we realistically can rely on people to do that. 

So it seems to me that a crux like that could perhaps fit well in the scope of this post. And I thus think it’d be cool if someone could either (1) expand this post to include cruxes like that, or (2) make another post with a similar approach, but covering non-technical cruxes relevant to AI safety.

To your first point - I agree both with why we limited the scope (but also, it was partly just personal interests), and that there should be more of this kind of work on other classes of risk. However, my impression is the literature and "public" engagement (e.g. EA forum, LessWrong) on catastrophic AI misuse/structural risk is too small to even get traction on work like this. We might first need more work to lay out the best arguments. Having said that, I'm aware of a fair amount of writing which I haven't got around to reading. So I am probably misjudging the state of the field.

To your second point - that seems like a real crux and I agree it would be good to expand in that direction. I know some people working on expanded and more in-depth models like this post. It would be great to get your thoughts when they're ready.

To your first point...

My impression is that there is indeed substantially less literature on misuse risk and structural risk, compared to accident risk, in relation to AI x-risk. (I'm less confident when it comes to a broader set of negative outcomes, not just x-risks, but that's also less relevant here and less important to me.) I do think that that might the sort of work this post does less interesting if done in relation to those less-discussed types of risks, since there fewer disagreements have been revealed, so there's less to analyse and summarise. 

That said, I still expect interesting stuff along these lines could be done on those topics. It just might be a quicker job with a smaller output than this post. 

I collected a handful of relevant sources and ideas here. I think someone reading those things and providing a sort of summary, analysis, and/or mapping could be pretty handy, and might even be doable in just a day or so of work. It might also be relatively easy to provide more "novel ideas" in the course of that work that it would've been for your post, since misuse/structural risks seem like less charted territory. 

(Unfortunately I'm unlikely to do this myself, as I'm currently focused on nuclear war risk.)


A separate point is that I'd guess that one reason why there's less work on misuse/structural AI x-risk than on accidental AI x-risk is that a lot of people aren't aware of those other categories of risks, or rarely think about them, or assume the risks are much smaller. And I think one reason for that is that people often write or talk about "AI x-risk" while actually only mentioning accidental AI x-risk. That's part of why I say "So, personally, I think I’d have made that choice of scope even more explicit." 

(But again, I do very much like this post overall. And as a target of this quibble of mine, you're in good company - I have the same quibble with The Precipice. I think one of the quibbles I most often have with posts I like is "This post seems to imply, or could be interpreted as implying, that it covers [topic]. But really it covers [some subset of that topic]. That's fair enough and still very useful, but I think it'd be good to be clearer about what the scope is.")


I know some people working on expanded and more in-depth models like this post. It would be great to get your thoughts when they're ready.

Sounds very cool! Yeah, I'd be happy to have a look at that work when it's ready.

Late arriving comment here! :-)

I started working with this as a rubric for analyzing tech companies... then (trying to number and rename in a useful way so that the diagram's contents could be quickly cited in writing) I noticed that the node positions at the bottom did not seem to have been optimized for avoiding crossed lines and easy reading.

Also "Creeping Failure" and "Inconspicuous Failure" have strong overlaps but are far from each other, and "ML Scales to AGI" (at the top right) has no arrow to "Many Powerful AIs" (at the lower left) which it seems like it obviously should have?

Another quirk: if NOT-"Agentive AGI" (in the middle near the top), then maybe "Comprehensive AI Services" (lower right) instead happens instead, but then the only arrow from there is a positive one to its next door neighbor "Context For AGI More Secure".  However, if you think about it, humans having more really good tools seems to me like it would be an obviously useful input to "Use Feedback Loops To Correct Course As We Go" in the lower left, to make that work better? But again I find no such arrow.

A hypothesis that explains most of this is that your tools didn't allow fast iteration or easy validity checking and/or perhaps you didn't do a first draft in a spreadsheet and then convert to this for display purposes.

I started using an actual belief network tool to regenerate things, preparatory to assigning numbers and then letting "calculemus" determine my beliefs... and then noticed a Practice-Level-"smell", on my part, related to refactoring someone's old work without talking to them first.

Is this graph from August 2019 still relevant to anyone else's live models or active plans in October of 2021?

Also, if this document still connects to a living practice, is there a most-recently-updated version that would be a better jumping off point for refinement?

What software did you use to produce this diagram?

Thanks for this post! This seems like a really great way of visually representing how these different hypotheses, arguments, approaches, and scenarios interconnect. (I also think it’d be cool to see posts on other topics which use a similar approach!)

It seems that AGI timelines aren’t explicitly discussed here. (“Discontinuity to AGI” is mentioned, but I believe that's a somewhat distinct matter.) Was that a deliberate choice?

It does seem like several of the hypotheses/arguments mentioned here would feed into or relate to beliefs about timelines - in particular, Discontinuity to AGI, Discontinuity from AGI, and Recursive self-improvement, ML scales to AGI, and Deep insights needed (or maybe not that last one, as that means “needed” for alignment in particular). But I don’t think beliefs about timelines would be fully accounted for by those hypotheses/arguments - beliefs about timelines could also involve cruxes like whether “Intelligence is a huge collection of specific things”) or whether “There’ll be another AI winter before AGI” could also play a role.

I’m not sure to what extent beliefs about timelines (aside from beliefs about discontinuity) would influence which of the approaches people should/would take, out of the approaches you list. But I imagine that beliefs that timelines are quite short might motivate work on ML or prosaic alignment rather than (Near) proof-level assurance of alignment or Foundational or “deconfusion” research. This would be because people might then think the latter approaches would take too long, such that our only shot (given these people’s beliefs) is doing ML or prosaic alignment and hoping that’s enough. (See also.)

And it seems like beliefs about timelines would feed into decisions about other approaches you don’t mention, like opting for investment or movement-building rather than direct, technical work. (That said, it seems reasonable for this post’s scope to just be what a person should do once they have decided to work on AI alignment now.)

It's great to hear your thoughts on the post!

I'd also like to see more posts that do this sort of "mapping". I think that mapping AI risk arguments is too neglected - more discussion and examples in this post by Gyrodiot. I'm continuing to work collaboratively in this area in my spare time, and I'm excited that more people are getting involved.

We weren't trying to fully account for AGI timelines - our choice of scope was based on a mix of personal interest and importance. I know people currently working on posts similar to this that will go in-depth on timelines, discontinuity, paths to AGI, the nature of intelligence, etc. which I'm excited about!

I agree with all your points. You're right that this post's scope does not include broader alternatives for reducing AI risk. It was not even designed to guide what people should work on, though it can serve that purpose. We were really just trying to clearly map out some of the discourse, as a starting point and example for future work.