Epistemic status: lots of this involves interpreting/categorising other people’s scenarios, and could be wrong. We’d really appreciate being corrected if so. [ETA: so far, no corrections.]
TLDR: see the summary table.
In the last few years, people have proposed various AI takeover scenarios. We think this type of scenario building is great, since there are now more concrete ideas of what AI takeover could realistically look like. That said, we have been confused for a while about how the different scenarios relate to each other and what different assumptions they make. This post might be helpful for anyone who has similar confusions.
We focus on explaining the differences between seven prominent scenarios: the ‘Brain-in-a-box’ scenario, ‘What failure looks like’ part 1 (WFLL 1), ‘What failure looks like’ part 2 (WFLL 2), ‘Another (outer) alignment failure story’ (AAFS), ‘Production Web’, ‘Flash economy’ and ‘Soft takeoff leading to decisive strategic advantage’. While these scenarios do not capture alI of the risks from transformative AI, participants in a recent survey aimed at leading AI safety/governance researchers estimated the first three of these scenarios to cover 50% of existential catastrophes from AI.
We plan to follow up with a subsequent post, which discusses some of the issues raised here in greater depth.
Variables relating to AI takeover scenarios
We define AI takeover to be a scenario where the most consequential decisions about the future get made by AI systems with goals that aren’t desirable by human standards.
There are three variables which are sufficient to distinguish the takeover scenarios discussed in this post. We will briefly introduce these three variables, and a number of others that are generally useful for thinking about takeover scenarios.
Key variables for distinguishing the AI takeover scenarios in this post:
- Speed. Is there a sudden jump in AI capabilities over a very short period (i.e. much faster than what we would expect from extrapolating past progress)?
- Uni/multipolarity. Is there a single AI system that takes over, or many?
- Alignment. What (misaligned) goals are pursued by AI system(s)? Are they outer or inner misaligned?
- Agency. How agentic are the AI(s) that take over? Do the AIs have large-scale objectives over the physical world and can they autonomously execute long-term plans to reach those objectives?
- Generality. How generally capable are the AI(s) that take over? (vs. only being capable in specific narrow domains)
- Competitive pressures. To what extent do competitive pressures (incentives to develop or deploy existentially risky systems in order to remain competitive) cause or exacerbate the catastrophe?
- Irreversibility mechanism. When and how does the takeover become irreversible?
- Homogeneity/heterogeneity of AIs. In the scenarios that involve multiple AI systems, how similar are the different systems (in learning algorithms, finetuning data, alignment, etc.)?
- Interactions between AI systems. In the scenarios that involve multiple AI systems, do we see strong coordination between them, or conflict?
Note that the scenarios we consider do not differ on the dimensions of agency and generality: they all concern takeovers by highly agentic, generally capable AIs - including ‘What failure looks like’ part 1 - we just stated these dimensions here for completeness.
Clarifying what we mean by outer and inner alignment
Recently, there has been some discussion about how outer and inner alignment should be defined (along with related terms like objective and capability robustness). In this post, we roughly take what has become known as the ‘objective-focused approach’, whilst also taking into account Richard Ngo’s arguments that it is not actually clear what it means to implement a “safe” or “aligned” objective function.
Outer alignment is a property of the objective function used to train the AI system. We treat outer alignment as a continuum. An objective function is outer aligned to the extent that it incentivises or produces the behaviour we actually want from the AI system.
Inner alignment is a property of the objective which the AI system actually has. This objective is inner aligned to the extent that it is aligned with, or generalises ‘correctly’ from, the objective function used to train the system.
(If you’re new to this distinction between outer and inner alignment, you might wonder why an AI system wouldn’t always just have the same objective as the one used to train it. Here is one intuition: if the training environment contains subgoals (e.g. ‘gaining influence or resources’) which are consistently useful for scoring highly on the training objective function, then the training process may select for AI systems which care about those subgoals in ways that ultimately end up being adversarial to humans (e.g. ‘gaining influence at the expense of human control’). Human evolution provides another intuition: you could think of evolution as a training process that led to inner misalignment, because humans care about goals other than just maximising our genetic fitness.)
This table summarises how the scenarios discussed in this post differ, according to the three key variables above. You can find a higher resolution version here.
We'll now go on to explaining and illustrating the differences between the scenarios in more detail. For clarity, we divide our discussion into slow scenarios and fast scenarios, following Critch. In the slow scenarios, technological progress is incremental, whereas in fast scenarios there is a sudden jump in AI capabilities over a very short period.
Outer-misaligned brain-in-a-box scenario
This is the ‘classic’ scenario that most people remember from reading Superintelligence (though the book also features many other scenarios).
A single highly agentic AI system rapidly becomes superintelligent on all human tasks, in a world broadly similar to today.
The objective function used to train the system (e.g. ‘maximise production’) doesn’t push it to do what we really want, and the system’s goals match the objective function. In other words, this is an outer alignment failure. Competitive pressures aren’t especially important, though they may have encouraged the organisation that trained the system to skimp on existential safety/alignment, especially if there was a race dynamic leading up to the catastrophe.
The takeover becomes irreversible once the superintelligence has undergone an intelligence explosion.
Inner-misaligned brain-in-a-box scenario
Another version of the brain-in-a-box scenario features inner misalignment, rather than outer misalignment. That is, a superintelligent AGI could develop some arbitrary objective that arose during the training process. This could happen for the reason given above (there are subgoals in the training environment that are consistently useful for doing well in training, but which generalise to be adversarial to humans), or simply because some arbitrary influence-seeking model just happened to arise during training, and performing well on the training objective is a good strategy for obtaining influence.
We suspect most people who find the ‘brain-in-a-box’ scenario plausible are more concerned by this inner misalignment version. For example, Yudkowsky claims to be most concerned about a scenario where an AGI learns to do something random (rather than one where it ‘successfully’ pursues some misspecified objective function).
It is not clear whether the superintelligence being inner- rather than outer-misaligned has any practical impact on how the scenario would play out. An inner-misaligned superintelligence would be less likely to act in pursuit of a human-comprehensible final goal like ‘maximise production’, but since in either case the system would both be strongly influence-seeking and capable of seizing a decisive strategic advantage, the details of what it would do after seizing the decisive strategic advantage probably wouldn’t matter. Perhaps, if the AI system is outer-misaligned, there is an increased possibility that a superintelligence could be blackmailed or bargained with, early in its development, by threatening its (more human-comprehensible) objective.
This scenario, described by Critch, can be thought of as one multipolar version of the outer-misaligned ‘brain-in-a-box’ scenario. After a key breakthrough is made which enables highly autonomous, generally capable, agentic systems with long-term planning capability and advanced natural language processing, several such systems become superintelligent over the course of several months. This jump in capabilities is unprecedentedly fast, but ‘slow enough’ that capabilities are shared between systems (enabling multipolarity). At some point in the scenario, groups of systems reach an agreement to divide the Earth and space above it into several conical sectors, to avoid conflict between them (locking in multipolarity).
Each system becomes responsible for a large fraction of production within a given industry sector (e.g., material production, construction, electricity, telecoms). The objective functions used to train these systems can be loosely described as “maximising production and exchange” within their industry sector. The systems are “successfully” pursuing these objectives (so this is an outer alignment failure).
In the first year, things seem wonderful from the perspective of humans. As economic production explodes, a large fraction of humanity gains access to free housing, food, probably a UBI, and even many luxury goods. Of course, the systems are also strategically influencing the news to reinforce this positive perspective.
By the second year, we have become thoroughly dependent on this machine economy. Any states that try to slow down progress rapidly fall behind economically. The factories and facilities of the AI systems have now also become very well-defended, and their capabilities far exceed those of humans. Human opposition to their production objectives is now futile. By this point, the AIs have little incentive to preserve humanity’s long-term well-being and existence. Eventually, resources critical to human survival but non-critical to machines (e.g., arable land, drinking water, atmospheric oxygen) gradually become depleted or destroyed, until humans can no longer survive.
We’ll now describe scenarios where there is no sudden jump in AI capabilities. We’ve presented these scenarios in an order that illustrates an increasing ‘degree’ of misalignment. In the first two scenarios (WFLL 1 and AAFS), the outer-misaligned objective functions are somewhat close to what we want: they produce AI systems that are trying to make the world look good according to a mixture of feedback and metrics specified by humans. Eventually, this still results in catastrophe because once the systems are sufficiently powerful, they can produce much more desirable-looking outcomes (according to the metrics they care about), much more easily, by controlling the inputs to their sensors instead of actually making the world desirable for humans. In the third scenario (Production Web), the ‘degree’ of misalignment is worse: we just train systems to maximise production (an objective that is further from what we really want), without even caring about approval from their human overseers. The fourth scenario (WFLL 2) is worse still: the AIs have arbitrary objectives (due to inner alignment failure) and so are even more likely to take actions that aren’t desirable by human standards, and likely do so at a much earlier point. We explain this in more detail below.
The fifth scenario doesn’t follow this pattern: instead of varying the degree of misalignment, this scenario demonstrates a slow, unipolar takeover (whereas the others in this section are multipolar). There could be more or less misaligned versions of this scenario.
What failure looks like, part 1 (WFLL 1)
In this scenario, described by Christiano, many agentic AI systems gradually increase in intelligence and generality, and are deployed increasingly widely across society to do important tasks (e.g., law enforcement, running companies, manufacturing and logistics).
The objective functions used to train them (e.g., ‘reduce reported crimes’, ‘increase reported life satisfaction’, ‘increasing human wealth on paper’) don’t push them to do what we really want (e.g., ‘actually prevent crime’, ‘actually help humans live good lives’, ‘increasing effective human control over resources’) - so this is an outer alignment failure.
The systems’ goals match these objectives (i.e. are ‘natural’ or ‘correct’ generalisations of them). Competitive pressures (e.g., strong economic incentives, an international ‘race dynamic’, etc.) are probably necessary to explain why these systems are being deployed across society, despite some people pointing out that this could have very bad long-term consequences.
There’s no discrete point where this scenario becomes irreversible. AI systems gradually become more sophisticated, and their goals gradually gain more influence over the future relative to human goals. In the end, humans may not go extinct, but we have lost most of our control to much more sophisticated machines (this isn’t really a big departure from what is already happening today - just imagine replacing today’s powerful corporations and states with machines pursuing similar objectives).
Another (outer) alignment failure story (AAFS)
This scenario, also described by Christiano, is initially similar to WFLL 1. AI systems slowly increase in generality and capability and become widely deployed. The systems are outer misaligned: they pursue natural generalisations of the poorly chosen objective functions they are trained on. This scenario is more specific about exactly what objectives the systems are pursuing: they are trying to ensure that the world looks good according to some kind of (augmented) human judgment (the systems are basically trained according to the regime described in An unaligned benchmark).
Problems arise along the way, when systems do things that look good but aren’t actually good (e.g. a factory colludes with the auditors valuing its output, giving a great quarterly report that didn’t actually correspond to any revenue). Such problems tend to be dealt with via short-term fixes - improving sensor coverage to check mistakes (e.g. in a way that reveals collusion between factories and auditors) or tweaking reward functions (e.g. to punish collusion between factories and auditors). This leads to a false sense of security initially. But as the pace of AI progress accelerates and we still don’t know how to train AI systems to actually help us, we eventually have extremely powerful systems, widely deployed across society, which are pursuing proxy goals that come apart from what we actually want. Specifically: ‘ensuring things look good according to human judgment’ eventually means fooling humans and carefully controlling what gets fed into the sensors, because the AIs can produce much more desirable-looking outcomes, much more easily, by controlling the sensors instead of actually making the world good. Eventually, all humans will either be killed or totally disempowered, because this is the best way of making sure the systems’ objectives are maximally positive and will remain that way forever.
To explain exactly how this scenario differs from WFLL 1, consider that outer (mis)alignment can be viewed as a continuum: ‘how’ misspecified is the objective function used to train the AI system? On one extreme, we have objective functions that do really well at producing or incentivising the behaviour we actually want from AI systems (e.g. a reward function trained using iterated amplification to reward systems to the extent that they try to help their operators). On the other extreme, we have objective functions that don't capture anything we value (e.g. ‘maximise paperclips’).
We find it helpful to think about the objective functions used in training as specifying a ‘sensory window’ through which the system being trained views the world (you could even think of it as a huge set of camera feeds). This window will probably be defined by a bunch of human feedback, along with other metrics (e.g. GDP, inflation, unemployment, approval ratings). The training process is selecting for AI systems that make this sensory window look ‘good’ according to feedback and desired values for those metrics.
Bringing these ideas together: the better defined this ‘sensory window’ (i.e. the more outer aligned the objective function is), the better things will look from the human perspective. In WFLL 1, the sensory window is very large, rich and well-defined, such that even as AI systems gain more and more influence relative to humans, the world continues to look pretty good to us. In AAFS, the sensory window is smaller and less well-defined, such that it’s eventually easy for systems to seize their sensors and kill or disempower any humans who try to stop them.
This has a few practical implications for how AAFS plays out, compared to WFLL 1.
First, in WFLL 1, there is a decent chance (maybe 50:50), that AI systems will leave some humans alone (though still mostly disempowered). This is because the sensory window was so well-defined that it was too hard for AI systems to cause extinction without it showing up on their sensors and metrics. In AAFS, this is much less likely, because the sensory window is easier to fool.
Second, in AAFS, the point of no return will happen sooner than in WFLL 1. This is because it will require a lower level of capabilities for systems to take control without it showing up on the (more poorly defined) sensory window.
Third, in AAFS, warning shots (i.e. small- or medium-scale accidents caused by alignment failures, like the ‘factory colludes with auditors’ example above) are more likely and/or severe than in WFLL 1. This is because more possible accidents will not show up on the (more poorly defined) sensory window. A further implication here is that competitive pressures probably need to be somewhat higher - or AI progress somewhat faster - than in WFLL 1, to explain why we don’t take steps to fix the problem before it’s too late.
The next scenario demonstrates what happens when the objective function/sensory window is even closer to the bad extreme.
Critch’s Production Web scenario is similar to WFLL 1 and AAFS, except that the objective functions used to train the systems are even less outer aligned. Specifically, the systems are trained to ‘maximise productive output’ or some similarly crude measure of success. This measure defines an even narrower sensory window onto the world than for the systems in WFLL 1 and AAFS - it isn’t even superficially aligned with what humans want (the AI systems are not trying to optimise for human approval at all).
‘Maximising productive output’ eventually means taking steps that aren’t desirable from the human perspective (e.g. using up resources critical to human survival but non-critical to machines, like drinking water and atmospheric oxygen).
The implications of this even more (outer) misaligned objective follow the same pattern we described when comparing AAFS with WFLL 1. In the ‘Production Web’ scenario:
- Human extinction is the only likely outcome (keeping humans alive becomes counterproductive to maximising productive output).
- The point of no return will happen even sooner (AI systems will start e.g. using up resources critical to human survival but non-critical to machines as soon as they are capable enough to ensure that humans cannot stop them, rather than having to wait until they are capable enough to manipulate their sensors and human overseers).
- Warning shots will be even more likely/severe (since their objectives are more misaligned, fewer possible accidents will be punished).
- Competitive pressures therefore need to be even higher.
Another point of comparison: you can also view this scenario as a slower version of the Flash Economy, meaning that there is more opportunity for incremental progress on AI alignment or improved regulation to stop the takeover.
Further variants of slow, outer-alignment failure scenarios
If systems don’t develop coherent large-scale goals over the physical world, then the failures might take the form of unorganized breakdowns or systems ‘wireheading’ themselves (i.e. trying to maximise the contents of their reward memory cell) without attempting to seize control of resources.
We can also consider varying the level of competitive pressure. The more competitive pressure there is, the harder it becomes to coordinate to prevent the deployment of dangerous technologies. Especially if there are warning shots (i.e. small- or medium-scale accidents caused by alignment failures), competitive pressures must be unusually intense for potentially dangerous TAI systems to be deployed en masse.
We could also vary the competence of the technical response in these scenarios. The more we attempt to ‘patch’ outer misalignment with short-term fixes (e.g., giving feedback to make the systems’ objectives closer to what we want, or to make their policies more aligned with their objectives), the more likely we are to prevent small-scale accidents. The effect of this mitigation depends on how ‘hackable’ the alignment problem is: perhaps this kind of incremental course correction will be sufficient for existentially safe outcomes. But if it isn’t, then all we would be doing is deferring the problem to a world with even more powerful systems (increasing the stakes of alignment failures), and where inner-misaligned systems have been given more time to arise during the training process (increasing the likelihood of alignment failures). So in worlds where the alignment problem is much less ‘hackable’, competent early responses tend to defer bad outcomes into the future, and less competent early responses tend to result in an escalating series of disasters (which we could hope leads to an international moratorium on AGI research).
What failure looks like, part 2 (WFLL 2)
Described by Christiano and elaborated further by Joe Carlsmith, this scenario sees many agentic AI systems gradually increase in intelligence, and be deployed increasingly widely across society to do important tasks, just like WFLL 1.
But then, instead of learning some natural generalisation of the (poorly chosen) training objective, there is an inner alignment failure: the systems learn some unrelated objective(s) that arise naturally in the training process i.e. are easily discovered in neural networks (e.g. “don't get shut down”). The systems seek influence as an instrumental subgoal (since with more influence, a system is more likely to be able to e.g. prevent attempts to shut it down). Early in training, the best way to do that is by being obedient (since it knows that unobedient behaviour would get it shut down). Then, once the systems become sufficiently capable, they attempt to acquire resources and power to more effectively achieve their goals.
Takeover becomes irreversible during a period of heightened vulnerability (a conflict between states, a natural disaster, a serious cyberattack, etc.) before systems have undergone an intelligence explosion. This could look like a “rapidly cascading series of automation failures: a few automated systems go off the rails in response to some local shock. As those systems go off the rails, the local shock is compounded into a larger disturbance; more and more automated systems move further from their training distribution and start failing.” After this catastrophe, “we are left with a bunch of powerful influence-seeking systems, which are sophisticated enough that we can probably not get rid of them”.
Compared to the slow outer-alignment failure scenarios, the point of no return in this scenario will be even sooner (all else being equal), because AIs don’t need to keep things looking good according to their somewhat human-desirable objectives (which takes more sophistication) - they just need to be able to make sure humans cannot take back control. The point of no return will probably be even sooner if the AIs all happen to learn similar objectives, or have good cooperative capabilities (because then they will be able to pool their resources and capabilities, and hence be able to take control from humans at a lower level of individual capability).
You could get a similar scenario where takeover becomes irreversible without any period of heightened vulnerability, if the AI systems are capable enough to take control without the world being chaotic.
Soft takeoff leads to decisive strategic advantage
This scenario, described by Kokotajlo, starts off much like ‘What failure looks like’. Many general agentic AI systems get deployed across the economy, and are misaligned to varying degrees. AI progress is much faster than today, but there is no sudden jump in AI capabilities. Each system has some incentive to play nice and obey governing systems. However, then one particular AI is able to buy more computing hardware and invest more time and resources into improving itself, enabling it to do more research and pull further ahead of its competition, until it can seize a decisive strategic advantage and defeat all opposition. This would look a lot like the ‘brain-in-a-box’ superintelligence scenario, except it would be occurring in a world that is already very different to today. The system that takes over could be outer or inner misaligned.
Thanks to Jess Whittlestone, Richard Ngo and Paul Christiano for helpful conversations and feedback. This post was partially inspired by similar work by Kaj Sotala. All errors are our own.
That is, the median respondent’s total probability on these three scenarios was 50%, conditional on an existential catastrophe due to AI having occurred. ↩︎
Some of the failure stories described here must assume the competitive pressures to deploy AI systems are unprecedentedly strong, as was noted by Carlsmith. We plan to discuss the plausibility of these assumptions in a subsequent post. ↩︎
We won't discuss this variable in this post, but it has important consequences for the level of cooperation/conflict between TAI systems. ↩︎
How these scenarios are affected by varying the level of cooperation/conflict between TAI systems is outside the scope of this post, but we plan to address it in a future post. ↩︎
We would welcome more scenario building about takeovers by agentic, narrow AI (which don’t seem to have been discussed very much). Takeovers by non-agentic AI, on the other hand, do not seem plausible: it’s hard to imagine non-agentic systems - which are, by definition, less capable than humans at making plans for the future - taking control of the future. Whether and how non-agentic systems could nonetheless cause an existential catastrophe is something we plan to address in a future post. ↩︎
You can think about the objective that an AI system actually has in terms of its behaviour or its internals. ↩︎
We think an important, underappreciated point about this kind of failure (made by Richard Ngo) is that the superintelligence probably doesn’t destroy the world because it misunderstands what humans want (e.g. by interpreting our instructions overly literally) - it probably understands what humans want very well, but doesn’t care, because it ended up having a goal that isn’t desirable by our standards (e.g. ‘maximise production’). ↩︎
This does assume that systems will be deployed before they are capable enough to anticipate that causing such ‘accidents’ will get them shut down. Given there will be incentives to deploy systems as soon as they are profitable, this assumption is plausible. ↩︎
So for this failure scenario, it isn’t crucial whether the training objective was outer aligned. ↩︎
Of course, not all arbitrarily chosen objectives, and not all training setups, will incentivise influence-seeking behaviour, but many will. ↩︎