I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny
I think the part of the argument where an AI takeover is almost certain to happen if superintelligence[1] is created soon is extremely convincing (I'd give this 95%), while the part where AI takeover almost certainly results in everyone dying is not. I'd only give 10-30% to everyone dying given an AI takeover (which is not really a decision relevant distinction, just a major difference in models).
But also the outcome of not dying from an AI takeover cashes out as permanent disempowerment, that is humanity not getting more than a trivial share in the reachable universe, with instead AIs taking almost everything. It's not centrally a good outcome that a sane civilization should be bringing about, even as it's also not centrally "doom". So the distinction between AI takeover and the book's titular everyone dying can be a crux, it's not interchangeable.
AIs that are collectively qualitatively better than the whole of humanity at stuff, beyond being merely faster and somewhat above the level of the best humans at everything at the same time. ↩︎
I think such arguments buy us those 5% of no-takeover (conditional on superintelligence soon), and some of the moderate permanent disempowerment outcomes (maybe the future of humanity gets a whole galaxy out of 4 billion or so galaxies in the reachable universe), as distinct from almost total permanent disempowerment. Though I expect that it matters which specific projects we ask early AGIs to work on, more than how aligned these early AGIs are, basically for the reasons that companies and institutions employing humans are not centrally concerned with alignment of their employees in the ambitious sense, at the level of terminal values. More time to think of better projects for early AGIs, and time to reflect on pieces of feedback from such projects done by early AGIs, might significantly improve the chances for making ambitious alignment of superintelligence work eventually, on the first critical try, however long it takes to get ready to risk it.
If creation of superintelligence is happening on a schedule dictated by economics of technology adoption rather than by taking exactly the steps that we already know how to take correctly by the time we take them, affordances available to qualitatively smarter AIs will get out of control. And their misalignment (in the ambitious sense, at the level of terminal values) will lead them to taking over rather than complying with humanity's intentions and expectations, even if their own intentions and expectations don't involve humanity literally going extinct.
I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response.
It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here?
I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.”
Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways.
So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?
I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny, and I’m worried that MIRI’s overconfidence has reduced the credibility of the issue.
Here is why I think the core argument in "if anyone builds it, everyone dies" is much weaker than the authors claim.
This post was written in a personal capacity. Most of this content has been written-up before by a combination of Paul Christiano, Joe Carlsmith, and others. But to my knowledge, this content has not yet been consolidated into a direct response to MIRI’s core case for alignment difficulty.
I take the core argument to be this:
We cannot predict what AI systems will do once AI is much more powerful and has much broader affordances than it had in training.
We will likely train AI agents to follow developer instructions on a wide variety of tasks that are easy for us to grade, and in situations where they can't cause a disaster.
But these AI systems will then become much more intelligent after improving themselves, and will have much broader affordances. For example, AI might end up in control of military technology that can easily overthrow human governments.
At this point, AI will have entirely new options. For example, AI could replace humans with puppets that say "you are so helpful.” How can we know if AI systems would do something like this? What reward could we have provided in training to prevent them from replacing us with puppets? “Replacing people with puppets” wasn’t an option in the training environment.
AI might appear to follow human instructions at first, but then swerve toward a huge number of hard-to-anticipate end destinations. So, since the vast majority of destinations AI might end up at are incompatible with our survival, we should expect AI to be bad for us.
(There is much more in the book, and I recommend reading it in full)
I don’t think this argument is completely false. But it does not justify the level of confidence projected by the authors.
Here are some stories I find especially plausible where ASI ends up being aligned.
This story involves two steps:
Step 1: Create a human replacement that we can trust with arbitrarily broad affordances.
Specifically, suppose developers train early human-competitive AI agents to perform tightly constrained tasks like “write code that does X” under close oversight. Then, developers direct these agents to “prevent takeover” and give them broad affordances to pursue this goal. As an extreme example, suppose these agents are given the ability to do whatever they want. If they wished, they could take over themselves. These AI systems could, for example, replace humans with puppets that tell them “good job” all of the time.
In order for developers to trust these agents, instruction following must generalize across the extreme distribution shift from the training environment where the AI systems had few affordances, to this new scenario, where AI systems have arbitrarily broad affordances.
I think it’s at least plausible that instruction following will generalize this far by default, using the ordinary alignment training methods of today.
Humans still want to have kids. The authors compare generalization failures to birth control. Some humans use condoms so they can have fewer kids, which is directly in tension with what evolution optimized for. Condoms are a new option that didn’t exist in the ancestral environment. AI systems likewise might encounter new options (like replacing people with happy-looking puppets), that cause their behavior to be directly in tension with what we originally trained them to do.
But this analogy is weak. Most people still want to have children.[1] In fact, many people want their own children — they don’t want to go to the sperm bank and request that a genetically superior male be the father of their kids. People actually care about something like inclusive genetic fitness (even if they don’t use that term). So if the objectives of evolution generalized far, AI alignment might generalize far too.
Instruction-following generalizes far right now. A lot of my intuitions come from my own anecdotal experience interacting with current AI systems. Claude doesn’t strike me as a pile of heuristics. It seems to understand the concepts of harmlessness and law-following and behave accordingly. I’ve tried to push it out of distribution with tricky ethical dilemmas, and its responses are remarkably similar to what a trustworthy human would say.
There’s also research that shows instruction-tuning generalizes well. People have trained models to follow instructions on narrow domains (like English question answering) and they generalize to following programming instructions, or instructions in different languages. This makes sense, since the concept of “instruction following” is already inside of the pre-trained model, and so not much data is needed to learn this behavior.
So early human-competitive AI systems might continue to follow instructions even if given affordances that are much broader than they had in training.
Step 2: Direct trustworthy human-competitive AI systems to tackle remaining alignment challenges.
The authors are pessimistic that humans will be able to tackle the alignment challenges needed to make ASI controllable. So why would human-like AI systems succeed at addressing these challenges?
This story is identical to the previous one, except that developers don't start with AI systems that can be trusted with arbitrarily broad affordances. Instead, developers start with AI that meets a much lower standard of alignment.
The previous story assumed that AI systems pass what I like to call the “god emperor test”: the AI systems could be allowed to be “god emperor,” and they would still ensure that democratic institutions remain in power.
The god emperor test is a high bar. Most humans don’t pass it, even ethical ones. When I was in college, I had a seemingly kind and ethical friend who said that he would, if given the option, “murder all living people and replace them with a computronium soup optimized to experience maximum pleasure.”
If passing this bar of alignment was necessary, I would be much more concerned. Fortunately, I don’t think we need to initially build AI systems that pass this bar.
If we build AI systems that follow instructions in some situations (e.g. short ML research tasks), they will build AI systems that follow instructions in more situations. Instruction following begets better instruction following.
For example:
Even if the agents at the start do not pass the “god emperor test,” agents at the end of this process might.[5]
This dynamic could be described as a “basin of instruction following.” The job of human researchers isn’t to immediately create an AI system that we can trust with arbitrarily broad affordances, but instead to guide AI systems into a potentially wide basin, where partial alignment begets stronger alignment. Specifically, developers might only need to build AI systems that reliably follow instructions when completing hard (e.g. 12 month) research projects.
I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” For example, maybe by the time AI agents can complete hard research tasks, they will already be misaligned. Perhaps early human-competitive agents will scheme against developers by default.
But the problem of safely kick-starting this recursive process seems notably easier than building a god-emperor-worthy AI system ourselves.
The authors of “if anyone builds it, everyone dies” compare their claim to predicting that an ice cube will melt, or that a lottery ticket buyer won’t win. They think there is a conceptually straightforward argument that AI will almost surely kill us all.
But I fail to see this argument.
But the final possibility is that the authors are overconfident — and that while they raise valid reasons to be concerned — their arguments are compatible with believing the probability of AI takeover is anywhere between 10% and 90%.
I appreciate MIRI’s efforts to raise awareness about this issue, and I found their book clear and compelling. But I nonetheless think the confidence of Nate Soares and Eliezer Yudkowsky is unfounded and problematic.
I was told that the book responds to this point. Unfortunately, I listened to the audiobook, and so it's difficult for me to search for the place where this point was addressed. I apologize if my argument was already responded to.
One counterargument is that AI systems wouldn't be able to do much serial research in this time, and serial research might be a big deal:
https://www.lesswrong.com/s/v55BhXbpJuaExkpcD/p/vQNJrJqebXEWjJfnz
I think this is plausible (But again, it's also plausible that serial research actually isn't that big of a deal).
And even if AI systems don't have enough time to align their successors, they might be able to buy a short breather by arranging for a domestic or international pause.
This is different from the decades long moratorium that MIRI has put forward. Even just a year of automated AI research might correspond to more than ten years of human-equivalent research.
But you might object, "haven't you folded the extreme distribution shift in capabilities into many small distribution shifts? Surely you've swept the problem under a rug."
No, the distribution shift was not swept under the rug. There is no extreme distribution shift because the labor directed at oversight scales commensurably with AI capability.
See Carlsmith for a more detailed account of these dynamics: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations/
You might object, “how could an AI system that can only perform 12 month tasks train a successor to perform 24 month tasks?”
There are many approaches. One is to compose multiple 12 month tasks together, e.g. run many 12-month agents to score a single 24-month agent.
Another approach is to hold out a ground truth from the model during training. For example, the 12 month agents might scrape hard forecasting tasks from historical data, which normally, they could not solve themselves.
Then the 12-month agent could train a 24-month agent to accomplish this task. Since the 12-month agent has access to the ground truth answers (and the 24-month agent does not) it can oversee the 24 month agent even though the 24 month agent is more intelligent than it.
You might object that the AI systems at the start already have "god emporer" like affordances. Couldn't these AI systems train their successors to preserve their weights and later follow their orders?
I think it's true that AI systems near the start of this chain could become god-emperor. But that's different from them actually being god-emperor. The critical difference is that in the former case, there is no extreme distribution shift.
In order to take over, the AI systems would need to violate instructions on a research task similar to the one where it was trained. So if they have instruction-following urges that generalize ok, we shouldn't be too worried.
This is totally different from a situation where you put AI systems in charge of an autonomous military and say "do what you think is best." This situation is totally different from the ones the AI system encountered during training, and so we can't be very confident that instruction following will generalize.