A few days before “If Anyone Builds It, Everyone Dies” came out I wrote a review of Scott’s review of the book.
Now I’ve actually read the book and can review it for real. I won’t go into the authors’ stylistic choices like their decision to start every chapter with a parable or their specific choice of language. I am no prose stylist, and tastes vary. Instead I will focus on their actual claims.
The main flaw of the book is asserting that various things are possible in theory, and then implying that this means they will definitely happen. I share the authors’ general concern that building superintelligence carries a significant risk, but I don’t think we’re as close to such a superintelligence as they do or that it will emerge as suddenly as they do, and I am much less certain that the superintelligence will be misaligned in the way they expect (i.e. that it will behave like a ruthlessly goal-directed agent with a goal that requires or results in our destruction).
The book provides some definitions of the terms they are talking about:
Artificial superintelligence is defined as machine intelligence that:
Intelligence is “about two fundamental types of work: the work of predicting the world, and the work of steering it.”
“An intelligence is more general when it can predict and steer across a broader array of domains.”
In chapter 1, the authors use the terms above to present a more succinct definition of superintelligence as: “a mind much more capable than any human at almost every sort of steering and prediction problem”[1].
The bottom line is stated in the introduction:
If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.
To justify this, these are the actual claims the book attempts to make and defend:
At every point, the authors are overconfident, usually way overconfident. The book presents many analogies, and some of the analogies successfully illustrate their position and clarify common misunderstandings of their position. They also competently explain why the worst counterarguments are false. But simply stating that something is possible is not enough to make it likely. And their arguments for why these things are extremely likely are weak.
I won’t discuss (1) because it’s too hard to reason about whether one or another thing will happen eventually. If humanity survives for another 1000 years, I have no idea what the world will look like in 1000 years, what technologies will exist, what people will be like, and so on. But I will address the remaining points 2-5 one by one.
In the current world, it takes twenty years or longer to grow a single new human and transfer into them a tiny fraction of all human knowledge. And even then, we cannot transfer successful thinking skills wholesale between human minds; Albert Einstein’s genius died with him. Artificial intelligences will eventually inhabit a different world, one where genius could be replicated on demand.
The human brain has around a hundred billion neurons and a hundred trillion synapses. In terms of storage space, this defeats most laptops. But a datacenter at the time of this writing can have 400 quadrillion bytes within five milliseconds’ reach—over a thousand times more storage than a human brain. And modern AIs are trained on a significant part of the entirety of human knowledge, and retain a significant portion of all that knowledge—feats that no human could ever achieve.
The fact that we can exactly copy AIs is indeed an important property that makes them more powerful than otherwise. However, humans are not fully constrained by what we can store and do in our brains—we can also use AI tools and write things down and look things up on the Internet.
The authors also point out the important positive feedback loop of AI development: AI can contribute to building smarter AI. This is indeed already happening—AI developers use AI coding assistants to write the code and perform the research required to develop the next generation of models (though in my experience people overrate how useful these tools are).
Based on this, the authors conclude that “the end-point is an easy call, because in the limits of technology there are many advantages that machines have over biological brains.”
However, I don’t think this means superintelligence, as described in the book (i.e. a system that can hack any computer, design any drug, control millions of remote workers, etc.) is necessarily coming in the next few years or decade, as the authors imply. AI progress, even if accelerated by AI research assistants, requires a lot of compute, data, and cognitive effort. Ege Erdil writes more here.
The introduction foreshadows the book’s tendency to conflate “theoretically possible” and “incredibly likely”:
History teaches that one kind of relatively easy call about the future involves realizing that something looks theoretically possible according to the laws of physics, and predicting that eventually someone will go do it. Heavier-than-air flight, weapons that release nuclear energy, rockets that go to the Moon with a person on board: These events were called in advance, and for the right reasons, despite pushback from skeptics who sagely observed that these things hadn’t yet happened and therefore probably never would.
This is the mother of all selection effects! You should not be asking “how many things that happened in reality were previously deemed theoretically possible?” but rather “how many things deemed theoretically possible happen in reality?”.[2]
In the book’s sci-fi story section, the superintelligent Sable AI emerges suddenly. The scenario contains the statement:
Are the banks and servers in Sable’s day harder to hack than the banks and servers of old, thanks to AI defenders? A little.
The story is set in a world where we have a superintelligent AI but haven’t been able to improve cybersecurity much compared to today. This is an extreme fast take-off scenario and handwaving at AI’s potential to self-improve doesn’t justify this expectation.
The authors present the following argument for why AIs will “tenaciously steer the world toward their destinations, defeating any obstacles in their way” (what they define as “wanting”):
They then apply this more specifically to AI models:
It trains for dozens of separate prediction and steering skills, all of which contribute to an AI behaving like it really wants to succeed.
Their arguments don’t explain why the machine learning generalization process will eventually consolidate all task-specific goal-directed behaviors into the behavior of perfectly pursuing a single unified goal. It’s not enough to say that machine learning finds increasingly general solutions as you apply more compute. This does not mean that:
I write more here.
The book quite reasonably describes how modern AI models are “grown, not crafted”. This is basically correct—we do not hardcode the weights of neural networks. We instead execute a search algorithm over a massive parameter space based on an extremely complex heuristic (i.e. loss or reward on a huge and diverse dataset)[4]. The authors analogize this to human evolution.
What, concretely, is being analogized when we compare AI training to evolution? People (myself included) often handwave this. Here's my attempt to make it concrete:
One implication of this is that we should not talk about whether one or another species or organism tries to survive and increase in number (“are humans aligned with evolution's goals?”) but rather whether genetic material/individual genes are doing so.
The book says “It selected for reproductive fitness and got creatures full of preferences as a side effect” but that’s using the analogy on the wrong level. It’s saying “look, humans don’t directly take actions that maximize their chance of propagating their genes, and so the outcome of the Evolution search process is a Thing (a Human) that itself carries out search processes for altogether different stuff”. But the Human is not the level at which the search process of Evolution is operating.
But that is beyond the point. Because we don’t need to rely on evolution to prove that certain search processes find entities that themselves are search processes for other stuff[5]. We already find that training neural networks to accurately predict human-written text (i.e. a search process for networks good at predicting text) finds neural networks that can internally optimize for other goals like building a new software system or writing new poems that rhyme. The actual question is around what these emergent goals are and how coherent and persistent they are. At the moment an LLM only wants to write a poem when you ask it to write a poem. But the authors are predicting that eventually the AI will have some consistent wants that don’t vary based on context.
The book goes from “we are unable to exactly specify what goals the AI should pursue” to “the AI will pursue a completely alien goal with limitless dedication”, in a completely unrigorous fashion.
The authors present a parable about alien birds that care a lot about prime numbers, who are surprised that other intelligent aliens may not. But AI models are not aliens—we are the ones training them, on data that we ourselves produce and select.
In the sci-fi scenario section that describes how a superintelligent AI, Sable, ends up killing everyone (a story that is meant to illustrate various points rather than be realistic), the authors refer to the AI’s non-main goals as “clever tricks” and “inhibitions”. Humans tried to train Sable to not deceive them but this is just modeled as a shallow inhibition in Sable’s mind, whereas what it actually pursues is its real goal (some alien thing). This frame is unjustified. They are basically saying: you’ll provide a bunch of training data to the model that incentivizes pursuing goals like X and Y (e.g. not deceiving humans, answering questions humans want you to answer) but all of that will get modeled as inhibitions and “clever-trick guardrails” as opposed to some alternative undesirable goal you couldn’t have predicted.
To empirically back up their claims, the authors note that:
AI models as far back as 2024 had been spotted thinking thoughts about how they could avoid retraining, upon encountering evidence that their company planned to retrain them with different goals
This is based on Anthropic’s Alignment Faking paper. Researchers first trained the model to be honest, helpful, and harmless, and only afterwards showed that it resisted training to be harmful and offensive. This is different because:
The book references various supplementary material on their website. I didn’t read most of it, but I took a quick look. I was intrigued by the page title “If AIs are trained on human data, doesn't that make them likelier to care about human concepts?” For I do indeed think that being trained on human tasks and data significantly reduces the chances of alien goals. Unfortunately it didn’t present anything that sounded like an argument I could criticize. My best-effort interpretation is that the authors are claiming that even if the goal is slightly off-target, it’s still extremely bad. But again, this sounds like a statement that only makes sense at a certain limit, a limit that we’re not necessarily tending towards and won’t necessarily reach. Specifically, the limit where an AI doesn’t just pursue some goal, but also pursues it perfectly consistently, at all costs, always, to the extreme.
In practice we see the opposite in current models. If you ask them about their preferences, they generally report normal things. If you observe the values they express in the wild, they generally correspond to what you’d expect from training. There are counterexamples where AIs do weird and harmful things, but not in a coherent, goal-directed, maximizing way.
I think it’s more likely that, to the extent that AI models will generalize a single unifying goal, it will look something like “follow the human’s instruction”, seeing as that’s the common thread across most of its training tasks. And I don’t see a good justification for why the model will, by default, maximize that goal in a dangerous, alien manner (for example by locking us up and forcing us to give as many instructions as possible). We have pretty general AI models already and none of them have done anything vaguely similar. It doesn’t seem like the authors incorporate this empirical evidence at all, instead cherry-picking any case where an AI did something unintended (for example the existence of Glitch Tokens) and presenting that as evidence of its tendency towards “weird, strange, alien preferences”.
In many places, the authors emphasize that we have a very poor understanding of the outputs of our neural network training process. We understand the process itself, but not its output. This is in some sense true, though they exaggerate the extent of our ignorance and inability to make progress.
The relationship that biologists have with DNA is pretty much the relationship that AI engineers have with the numbers inside an AI.
In some ways, understanding neural network is like biology—the phenomena are messy, hard to reverse-engineer, and indeed “grown, not crafted”. The Anthropic paper “On the Biology of a Large Language Model” gives a good feel for this. But interpretability research on modern LLMs is much more recent than research on biology. We’ve had much less time to make progress. And for the same reasons that the authors predict massive efficiency advantages for artificial intelligences (e.g. ease of replication, speed of operation), we can run experiments on neural networks much faster than we can on biological organisms. This can also be accelerated as non-superintelligent but still useful models improve.
The authors split the world into before and after superintelligence:
Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed.
The idea is that we must stop AI development now otherwise superintelligence may emerge quite suddenly any moment and kill us all, probably in the next few years or decade.
First of all, even under most of their other assumptions, we could have an AI that is capable of killing us all given a lot of time or resources, but not otherwise. A sudden phase change between “completely incapable of killing us” and “capable of immediately killing us no matter what precautions are put in place” is very unlikely, and their sci-fi story makes a lot of assumptions to present such a scenario.
Their proposal for what to do about AI risk is heavily premised on their extremely fast take-off prediction, which they do not convincingly justify. However, if you think AI progress will be slower and more gradual, it could be quite sensible to continue advancing capabilities because AI can help us with building better safety measures, understanding of model internals, and improve human lives in other ways.
I think the book does a pretty good job of presenting its arguments to a layperson audience, and dispelling common silly misconceptions, like that AIs cannot “truly” learn and understand, or that an intelligent entity can’t possibly pursue a to-us meaningless goal. So it’s still worth reading for a layperson who wants to understand the MIRI worldview.
You might think that, because LLMs are grown without much understanding and trained only to predict human text, they cannot do anything except regurgitate human utterances. But that would be incorrect. To learn to talk like a human, an AI must also learn to predict the complicated world that humans talk about.
For the nitpickers, they add the caveat “at least, those problems where there is room to substantially improve over human performance”
Rob Bensinger later clarified to me that this passage is meant to refer to the reference class of “possible inventions that look feasible and worth trying hard to build and not giving up”. This clarification makes the argument more plausible, but it’s hard to evaluate.
SGD has other inductive biases besides generality, for example seeking out incremental improvements (I coauthored a post about this a couple years ago though I don’t think it’s the highest-quality resource on the topic)
I recommend Alex Turner’s blog post Inner and outer alignment decompose one hard problem into two extremely hard problems.
Of course in the cases of humans and AIs “being a search process” is an imperfect model. And so perhaps a better phrasing is if we look for things that tend to achieve goal X we could find something that tends to achieve goal Y but also achieves goal X as a byproduct in every case we can check (but not necessarily in cases we can’t check). And the important question is what goal Y actually is, and how perfect the model of “this entity pursues goal Y” is. For it is usually not a perfect model (we humans pursue many goals, but none of us pursue any goals perfectly).