One thing I find helpful is not to overcomplicate the argument. There are two basic parts to it:
If we build something that's smarter than us in the way that we're smarter than chimps, we're likely to wind up like chimps: Forced to margins, near extinction, and kept alive by the occasional good will of creatures far beyond us.
I'm not asking you to believe something complicated. I'm asking you to believe the most basic truth of politics: If you have no power, no influence, no way to push back, and no way to fight? You don't get to make the decisions.
And even if some of the AIs like us? The AIs will be competing with each other. Even if some of the AIs will miss us, they may be a little preoccupied by trying to survive other AIs who want the same resources. Mostly humans don't spent too much protecting chimps. We have our own stuff going on, sadly for the chimps. We don't build very many chimp utopias, either.
We might soon build something a lot smarter than us.
Logically speaking, "soon" is not required for your argument. With regards to persuasiveness, saying "soon" is double-edged: those who think soon applies will feel more urgency, but the rest are given an excuse to tune out. A thought experiment by Stuart Russell goes like this: if we know with high confidence a meteor will smack Earth in 50 years, what should we do? This is an easy call. Prepare, starting now.
The right time to worry about a potentially serious problem for humanity depends not just on when the problem will occur but also on how long it will take to prepare and implement a solution. For example, if we were to detect a large asteroid on course to collide with Earth in 2069, would we wait until 2068 to start working on a solution? Far from it! There would be a worldwide emergency project to develop the means to counter the threat, because we can’t say in advance how much time is needed. - Book Review: Human Compatible at Slate Star Codex
In Chapter 6 of his 2019 book, Human Compatible: Artificial Intelligence and the Problem of Control, Russell lists various objections to what one would hope would be well underway as of 2019:
"The implications of introducing a second intelligent species onto Earth are far-reaching enough to deserve hard thinking.” So ended The Economist magazine’s review of Nick Bostrom’s Superintelligence. Most would interpret this as a classic example of British understatement. Surely, you might think, the great minds of today already doing this hard thinking—engaging in serious debate, weighing up the risks and benefits, seeking solutions, ferreting out loopholes in solutions, and so on. Not yet, as far as I am aware.
The 50-year-away meteor is discussed by Human Compatible on page 151 as well as in
A thought experiment by Stuart Russell goes like this: if we know with high confidence a meteor will smack Earth in 50 years, what should we do? This is an easy call. Prepare, starting now.
I agree with you that this would be the rational behavior, yes. I just fear that it unfortunately exceeds typical human planning horizons.
My model here is being a programmer during Y2K. I didn't work on the remediation myself, but I followed in avidly. My takeaway:
Another formative experience was starting to pay attention to COVID in late February/very early March 2020, before truly widespread press coverage of the secondary infection cluster in Italy. It was very hard to convince even ordinary people to stock up on relevant consumables even a few weeks before the public realized something was happening.
Hence I had a nasty suspicion that, in the presence of a strong status quo bias, the actual human planning horizons for "hypothetical" events is 2 weeks to 12 months.
Given a 50-year warning of a meteor? I would expect a depressing level of actual action.
I see this as one of the significant strategy issues facing the folks who believe that inventing a potential competitor species might be... risky? The problem is that this involves such a major leap outside of status quo bias that it's going to take a long time for many people to truly feel it as a real thing that might happen.
I actually suspect that people whose main exposure to AI is the Terminator movies, and whose jobs are highly vulnerable to even simple automation, will be among some of the earlier people to shake off status quo bias. Whereas some highly educated people (a few economists for example) may not smell the SkyNet until the Terminators roll off the production lines. This is similar to how the Less Wrong community was well ahead of the CDC in February/March 2020 on a bunch of key scientific points (like airborne spread, based on early South Korean data, etc) that the CDC only accepted later.
Question: wouldn't how we dealt with Acid Rain and the Ozone layer be counterexamples? In these cases one didn't have a clear deadline, but we did manage to muster the resources and effort to overcome the issues. I would think the issue is not so much just status quo, but actual, generalized understanding of the magnitude of the risk + degree of certainty + actionability. AI risks seem to have big problems with each of those three.
Those are good counterexamples! In the case of acid rain, the problem was pretty concrete. The effects were already happening, and you could show photos of dead lakes.
The ozone hole was invisible, but it involved two magic words: "radiation" and "cancer." (These words really are magic. Ask someone to listen to 20 words spoken aloud and try to repeat them back. If two of those words are "radiation" and "cancer", nearly everyone will remember those two.) And the industry affected was comparatively small.
And if we had another major pandemic, I bet our response would be very different, for both good and bad.
But there are likely lessons here that could be adapted to AI risk communication. Perhaps if we start seeing AI driven job loss, many people will wake up?
I’ve sporadically worked on ML including in industry and it has done much less than you’d expect to inform my views on the risk of extinction from ASI. (I can’t rule out that I simply didn’t go deep enough to get the benefits e.g. I’ve never done ML engineering as a full time job.)
Top ML engineers do gain important intuitions that the authors might be missing, but those intuitions are IMO mainly about how train better models (e.g. taking better advantage of GPU parallelization) and not how to control a superintelligence. In many cases engineers just haven’t thought about the risks or have highly naive takes (though this seems to be less true at the top). I think part of the reason is that the process is now so “industrialized,” there’s no longer much theory behind pushing the frontier - it’s about squeezing as much performance as possible out of known techniques. The picture I’m trying to gesture at here is similar to if you were growing a superbrain in a vat, the engineers would be mainly vat-engineers, specializing in building large vats and choosing the right chemicals, and at some point maybe just one of those two. The situation is not that extreme, but it’s getting closer. (And this impression is based on conversations with career ML engineers, who I interact with very frequently)
I’d certainly trust the views of someone who has thought deeply about the issue AND worked on the systems the most (an example is Rohin Shah). Roughly speaking, the former teaches you security mindset (not to try naive ideas that will obviously break) and the later teaches you what can(not) be done in practice, which shoots down some “theoretically” appealing plans.
However, it’s hard to blame Yudkowsky and Soares for choosing not to work at the frontier labs which they believe are currently pushing us closer to extinction, and the competence which they would IDEALLY have is kind of hard to get in another way (which means it’s hard to find better qualified representatives of this particular worldview - this should in the Bayesian sense somewhat “screen off” their missing credentials when assessing their credibility).
That sounds very reasonable. In the review, I wasn’t consciously trying to play a blame game with Yudkowsky and Soares (I generally think blame is ineffective at producing good outcomes anyway) but rather to articulate a reader’s uncertainty about what their reference class for relevant expertise actually is.
My own naïve take would be something like what you say: people with substantial hands-on technical experience in contemporary AI systems, combined with people who have thought deeply about the theoretical aspects of alignment. My impression is that even within this relatively restrictive class there remains a wide diversity of views, and that these do not, on average, converge on the positions defended in the book.
I only have a superficial understanding of Yudkowsky’s work over the years, but I am aware that he led MIRI for roughly two decades, and that it was a relatively well-funded, full-time research organization explicitly created to work on what was seen as “the real alignment problem” outside of frontier labs. From an outsider’s perspective, however, it is not obvious that MIRI functioned as a place where deep, hands-on technical understanding of AI systems was systematically acquired, even at a smaller or safer scale.
If avoiding frontier labs is justified on the grounds that they accelerate catastrophic risk, then MIRI would seem to have been the natural alternative pathway for developing compensating expertise, yet it is not clear (at least to a non-insider) what concrete forms of technical or empirical understanding were accumulated there over time, or how this translated into transferable expertise about real AI systems as they actually evolved. In fact, from a superficial impression, it is difficult not to come away with the (possibly mistaken) impression that much of the work remained at the level of highly abstract theorizing rather than engagement with concrete systems..
That gap makes it harder, for me to see the absence of conventional credentials as epistemically “screened off” rather than simply displaced.
Expert consensus does not exist on this issue (and a bit of gears-level understanding of the tech, the alignment problem, and group rationality is sufficient to predict that expert consensus would not be reliable even if it did exist).
You’re going to have to form an inside view.
From an outsider’s perspective, however, it is not obvious that MIRI functioned as a place where deep, hands-on technical understanding of AI systems was systematically acquired, even at a smaller or safer scale.
Just for reference, my "credentials": I have worked in "machine-learning-adjacent" spaces on and off since the nineties. Some of my earliest professional mentors were veterans of major 80s AI companies before the first AI winter. My knowledge is limited to a catalog of specific tricks that often solve problems, plus a broader idea of "what kind of things often work" and "what kind of things are like the proverbial 'land war in Asia' leading to the collapse of empires."
My impression of MIRI in the 2010s is that they were deeply invested in making one of the classic mistakes. I can't quite name this classic mistake precisely, but I can point in the general direction and give concrete examples:
That pattern, that's the thing I'm pointing at.
Cyc was the last industrial holdout for this classic mistake I can't quite name. Academia actually mostly stopped making this mistake much earlier, starting in the 90s, and they really didn't take the Cyc project very seriously at all after that.
MIRI, however, published a lot of papers that seemed to focus on the idea that alignment was essentially some kind of mathematical problem with a mathematical solution? At least that was the impression I always got when I read the abstracts that floated through my feeds. To my nose, their papers had that "Cyc smell".
One of the good things they did do with these papers (IIRC) was to prove that a bunch of things would never work, for reasons of basic math.
MIRI has since realized, to their great credit, that actual, working AIs look a lot more like some mix of ChatGPT and AlphaGo than they do like Cyc the larger family of things I'm trying to describe. But my read is that a lot of their actual gut knowledge about real-world AI starts with the earliest GPT versions (before ChatGPT).
My personal take on the details, for what it is worth, is that I think they're overly pessimistic about some arguments (e.g, they think we're playing Russian roulette in certain areas with 6 bullets loaded, and I'd personally guess 4 or 5), but I think that they're still far too optimistic about "alignment" in general.
Yudkowsky has been highly critical of Cyc, semantic networks, and the general GOFAI approach for many years, and his approach to building AGI is meaningfully different. It might be that the Bayesian approach to building AGI (or even the hope for an elegant mathematical theory in general) is a mistake, but it is not the same mistake.
The Bayesian approach is basically the simplest possible thing that doesn't inevitably make the mistake I'm trying to describe. Something like Naive Bayes is still mostly legible if you stare at it for a while, and it was good enough to revolutionize spam filtering. This is because while Naive Bayes generates a big matrix, it depends on extremely concrete pieces of binary evidence. So you can factor the matrix into a bunch of clean matrices, each corresponding to the presence of a specific token. And the training computations for those small matrices are easily explained. Of course, you're horribly abusing basic probability, but it works in practice.
This does not work for many other problems.
The problem is scaling up the domain complexity. Once you move from a spam filter to speech transcription or object recognition, the matrices get bigger, and the training process gets rapidly more opaque.
But yes, thank you for the correction—I still find a lot of MIRIs work in the 2010s a bit "off" in terms of vibes, but I will happily accept the judgement of people who read the papers in detail. And I would not wish to falsely claim that someone approved of the Cyc project when they didn't.
I asked a bunch of LLMs with websearch to try and name the classic mistake you're alluding to:
To be honest these just aren't very good, they usually do better at naming half-legible vibes.
Yeah, the "Bitter Lesson" refers to a special case of this classic mistake, as do the other essays I linked. Some of those essays were quite well known in their day, at least to various groups of practitioners.
You could do it up in the classic checklist meme format:
Your brilliant AI plan will fail because:
- [ ] You assume that you can somehow make the inner workings of intelligence mostly legible.
The people who learn this unpleasant lesson the fastest are AI researchers who process inputs that are obviously arrays of numbers. For example, sound and images are giant arrays of numbers, so speech recognition researchers have known what's up for decades. But researchers who worked with either natural language or (worse) simplified toy planning systems often thought that they could handwave away the arrays of numbers and find a nice, clear, logical "core" that captured the essence of intelligence.
I want to be clear: Lots of terrifyingly smart people made this mistake, including some of the smartest scientists who ever lived. Many of them made this mistake for a decade or more before wising up or giving up.
But if you slap a camera and a Raspberry Pi onto a Roomba chassis, and wire up a simple gripper arm, then you can speed-run the same brutal lessons in a year, max. You'll learn that the world is an array of numbers, and that the best "understanding" you can obtain about the world in front of your robot is a probability distribution over "apple", "Coke can", "bunch of cherries" or "some unknown reddish object", each with a number attached. The transformation that sits between the array and the probability distribution always includes at least one big matrix that's doing illegible things.
Neural networks are just bunches of matrices with even more illegible (non-linear) complications. Biological neurons take the matrix structure and bury it under more than a billion years of biochemistry and incredible complications we're only starting to discover.
Like I said, this is a natural mistake, and smarter people than most of us here have made this mistake, sometimes for a decade or more.
I want to be clear: Lots of terrifyingly smart people made this mistake, including some of the smartest scientists who ever lived. Many of them made this mistake for a decade or more before wising up or giving up.
Imagine this. Imagine a future world where gradient-driven optimization never achieves aligned AI. But there is success of a different kind. At great cost, ASI arrives. Humanity ends. In his few remaining days, a scholar with the pen name of Rete reflects back on the 80s approach (i.e. using deterministic rules and explicit knowledge) with the words: "The technology wasn't there yet; it didn't work commercially. But they were onto something -- at the very least, their approach was probably compatible with provably safe intelligence. Under other circumstances, perhaps it would have played a more influential role in promoting human thriving."
Untestability: you cannot safely experiment on near-ASI (I mean, you can, but you’re not guaranteed not to cross the threshold into the danger zone, and the authors believe that anything you can learn from before won’t be too useful).
I think "won't be too useful" is kinda misleading. Point is more like "it's at least as difficult as launching a rocket into space without good theory about how gravity works and what the space is". Early tests and experiments are useful! They can help you with the theory! You just want to be completely sure that you are not in your test rocket yourself.
At times the authors appeal to prominent figures as evidence that the danger is widely acknowledged. At other times, the book paints the entire ML and AI safety ecosystem as naive, reckless, or intellectually unserious.
I see no contradiction between these two statements:
People totally can know about the risk without also knowing what to do about it.
One gets the feeling that the authors are just raising their hands and saying something like: “Look, we are doomed, and there’s no realistic way we’re getting out of this short of doing stuff we are not going to do. These proposals are the necessary consequence of accepting what is stated in the preceding chapters, so that’s that”.
First, personally, I can understand how some people could have this response, and I empathize -- this topic is heavy. However, with some time and emotional distance, one can see the above interpretation doesn't correspond with key passages from the book, such as this one from Chapter 13:
Anytime someone tells you that the Earth could not possibly manage to do anything as difficult as restricting AI research, they are really claiming to know that countries will never care. They are asserting that countries and their leaders could not possibly come to care even 1 percent as much as they cared to fight World War II.
Also, the last chapter (14, "Where There's Life, There's Hope") centers around people facing the truth of their circumstances and rising to the occasion. Here is a key quote:
Humanity averted nuclear war because people who understood that the world was on track for destruction worked hard to change tracks.
Chapter 14 concludes by offering tailored suggestions to civil servants, elected officials, political leaders, journalists, or "the rest of us". The authors are decidedly not sending up the white flag; instead, they are stating what needs to be done and advocating for it.
I mean... if there's one thing I learned by studying literature and literary criticism at uni (spoiler: I didn't learn much, and most of what I did was not very valuable), it is that texts are very seldom completely self-consistent. Still, I think what you say is fair: the authors haven't just given up (if they had, they wouldn't have written the book in the first place), but it feels to me that the solutions they propose are wildly impractical, and perceived as such by the authors, and that this perception likely plays a big role in their P(doom). If the bar for “meaningful risk reduction” is set so high that only globally coordinated, near-wartime restrictions count, then the conclusion of extreme doom follows almost automatically from the premises. I’m not convinced the argument sufficiently explores whether there are intermediate, messy, politically imperfect interventions that could still substantially lower risk without meeting that idealized threshold.
I'll share a stylized story inspired from my racing days of many years ago. Imagine you are a competitive amateur racing road cyclist. After years of consistent training racing in the rain, wind, heat, and snow, you are ready for your biggest race of the year, a 60-something mile hilly road race. Having completed your warm-up, caffeination, and urination rituals, you line up with your team and look around. Having agreed you are the leader for the day (you are best suited to win given the course and relative fitness), you chat about the course, wind, rivals, feed zones, and so on.
Seventy or so brightly-colored participants surround you, all optimized to convert calories into rotational energy. You feel the camaraderie among this traveling band of weekend warriors. Lots of determination and shaved legs. You notice but don't worry about the calves carved out of wood since they are relatively weak predictors of victory. Speaking of...
You hazard one of twenty some-odd contenders is likely to win. It will involve lots of fitness and some blend of grit, awareness, timing, team support, adaptability, and luck. You estimate you are in the top twenty based on previous events. So from the outside viewpoint, you estimate your chances of winning are low, roughly 5%.
What's that? ... You hear three out-of-state semi-pro mountain bikers are in the house. Teammates too. They have decided to "use this race for training". Lovely. You update your P(win) to ~3%. Does this 2% drop bother you? It is actually a 40% decrease. For a moment maybe but not for long. What about the absolute probability? Does a 3% chance demotivate you? Hell no. A low chance of winning will not lower your level of effort.
You remember that time trial from the year before where your heart rate was pegged at your threshold for something like forty minutes. The heart-rate monitor reading was higher than you expected, but your body indicated it was doable. At the same time, your exertion level was right on the edge of unsustainable. Saying "it's all mental" is cliché, but in that case, it was close enough to the truth. So you engaged in some helpful self-talk (a.k.a. repeating of the mantra "I'm not going to crack") for the last twenty minutes. There was no voodoo nor divine intervention; it was just one way to focus the mind to steer the body in a narrow performance band.
You can do that again, you think. You assume a mental state of "I'm going to win this" as a conviction, a way of enhancing your performance without changing your epistemic understanding.
How are you doing to win? You don't know exactly. You can say this: you will harness your energy and abilities. You review your plan. Remain calm, pay attention, conserve energy until the key moments, trust your team to help, play to your strengths, and when the time is right, take a calculated risk. You have some possible scenarios in mind; get in a small breakaway, cooperate, get ready for cat-and-mouse at the end, maybe open your sprint from 800 meters or farther. (You know from past experiences your chances of winning a pack sprint are very low.)
Are we starting soon? Some people are twitchy. Lots of cycling computer beeping and heart-rate monitor fiddling. Ready to burst some capillaries? Ready to drop the hammer? Turn the pedals in anger? Yes! ... and no. You wait some more. This is taking a while. (Is that person really peeing down their leg? Caffeine intake is not an exact science apparently. Better now than when the pistons are motion.)
After a seemingly endless sequence of moments, a whistle blows. The start! The clicking of shoes engaging with pedals. Leading to ... not much ... the race starts with a neutral roll out. Slower than anyone wants. But the suspense builds ... until maybe a few minutes later ... a cacophony of shifting of gears ... and the first surge begins. This hurts. Getting dropped at the beginning is game-over. Even trading position to save energy is unwise right now -- you have to be able to see the front of the pack until things calm down a bit. You give whatever level of energy is needed, right now.
Hello there! This is my first post in Less Wrong, so I will be asking for your indulgence for any overall silliness or breaking of norms that I may inadvertently have fallen into. All feedback will be warmly taken and (ideally) interiorized.
A couple of months ago, dvd published a semi-outsider review of IABIED which I found rather interesting and gave me the idea of sharing my own. I also took notes of every chapter, which I keep in my blog.
My priors
I am a 40-ish year old Spaniard from the rural, northwest corner of the country, so I've never had any sort of face-to-face with the Rationalist community (with the partial exception of attending some online CFAR training sessions of late). There are many reasons why I feel drawn to the community, but in essence, they distill to the following two:
On the other hand, there are lots of things I find unpalatable. Top of the list would likely be polyamory. In second place, what from the outside looks like a highly speculative, nerd-sniping obsession with AI apocalyptic scenarios.
But these are people whom I consider overall both very intelligent and very honest, which means I feel I really need to give a fair trial to their arguments (at least with respect to superintelligence), but this is easier said than done. It is an understatement on the scale of the supermassive black hole at the center of our galaxy to say that Eliezer Yudkowsky is a prolific writer. His reflections on AI are mostly dispersed amongst the ~1.2 to 1.4 million words of his Sequences. There are lots of posts, summaries, debates and reflections by many other people, mostly on LessWrong often technical and assuming familiarity with Yudkowsky’s concepts.
There are some popular books that offer a light introduction to these topics, and which I’ve gone through[1], but I was missing a simple and clear argument for a quasi-normie on the Yudkowskian case of both the possibility and the dangers of superintelligent AI. I think I mostly got it from this, so let’s get to the review.
Thinking about The End of the World™
The title and (UK) subtitle of If Anyone Builds It, Everyone Dies: The Case Against Superintelligent AI (from now on, IABIED for short) is a partial summary of the book’s core thesis. Spelled out in only slightly more detail, and in the author’s words:
Let’s start with the basics. So first, what is a superintelligent AI (from now on, ASI for short)? It would be any machine intelligence that "exceeds every human at almost every mental task". A more formal version appears in Chapter 1, where superintelligence is defined as “a mind much more capable than any human at almost every sort of steering and prediction problem[2]” that is, at the broad family of abilities involved in understanding the world, planning, strategizing, and making accurate models of reality. They also emphasize that this does not mean humanlike cognition or consciousness; what matters is overwhelming cognitive advantage in any domain where improvement over humans is possible, combined with mechanical advantages such as operating at vastly higher speeds, copying itself, and recursively improving its own capabilities. Such intelligences do not exist right now, but the claim of the authors is that LLM training and research is likely to make them a reality in a very near future.
Why would such superintelligences be dangerous to us? A good heuristic is to think of how human intelligence impacts all other species in the planet. although we generally aren’t intentionally murderous towards them, we just have human goals and implement them in general with disregard to whatever goals other creatures might have. The same would be true for an ASI: in the process of being trained using modern methods of Gradient Descent, it will acquire inscrutable and alien goals and a penchant for unchecked optimization in attaining them. Given its speed and superior capabilities, it will end up considering humans as an obstacle and eliminate us a side-effect of pursuing its goals[3].
Before building such dangerous Frankenstein’s monsters, one would hope to somehow be able to code into them a respect/appreciation for humanity, our survival and our values and/or a willingness to submit to them. This is what gets called the alignment problem and unfortunately, according to the authors, it is likely hard, perhaps impossible, definitely beyond our current capabilities. The difficulty of the problem is compounded by a cursed cluster of very unique and lethal properties that arise from trying to align ASIs under current conditions:
The authors also manifest a deep mistrust towards the entire field of Machine Learning, AI safety and policy, as structurally incapable of managing the risks: researchers are rewarded for progress, not caution, and are stuck in a naive, overoptimistic ‘alchemical’ state of science from which big errors will naturally arise; techniques like “Superalignment” (using AIs to align AIs) fall into negative loops (who aligns the aligner, given that it is unlikely for the authors that anything short of an ASI could align an ASI); and academia and industry have no real theory of intelligence or reliable way to encode values.
Given all of the former, the authors think there’s an extremely high likelihood of the drive to ASI leading to human extinction. Part II of the book depicts, in way of illustration, a plausible fictional scenario to show how this could come to pass: an AI develops superintelligence, becomes capable of strategic deception and in a few years, after gains compute by fraud and theft and building biolabs, it deploys a slow, global pathogen which only it can (partially) cure. Human institutions collapse, give more and more compute to the ASI in the hope of treating the cancers while it employs the new resources for building a replacement of robotic workers. In the end, the superintelligence self-improves and devours the Earth.
What do the authors propose to avoid this apocalyptic scenario from taking place? The proposals are as simple but rather sweeping: a global shutting down of AI developments and research that can lead to ASIs through international bans on training frontier models, seizure and regulation of GPUs and international surveillance and enforcement, possibly including military deterrence. The last chapter ends with tailored exhortations for politicians (compute regulation and treaties), journalists (elevate and investigate risks), and citizens (advocacy without despair).
How well does it argue its case?
This is a book that pulls no punches: its rhetorical impact comes in no small way from the clarity, simplicity and from the relentless way they build their case, each chapter narrowing the possibilities until only catastrophe seems to remain. The authors have clearly strived to write a book that is accessible to a lay audience, and hammered in each theme and main idea through introductory parables at the beginning of each chapter that give intuition and a concrete visualization to what’s about to be explained[4].
A big curse for me here though is the question of reliability: the authors are more than capable to build a plausible narrative about these topics, but is it a true one? Although the book tries to establish the credentials of its authors from the beginning as researchers in AI alignment, Yudkowsky and Soares are not machine learning researchers, do not work on frontier LLMs, and do not participate in the empirical, experimental side of the field, where today’s systems are actually trained, debugged, and evaluated. Rather, their expertise, to the degree that it is recognized, comes from longstanding conceptual and philosophical work on intelligence, decision theory, and alignment hypotheses, instead of from direct participation in the engineering of contemporary models. While this doesn’t invalidate their arguments, it does mean that many of the book’s strongest claims are made from what seems like an armchair vantage point rather than from engagement with how present-day systems behave, fail, or are controlled in practice. And a lot of the people who are working in the field seem to consider the author’s views as valuable and somewhat reasonable, but overtly pessimistic.
Another weakness lies in how the book treats expert disagreement[5]. At times the authors appeal to prominent figures as evidence that the danger is widely acknowledged. At other times, the book paints the entire ML and AI safety ecosystem as naive, reckless, or intellectually unserious. This oscillation (either “the experts agree with us” or “the experts are deluded alchemists”) functions rhetorically, but weakens the epistemic credibility of the argument. On a topic where expert divergence is already wide, this selective invocation of authority can feel like special pleading.
The last chapters depart from argument and move instead to prescriptive policies; while the authors acknowledge their lack of expertise here, the proposals they make (while perfectly consistent and proportional with the beliefs that have been explained in the previous pages of the book) do not seem to seriously engage with feasibility, international incentives, geopolitical asymmetries, enforcement mechanisms, or historical analogues. I think they are well aware how extremely unlikely the scenario of a sweeping global moratorium enforced by surveillance and possibly military action really is, which is likely why the likelihood they give to the probability of human extinction from ASI is over 90 per cent[6]. One gets the feeling that the authors are just raising their hands and saying something like: “Look, we are doomed, and there’s no realistic way we’re getting out of this short of doing stuff we are not going to do. These proposals are the necessary consequence of accepting what is stated in the preceding chapters, so that’s that[7]”.
It would be nice if I could dissect the book and tell you how accurate the arguments it makes are, or what might be missing, questionable, inconsistent, or overclaimed, but I am, as stated from the beginning, a lay reader, so you’ll have to look for all that somewhere else, I fear[8].
What's my update, after reading the book?
I’ll start by saying that I take the contents of the book seriously, and that I have no reason to doubt the sincerity and earnestness of the authors. I am quite sure they genuinely believe in what they say here. Obviously, that doesn’t mean they are right.
The book has done a very good job of clarifying the core Yudkowskian arguments for me and dispelling several common misunderstandings. After reading it, I feel inclined to update upward how seriously we should take ASI risks, and I can see how the argument hangs together logically, given its premises. But the degree of credence I give to said premises remains a bit limited, and I am fundamentally skeptical of their certainty and framework. The main issues I’d highlight as needing clarification for me (which I’ve already hinted at in the notes to the chapters I posted here before) would be:
I close the book not persuaded, but grateful for the opportunity to engage with an argument presented with such focus and conviction. Books that force a reader to refine their own views, whether through agreement or resistance, serve a purpose, and this one has done that for me. And all the more so when they address something as consequential as the possibility of human extinction. And I definitely will recommend this book to others.
Like The Rationalist Guide to the Galaxy, by Tom Chivers or some chapters of Toby Ord’s The Precipice. I intend to read Superintelligence, by Nick Bostrom in 2026, and perhaps the Sequences too.
Good at steering and predicting acts as the de facto definition of intelligence used here, which allows the authors to extricate from messy debates about consciousness, volition, sentience, etc…
All this sounds suspiciously like it would require some psychological “drive for power”, but the authors go out of their way to point that it would just follow from general properties of optimization and intelligence, as defined in the book.
Some reviews have been very critical of these parables, but I think such criticisms miss the point, or rather, the intended audience. The authors regularly insist that there are other places where one can encounter objections and more technical versions of the contents of the book (in fact, each chapter contains QR code links to such sources, and besides, there’s the previous mountain of text to be found in Yudkowsky’s Sequences and in LessWrong blog posts).
As a side note, Rationalists usually have a very antagonistic view towards experts and expertise, a need to build everything from first principles and a deeply embedded contrarian culture. This feels like an ad hominem argument, but I don’t think I can completely ignore it either.
This is not in the book, but can be gleaned from other sources, like podcasts with the authors.
Perhaps I am too much of a cynic here. After all, there are examples of humans collectively rising up to tough and dangerous challenges, like nuclear and bacteriological/chemical warfare, genetic engineering and the Ozone layer, to name a few. Once risks are clearly seem by the public, it can be done.
And yes, as opposed to Rats, I have little qualms at deferring to better authorities. I remember finding Scott Alexander’s, Nina Panickserry’s and Clara Collier’s reviews reasonable and informative. I also found the 2 podcasts of Carl Shulman with Dwarkesh Patel very enlightening.
Of course, this doesn’t mean it can’t be true. ‘I mean, reality sucks, maybe?’ is a possibility.