3205

LESSWRONG
Petrov Day
LW

3204
IABIEDAI

51

A non-review of "If Anyone Builds It, Everyone Dies"

by boazbarak
28th Sep 2025
4 min read
10

51

IABIEDAI

51

A non-review of "If Anyone Builds It, Everyone Dies"
13Eliezer Yudkowsky
5Buck
4boazbarak
2Vladimir_Nesov
5sjadler
2boazbarak
2boazbarak
2RussellThor
2boazbarak
2boazbarak
New Comment
10 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:18 PM
[-]Eliezer Yudkowsky1h132

The gap between Before and After is the gap between "you can observe your failures and learn from them" and "failure kills the observer".  Continuous motion between those points does not change the need to generalize across them.

It is amazing how much of an antimeme this is (to some audiences).  I do not know any way of saying this sentence that causes people to see the distributional shift I'm pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.

Reply
[-]Buck31m50

Where do you think you've spelled this argument out best? I'm aware of a lot of places where you've made the argument in passing, but I don't know of anywhere where you say it in depth.

(My response last time (which also wasn't really in depth; I should maybe try to articulate my position better sometime...) was this.)

Reply
[-]boazbarak21m40

You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does. 

However, if AI's advance continuously in capabilities, then there are many intermediate points between today where (for example) "failure means prompt injection causes privacy leak" and "failure means everyone is dead". I believe that if AIs that capable of the latter would be scaled up version of current models, then by studying which alignment methods scale and do not scale, we can obtain valuable information.

If you consider the METR graph, of (roughly) duration of tasks quadrupling every year, then you would expect non-trivial gaps between the points. that (to take the cybersecurity example) AI is at the level of a 2025 top expert, AI can be equivalent to a 2025 top level hacking team, AI reaches 2025 top nation state capabilities. (And of course while AI improves , the humans will be using AI assistance also.)

I believe there is going to be a long and continuous road ahead between current AI systems and ones like Sable in your book. 
I don't believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run. 
Hence I believe we will be able to learn from both successes and failures of our alignment methods throughout this time.

Of course, it is possible that I am wrong, and future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.
 

Reply
[-]Vladimir_Nesov12m20

In this framing the crux is whether there is an After at all (at any level of capability). The distinction between "failure doesn't kill the observer" (a perpetual Before) and "failure is successfully avoided" (managing to navigate the After).

Reply
[-]sjadler4h52

Thanks for taking the time to write up your reflections. I agree that the before/after distinction seems especially important (‘only one shot to get it right’), and a crux that I expect many non-readers not to know about the EY/NS worldview.

I’m wondering about your take in this passage:

In the book they make an analogy to a ladder where every time you climb it you get more rewards but once you reach the top rung then the ladder explodes and kills everyone. However, our experience so far with AI does not suggest that this is a correct world view.

I’m curious what about the world’s experience with AI seems to falsify it from your POV? / casts doubt upon it? Is it about believing that systems have become safer and more controlled over time?

(Nit, but the book doesn’t posit that the explosion happens at the top rung; in that case, we could just avoid ever reaching the top rung. It posits that the explosion happens at a not-yet-known rung, and so each successive rung climb carries some risk of blow-up. I don’t expect this distinction is load-bearing for you though)

Reply
[-]boazbarak2m20

p.s. I just realized that I did not answer your question:

> Is it about believing that systems have become safer and more controlled over time?

No this is not my issue here.  While I hope it won't be the case, systems could well become more risky and less controlled over time. I just believe that if that is the case then it would be observable via seeing increased rate of safety failures far before we get to the point where failure means that literally everyone on earth dies.

Reply
[-]boazbarak15m20

See my response to Eliezer. I don't think it's one shot - I think there are going to be both successes and failures along the way that would give us information that we will be able to use. 

Even self improvement is not a singular event - already AI scientists are using tools such as codex or claude code to improve their own productivity. As models grow in capability, the benefit of such tools will grow, but it is not necessarily one event. Also, I think that we would likely require this improvement just to sustain the exponential  at its current rate- it would not be sustainable to continue the growth in hiring and so increasing productivity via AI would be necessary.

Re the nit: In page 205 they say "Imagine that evert competing AI company is climbing a ladder in the dark. At every rung but the top one, they get five times as much money ... But if anyone reaches the top rung, the ladder explodes and kills everyone. Also nobody knows where the ladder ends." 

I'll edit a bit the text so it's clear you don't know when it ends.

Reply
[-]RussellThor3h2-10

The idea that there would be a distinct "before" and "after" is also not supported by current evidence which has shown continuous (though exponential!) growth of capabilities over time.

 

The time when the AI can optimize itself better then a human is a one-off event. You get the overhang/potential take-off here. Also the AI having a coherent sense of "self" that it could protect by say changing its own code, controlling instances of itself could be an attractor and give "before/after". 

Reply
[-]boazbarak15m20

See this response 

Reply
[-]boazbaraknow20

Also I don't think sense of "self" is a singular event either, indeed already today's systems are growing in their situational awareness which can be thought as some sense of self. See our scheming paper https://www.antischeming.ai/

Reply
Moderation Log
More from boazbarak
View more
Curated and popular this week
10Comments

I was hoping to write a full review of "If Anyone Builds It, Everyone Dies" (IABIED Yudkowsky and Soares) but realized I won't have time to do it.  So here are my quick impressions/responses to IABIED. I am writing this rather quickly and it's not meant to cover all arguments in the book, nor to discuss all my views on AI alignment; see six thoughts on AI safety and Machines of Faithful Obedience for some of the latter.

First, I like that the book is very honest, both about the authors' fears and predictions, as well as their policy prescriptions. It is tempting to practice strategic deception, and even if you believe that AI will kill us all, avoid saying it and try to push other policy directions that directionally increase AI regulation under other pretenses. I appreciate that the authors are not doing that. As the authors say, if you are motivated by X but pushing policies under excuse Y, people will see through that.

I also enjoyed reading the book. Not all parables made sense, but overall the writing is clear. I agree with the authors that the history of humanity is full of missteps and unwarranted risks (e.g. their example of leaded fuel). There is no reason to think that AI would be magically safe on its own just because we have good intentions or that the market will incentivize that. We need to work on AI safety and, even if AI falls short of literally killing everyone, there are a number of ways in which its development could turn out bad for humanity or cause catastrophes that could have been averted.

At a high level, my main disagreement with the authors is that their viewpoint is very "binary" while I believe reality is much more continuous.  There are several manifestations of this "binary" viewpoint in the book. There is a hard distinction between "grown" and "crafted" systems, and there is a hard distinction between current AI and superintelligence. 

The authors repeatedly talk about how AI systems are grown, full of inscrutable numbers, and hence we have no knowledge how to align them. While they are not explicit about it, they implicit assumption is that there is a sharp threshold between non superintelligent AI and superintelligent AI. As they say "the greatest and most central difficulty in aligning artificial superintelligence is navigating the gap between before and after." Their story also has a discrete moment of "awakening" where "Sable" is tasked with solving some difficult math problems and develops its independent goals. Similarly when they discuss the approach of using AI to help with alignment research, they view it in binary terms: either the AI is too weak to help and may at best help a bit with interpretability, or AI is already "too smart, too dangerous, and would not be trustworthy."

I believe the line between "grown" vs "crafted" is much more blurry than the way the authors present it. First, there is a sense in which complex systems are also "grown". Consider for example, a system like Microsoft Windows with 10s of millions of lines of source code that has evolved over decades. We don't fully understand it either - which is why we still discover zero day vulnerabilities. This does not mean we cannot use Windows or shape it. Similarly, while AI systems are indeed "grown", they would not be used by hundreds of millions of users if AI developers did not have strong abilities to shape them into useful products. Yudkowsky and Soares compare training AIs to "tricks .. like the sort of tricks a nutritionist might use to ensure a healthy brain development in a fetus during pregnancy." In reality model builders have much more control over their systems than even parents who raise and educate their kids over 18 years. ChatGPT might sometimes give the wrong answer, but it doesn't do the equivalent of becoming an artist when its parents wanted it to go to med school.

The idea that there would be a distinct "before" and "after" is also not supported by current evidence which has shown continuous (though exponential!) growth of capabilities over time. Based on our experience so far, the default expectation would be that AIs will grow in capabilities, ability for longterm planning and acting, in a continuous way. We also see that AI's skill profile is generally incomparable to humans. (For example it is typically not the case that an AI that achieves a certain score in a benchmark/exam X will perform in task Y similarly to humans that achieve the same score.) Hence there would not be a single moment where AI transitions from human level to superhuman level, but rather AIs will continue to improve, with different skills transitioning from human to superhuman levels at different time.

Continuous improvement means that as AIs become more powerful, our society of humans augmented with AIs is also more powerful, both in terms of defensive capabilities as well as research on controlling AIs. It also mean that we can extract useful lessons about both risks and mitigations from existing AIs, especially if we deploy them in the real world. In contrast, the binary point of view is anti empirical. One gets the impression that no empirical evidence for alignment advances would change the authors' view since it would all be evidence from the "before" times, which they don't believe will generalize to the "after" times. 

In particular, if we believe in continuous advances then we have more than one chance to get it right. AIs would not go from cheerful assistants to world destroyers in a heartbeat. We are likely to see many applications of AIs as well as (unfortunately) more accidents and harmful outcomes, way before they get to the combination of intelligence, misalignment, and unmonitored powers that leads to infecting everyone in the world with a virus that gives them "twelve different kinds of cancer" within a month.

Yudkowsky and Soares talk in the book about various accidents in nuclear reactors and space ships, but they never mention all the cases that nuclear reactors actually worked and space ships returned safely. If they are right that there is one threshold which once passed, it's "game over" then this makes sense. In the book they make an analogy to a climbing a ladder in the dark where every time you climb it you get more rewards but no one can see where the ladder ends and once you reach the top rung it explodes and kills everyone. However, our experience so far with AI does not suggest that this is a correct world view.