2353

LESSWRONG
LW

2352
IABIEDAIWorld Modeling
2025 Top Fifty: 14%

104

Contra Collier on IABIED

by Max Harms
20th Sep 2025
24 min read
17

104

104

Contra Collier on IABIED
10WilliamKiely
2elifland
4ryan_greenblatt
14habryka
4Davidmanheim
4peterbarnett
1anaguma
1Matthew Barnett
11Vladimir_Nesov
2Matthew Barnett
2Vladimir_Nesov
4Davidmanheim
0Matthew Barnett
2Davidmanheim
0Matthew Barnett
1WilliamKiely
6Davidmanheim
New Comment
17 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:55 PM
[-]WilliamKiely2h1010

(I only skimmed your review / quickly read about half of it. I agree with some of your criticisms of Collier's review and disagree with others. I don't have an overall take.)

One criticism of Collier's review you appeared not to make that I would make is the following.

Collier wrote:

By far the most compelling argument that extraordinarily advanced AIs might exist in the future is that pretty advanced AIs exist right now, and they’re getting more advanced all the time. One can’t write a book arguing for the danger of superintelligence without mentioning this fact.

I disagree. I think it was clear decades before the pretty advanced AIs of today existed that extraordinarily advanced AIs might exist (and indeed probably would exist) eventually. As such, the most compelling argument that extraordinarily advanced AIs might or probably will exist in the future is not that pretty advanced AIs exist today, but the same argument one could have made (and some did make) decades ago.

One version of the argument is that the limits of how advanced AI could be in principle seem extraordinarily advanced (human brains are an existence proof and human brains have known limitations relative to machines) and it seems unlikely that AI progress would permantently stall before getting to a point where there are extraordinarily advanced AIs.

E.g. I.J. Good foresaw superintelligent machines, and I don't think he was just getting lucky to imagine that they might or probably would come to exist at some point. I think he had access to compelling reasons.

The existence of pretty advanced AIs today is some evidence and allows us to be a bit more confident that extraordinarily advanced AIs will eventually be built, but their existence is not the most compelling reason to expect significantly more capable AIs to be created eventually.

Reply
[-]elifland1h20

I agree about what is more evidence in my view, but that could be consistent with current AIs and the pace of their advancement being more compelling to the average reader, particularly people who strongly prefer empirical evidence to conceptual arguments.

Not sure whether Collier was referring to it being more compelling in her view, readers', or both.

edit: also of course current AIs and the pace of advancement are very relevant evidence for whether superhuman AGIs will arrive soon. And I think often people (imo wrongly in this case, but still) round off "won't happen for 10-20+ years" to "we don't need to worry about it now."

Reply
[-]ryan_greenblatt3h40

I find it very strange that Collier claims that international compute monitoring would “tank the global economy.” What is the mechanism for this, exactly?

>10% of the value of the S&P 500 is downstream of AI and the proposal is to ban further AI development. AI investment is a non-trivial fraction of current US GDP growth (20%?). I'd guess the proposal would cause a large market crash and a (small?) recession in the US; it's unclear if this is well described as "tanking the global economy".

Reply
[-]habryka2h145

it's unclear if this is well described as "tanking the global economy".

I think the answer is “no”? 

Like, at least in this context I would read the above as implying a major market crash, not a short term 20% reduction in GDP-growth. We pass policies all the time that cause a 20% reduction in GDP growth, so in the context of a policy discussion concerned about the downside, implying either political infeasibility or the tradeoff obviously not being worth it, I feel like it clearly implies more. 

Like, if you buy the premise of the book at all, the economical costs here are of course pretty trivial. 

Reply1
[-]Davidmanheim2h40

But the claim isn't, or shouldn't be, that this would be a short term reduction, it's that it cuts off the primary mechanism for growth that supports a large part of the economy's valuation - leading to not just a loss in value for the things directly dependent on AI, but also slowing growth generally. And reduction in growth is what makes the world continue to suck, so that most of humanity can't live first-world lives. Which means that slowing growth globally by a couple percentage points is a very high price to pay.

I think that it's plausibly worth it - we can agree that there's a huge amount of value enabled by autonomous but untrustworthy AI systems that are likely to exist if we let AI continue to grow, and that Sam was right originally that there would be some great [i.e. incredibly profitable] companies before we all die. And despite that, we shouldn't build it - as the title says.

Reply
[-]peterbarnett1h40

I think that the proposal in the book would "tank the global economy", as defined by a >10% drop in the S&P 500, and similar index funds, and I think this is a kinda reasonable definition. But I also think that other proposals for us not all dying probably have similar (probably less severe) impacts because they also involve stopping or slowing AI progress (eg Redwood's proposed "get to 30x AI R&D and then stop capabilities progress until we solve alignment" plan[1]).

  1. ^

    I think this is an accurate short description of the plan, but it might have changed last I heard. 

Reply
[-]anaguma2h10

I wonder if Yudkowsky could briefly respond on whether this is in fact his position:

Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant

Reply
[-]Matthew Barnett2h1-11

I would strongly disagree with the notion that FOOM is “a key plank” in the story for why AI is dangerous. Indeed, one of the most useful things that I, personally, got from the book, was seeing how it is *not* load bearing for the core arguments.

I think the primary reason why the foom hypothesis seems load-bearing for AI doom is that without a rapid AI and local takeoff, we won't simply get "only one chance to correctly align the first AGI [ETA: or the first ASI]".

If foom occurs, there will be a point where a company develops an AGI that quickly transitions from being just an experimental project to something capable of taking over the entire world. This presents a clear case for caution: if the AI project you're working on will undergo explosive recursive self-improvement, then any alignment mistakes you build into it will become locked in forever. You cannot fix them after deployment because the AI will already have become too powerful to stop or modify.

However, without foom, we are more likely to see a gradual and diffuse transition from human control over the world to AI control over the world, without any single AI system playing a critical role in the transition by itself. The fact that the transition is not sudden is crucial because it means that no single AI release needs to be perfectly aligned before deployment. We can release imperfect systems, observe their failures, and fix problems in subsequent versions. Our experience with LLMs demonstrates this pattern, where we could fix errors after deployment, making sure future model releases don't have the same problems (as illustrated by Sydney Bing, among other examples).

A gradual takeoff allows for iterative improvement through trial and error, and that's simply really important. Without foom, there is no single critical moment where we must achieve near-perfect alignment without any opportunity to learn from real-world deployment. There won't be a single, important moment where we abruptly transition from working on "aligning systems incapable of taking over the world" to "aligning systems capable of taking over the world". Instead, systems will simply gradually and continuously get more powerful, with no bright lines.

Without foom, we can learn from experience and course-correct in response to real-world observations. My view is that this fundamental process of iteration, experimentation, and course correction in response to observed failures makes the problem of AI risk dramatically more tractable than it would be if foom were likely.

Reply
[-]Vladimir_Nesov2h112

we won't simply get "only one chance to correctly align the first AGI"

We only get one chance for a "sufficiently critical try" which means an AI of the level of power where you lose control over the world if you failed to align it. I expect there are no claims to the effect that there will be only one chance to correctly align the first AGI.

A counterargument from no-FOOM should probably claim that there will never be such a "sufficiently critical try" at all because at every step of the way it would be possible to contain a failure of alignment at that step and try again and again until you succeed, as normal science and engineering always do.

Reply
[-]Matthew Barnett1h2-2

I expect there are no claims to the effect that there will be only one chance to correctly align the first AGI.

For the purpose of my argument, there is no essential distinction between 'the first AGI' and 'the first ASI'. My main point is to dispute the idea that there will be a special 'it' at all, which we need to align on our first and only try. I am rejecting the scenario where a single AI system suddenly takes over the world. Instead, I expect AI systems will continuously and gradually assume more control over the world over time. In my view, there will not be one decisive system, but rather a continuous process of AIs assuming greater control over time.

To understand the distinction I am making, consider the analogy of genetically engineering humans. By assumption, if the tech continues improving, there will eventually be a point where genetically engineered humans will be superhuman in all relevant respects compared to ordinary biological humans. They will be smarter, stronger, healthier, and more capable in every measurable way. Nonetheless, there is no special point at which we develop 'the superhuman'. There is no singular 'it' to build, which then proceeds to take over the world in one swift action. Instead, genetically engineered humans would simply progressively get smarter, more capable, and more powerful across time as the technology improves. At each stage of technological innovation, these enhanced humans would gradually take over more responsibilities, command greater power in corporations and governments, and accumulate a greater share of global wealth. The transition would be continuous rather than discontinuous.

Yes, at some point such enhanced humans will possess the raw capability to take control over the world through force. They could theoretically coordinate to launch a sudden coup against existing institutions and seize power all at once. But the default scenario seems more likely: a continuous transition from ordinary human control over the world to superhuman genetically engineered control over the world. They would gradually occupy positions of power through normal economic and political processes rather than through sudden conquest.

Reply
[-]Vladimir_Nesov16m20

For the purpose of my argument, there is no essential distinction between 'the first AGI' and 'the first ASI'.

For the purpose of my response there is no essential distinction there either, except perhaps the book might be implicitly making use of the claim that building an ASI is certainly a "sufficiently critical try" (if something weaker isn't already a "sufficiently critical try"), which makes the argument more confusing if left implicit, and poorly structured if used at all within that argument rather than outside of it.

The argument is still not that there is only one chance to align an ASI (this is a conclusion, not the argument for that conclusion). The argument is that there is only one chance to align the thing that constitutes a "sufficiently critical try". A "sufficiently cricial try" is conceptually distict from "ASI". The premise of the argument isn't about a level of capability alone, but rather about lack of control over that level of capability.

A counterargument would need to claim that even ASI won't constitute a "sufficiently critical try" in this sense, that is even ASI won't successfully take control over the world if misaligned. Probably because by the time it's built there are enough checks and balances that it can't (at least individually) take over the world if misaligned. And indeed this seems to be in line with the counterargument you are making. You don't expect there will be lack of control, even as we reach ever higher levels of capability.

Nonetheless, there is no special point at which we develop 'the superhuman'. There is no singular 'it' to build, which then proceeds to take over the world in one swift action.

Thus there is no "sufficiently critical try" here. But if there were, it would be a problem. Since in your view there won't be, you reject the premise, which is fair enough.

Reply
[-]Davidmanheim2h40

I think the primary reason why the foom hypothesis seems load-bearing for AI doom is that without a rapid AI and local takeoff, we won't simply get "only one chance to correctly align the first AGI". 


As the review makes very clear, the argument isn't about AGI, it's about ASI. And yes, they argue that you would in fact only get one chance to align the system that takes over. As the review discusses at length:

I do think we benefit from having a long, slow period of adaptation and exposure to not-yet-extremely-dangerous AI. As long as we aren’t lulled into a false sense of security, it seems very plausible that insights from studying these systems will help improve our skill at alignment. I think ideally this would mean going extremely slowly and carefully, but various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow. But regardless of how fast things proceed, I think it’s clearly good to study what we have access to (as long as that studying doesn’t also make things faster or make people falsely confident).

But none of this involves having “more than one shot at the goal” and it definitely doesn’t imply the goal will be easy to hit. It means we’ll have some opportunity to learn from failures on related goals that are likely easier.

The “It” in “If Anyone Builds It” is a misaligned superintelligence capable of taking over the world. If you miss the goal and accidentally build “it” instead of an aligned superintelligence, it will take over the world. If you build a weaker AGI that tries to take over the world and fails, that might give you some useful information, but it does not mean that you now have real experience working with AIs that are strong enough to take over the world.

Reply
[-]Matthew Barnett2h00

As the review makes very clear, the argument isn't about AGI, it's about ASI. And yes, they argue that you would in fact only get one chance to align the system that takes over.

I'm aware; I was expressing my disagreement with their argument. My comment was not premised on whether we were talking about "the first AGI" or "the first ASI". I was making a more fundamental point.

In particular: I am precisely disputing the idea that there will be "only one chance to align the system that takes over". In my view, the future course of AI development will not be well described as having a single "system that takes over". Instead, I anticipate waves of AI deployment that gradually, and continuously assume more control.

I fundamentally dispute the entire framing of thinking about "the system" that we need to align on our "first try". I think AI development is an ongoing process in which we can course correct. I am disputing that there is an important, unique point when we will build "it" (i.e. the ASI).

Reply
[-]Davidmanheim2h20

I seems like you're arguing against something different than the point you brought up. You're saying that slow growth on multiple systems means we can get one of them right, by course correcting. But that's a really different argument - and unless there's effectively no alignment tax, it seems wrong. That is, the systems that are aligned would need to outcompete the others after they are smarter than each individual human, and beyond our ability to meaningfully correct. (Or we'd need to have enough oversight to notice much earlier - which is not going to happen.)

Reply
[-]Matthew Barnett2h00

You're saying that slow growth on multiple systems means we can get one of them right, by course correcting.

That's not what I'm saying. My argument was not about multiple simultaneously existing systems growing slowly together. It was instead about how I dispute the idea of a unique or special point in time when we build "it" (i.e., the AI system that takes over the world), the value of course correction, and the role of continuous iteration.

Reply
[-]WilliamKiely3h10

[C]urrently available techniques do a reasonably good job of addressing this problem. ChatGPT currently has 700 million weekly active users, and overtly hostile behavior like Sydney’s is vanishingly rare.

Yudkowsky and Soares might respond that we shouldn’t expect the techniques that worked on a relatively tiny model from 2023 to scale to more capable, autonomous future systems. I’d actually agree with them. But it is at the very least rhetorically unconvincing to base an argument for future danger on properties of present systems without ever mentioning the well-known fact that present solutions exist.

It is not a “well-known fact” that we have solved alignment for present LLMs. If Collier believes otherwise, I am happy to make a bet and survey some alignment researchers.

I think you're strawmanning her here.

Her "present solutions exist" statement clearly refers to her "techniques [that] do a reasonably good job of addressing this problem [exist]" from the previous paragraph that you didn't quote (that I added in the quote above). I.e. She's clearly not claiming that alignment for present LLMs is completely solved, just that solutions that work "reasonably well" exist such that overtly hostile behavior like Bing Sydney's is rare.

Reply
[-]Davidmanheim2h62

But the way you are reading it seems to mean her "strawmann[ed]" point is irrelevant to the claim she made! That is, if we can get 50% of the way to aligned for current models, and we keep doing research and finding partial solutions at each stage getting 50% of the way to aligned for future models, and at each stage those solutions are both insufficient for full alignment, and don't solve the next set of problems, we still fail. Specifically, not only do we fail, we fail in a way that means "we shouldn’t expect the techniques that worked on a relatively tiny model from 2023 to scale to more capable, autonomous future systems." Which is the think she then disagrees with in the remainder of that paragraph you're trying to defends.

Reply
Moderation Log
More from Max Harms
View more
Curated and popular this week
17Comments
IABIEDAIWorld Modeling

Clara Collier recently reviewed If Anyone Builds It, Everyone Dies in Asterisk Magazine. I’ve been a reader of Asterisk since the beginning and had high hopes for her review. And perhaps it was those high hopes that led me to find the review to be disappointing.

Collier says “details matter,” and I absolutely agree. As a fellow rationalist, I’ve been happy to have nerds from across the internet criticizing the book and getting into object-level fights about everything from scaling laws to neuron speeds. While they don’t capture my perspective, I thought Scott Alexander and Peter Wildeford’s reviews did a reasonable job at poking at the disagreements with the source material without losing track of the big picture.

But I did not feel like Collier’s review was getting the details or the big picture right. Maybe I’m missing something important. Part of my motive for writing this “rebuttal” is to push back on where I think Collier gets things wrong, but part of it stems from a hope that by writing down my thoughts, someone will be able to show me what I’m missing. (Maybe Collier will respond, and we can try to converge?)

I’ll get into a mess of random nitpicking at the end of this essay, but I want to start by addressing two main critiques from Collier’s review, that I think are pretty important:

  • “The idea of an intelligence explosion [is] a key plank of the MIRI story”
  • “If one believes that AI progress will be slow and continuous, or even relatively fast and continuous, it follows that we’ll have more than one shot at the goal.”

FOOM

Collier writes:

[The MIRI worldview says there will be] a feedback loop where AIs rapidly improve their own capabilities, yielding smarter agents, which are even better at AI research, and so on, and so forth — escalating uncontrollably until it yields a single AI agent which exceeds all humans, collectively, in all mental abilities. This process is called an intelligence explosion, or, colloquially, FOOM (rhymes with “doom”). It’s probably the single most controversial premise inside the community of people who seriously worry about superintelligent AIs

[...]

[O]ne would think they’d devote some space to explaining why they think an intelligence explosion is likely to occur. Remarkably, they don’t. The concept gets two sentences in the introduction. They don't even explain why it's relevant. It is barely introduced, let alone justified or defended. And it’s certainly not obvious enough to go without saying, because advances in the neural networks which constitute current advanced AI have been continuous. The combination of steady algorithmic progress and increasing computational resources have produced years of predictable advances. Of course, this can’t rule out the possibility of a future intelligence explosion, but the decision not to explain why they think this might happen is utterly baffling, as it’s load-bearing for everything that follows.

Collier is absolutely right that recursive self improvement leading to a rapid intelligence explosion has been a common thing MIRI has talked about since the era when MIRI was the Singularity Institute.

But I would strongly disagree with the notion that FOOM is “a key plank” in the story for why AI is dangerous. Indeed, one of the most useful things that I, personally, got from the book, was seeing how it is *not* load bearing for the core arguments.

Yudkowsky does think an intelligence explosion stemming from recursive self improvement is at least somewhat likely (as do I). But even if AI capabilities progress at a relatively slow rate, the likeliest outcome is still catastrophe. The scenario provided in Part 2 of the book deliberately focuses on AIs that can’t self-improve much and can’t achieve vast superintelligence (until well after humanity is disempowered) because the authors wanted to show that the core argument doesn’t require those assumptions.

Just to check, let’s go through the book’s central points, which were nicely laid out by Peter Wildeford in his review (all this text is his):

  1. AI superintelligence is possible in principle and will happen eventually. Machines possess inherent advantages in speed, memory, and self-improvement that make their eventual superiority over biological brains almost certain. Chess AI did not stop at human-level intelligence but kept going to vastly surpass human Chess players. Go AI did not stop at human-level intelligence but kept going to vastly surpass human Go players. AI will become superhuman at more and more things until eventually it becomes superhuman at everything.
  2. AI minds are alien and we currently lack the fundamental understanding to instill reliable, human-aligned values into a mind far more intelligent than our own. Current AI models are grown through complex processes like gradient descent, not crafted with understandable mechanisms. This results in alien minds with internal thought processes fundamentally different from humans. We can create these systems without truly understanding how they work, but we can’t specify and control their values.
  3. You can’t just train AIs to be nice. AIs trained for general competence will inevitably develop their own goals as a side effect and these emergent preferences will not align with human values. Instead, they will be shaped by the AI's unique cognitive architecture and training environment. An AI's level of intelligence is independent of its ultimate goals. A superintelligent AI will not inherently converge on human values like compassion or freedom. Instead, it will pursue its own arbitrary objectives with superhuman efficiency.
  4. Nearly any AI will want power and control, because it is useful to whatever goals the AI does have. To achieve any long-term goal, an AI superintelligence will recognize the instrumental value of self-preservation, resource acquisition, and eliminating potential threats. From its perspective, humanity is both a competitor for resources and a potential threat that could switch it off. Therefore, eliminating humanity becomes a logical side effect of pursuing its primary goals, not an act of malice.
  5. We only get one chance to specify the values of an AI system correctly and robustly, as failure on the first try would be catastrophic. Combined with our lack of understanding, this is akin to trying to coach a high schooler to make a computer secure against the NSA on his first try or trying to get a college graduate to build a flawless nuclear reactor on her first try.
  6. Because of 2-5 and maybe other reasons, superintelligence will inevitably lead to human extinction with near certainty, regardless of the positive intentions of the creator. It is not sufficient to create superintelligence in a nice, safety-focused, Western AI company. Per the authors, anything and anyone using current or foreseeable methods will inevitably lead to the extinction of humanity. The authors assert this is not a speculative risk but a predictable outcome with very high confidence.
  7. According to the authors, the only rational course of action in reaction to (6) is an immediate, verifiable, full-scale and global halt to all large-scale AI development. This would be potentially similar to how the world got together and managed to prevent nuclear war (so far). It would require international treaties, monitored consolidation of computing hardware, and a halt to research that could lead to AI superintelligence. On Yudkowsky’s and Soares’s worldview, other policy solutions don’t come close to solving the problem and are basically irrelevant. But a global pause would potentially be feasible because it is in the interest of any world leader — in China, Russia, or the US — “to not die along with their families.” This shared interest in survival is what prevented a global nuclear war.
  8. At minimum, if you’re not fully bought into (7), the authors argue we should build in the optionality to pause AI development later, if we get more evidence there is a threat. The authors consider this solution insufficient, but a nonetheless worthy first step as there is a lot of preparatory work to do.

</PeterWildeford>

Is FOOM load bearing for any of those points? I don’t think so. The first four points (which correspond to the first five chapters of the book) at the very least seem to me to be independent of takeoff speeds. One could argue that point 2 is irrelevant if we go slowly, because even though we currently lack skill to align AIs, we may develop that skill in the future before they’re dangerous, but the point that they’re alien and very opaque still stands.

My guess is that Collier thinks that FOOM is necessary to justify point 5, and that if there’s “steady, predictable advances” that we’ll have multiple tries. I’ll address that in a moment. Regardless, I think it’s simply not true that FOOM is load-bearing for “everything” in the book, and I think it’s therefore clumsy to attack Soares and Yudkowsky for not talking about it more.

On the topic of FOOM, I also want to quickly address what appears to be another misconception: whether a rapidly increasing level of intelligence is necessarily linked to there being a single ASI. While an extremely rapid takeoff might allow for a first-mover to seize everything (and indeed, I think this is Yudkowsky’s default guess), I see no reason why multiple AI systems couldn’t take off roughly in parallel. (My 2016 novel Crystal Society imagines this exact premise.)

Yudkowsky also thinks this is plausible, or at least worth considering. IABIED directly addresses the notion that there might be multiple AIs:

It doesn’t have to be just one AI that gets smart enough. We could have told a story where there are thousands of superhuman machine intelligences around the world, all waiting against a calculated day when a change of strategy would be militarily successful. There are many different ways the future could go. They all lead to the same place, if a story like this one is allowed to begin.

And in the book’s online resources:

It’s not obvious to us whether there will be multiple smarter-than-human AIs of comparable ability, such that an “AI civilization” might emerge that has need of “AI property rights.” [...] We’re not saying that a single, dominant AI will necessarily emerge, but rather that it seems like a hard call.

(See also: this bit of the online resources talking about multiple AIs, and this bit of the online resources that addresses the idea of using AI competition to protect us.)

Also on the topic of getting FOOM right, I think it’s probably a mistake to characterize an intelligence explosion as contrary to a picture of predictable, continuous progress. When the forecasters behind AI 2027 extrapolated out the “predictable” trajectory of AI development, they describe a scenario where the AIs become superhuman and recursively self-improve, leading to an intelligence explosion that kills all humans! Collier seems to imply that continuous = slow and gradual, and I just don’t think that follows.

Collier writes:

Even proponents of AI 2027, itself a relatively extreme view, believe that there will be several competing AIs which will gradually become more capable and more power-seeking over a period of years, giving humans plenty of chances to catch them in the act and take precautionary measures.

I think this is the wrong take on AI 2027. The “competing AIs” they describe are, as far as I can tell, only really relevant in that scenario because the race dynamics make things harder. Yes, AI 2027 involves many opportunities for the humans to notice that they’re in danger prior to the full loss of control, but none of that contradicts the story of an uncontrollable intelligence explosion. The intelligence explosion becomes uncontrollable when the humans fail to stop advancing AI capabilities before the critical window closes. (See my thoughts on AI 2027 for more thoughts on why the Slowdown ending is unrealistic.)

The scenario in IABIED also involves an AI that becomes uncontrollable after some amount of time where the developers have the opportunity to notice the warning signs and take precautionary measures. And there’s some discussion of “catching the AI in the act”:

Or maybe [the company building the first potentially dangerous AI] would alert government officials, and there would be a big argument in a presidential office about whether to order [the AI] to create a smarter AI before other countries did it first. Maybe the side of caution would win, and time would pass. . . ​until a copy of [the AI] stolen by foreign intelligence services was run by more reckless operators.

I don’t think anyone believes that we have no opportunity to notice warning signs. In an important sense, we’re already in that position today. Developers are noticing a steady trickle of red flags, and the issue is more that these are not blatant enough to serve as a “fire alarm” or get most of the people building those machines to change course. Perhaps Collier believes there will be some significant jump in how scary these things are, such that the warning signs become much more potent? I’m curious for other people’s perspectives here (especially if you’ve read the book), because I feel confused about the supposed contrast with AI 2027.

Relevant sections of the book’s online supplement:

  • Won’t there be early warnings researchers can use to identify problems?
  • Will there be warning shots?

Gradualism

This seems like a good point to transition into talking about the other central point that I hear Collier as making, which I might paraphrase as “gradual progress means that we’ll have time to learn about how AI works and develop techniques for aligning and controlling it that can be tested and refined in pace with the AI getting stronger and smarter.” This belief that we might be okay if AI progress is gradual — if it’s slow enough — that we’ll be able to adapt, is pretty common, I think, and it has merit. I agree, for starters, that going slow is broadly helpful. That’s why I think it’s important to set up policies and regulations that slow things down.

But speed is not the only crucial aspect of the gradualism hope — it’s also essential that the work we do to align AIs on Monday will continue to be relevant on Tuesday. Collier seems to think that Yudkowsky’s position is that current work is useless, because there will be a radical discontinuity. Here’s her attempt to characterize Yudkowsky’s position:

Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.

I think this is not Yudkowsky’s position. I’m not sure exactly what the real Yudkowsky would say, but my shoulder-Eliezer says “It’s not that the work is irrelevant, it’s that it’s insufficient. If you tackle a problem that’s too hard, you can spend a lot of time without getting anywhere close to the solution. Even if it turns out we have nice, smooth, gradual progress over the span of a decade or whatever, we’re still going to die when we build a superintelligence because our civilization is too inadequate to address everything on the list of lethalities. This is a very hard problem, and you won’t have the luxury of learning from (a large and critical subset of) your failures, unlike all the other very hard engineering problems we’ve solved before.”

Perhaps I think he’d say this because I, personally, have a lot of esteem for the people who are currently working on solving the technical problems (including myself!) and think that we need more people working to try and figure out alignment. I agree that MIRI is, at times, dismissive of people’s hope in a way that can be abrasive and make it seem like the authors don’t have esteem for our colleagues. But I also note that the authors of IABIED have also repeatedly praised interpretability researchers, made personal efforts to make progress on the technical challenges, and so on.

Let’s turn to the question of whether we’ll only have one shot, since it seems like it might be something of a double-crux.

Collier writes:

We are told: "Humanity only gets one shot at the real test." That is, we will have one opportunity to align our superintelligence. That's why we'll fail. It's almost impossible to succeed at a difficult technical challenge when we have no opportunity to learn from our mistakes. But this rests on another implicit claim: Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.

I think the situation with ASI is similar to a killer asteroid headed for Earth that we only have one real shot to deflect, because we only detected it recently. In the case of the asteroid, at least, there’s no relationship between whether we only have one shot, and whether the work we’ve done before this point is relevant. Yes, we already have some spacecraft, and we’ve learned a lot that’s relevant. But we still only get one shot. If we mess up for whatever reason, everyone dies, because the problem is global, lethal, and irreversible, not because it’ll be different from what we’ve dealt with before.

Now, notably, it might also be qualitatively different from what we’ve dealt with before. For example, AIs that are not clearly capable of taking over the world might avoid thoughts or behaviors that push in that direction for the simple reason that those thoughts and behaviors aren’t useful for getting them what they want. The actually-dangerous ASI might be the first intelligent model that invests heavily in scheming. Or it might not! Seems like a hard call whether the first dangerous ASI will be very similar to current models, and I’ve heard people go both ways. This uncertainty makes me more worried, but it is not load-bearing with regard to whether AI is dangerous or whether we only have one shot.

If one believes that AI progress will be slow and continuous, or even relatively fast and continuous, it follows that we’ll have more than one shot at the goal.

I’m pretty confused by this, and would love to hear a followup clarification of the logic she has in mind.

My best attempt to steelman the argument is: “If AIs develop in a smooth fashion, we’ll have some time to engage with and learn from minds that are comparably intelligent to a human before we face something radically superhuman. Like grinding low-level enemies in an RPG, this experience will give us the opportunity to level-up our alignment skills so that when we face the boss challenge of a radically superhuman AI, we’ll be ready. We’ll probably fail a bunch on the weaker AI, but those failures will let us broadly succeed on aligning all the AIs where failure would be catastrophic.”

(Apologies if the steelman is missing important things. I want comments/feedback here.)

I think this steelman has some truth. I do think we benefit from having a long, slow period of adaptation and exposure to not-yet-extremely-dangerous AI. As long as we aren’t lulled into a false sense of security, it seems very plausible that insights from studying these systems will help improve our skill at alignment. I think ideally this would mean going extremely slowly and carefully, but various readers may be less cautious/paranoid/afraid than me, and think that it’s worth some risk of killing every child on Earth (and everyone else) to get progress faster or to avoid the costs of getting everyone to go slow. But regardless of how fast things proceed, I think it’s clearly good to study what we have access to (as long as that studying doesn’t also make things faster or make people falsely confident).

But none of this involves having “more than one shot at the goal” and it definitely doesn’t imply the goal will be easy to hit. It means we’ll have some opportunity to learn from failures on related goals that are likely easier.

The “It” in “If Anyone Builds It” is a misaligned superintelligence capable of taking over the world. If you miss the goal and accidentally build “it” instead of an aligned superintelligence, it will take over the world. If you build a weaker AGI that tries to take over the world and fails, that might give you some useful information, but it does not mean that you now have real experience working with AIs that are strong enough to take over the world.

Nitpicks

I’m going to go through a few of the random bits of the review that I take issue with. My guess is that these are less important, even on the whole, but in the spirit of getting the details right I want to at least offer some pushback.

Collier:

In the book, Yudkowsky and Soares argue that it should be illegal to own more than eight of the most powerful GPUs available in 2024 without international monitoring (frontier training runs today use tens of thousands). To more mainstream AI safety organizations, this position entails rejecting valuable work in favor of policies that can’t be implemented and would tank the global economy if they were. 

I find it very strange that Collier claims that international compute monitoring would “tank the global economy.” What is the mechanism for this, exactly?

For context, “eight of the most powerful GPUs” might not intuitively sound like much, but that’s over $200,000 in hardware. Quoting an online appendix to the book:

The most advanced AI chips are also quite specialized, so tracking and monitoring them would have few spillover effects. NVIDIA’s H100 chip, one of the most common AI chips as of mid-2025, costs around $30,000 per chip and is designed to be run in a datacenter due to its cooling and power requirements. These chips are optimized for doing the numerical operations involved in training and running AIs, and they’re typically tens to thousands of times more performant at AI workloads than standard computers (consumer CPUs).

And:

Most people are not purchasing dozens of top-of-the-line AI GPUs and housing them in their garages. Most people aren’t running huge datacenters. Most people won’t even feel the effects of a ban on AI research and development. It’s just that ChatGPT wouldn’t change quite so often.

Humanity wouldn’t even need to stop using all the current AI tools. ChatGPT wouldn’t have to go away; we could keep figuring out how to integrate it into our lives and our economy. That would still be more change than the world used to see for generations. We would miss out on new AI developments (of the sort that would land as AI gets smarter but not yet smart enough to kill everyone), but society is mostly not clamoring for those developments.

And we would get to live. We would get to see our children live.

Developments that people are clamoring for, such as the development of new and life-saving medical technology, seem possible to pursue without also pursuing superintelligence. We are in favor of carve-outs for medical AI, so long as they function with adequate oversight and steer clear of dangerous generality.

There’s a strengthened version of Collier’s argument that says

  1. The next generation of AI will produce huge economic gains (and won’t kill us all)
  2. Not building the next generation will deprive us of those gains
  3. The counterfactual impact of stopping AI capabilities research is that we’ll be a lot poorer

I think this is valid (but not knowably sound, because it might kill us), and if we want to recklessly charge forward, I agree that we’ll be counterfactually richer (until we’re dead or disempowered). But if this is what Collier had in mind, I think it’s extremely bad form to characterize slowing down as “tanking the global economy,” since it would still allow for normal economic progress and would not involve people getting poorer.[1]

I understand the position that thinks there’s no such thing as ASI, or it would be safe by default, such that there’s no reason for governments to intervene here. And I understand the maximally libertarian position that argues that the government shouldn’t be able to make things illegal, regardless of cost. But I don’t understand the position Collier is putting forward here.

My position, for transparency: I’m broadly libertarian/classically liberal, but think there are exceptions. We already need to police the world to make sure random people aren’t building nuclear weapons, and I expect this will soon need to extend to labs capable of engineering bioweapons. I do not want a “global surveillance state,” but I think it’s possible for the government to track chips while also leaving almost everyone’s privacy almost entirely intact.

More on this in the book’s online supplement:

  • Isn’t this handing too much power to governments?

Collier:

Yudkowsky and Soares’ draconian chip monitoring proposals aren’t neatly separable from their arguments about the nature of artificial intelligence. They are the conclusion of a very specific set of beliefs — for example, that danger will come in the form of a single superintelligent AI, which can’t be monitored or countered by other systems, and that its ability to improve its own code will make the amount of physical compute it has access to largely irrelevant. These points are implied by the MIRI worldview, but Yudkowsky and Soares largely breeze past them. The book spends whole chapters unpacking the motivations of future superintelligent AIs, but devotes very little space to justifying its own account of how those AIs will be created.

  • Again, there’s no central argument about whether there will be one AI or multiple.
  • AI can be monitored and countered by various systems right up until it becomes smart enough to escape them. “Can’t be monitored or countered” makes it sound like we’re arguing it’s physically impossible to monitor an AI. The reality is that IABIED’s position is that sufficiently advanced, misaligned AI can’t be controlled using techniques we are likely to have (basically by the definition of “sufficient”).
  • I have never heard Yudkowsky or anyone else seriously suggest that the amount of compute the AI has access to will be irrelevant. Collier should cite her source. I think it’s reasonable to say that Yudkowsky is worried about algorithmic advancement (as am I), but algorithmic advancement isn’t a load-bearing part of the argument, any more than recursive self-improvement is.
  • The book spends several chapters on why a mind more intelligent than any living human would be hard to monitor and control. I do not think it’s right to say it “breezes past them.” 

It is true that the authors have a worldview and don’t spend much time defending the parts of it that aren’t central to the thesis. This is what makes the book so short and straightforward compared to other things Yudkowsky has written.

Collier:

We could say — and certainly Yudkowsky and Soares would say — that this isn’t important, because the essential dynamics of superintelligence don’t depend on any particular architecture.

Yes! Exactly!

But that just raises a different question: why does the rest of the book talk about particular architectures so much?

A few reasons, like:

  • It makes it clear that we know what we’re talking about. (Or, if we don’t know what we’re talking about, it’s at least not as simple as “they don’t understand how modern AI systems work”.)
  • It helps ground conversation in systems that a general audience knows about. A lot of people are paying attention to AI because of systems like ChatGPT. And LLMs are genuinely relevant to these issues; so it’s no real cost to focus on them, even though the book makes it clear that there’s no guarantee LLMs will attain (or create) superintelligence.
  • Talking about how current systems work helps clarify how little researchers understand about the inner workings on AIs. This isn’t the only reason alignment is hard, but it’s a massive one relevant to the kinds of AIs people build today.
  • And it helps with understanding the kinds of tools developers do and don’t have for aligning AIs.

See also, in the supplemental materials:

  • Why does gradient descent matter?
  • What Good Does Knowledge of LLMs Do?

Collier:

[Yudkowsky says] we should expect AIs to have strange alien values that we can't control or predict, since the humans who “grow” AIs can’t exactly input ethics or morals by hand. This might seem broadly reasonable — except that this was also Yudkowsky’s conclusion in 2006, when he assumed that AIs would be crafted. Back then, his argument was that during takeoff, when an AI rapidly self-improves into superintelligence, it would undergo a sudden and extreme value shift.

To my memory this was never the argument. I think that the argument was "we don't know how to encode the right values by hand nor do we know how to encode a meta-process that gets the right values at the end." Yes, there was some work on tiling agents (including the Löbian stuff) and some discussion of what to do about value drift, but I do not recall any time when Yudkowsky was optimistic about hand-coding it to have the right values, nor a time when his only concern was getting values to stick when the AI self-improves.

Some aspects of the alignment problem look different today than they did 15 years ago. For example, Eliezer mentioned in a conversation a few days ago:

The book does not go very hard on the old Fragility of Value thesis from the Overcoming Bias days, because the current technology is bad enough that we're not likely to get that kind of close miss.  The problem is more like, "you get some terms of the utility function sorta right on the training distribution but their max outside the training distribution is way different from where you hoped it would generalize" than "the AI cares about love, life, happiness, fun, consciousness, novelty, and honor, but not music and freedom".

But neither of these pictures is saying that the core problem (much less the whole problem) is goal instability. And they’re certainly not saying the whole problem is “goal instability specifically during recursive self-improvement that specifically is crossing the gap to superintelligence.”

So far, when these AIs have acted against the interests of humans, the motives haven’t exactly been alien. [...] “Sydney” [...] famously flew into a jealous rage after failing to break up New York Times journalist Kevin Roose’s marriage. But even if jealous ex-girlfriends were alien and incomprehensible, there’s the inconvenient fact that currently available techniques do a reasonably good job of addressing this problem. ChatGPT currently has 700 million weekly active users, and overtly hostile behavior like Sydney’s is vanishingly rare.

Describing Sydney’s behavior as human is… certainly not how I would describe it.

But also, does Collier really think she knows Sydney’s motives? Does she think that Sydney was actually jealous, in the same way a human is, and that her words were drawn out of a desire for Kevin Roose’s love? I’m not sure it makes sense to describe Sydney as an agent at all, much less one with human motives.

Regardless, I would also not describe our current techniques as doing a reasonably good job unless “good job” involves the occasional Mecha Hitler, AI-driven psychosis/weirdness, and quietly commenting out unit tests. Yes, none of these are overtly hostile in the way that Sydney was, but overt hostility is hardly the only sign of things gone wrong.

Again, linking to some deeper discussion of this:

  • Aren’t developers regularly making their AIs nice and safe and obedient?
  • Doesn’t the Claude chatbot show signs of being aligned?

And more to the point…

Yudkowsky and Soares might respond that we shouldn’t expect the techniques that worked on a relatively tiny model from 2023 to scale to more capable, autonomous future systems. I’d actually agree with them. But it is at the very least rhetorically unconvincing to base an argument for future danger on properties of present systems without ever mentioning the well-known fact that present solutions exist.

It is not a “well-known fact” that we have solved alignment for present LLMs. If Collier believes otherwise, I am happy to make a bet and survey some alignment researchers. (Or cite the work coming out of Anthropic, Redwood Research, Apollo, or any of the other great places doing empirical work.)

Yudkowsky and Soares don’t believe that studying [current] systems can teach us anything at all.

Why does Collier think this? Am I missing something? The book is chock-full of examinations and examples from current systems. The authors call interpretability researchers “heroes” for a reason. One of the main points of the book is that the mess that we see is a demonstration how far away a mature understanding of alignment is. That’s an important takeaway!

They’re also wrong about alchemists. Real early modern alchemy texts are painfully specific and process oriented. There is no significant tonal difference between Isaac Newton, physicist, and Isaac Newton, alchemist: less grand theorizing, more lab notes. Sometimes these works sound florid to modern ears, but that's also true of early modern century scientific writing and for that matter Eliezer Yudkowsky.

I don’t think Yudkowsky and Soares are picking on alchemists’ tone, I think they’re picking on the combination of knowledge of specific processes and ignorance of general principles that led to hubris in many cases.

[The intelligence explosion] plays a major role in the fictionalized description of a possible AI takeover scenario that makes up chapters seven through nine (and which has some elements that sound plausible, even if it does also include the Yudkowsky classic "tiny molecular machines with the strength of diamond and corresponding mechanical advantages in their speed and resilience").

I’m pretty confused by this description of the Part 2 scenario.

First of all, there is no nanotech in the takeover scenario prior to the AI having full control of the planet, unless you count the transistors in the computer or the drugs and biological viruses that the AI develops. It plays no role in the AI’s takeover whatsoever; from a human perspective, it doesn’t matter for the story. (Though the feasibility of nanotech might matter for other scenarios.)

More importantly, the scenario also doesn’t feature anything like an intelligence explosion until after the AI has full control of the planet. As I noted earlier, this was deliberately done to show that those kinds of developments, plausible though they are, are not load-bearing.

The scenario does start with the discovery of a new scaling law, similar to breakthroughs like the transformer architecture and chain-of-thought reasoning. But as described in the book and online resources, this isn’t load-bearing either.

More Was Possible

I don’t think IABIED is a perfect book. I don’t think Yudkowsky is a perfect communicator or figurehead or whatever. I want a thousand books about the risks, with a variety of different tones and styles and perspectives. I want our friends and allies and fellow rationalists to point out all the issues and help us do even better.

But I also want us to not let the perfect be the enemy of the good. There is space to take issue with Yudkowsky’s level of confidence[2] and use this opportunity to call out for the world to notice that the risks of our current path are unacceptably high. I was hoping that Asterisk’s review would get the details right and help our cause, but instead I feel sad and confused.[3] I welcome comments and responses to help me understand where I’m wrong.

  1. ^

    Yes, it would cause Nvidia's stock price to drop, and presumably the stock price of other tech companies. Even if those stocks tank, I do not think it's fair to describe this as tanking the global economy. Those high stock prices reflect the market's unrealized expectation, and I think my point about counterfactual losses not being the same as an actual increase in poverty stands.

  2. ^

    I, and some others at MIRI, are more optimistic than Eliezer. There’s lots of room for allies to disagree.

  3. ^

    And maybe a little angry around the edges. Sorry if any of that crept in. I’m genuinely trying hard not to be antagonistic, and want us to work together instead of getting caught up in infighting. 😔