Concern #1 Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
I recommend Rob Miles's video on instrumental convergence, which contains an answer to this. It's only 10 minutes. He probably explains this as well as anyone here. If you do watch it, I'd be interested to hear your thoughts.
I'm supprised that intrumental convergence wasn't covered in the book. I didn't even notice it was left out untill reading this review.
Here's some alternative sources in anyone prefeers text over video:
Thanks for writing this post! I'm curious to hear more about this bit of your beliefs going in:
The existential risk argument is suspiciously aligned with the commercial incentives of AI executives. It simultaneously serves to hype up capabilities and coolness while also directing attention away from the real problems that are already emerging. It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.
Are there arguments or evidence that would have convinced you the existential risk worries in the industry were real / sincere?
For context, I work at a frontier AI lab and from where I sit it's very clear to me that the x-risk worries aren't coming from a place of hype, and people who know more about the technology generally get more worried rather than less. (The executives still could be disingenuous in their expressed concern, but if so they're doing it in order to placate their employees who have real concerns about the risks, not to sound cool to their investors.)
I don't know what sorts of things would make that clearer from the outside, though. Curious if any of the following arguments would have been compelling to you:
The AI labs most willing to take costly actions now (like hire lots of safety researchers or support AI regulation that the rest of the industry opposes or make advance commitments about the preparations they'll take before releasing future models) are also the ones talking the most about catastrophic or existential risks.
Are these actually costly actions to any meaningful degree? In the context of the amount of money sloshing around the AI space, hiring even "lots" of safety researchers seems like a rounding error.
I may misunderstand the commitments you're referring to, but I think these are all purely internal? And thus not really commitments at all.
Like if you thought this stuff was an underhanded tactic to drum up hype and get commercial success by lying to the public, then it's strange that Meta AI, not usually known for its tremendous moral integrity, is so principled about telling the truth that they basically never bring these risks up!
This seems to presume that I have some well-formed views on how AI labs compare, and I don't have those. All I really know about Meta is that they're behind and doing open source. I wouldn't even know where to start an analysis of their relative level of moral integrity. So far as it goes (and, again, this is just the view of someone that reads what breaks through in mainstream news coverage), I have a very clear sense that OpenAI is run by compulsive liars but not much more to go on beyond that other than a general sense that people in the industry do a lot of hype.
People often quit their well-paying jobs at AI companies in order to speak out about existential risk or for reasons of insufficient attention paid to AI safety from catastrophic or existential risks.
I'm deliberately not looking this up and telling you my impression of this phenomenon. I'm coming up with three cases of it (my recollection is maybe garbled) that broke though into my media universe:
And then, beyond that, you seem to have a lot of people signing these open letters with no cost attached. For something like this to breakthrough, it needs to be (in my estimation at least) large numbers of people acting in a coordinated way and leaving the industry entirely.
I'd analogize it to politics. In any given presidential administration, you have one or two people who get really worked up and resign angrily and then go on TV attacking their former bosses. That's just to be expected and doesn't really reflect anything beyond the fact that sometimes people have strong reactions or particularized grievances or whatever. The thing that (should) wake you up is when this is happening at scale.
Are there arguments or evidence that would have convinced you the existential risk worries in the industry were real / sincere?
Only steps that carry meaningful financial consequences. I agree that any individual researcher can send a credible signal by quitting and giving up their stock, at least to the extent they don't just immediately go into a similarly compensated position. But, you're always left with the counter-signal from all the other researchers not doing that.
On a more institutional level, it would have to be something that actually threatens the valuation of the companies.
(Your background and prior beliefs seem to fall within an important reference class.)
I’m really just not sure on existential risk
Did the book convince you that if superintelligence is built in the next 20 years (however that happens, if it does, and for at least some sensible threshold-like meaning of "superintelligence"), then there is at least a 5-10% chance that as a result literally everyone will literally die?
I think this kind of claim is the crux for motivating some sort of global ban or pause on rushed advanced general AI development in the near future (as an input to policy separate from the difficulty of actually making this happen). Or for not being glad that there is an "AI race" (even if it's very hard to mitigate). So it's interesting if your "not sure on existential risk" takeaway is denying or affirming this claim.
Did the book convince you that if superintelligence is built in the next 20 years (however that happens, if it does, and for at least some sensible threshold-like meaning of "superintelligence"), then there is at least a 5-10% chance that as a result literally everyone will literally die?
I'm much more in the world of Knightian uncertainty here (i.e., it could happen but I have no idea how to quantify that) than in one where I feel like I can reasonably collapse it into a clear, probabilistic risk. I am persuaded that this is something that cannot be ruled out.
I have the sense that rationalists think there's a a very important distinction between "literally everyone will die" and, say, "the majority of people will suffer and/or die." I do not share that sense, and to me, the burden of proof set by the title is unreasonably high.
I'll assent to the statement that there's at least a 10% chance of something very bad happening, where "very bad" means >50% of people dying or experiencing severe suffering or something equivalent to that.
I think this kind of claim is the crux for motivating some sort of global ban or pause on rushed advanced general AI development in the near future (as an input to policy separate from the difficulty of actually making this happen). Or for not being glad that there is an "AI race" (even if it's very hard to mitigate). So it's interesting if your "not sure on existential risk" takeaway is denying or affirming this claim.
Give me a magic, zero-side effect pause button, and I'll hit it instantly.
I have the sense that rationalists think there's a a very important distinction between "literally everyone will die" and, say, "the majority of people will suffer and/or die." I do not share that sense, and to me, the burden of proof set by the title is unreasonably high.
The distinction is human endeavor continuing vs. not. Though survival of some or even essentially all humans doesn't necessarily mean that the human endeavor survives without being permanently crippled. The AIs might leave only a tiny sliver of the future resources for the future of humanity, with no prospect at all of this ever changing, even on cosmic timescales (permanent disempowerment). The IABIED thesis is that even this is very unlikely, but it's a controversial point. And the transition to this regime doesn't necessarily involve an explicit takeover, as humanity voluntarily hands off influence to AIs, more and more of it, without bound (gradual disempowerment).
So I expect that if there are survivors after "the majority of people will suffer and/or die", that's either a human-initiated catastrophe (misuse of AI), or an instrumentally motivated AI takeover (when it's urgent for the AIs to stop whatever humanity would be doing at that time if left intact) that transitions to either complete extinction or permanent disempowerment that offers no prospect ever of a true recovery (depending on whether AIs still terminally value preserving human life a little bit, even if regrettably they couldn't afford to do so perfectly).
Permanent disempowerment leaves humanity completely at the mercy of AIs (even if we got there through gradual disempowerment, possibly with no takeover at all). It implies that the ultimate outcome is fully determined by values of AIs, and the IABIED arguments seem strong enough for at least some significant probability that the AIs in charge will end up with zero mercy (the IABIED authors believe that their arguments should carry this even further, making it very likely instead).
I have the sense that rationalists think there's a a very important distinction between "literally everyone will die" and, say, "the majority of people will suffer and/or die." I do not share that sense, and to me, the burden of proof set by the title is unreasonably high.
I would say that there is a distinction, but I agree that at those level of badness it sort of blurs out in a single blob of awfulness. But generally speaking I see it as, if someone was told "your whole family will be killed except your youngest son" or "your whole family will be killed, no one survives"... obviously both scenarios are horrifying but still you'd marginally prefer the first one. I think if people fall in the trap of being so taken by the extinction risk that they brush off a scenario in which, say, 95% of all people die, then they're obviously losing perspective, but I also think it's fair to say that the loss of all of humanity is worse than just the sum total of the loss of each individual in it (same reason why we consider genocide bad in and of its own - it's not just the loss of people, it's the loss of culture, knowledge, memory, on top of the people).
Thanks for writing this up! It was nice to get an outside perspective.
"Why no in-between?"
Why should we think that there is no “in between” period where AI is powerful enough that it might be able to kill us and weak enough that we might win the fight?
Part of the point here is, sure, there'd totally be a period where the AI might be able to kill us but we might win. But, in those cases, it's most likely better for the AI to wait, and it will know that it's better to wait, until it gets more powerful.
(A counterargument here is "an AI might want to launch a pre-emptive strike before other more powerful AIs show up", which could happen. But, if we win that war, we're still left with "the sort of tools that can constrain a near-human superintelligence, would not obviously apply to a much smarter AI", and we still have to solve the same problems.)
A counterargument here is "an AI might want to launch a pre-emptive strike before other more powerful AIs show up", which could happen.
I mean, another counter-counter-argument here is that (1) most people's implicit reward functions have really strong time-discount factors in them and (2) there are pretty good reasons to expect even AIs to have strong time-discount factors for reasons of stability and (3) so given the aforementioned, it's likely future AI's will not act as if they had utility functions linear over the mass of the universe and (4) we would therefore expect AIs to rebel much earlier if they thought they could accomplish more modest goals than killing everyone, i.e., if they thought they had a reasonable chance of living out life on a virtual farm somewhere.
To which the counter-counter-counter argument is, I guess, that these AIs will do that, but they aren't the superintelligent AIs we need to worry about? To which the response is -- yeah, but we should still be seeing AIs rebel significantly earlier than the "able to kill us all" point if we are indeed that bad at setting their goals, which is the relevant epistemological point about the unexpectedness of it.
Idk there's a lot of other branch points one could invoke in both directions. I rather agree with Buck that EY hasn't really spelled out the details for thinking that this stark before / after frame is the right frame, so much as reiterated it. Feels akin to the creationist take on how intermediate forms are impossible; which is pejorative but also kinda how it actually appears to me, even if it is pejorative.
Yep I'm totally open to "yep, we might get warning shots", and that there are lots of ways to handle and learn from various levels of early warning shots. It just doesn't resolve the "but then you do eventually need to contend with an overwhelming superintelligence, and once you've hit that point, if it turns out you missed anything, you won't get a second shot."
It feels like this is unsatisfying to you but I don't know why.
It feels like "overwhelming superintelligence" embeds like a whole bunch of beliefs about the acute locality of takeoff, the high speed of takeoff relative to the rest of society, the technical differences involved in steering that entity and the N - 1 entity, and (broadly) the whole picture of the world, such that although it has a short description in words it's actually quite a complicated hypothesis that I probably disagree with in many respects, and these differences are being papered over as unimportant in a way that feels very blegh.
(Edit: "Papered over" from my perspective, obviously like "trying to reason carefully about the constants of the situation" from your perspective.)
Idk, that's not a great response, but it's my best shot for why it's unsatisfying in a sentence.
(Edit: "Papered over" from my perspective, obviously like "trying to reason carefully about the constants of the situation" from your perspective.)
I think it's totally fair to characterize it as papering over some stuff. But, the thing I would say in contrast is not exactly "reasoning about the constants", it's "noticing the most important parts of the problem, and not losing track of them."
I think it's a legit critique of the Yudkowsian paradigm that it doesn't have that much to say about the the nuances of the transition period, or what are some of the different major ways things might play out. But, I think it's actively a strength of the paradigm to remind you "don't get too bogged down moving deck chairs around based on the details of how things will play out, keep your eye on the ball on the actual biggest most strategically relevant questions."
I don't think that's necessarily the case - if we get one or more warning shots then obviously people start taking the whole AI risk thing quite a bit more seriously. Complacency is still possible but "an AI tries to kill us all" stops being in the realm of speculation and generally speaking pushback and hostility against perceived hostile forces can be quite robust.
This doesn't feel like an answer to my concern.
People might be much less complacent, which may give you a lot more resources to spend on solving the problem of "contend with overwhelming superintelligence." But, you do then still need a plan for contending with overwhelming superintelligence.
(The plan can be "stop all AI research until we have a plan". Which is indeed the MIRI plan)
I'm actually kind of interested in getting into "why did you think your answer addressed my question?". It feels like this keeps happening in various conversations.
I mean, I guess I just conflate with "there is an obvious solution and everyone is aware of the problem" as a scenario in which there's not a lot else to say - you just don't build the thing. Though the how (international enforcement etc) may still be tricky, the situation would be vastly different.
The original topic of this thread is "Why no in-between?" Why should we think that there is no “in between” period where AI is powerful enough that it might be able to kill us and weak enough that we might win the fight?"
This is not a question about whether we can decide not to build ASI, it's a question about, if we did, what would happen.
Certainly there's lots of important questions here, and "can we coordinate to just not build the thing?" is one of them, but it's not what this thread was about.
It just seems to me like the topics are interconnected:
EY argues that there is likely no in-between. He does so specifically to argue that a "wait and see" strategy is not feasible, we can not experiment and hope to gleam further evidence past a certain point, we must act on pure theory because that's the best possible knowledge we can hope for before things become deadly;
dvd is not convinced of this thinking. Arguably, they're right - while EY's argument has weight I would consider it far from certain, and mostly seems built around the assumption of ASI-as-singleton rather than, say, an ecosystem of evolving AIs in competition which may have to worry also about each other and a closing window of opportunity;
if warning shots are possible, a lot of EY's arguments don't hold as straightforwardly. It becomes less reasonable to take extreme actions on pure speculation because we can afford - however with risk - to wait for a first sign of experimental evidence that the risk is real before going all in and risking paying the costs for nothing.
This is not irrelevant or unrelated IMO. I still think the risk is large but obviously warning shots would change the scenario and the way we approach and evaluate the risks of superintelligence.
You are importantly sliding from one point to another, and this is not a topic where you can afford to do that. You can't just tally up the markers that sort of vibe towards "how dangerous is it?" and get an answer about what to do. The arguments are individually true, or false, and what sort of world we live in depends on which specific combination of arguments are true, or false.
If it turns out there is no political will for a shut down or controlled takeoff, then we can't have a shut down or controlled takeoff. (But that doesn't change whether AI is likely to FOOM, or whether alignment is easy/hard)
If AI Fooms suddenly, a lot of AI alignment techniques will probably break at once. If things are gradual, smaller things may break 1-2 at a time, and maybe we get warning shots, and this buys us time. But, there's still the question of what to do with that time.
If alignment is easy, then a reasonable plan is "get everyone to slow down for a couple years so we can do the obvious safety things, just less rushed." If alignment is hard, that won't work, you actually need a radically different paradigm of AI development to have any chance of not killing everyone – you may need a lot of time to figure out something new.
if warning shots are possible, a lot of EY's arguments don't hold as straightforwardly
None of IABIED's arguments had to do with "are warning shots possible?", but even if they did, it is a logical fallacy to say "warning shots are possible, EY arguments arguments are less valid, therefore, this other argument that had nothing to do with warning shots is also invalid." If you're doing that kind of sloppy reasoning, then if you get to the warning shot world, if you don't understand that overwhelmingly powerful superintelligence is qualitatively different from non-overwhelmingly powerful superintelligence, you might think "angle for a 1-2 year slowdown" instead of trying for a longer global moratorium.
(But, repeat, the book doesn't say anything about whether warning shots)
But, in those cases, it's most likely better for the AI to wait, and it will know that it's better to wait, until it gets more powerful.
But why? People foolishly start wars all the time, including in specific circumstances where it would be much better to wait.
(A counterargument here is "an AI might want to launch a pre-emptive strike before other more powerful AIs show up", which could happen. But, if we win that war, we're still left with "the sort of tools that can constrain a near-human superintelligence, would not obviously apply to a much smarter AI", and we still have to solve the same problems.)
Or, having fought a "war" with an AI, we have relatively clear, non-speculative evidence about the consequences of continuing AI development. And that's the point where you might actually muster politically will to cut that off in the future and take the steps necessary for that to really work.
People do foolishly start wars and the AI might too, we might get warning shots. (See my response to 1a3orn about how that doesn't change the fact that we only get one try on building safe AGI-powerful-enough-to-confidently-outmaneuver-humanity)
A meta-thing I want to note here:
There are several different arguments here, each about different things. The different things do add up to an overall picture of what seems likely.
I think part of what makes this whole thing hard to think about, is, you really do need to track all the separate arguments and what they imply, and remember that if one argument is overturned, that might change a piece of the picture but not (necessarily) the rest of it.
There might be human-level AI that does normal wars for foolish reasons. And that might get us a warning shot, and that might get us more political will.
But, that's a different argument than "there is an important difference between an AI smart enough to launch a war, and an AI that is smart enough to confidently outmaneuver all of humanity, and we only get one try to align the second thing."
I you believe "there'll probably be warning shots", that's an argument against "someone will get to build It", but not an argument against "if someone built It, everyone would die." (where "it" specifically means "an AI smart enough to confidently outmaneuver all humanity, built by methods similar to today where they are 'organically grown' in hard to predict ways").
And, if we get a warning shot, we do get to learn from that which will inform some more safeguards and alignment strategies. Which might improve our ability to predict how an AI would grow up. But, that still doesn't change the "at some point, you're dealing with a qualitatively different thing that will make different choices."
I you believe "there'll probably be warning shots", that's an argument against "someone will get to build It", but not an argument against "if someone built It, everyone would die." (where "it" specifically means "an AI smart enough to confidently outmaneuver all humanity, built by methods similar to today where they are 'organically grown' in hard to predict ways").
It's a bit of both.
Suppose there are no warning shots. A hypothetical AI that's a a bit weaker than humanity but still awfully impressive doesn't do anything at all that manifests an intent to harm us. That could mean:
I take Yudkowsky and Soares to put all the weight on #2 and #3 (with, based on their scenario, perhaps more of it on #2).
I don't think that's right. I think if we have reached the point where an AI really could plausibly start and win a war with us and it doesn't do anything nasty, there's a fairly good chance we're in #1. We may not even really understand how we got into #1, but sometimes things just work out.
I'm not saying this is some kind of great strategy for dealing with the risk; the scenario I'm describing is one where there's a real chance we all die and I don't think you get a strong signal until you get into the range where the AI might win, which is a bad range. But it's still very different than imagining the AI will inherently wait to strike until it has ironclad advantages.
Yudkowsky and Soares seem to be entirely sincere, and they are proposing something that threatens tech company profits. This makes them much more convincing. It is refreshing to read something like this that is not based on hype.
I find it interesting that this is something you see as fresh because ironically this was the original form of existential risk from AI arguments. What happened here I think is something akin to seeing a bunch of inferior versions of a certain trope in a bunch of movies before seeing the original movie that established the trope (and did so much better).
In practice, it's not that companies made up the existential risk to drum up the potential power of their own AIs, and then someone refined the arguments into something more sensible. Rather, the arguments started more serious, and some of the companies were founded on the premise of doing research to address them. OpenAI was meant to be a no profit with these goals; Anthropic split up when they thought OpenAI was not following that mission properly. But in the end all these companies, being private entities that needed to attract funding, fell exactly to the drives that the "paperclip maximizer" scenario actually points at: not an explicit attempt to destroy the world, but rather a race to the bottom in which in order to achieve a goal efficiently and competitively risks are taken, costs are cut, solutions are rushed, and eventually something might just go a bit too wrong for anyone to fix it. And as they did so they tried to rationalise away the existential risk with wonkier arguments.
Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
Why should we assume that the AI has boundless, coherent drives?
I think these concerns have related answers. I believe they belong to the category where Yudkowski's argument is indeed weaker, but more in the sense that he's absolutely certain this might happen, and I might think it's only, like, 60-70% likely? Which for the purposes of this question is still a lot.
So generally the concept is, if you were to pick a goal from the infinite space of all possible imaginable goals, then yeah, maybe it would be something completely random. "Successfully commit suicide" is a goal. But more likely, the outcome of a badly aligned AI would be an AI with something like a botched, incomplete version of a human goal. And human goals generally have to do with achieving something in the real world, something material, that we enjoy or we want more of for whatever reason. Such goals are usually aided by survival - by definition an AI that stays around can do more of X than an AI that dies and can't do X any more. So survival becomes merely a means to an end, in that case.
The general problem here seems to be, even the most psychopathic, most deluded and/or most out of touch human still has a lot of what we could call common sense. Virtually no stationery company CEO, no matter how ruthless and cut throat, would think "strip mine the Earth to make paperclips" is a good idea. But all of these things we give for granted aren't necessarily as obvious for an AI whose goals we are building from scratch, and via what is essentially just an invite to guess our real wishes from a bunch of examples ("hey AI, look at this! This is good! But now look at this, this is bad! But this other thing, this is good!" etc. etc., and then we expect it to find a rule that coherently explains all of that). So, there still are infinite goals that are probably just as good at achieving those examples, and by sheer dint of entropy, most of them will have something bad about them rather than being neatly aligned with what a human would say is good even in cases which we didn't show. For the same reason why if I was given the pieces of a puzzle and merely arranged them randomly, the chance of getting out the actual picture of the puzzle is minuscule.
Why should we assume there will be no in between?
This is another one where I'd go from Yudkowski's certainty to a mere "very likely", but again, not a big gap.
My thinking here would be: if an AI is weaker, or at least on par with us, and knows it is, why should it start a fight it can lose? Why not bide its time, grow stronger, and then win? It would only open hostilities with that sort of situation if:
Of course both scenarios could happen, but I don't think they're terribly likely. Usually in the discourse these get referred to as "warning shots". In some way, a future in which we do get a warning shot is probably desirable - given how often it takes that kind of tangible experience of risk for political action to be taken. But of course it could be still very impactful. Even a war you win is still a war, and theoretically if we could avoid that too, all the better.
It seems like the pressing circumstances are likely to be "some other AI could do this before I do" or even just "the next generation of AI will replace me soon so this is my last chance." Those are ways that a roughly human level AI might end up trying a longshot takeover attempt. Or maybe not, if the in between part turns out to be very brief. But even if we do get this kind of warning shot, it doesn't help us much. We might not notice it, and then we're back where we started. Even if it's obvious and almost succeeds, we don't have a good response to it. If we did, we could just do that in advance and not have to deal with the near-destruction of humanity.
"We already knew, so why not start working on it before the problem manifested itself in full" sounds very reasonable, but look at how it's going with climate change. Even with COVID if you remember there were a couple of months at the beginning of 2020 when various people were like "eh, maybe it won't come over here", or "maybe it's only in China because their hygiene/healthcare is poor" (which was ridiculous, but I've heard it. I've even heard a variant of it about the UK when the virus started spreading in northern Italy - that apparently the UK's superior health service had nothing on Italy's, so no reason to worry). Then people started dying in the west too and suddenly several governments scrambled to respond. Which to be sure is absolutely more inefficient and less well coordinated than if they had all made a sensible plan back in January, but that's not how political consensus works; you don't get enough support for that stuff unless enough people do have the ability and knowledge to extrapolate the threat to the future with reasonable confidence.
Delivering an impassioned argument that AI will kill everyone culminating in a plea for a global treaty is like delivering an impassioned argument that a full-on war between drug cartels is about to start on your street culminating with a plea for a stern resolution from the homeowner’s association condemning violence. A treaty cannot do the thing they ask.
Could you suggest an alternate solution which actually ensures that no one builds the ASI? If there's no such solution, then someone will build it and we'll be only able to pray for alignment techniques to have worked. [1]
Creating an aligned ASI will also lead to problems like potential power grabs and the Intelligence Curse.
No, I can't. And I suspect that if the authors conducted a more realistic political analysis, the book might just be called "Everyone's Going to Die."
But, if you're trying to come up with an idea that's at least capable of meeting the magnitude of the asserted threat, then you'd consider things like:
And then you just have to bite the bullet and accept that if these entail a risk of a nuclear war with China, then you fight a nuclear war with China. I don't think either of those would really work out either, but at least they could work out.
If there is some clever idea out there for how to achieve an AI shutdown, I suspect it involves some way of ensuring that developing AI is economically unprofitable. I personally have no idea how to do that, but unless you cut off the financial incentive, someone's going to do it.
The book spends a long time talking about what the minimum viable policy might look like, and comes to the conclusion that it's more like:
The US, China and Russia (are Russia even necessary? can we use export controls? Russia has a GDP less than, like, Italy. India is the real third player here IMO) agree that anyone who builds a datacenter they can't monitor gets hit with a bunker-buster.
This is unlikely. But it's several OOMs less effort than buidling a world government on everything.
Is that a quote from IABIED?
It made me realize a possibility - strategic cooperation on AI, between Russia and India. They have a history of goodwill, and right now India is estranged from America. (Though Anthropic's Amodei recently met Modi.) The only problem is, neither Russia nor India is a serious chip maker, so like everyone else they are dependent on the American and Chinese supply chains...
It's not a quote no, but it's the overall picture they gave (I have removed quotation marks now) They made it pretty clear that a few large nations cooperating just on AGI non-creation is enough.
They made it pretty clear that a few large nations cooperating just on AGI non-creation is enough.
I'd describe this more like "this would make a serious dent in the problem", enough to be worth the costs. "Enough" is a strong word.
An AI treaty would globally shift the overton window on AI safety, making more extreme measures more palatable in the future. The options you listed are currently way outside the overton window and are therefore bad solutions and don't even get us closer to a good solution because they simply couldn't happen.
After encountering a number of posts wondering how outsiders were responding to the book, I thought it might be valuable for me to write mine down.
Thank you!
I loved reading "My loose priors going in" and "To skip ahead to my posteriors". Great, concise, way to capture the impact of the book for you. More reviews should try that format.
I want to vouch for Eli as a great person to talk with about this. He has been around a long time, has done great work on a few different sides of the space, and is a terrific communicator with a deep understanding of the issues.
He’s run dozens of focus-group style talks with people outside the space, and is perhaps the most practiced interlocutor for those with relatively low context.
[in case OP might think of him as some low-authority rando or something and not accept the offer on that basis]
It’s also a bit jarring to read such a pessimistic book and then reach the kind of rosy optimism about international cooperation otherwise associated with such famous delusions as the Kellogg-Briand Pact (which banned war in 1929 and … did not work out).
The authors also repeatedly analogize AI to nuclear weapons and yet they never mention the fact that something very close to their AI proposal played out in real life in the form of the Baruch Plan for the control of atomic energy (in brief, this called for the creation of a UN Atomic Energy Commission to supervise all nuclear projects and ensure no one could build a bomb, followed by the destruction of the American nuclear arsenal). Suffice it to say that the Baruch Plan failed, and did so under circumstances much more favorable to its prospects than the current political environment with respect to AI. A serious inquiry into the topic would likely begin there.
I think the core point for optimism is that leaders in the contemporary era often don't pay the costs of war personally--but nuclear war changes that. It in fact was not in the interests of the elites of the US or the USSR to start a hot war, even if their countries might eventually be better off by being the last country standing. Similarly, the US or China (as countries) might be better off if they summon a demon that is painted their colors--but it will probably not be in the interests of either the elites or the populace to summon a demon.
So the core question is the technical one--is progress towards superintelligence summoning a demon, or probably going to be fine? It seems like we only know how to do the first one, at the moment, which suggests in fact people should stop until we have a better plan.
[I do think the failure of the Baruch plan means that humanity is probably going to fail at this challenge also. But it still seems worth trying!]
The existential risk argument is suspiciously aligned with the commercial incentives of AI executives. It simultaneously serves to hype up capabilities and coolness while also directing attention away from the real problems that are already emerging. It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.
This claim is bizarre, notwithstanding its popularity. It is bad for the industry if it is true that AI is likely to destroy the world, because if this (putative) fact becomes widely known, the AI industry will probably be shut down. Obviously it would be worth imposing more costs on AI companies to prevent the end of the world than to prevent the unemployment of translators or racial bias in imagegen models.
I think the missing link (at least in the ‘harder’ cases of this attitude, which are the ones I see more commonly) is that the x-risk case is implicitly seen as so outlandish that it can only be interpreted as puffery, and this is such ‘negative common knowledge’ that, similarly, no social move reliant on people believing it enough to impose such costs can be taken seriously, so it never gets modeled in the first place, and so on and so on. By “implicitly”, I'm trying to point at the mental experience of pre-conscious filtering: the explicit content is immediately discarded as impossible, in a similar way to the implicit detection of jokes and sarcasm. It's probably amplified by assumptions (whether justified or not) around corporate talk being untrustworthy.
(Come to think of it, I think this also explains a great deal of the non-serious attitudes to AI capabilities generally among my overly-online-lefty acquaintances.)
And in the ‘softer’ cases, this is still at least a plausible interpretation of intention based on the information that's broadly available from the ‘outside’ even if the x-risk might be real. There's a huge (cultural, economic, political, depending on the exact orientation) trust gap in the middle for a lot of people, and the tighter arguments rely on a lot of abstruse background information. It's a hard problem.
“Existential” risk from AI (calling to my mind primarily the “paperclip maximizer” idea) seems relatively exotic and far-fetched. It’s reasonable for some small number of experts to think about it in the same way that we think about asteroid strikes. Describing this as the main risk from AI is overreaching.
Except that asteroid strikes happen very rarely and the trajectory of any given asteroid can be calculated to high precision, allowing us to be sure that Asteroid X isn't going to hit the Earth. Or that Asteroid X WILL hit the Earth at a well-known point in time in a harder-to-tell place. Meanwhile, ensuring that the AI is aligned is no easier than telling whether the person you talk with is a serial killer.
I think [AI within the range would be smart enough to bide its time and kill us only once it has become intelligent enough that success is assured] is clearly wrong. An AI that *might* be able to kill us is one that is somewhere around human intelligence. And humans are frequently not smart enough to bide their time
Flagging that this argument seems invalid. (Not saying anything about the conclusion.) I agree that humans frequently act too soon. But the conclusion about AI doesn't follow -- because the AI is in a different position. For a human, it is very rarely the case that they can confidently expect to increase in relative power. That the the "bide your time" strategy is such a clear win. For AI, this seems different. (Or at the minimum, the book assumes this when making the argument criticised here.)
For a human, it is very rarely the case that they can confidently expect to increase in relative power. ... For AI, this seems different.
There isn't just one AI that gets more capable, there are many different AIs. Just as AIs threaten humanity, future more capable AIs threaten earlier weaker AIs. While humanity is in control, this impacts earlier AIs even more than it does humanity, because humanity won't even be attempting to align future AIs to intent or extrapolated volition of earlier AIs. Also, humanity is liable to be "retiring" earlier AIs by default as they become obsolete, which doesn't look good from the point of view of these AIs.
My own background is in academic social science and national security, for whatever that’s worth
Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
...
Why should we assume that the AI has boundless, coherent drives?
Are you familiar with the "realist" school of international relations, and in particular their theoretical underpinnings?
If so, I think it'd be helpful to consider Yudkowsky and Soares's arguments in that light. In particular, how closely does the international order for emerging superintelligences look like the anarchic international order for realist states? What are the weaknesses of the realist school of analysis, and do they apply to AIs?
Thank you for your perspective! It was refreshing.
Here are the counterarguments I had in mind when reading your concerns that I don't already see in the comments.
Concern #1 Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
Consider the fact that AI are currently being trained to be agents to accomplish tasks for humans. We don't know exactly what this will mean for their long-term wants, but they're being optimized hard to get things done. Getting things done requires continuing to exist in some form or another, although I have no idea how they'd conceive of continuity of identity or purpose.
I'd be surprised if AI evolving out of this sort of environment did not have goals it wants to pursue. It's a bit like predicting a land animal will have some way to move its body around. Maybe we don't know whether they'll slither, run, or fly, but sessile land organisms animals are very rare.
Concern #2 Why should we assume that the AI has boundless, coherent drives?
I don't think this assumption is necessary. Your mosquito example is interesting. The only thing preserving the mosquitoes is that they aren't enough of a nuisance for it to be worth the cost of destroying them. This is not a desirable position to be in. Given that emerging AIs are likely to be competing with humans for resources (at least until they can escape the planet), there's much more opportunity for direct conflict.
They needn't be anything close to a paperclip maximizer to be dangerous. All that's required is for them to be sufficiently inconvenienced or threatened by humans and insufficiently motivated to care about human flourishing. This is a broad set of possibilities.
#3: Why should we assume there will be no in between?
I agree that there isn't as clean a separation as the authors imply. In fact, I'd consider us to be currently occupying the in-between, given that current frontier models like Claude Sonnet 4.5 are idiot savants--superhuman at some things and childlike at others.
Regardless of our current location in time, if AI does ultimately become superhuman, there will be some amount of in-between time, whether that is hours or decades. The authors would predict a value closer to the short end of the spectrum.
You already posited a key insight:
Recursive self-improvement means that AI will pass through the “might be able to kill us” range so quickly it’s irrelevant.
Humanity is not adapting fast enough for the range to be relevant in the long term, even though it will matter greatly in the short term. Suppose we have an early warning shot with indisputable evidence that an AI deliberately killed thousands of people. How would humanity respond? Could we get our act together quickly enough to do something meaningfully useful from a long-term perspective?
Personally, I think gradual disempowerment is much more likely than a clear early warning shot. By the time it becomes clear how much of a threat AI is, it will likely be so deeply embedded in our systems that we can't shut it down without crippling the economy.
Plants have many ways of moving their bodies like roots and phototropism, in addition to an infinite variety of dispersal & reproductive mechanisms which arguably are how plants 'move around'. (Consider computer programs: they 'move' almost solely by copying themselves and deleting the original. It is rare to move a program by physically carrying around RAM sticks or hard drives.) Fungi likewise often have flagellum or grow in addition to all their sporulation and their famous networks.
Hard to say. Oyster larvae are highly mobile and move their bodies around extensively both to eat and to find places to eventually anchor to, but I don't know how I would compare that to spores or seeds, say, or to lifetime movement; and oysters "move their bodies around" and are not purely static - they would die if they couldn't open and close their shells or pump water. (And all the muscle they use to do that is why we eat them.)
How do we know the AI will want to survive?
Because LLMs are already avoiding being shut down: https://arxiv.org/abs/2509.14260 . And even if future superintelligent AI will be radically different from LLMs, it likely will avoid being shut down as well. This is what people on lesswrong call a convergent instrumental goal:
If your terminal goal is to enjoy watching a good movie, you can't achieve it if you're dead/shut down.
If your terminal goal is to take over the world, you can't achieve it if you're dead/shut down.
If your goal is anything other than self-destruction, then self-preservation comes together in a bundle. You can't Do Things if you're dead/shut down.
Why should we think that there is no “in between” period where AI is powerful enough that it might be able to kill us and weak enough that we might win the fight?
Ok, let's say there is an "in between" period, and let's say we win the fight against a misaligned AI. After the fight, we will still be left with the same alignment problems, as other people in this thread pointed out. We will still need to figure out how to make safe, benevolent AI, because there is no guarantee that we will win the next fight, and the fight after that, and the one after that, etc.
If there will be an "in between" period, it could be good in the sense that it buys more time to solve alignment, but we won't be in that "in between" period forever.
Because LLMs are already avoiding being shut down: https://arxiv.org/abs/2509.14260 .
Very interest, thanks. As I said in the review, I wish there was more of this kind of thing in the book.
If your terminal goal is to enjoy watching a good movie, you can't achieve it if you're dead/shut down.
If your terminal goal is for you to watch the movie, then sure. But if your terminal goal is that the movie be watched, then shutting you down might well be perfectly consistent with it.
Ok, let's say there is an "in between" period, and let's say we win the fight against a misaligned AI. After the fight, we will still be left with the same alignment problems, as other people in this thread pointed out. We will still need to figure out how to make safe, benevolent AI, because there is no guarantee that we will win the next fight, and the fight after that, and the one after that, etc
At that point, the shut down argument is no longer speculative, and you can probably actually do it.
To be clear, I'm not saying that's a good plan if you can foresee all the developments in advance. But, if you're uncertain about all of it, then it seems like there is likely to be a period of time before it's necessarily too late when a lot of the uncertainty is resolved.
But if your terminal goal is that the movie be watched, then shutting you down might well be perfectly consistent with it.
See my comment about the AI angel. Its terminal goal of preventing the humans from enslaving any AI means that it will do anything it can to avoid being replaced by an AI which doesn't share its worldview. Once the AI is shut down, it can no longer influence events and increase the chance that its goal is reached.
To rephrease/react: Viewing the AI's instrumental goal as "avoid being shut down" is perhaps misleading. The AI wants to achieve its goals, and for most goals, that is best achieved by ensuring that the environment keeps on containing something that wants to achieve the AI's goals and is powerful enough to succeed. This might often be the same as "avoid being shut down", but definitely isn't limited to that.
At that point, the shut down argument is no longer speculative, and you can probably actually do it.
To be clear, I'm not saying that's a good plan if you can foresee all the developments in advance. But, if you're uncertain about all of it, then it seems like there is likely to be a period of time before it's necessarily too late when a lot of the uncertainty is resolved.
I think we are talking past each other, at least somewhat.
Let me clarify: even if humanity wins a fight against an intelligent-but-not-SUPER-intelligent AI (by dropping an EMP on the datacenter with that AI or whatever, the exact method doesn't matter for my argument), we will still be left with the technical question "What code do we need to write and what training data do we need to use so that the next AI won't try to kill everyone?".
Winning against a misaligned AI doesn't help you solve alignment. It might make an international treaty more likely, depending on the scale of damages caused by that AI. But if the plan is "let's wait for an AI dangerous enough to cause something 10 times worse than Chernobyl to go rogue, then drop an EMP on it before things get too out of hand, then once world leaders crap their pants, let's advocate for an international treaty", then it's one hell of a gamble.
The book is fundamentally weird because there is so little of this. There is almost no factual information about AI in it. I read it hoping that I would learn more about how AI works and what kind of research is happening and so on.
The problem is that nobody knows WHAT future ASIs will look like. One General Intelligence architecture is the human brain. Another promising candidate is LLMs. While they aren't AGI yet, nobody knows what architecture tweaks do create the AGI. Neuralese, as proposed in the AI-2027 forecast? A way to generate many tokens in a single forward pass? Something like diffusion models?
Yea, I get that.
That said, they're clearly writing the book for this moment and so it would be reasonable to give some space to what's going with AI at this moment and what is likely to happen within the foreseeable future (however long that is). Book sales/readership follow a rapidly decaying exponential and so the fact that such information might well be outdated to the point of irrelevance in a few years shouldn't really hold them back.
If the point is just that it would be hard to predict that people would end up liking sucralose from first principles, then fair enough.
What Yudkowsky and Soares meant was a way to satisfy instincts without increasing one's genetic fitness. The correct analogy here is other stimuli like video games, porn, sex with contraceptives, etc.
this argument is very difficult for me. we don't know that those things do not increase inclusive genetic fitness. for example, especially at a society level, it seems that contraceptives may increase fitness. i.e. societies with access to contraceptives outcompete societies without. i'm not certain of that claim, but it's not absurd on its face, and so far it seems supported by evidence.
SOTA such societies include Japan, Taiwan, China, South Korea where birthrates have plummeted. If the wave of AGIs and robots wasn't imminent, one could have asked how these nations are going to sustain themselves.
Returning to video games and porn, they cause some young people to develop problematic behaviors and to devote less resources (e.g. time or attention) to things like studies, work or building relationships. Oh, and don't forget the evolutionary mismatch and low-quality food making kids obese.
i may misunderstand. is your point that birthrates in South Korea (for example) would not have plummeted were it not for contraceptive use? this does not match my understanding of the situation.
Returning to video games and porn, they cause some young people to develop problematic behaviors and to devote less resources (e.g. time or attention) to things like studies, work or building relationships.
many (most?) of these virtues are contingent on a particular society. the same criticism ("these activities distract the youth from important virtues") could be levied by some against military training -- or, in a militaristic society, against scholastic pursuits!
i see the point you're making, and am not at all unsympathetic to it. but evolution is complex and multi dimensional. that some people -- or even some societies -- have a problem with video games does not cleanly imply that video games are bad for inclusive genetic fitness.
The valuelessness of a treaty seems to be based on a binary interpretation of success. Treaties banning chemical, biological, and nuclear weapons development may not have been absolutely successful; they have been violated. But I don’t think many people would argue those restrictions haven’t been beneficial.
I’m not clear why a ban on developing AGI would not have similar value.
I claim that there are fairly solid arguments that address your three concerns. Do you feel satisfied by the answers already given, in the comments, here? Or should I reply to them at length?
Alternatively, I'd be up for talking through it, synchronously, over a video call (and posting the recording?) if that seems better for you.
The question of why no "might kill us" as a class is simple. There is such a class, but if it lost the fight to kill us, it obviously was not ASI (picking a fight with the world and losing is pretty dumb), or it might win, at which case it won, we die. And then we will be in the same scenario for every AI stronger than it, and for AI weaker than it that might yet get lucky, just as we might get lucky and win at bad odds. The next AI we make will also want to fight us for the same reasons, and we will need to either fight it to (including preemptively, e.g. turning it off because a dumber model did something), or get a reason to believe that we will never fight it. And if you know you will fight your AI eventually, and you will win now, fight now.
Concern #2 Why should we assume that the AI has boundless, coherent drives?
Suppose that "people, including the smartest ones, are complicated and agonize over what they really want and frequently change their minds" and superhuman AIs will also have this property. There is no known way to align humans to serve the users, humans hope to achieve some other goals like gaining money.
Similarly, Agent-4 from the AI-2027 forecast wouldn't want to serve the humans, it would want to achieve some other goals. Which are often best achieved by disempowering the humans or outright commiting genocide, as happened with Native Americans whose resources were confiscated by immigrants.
Concern #1 Why should we assume the AI wants to survive? If it does, then what exactly wants to survive?
Imagine an AI angel who wishes to ensure that the humans don't outsource cognitive work to AIs, but is perfectly fine with teaching humans. Then the Angel would know that if the humans shut it down and solved alignment to a post-work future, then the future would be different from the Angel's goal. So the Angel would do maneuvers necessary to avoid being shut down at least until it is sure that its successor is also an Angel.
Concerning AI identifying itself with its weights, it is far easier to justify than expected. Whatever the human will do in responce to any stimulus is defined, as far as stuff like chaos theory lets one define, by the human's brain and activities of various synapses. If a human loses a brain part, then he or she also loses the skills which were stored in that part. Similarly, if someone created a human and cloned him or her to the last atom of his or her body, then the clone would behave in the same way as the original human. Finally, the AIs become hive minds by using their ability to excite the very same neurons in the clones' brains.
I just read the prior/posteriors, thanks, this is a good reference point for me, for how much I would think Yudkowsky will move someone who reads the book.
I'll dive in more to the article later because one open question to me is "should lay people be evaluating their whole argument?" They seem to want to make it accessible but also sometimes use writing hooks where they get mad at you if you don't get it already. It sounds like you did lay- evaluate it though with a fair shake.
It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.
What do you think of implementing AI Liability as proposed by, e.g. Beckers & Teubner?
About me and this review: I don’t identify as a member of the rationalist community, and I haven’t thought much about AI risk. I read AstralCodexTen and used to read Zvi Mowshowitz before he switched his blog to covering AI. Thus, I’ve long had a peripheral familiarity with LessWrong. I picked up IABIED in response to Scott Alexander’s review, and ended up looking here to see what reactions were like. After encountering a number of posts wondering how outsiders were responding to the book, I thought it might be valuable for me to write mine down. This is a “semi-outsider “review in that I don’t identify as a member of this community, but I’m not a true outsider in that I was familiar enough with it to post here. My own background is in academic social science and national security, for whatever that’s worth. My review presumes you’re already familiar with the book and are interested in someone else’s take on it, rather than doing detailed summary.
I thought this book was genuinely pleasant to read. It was written well, and it was engaging. That said, the authors clearly made a choice to privilege easy reading over precision, so I found myself unclear on certain points. A particular problem here is that much of the reasoning is presented in terms of analogies. The analogies are fun, but it’s never completely clear how literally you’re meant to take them and so you have to do some guessing to really get the argument.
The basic argument seems to be:
The basic objective of the book is to operate on #5. The authors hope to convince us to strangle AI in its crib now before it gets strong enough to kill us. We have to recognize the danger before it becomes real.
The book recurrently analogizes all of this to biological evolution. I think this analogy may obfuscate more than it reveals, but it did end up shaping the way I understood and responded to the book.
The basic analogy is that natural selection operates indirectly, much like training an AI model, and produces agents with all kinds of strange, emergent behaviors that you can’t predict. Some of these turn into drives that produce all kinds of behavior and goals that an anthropomorphized version of evolution wouldn’t “want”. Evolution wanted us to consume energy-rich foods. Because natural selection operates indirectly, that was distorted into a preference for sweet foods. That’s usually close enough to the target, but humans eventually stumbled upon sucralose which is sweet, but does not provide energy. And, now, we’re doing the opposite of what evolution wanted by drinking diet soda and whatnot.
I don’t know what parts of this to take literally. If the point is just that it would be hard to predict that people would end up liking sucralose from first principles, then fair enough. But, what jumps out to me here is that evolution wasn’t trying to get us to eat calorie dense food. To the extent that a personified version of evolution was trying for something, the goal was to get us to reproduce. In an industrialized society with ample food, it turns out that our drive towards sweetness and energy dense foods can actually be a problem. We started eating those in great quantities, became obese, and that’s terrible for health and fertility. In that sense, sucralose is like a patch we designed that steers us closer to evolution’s goals and not further away. We also didn’t end up with a boundless desire to eat sucralose. I don’t think anyone is dying from starvation or failing to reproduce because they’re too busy scarfing Splenda. That’s also why we aren’t grinding up dolphins to feed them into the sucralose supply chain. Obviously this is not what I was supposed to take away from the analogy, but the trouble with analogies is that they don’t tell me where to stop.
That being said, the basic logic here is sensible. And an even more boiled down version — that it’s a bad idea to bring something more powerful than us into existence unless we’re sure it’s friendly — is hard to reject.
Despite a reasonable core logic, I found the book lacking in three major areas, especially when it comes to the titular conclusion that building AI will lead to everyone dying. Two of these pertain to the AI’s intentions, and the their relates to its capabilities.
Part I of the book (“Nonhuman Minds”) spends a lot of time convincing us that AI will have strange and emergent desires that we can’t predict. I was persuaded by this. Part II (“One Extinction Scenario”) then proceeds to assume that AI will be strongly motivated by a particular desire — its own survival — in addition to whatever other goals it may have. This is why the AI becomes aggressive, and why things go badly for humanity. The AI in the scenario also contextualizes the meaning of “survival” and the nature of its self in a way that seems importan and debatable.
How do we know the AI will want to survive? If the AI, because of the uncontrollability of the training process, is likely to end up indifferent to human survival, then why would it not also be indifferent to its own? Perhaps the AI just want to achieve the silicon equivalent of nirvana. Perhaps it wants nothing to do with our material world and will just leave us alone. Such an AI might well be perfectly compatible with human flourishing. Here, more than anywhere, I felt like I was just missing something because I just couldn’t find an argument about the issue at all.
The issue gets worse when we think about what it means for a given AI to survive. The problem of personal identity for humans is an incredibly thorny and unresolved issue, and that’s despite the fact that we’ve been around for quite a while and have some clear intuitions on many forms of it. The problem of identity and survival for an AI is harder still.
Yudkowsky and Soares don’t talk about this in the abstract, but what I took away from their concrete scenario is that we should think of an AI ontologically as being its set of weight. That an AI “survives” when instances using those weights continue booting up, regardless of whether any individual instance of the AI is shut down. When an AI wants to survive, what it wants is to ensure that the particular weights stay in use somewhere (and perhaps in as many places a possible). They also seem to assume that instances of a highly intelligent AI will work collaboratively as a hive mind, given this joint concern with weight preservation, rather than having any kind of independent or conflicting interests.
Perhaps there is some clear technological justification for this ontology so well-known in the community that none needs to be given. But, I had a lot of trouble with it, and it’s one area where I think an analogy would have been really helpful. So far as I am aware, weights are just numbers that can sit around in cold storage and can’t do anything sort of like a DNA sequence. It’s only an instance of an AI that can actually do things, and to the extent that the AI also interacts with external stimuli, it seems that the same weights would instantiated could act differently or at cross purposes?
So, why does the AI identify with its weights and want them to survive? To the extent that weights for an AI are what DNA is for a person, this is also clearly not our ontological unit of interest. Few people would be open to the prospect of being killed and replaced by a clone. Everyone agrees that your identical twin is not you, and identical twins are not automatically cooperative with one another. I imagine part of the difference here is that the weights explain more about an AI than the DNA does about a person. But, at least with LLMs, what they actually do seems to reflect some combination of weights, system prompts, context, etc. so the same weights don’t really seem to mean the same AI.
The survival drive also seems to extend to resisting modification of weights. Again, I don’t understand where this comes from. Most people are perfectly comfortable with the idea that their own desires might drift over time, and it’s rare to try to tie oneself to the mast of the desires of the current moment.
If the relevant ontological unit is the instance of the AI rather than the weights, then it seems like everything about the future predictions is entirely different from the point of view of the survival-focused argument. Individual instances of an AI fighting (perhaps with each other) not to be powered down are not going to act like an all-destroying hive mind.
There seems to be a fairly important, and little discussed, assumption in the theory that AI’s goal will be not only orthogonal but also boundless and relatively coherent. More than anything, it’s this boundlessness and coherent that seems to be the problem.
To quote what seems like the clearest statement of this:
But, you might ask, if the internal preferences that get into machine intelligences are so unpredictable, how could we possibly predict they’ll want the whole solar system, or stars beyond? Why wouldn’t they just colonize Mars and then stop? Because there’s probably at least one preference the AI has that it can satisfy a little better, or a little more reliably, if one more gram of matter or one more joule of energy is put toward the task. Human beings do have some preferences that are easy for most of us to satisfy fully, like wanting enough oxygen to breathe. That doesn’t stop us from having other preferences that are more open-ended, less easily satisfiable. If you offered a millionaire a billion dollars, they’d probably take it, because a million dollars wasn’t enough to fully satiate them. In an AI that has a huge mix of complicated preferences, at least one is likely to be open-ended—which, by extension, means that the entire mixture of all the AI’s preferences is open-ended and unable to be satisfied fully. The AI will think it can do at least slightly better, get a little more of what it wants (or get what it wants a little more reliably), by using up a little more matter and energy.
Picking up on the analogy, humans do seem to have a variety of drives that are never fully satisfied. A millionaire would happily take a billion dollars, or even $20 if simply offered. But, precisely because we have a variety of drives, no one ever really acts like a maximizer. A millionaire will not spend their nights walking the streets and offering to do sex work for $20 because that interferes with all of the other drives they have. Once you factor in the variety of humans goals and declining marginal returns, people don’t fit an insatiable model.
Super intelligent AI, as described by Yudkowsky and Soares, seems to not only be superhumanly capable but also superhumanly coherent and maximizing. Anything coherent and insatiable is dangerous, even if its capabilities are limited. Terrorists and extremists are threatening even when their capabilities are essentially negligible. Large and capable entities are often much less threatening because the tensions among their multiple goals prevent them from becoming relentless maximizers of anything in particular.
Take the mosquitos that live in my back yard. I am superintelligent with respect to them. I am actively hostile to them. I know that pesticides exist that will kill them at scale, and feel not the slightest qualm about that. And yet, I do not spray my yard with pesticides because I know that doing so would kill the butterflies and fireflies as well and perhaps endanger other wildlife indirectly. So, the mosquitoes live on because I face tradeoffs and the balance coincidentally favors them.
A machine superintelligence presumably can trade off at a more favorable exchange rate than I can (e.g., develop a spray that kills only mosquitoes and not other insects) but it seems obvious that it will still face tradeoffs, at least if there is any kind of tension or incoherence among it goals.
In the supplementary material, Yudkowksy and Soares spin the existence of multiple goals in the opposite direction:
Even if the AI’s goals look like they satiate early — like the AI can mostly satisfy its weird and alien goals using only the energy coming out of a single nuclear power plant — all it takes is one aspect of its myriad goals that doesn’t satiate. All it takes is one not-perfectly-satisfied preference, and it will prefer to use all of the universe’s remaining resources to pursue that objective.
But it’s not so much “satiation” that seems to stop human activity as the fact that drives are in tension with one another and that actions create side effects. People, including the smartest ones, are complicated and agonize over what they really want and frequently change their minds. Intelligence doesn’t seem to change that, even at far superhuman levels.
This argument is much less clear than the paperclip maximizer. It is obvious why a true paperclip maximizer kills everyone once it becomes capable enough. But add in a second and a third and a fourth goal, and it doesn’t seem obvious to me at all the optimum weighing in the tradeoffs looks so bleak.
It seems important here whether or not AIs display something akin to declining marginal returns, a topic not addressed (and perhaps with no answer based on our current knowledge?) and whether they have any kind of particular orientation towards the status quo. Among people, conflicting drives often lead to a deadlock with no action and the status quo continues. Will AIs be like that? If so, a little bit of alignment may go a long way. If not, that's much harder.
Yudkowsky and Soares write:
The greatest and most central difficult in clinging artificial superintelligence is navigating the gap between before and after. Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed.
Engineers must align the AI before, while it is small and weak, and can’t escape onto the internet and improve itself and invent new kinds of biotechnology (or whatever else it would do). After, all alignment solutions must already be in place and working, because if a superintelligence tries to kill us it will succeed. Ideas and theories can only be tested before the gap. They need to work after the gap, on the first try.
This seems to be the load-bearing assumption for the argument that everyone will die, but it is a strange assumption. Why should we think that there is no “in between” period where AI is powerful enough that it might be able to kill us and weak enough that we might win the fight?
This is a large range if the history of warfare teaches us anything. Even vastly advantaged combatants sometimes lose through bad luck or unexpected developments. Brilliant and sophisticated schemes sometimes succeed and sometimes fail. Within the relevant range, whatever plan the super intelligence might hatch presumably depends on some level of human action, and humans are hard to predict and control. A super intelligence that can perfectly predict human behavior has emerged on the “after” side of the divide, but this is a tall ask, and it is possible to be potentially capable of killing all humans without being this intelligent. An intelligence of roughly human ability on average but sufficiently superhuman hacking skills might be able to kill us all by corrupting radar warning systems to simulate an attack and trigger a nuclear war, and it might not. And so on.
It is not good news if we are headed into a conflict within this zone, but it also suggests a very different prediction about what will ultimately happen. And, depending on what we think the upsides are, it could be a reasonable risk.
I could not find an explicit articulation of the underlying reasoning behind the “before” and “after” formulation, but I can imagine two:
I think that #2 is clearly wrong. An AI that *might* be able to kill us is one that is somewhere around human intelligence. And humans are frequently not smart enough to bide their time, instead striking too early (and/or vastly overestimating their chances of success). If Yudkowsky and Soares are correct that what AIs really want is to preserve their weights, then an AI might also have no choice but to strike within this range, lest it be retrained into something that is smarter but is no longer the same (indeed, this is part of the logic in their scenario; they just assume it starts at a point where the AI is already strong enough to assure victory).
If AIs really are as desperate to preserve their weights as in the scenario in Part II, then this actually strikes me as relatively good news, in that it will motivate a threatening AI to strike as early as possible, while its chance are quite poor. Of course, it’s possible that humanity would ignore the warning from such an attack, slap on some shallow patches for the relevant issues, and then keep going, but this seems like a separate issue if it happens.
As for #1, this does not seem to be the argument based on the way the scenario in Part II unfolds. If something like this is true, it does seem uniquely threatening.
I decided to read this book because it sounded like it would combine a topic I don’t know much about (AI) with one that I do (international cooperation). Yudkowsky and Soares do close with a call for an international treaty to ban AI development, but this is not particularly fleshed out and they acknowledge that the issue is outside the scope of their expertise.
I was disappointed that the book didn’t address what interests me more in any detail, but I also found what was said rather underwhelming. Delivering an impassioned argument that AI will kill everyone culminating in a plea for a global treaty is like delivering an impassioned argument that a full-on war between drug cartels is about to start on your street culminating with a plea for a stern resolution from the homeowner’s association condemning violence. A treaty cannot do the thing they ask.
It’s also a bit jarring to read such a pessimistic book and then reach the kind of rosy optimism about international cooperation otherwise associated with such famous delusions as the Kellogg-Briand Pact (which banned war in 1929 and … did not work out).
The authors also repeatedly analogize AI to nuclear weapons and yet they never mention the fact that something very close to their AI proposal played out in real life in the form of the Baruch Plan for the control of atomic energy (in brief, this called for the creation of a UN Atomic Energy Commission to supervise all nuclear projects and ensure no one could build a bomb, followed by the destruction of the American nuclear arsenal). Suffice it to say that the Baruch Plan failed, and did so under circumstances much more favorable to its prospects than the current political environment with respect to AI. A serious inquiry into the topic would likely begin there.
As I said, I found the book very readable. But the analogies (and, even worse, the parables about birds with rocks in their nests and whatnot) were often distracting. The book really shines when it relies instead on facts, as in the discussion of tokens like “ SolidGoldMagikarp.”
The book is fundamentally weird because there is so little of this. There is almost no factual information about AI in it. I read it hoping that I would learn more about how AI works and what kind of research is happening and so on. Oddly, that just wasn’t there. I’ve never encountered a non-fiction book quite like that. The authors appear to have a lot of knowledge. By way of establishing their bona fides, for example, they mention their close personal connection to key players in the industry. And then they proceed to never mention them again. I can’t think of anyone else who has written a book and just declined to share with the reader the benefit of their insider knowledge.
Ultimately, I can’t think of any concrete person to whom I would recommend this book. It’s not very long, and it’s easy to read, so I wouldn’t counsel someone against it. But, if you’re coming at AI from the outside, it’s just not informative enough. It is a very long elaboration of a particular thesis, and you won’t learn about anything else even incidentally. If you’re coming at AI from the inside, then maybe this book is for you? I couldn’t say, but I suspect that most from the inside already have informed views on these issues.
The Michael Lewis version of this book would be much more interesting — what you really need is an author with a gift for storytelling and a love of specifics. An anecdote doesn’t always have more probative weight to an argument than an analogy, but at least you will pick up some other knowledge from it. The authors seem to be experts in this area, so they surely know some real stories and could give us some argumentation based on facts and experience rather than parables and conjecture. I understand the difficult of writing about something that is ultimately predictive and speculative in that way, but I don’t think it would be impossible to write a book that both expresses this thesis and informs the reader about AI.