My personal experience has been that most people outside LW rapidly or even immediately get the idea that "building something non-human that's much smarter than humans and whose beneficent motives you're not certain of" is a bad idea. It's not complicated, and it has been the plot of many SF movies over the decades. I think most people also get that the traditional "the hero saves the day by coming up with a paradox or an expression of the human heart that the machines cannot comprehend" is a pretty implausible way to end that movie. If you need an additional argument, try "How do you think the chimps feel about the appearance of the human race? We don't bear them any specific ill will, but we did basically take over the world, cut down most of the jungles, and then put them in zoos, because we could."
[This post is not meant for the rationalist audience directly, because it has already internalized this lesson, but is to be used as an analogy in discourse with an unfamiliar crowd. The end goal is to make people intuitively understand why AI is extremely risky whether or not it has much agency, by burning through a lot of inferential steps quickly. I take it as a given that p(doom) is high. I also assume that it doesn't take that much more intelligence than that of a very intelligent human in order to significantly increase the probability of doom. Note: by “AI” I often mean “ASI”.]
I believe one of the fundamental misunderstandings in AI discourse has to do with anthropomorphizing AI too much. We give a sense of agency to AI when we refer to it as “malignant AI”, or when we talk of “deceptive alignment”. It’s not that giving a sense of agency is bad; it is necessary and helpful in many ways. But in other ways, I think it actively inhibits people from understanding the sheer aura of threat something more intelligent than they would exhibit (regardless of its agency/intention). If you linger on agency, you point people in the direction of wondering why AI would want to kill everyone; and then they stray in the direction of Skynet, and conclude that there is no reason in particular why AI would hate them. Uh oh, now the burden of proof is suddenly on you to explain orthogonality and instrumental convergence, which costs you precious and finite inferential steps.
Before you even get to agency—before you let people think of reasons why AI might want them dead— you must underline why capabilities, in and of themselves, put the burden of proof on those who claim AI is safe by default. You cannot jump the gun on this: it must be made clear why any claims arguing a superintelligence is not inherently dangerous are to be immediately investigated and treated with high suspicion because it is a very strong assumption.
Hopefully this analogy will serve as a useful intuition pump for this.
tl;dr: There is a certain string of words, in the possibility space of all strings, which flawlessly explains how to kill all humans. This string, if it were printed out onto paper, would be a major infohazard to humanity. It would have to be kept away from as many agents as possible, because if a single person uncovered it, the p(doom) of humanity would shoot up by a non-trivial amount. It would not make sense to take as a given that the average person is sufficiently good or intelligent enough not to use the string, and thus leave the strand in place. If 1 out of every 100 people is sufficiently bad or stupid, and that person uncovers the string, everyone dies. A p(doom) of 1% represents more people than have died in WW2, which is not to be sneezed at.[1] Equivocating on the type of person that might or might not use the string is madness. When you uncover the String of Death, you destroy its substrate and Obliviate yourself, instead of making grand claims about the kindness of the human heart (or worse, the common sense of the human brain).
An artificial intelligence that is capable of locating the String of Death in the possibility space is just as much an infohazard as the String of Death printed in book form is. If AI labs expect 1 in 100 AIs of sufficient intelligence to locate the String of Death to be misaligned with humanity, then that is the same 1% chance of extinction as earlier. I do not want us to be skipping the lesson about capabilities and arguing about whether it is 1% or 10% chance of misalignment, when a 1% chance already represents 10 Holocausts worth of damage. The burden of proof is on you to demonstrate that spawning the String of Death out of possibility space is not, in fact, a Bad Idea.[2]
2,335 steps
You enter a room with a single wooden table at its center. There is a 300-page dusty tome on it. You walk up to the table and open its cover, to find that it is titled How to take over the world in 2,335 easy steps.[3] Below, you can make out the subtitle “January 9th 2024 edition: new and improved!” Flipping through the pages, you rapidly skim the book and come across items like ‘publish the following paper into this journal’, or ‘invest X dollars into Y’ and ‘send the following email to personality X’.
The book is extremely precise about how you, personally, should go about taking over the world: it tells you to send email X at precisely 2:34 PM next Tuesday, for example, before going on to describe 12 more steps, and then sternly admonishing you to take a one hour lunch break (which is, to be specific, step 1,476).
The book knows everything that is publicly available on the internet: every detail about the world of January 9th that one could reasonably gather is acknowledged (but not always in explicit terms) in the book. Indeed, whilst you note that the grimoire is too short to go over every variable that runs the world, you discover with awe that any variable you think of that is not in the book does not end up actually hindering the Master Plan. It was somehow devised in such a way that every possible roadblock it could encounter on the road to taking over the world is to be swiftly avoided, even if it doesn’t mention those roadblocks explicitly! It’s as if someone had treated entire industries, countries, technologies and individuals as black boxes, while also being able to perfectly predict them.[4] Additionally, the book supplies probabilities of success for each step. Like a choose-your-own-adventure story, it redirects you toward fallback plans whenever a step fails. The book is not omniscient so much as it is extraordinarily well-informed and well-organized.
You formulate the hypothesis that whatever intelligence wrote this tome must have gone over every variable in the world, before turning the relevant ones into near-perfectly predictable black boxes, for your convenience.
So, you are utterly shocked when you discover (in the copyright section of the book) that it was not written by an intelligence at all! Rather, the book had been found laying on the floor in the library of Babel, and Bob (the janitor) had picked it up and placed in on the table while mopping. A flawless plan for taking over the world starting January 9th 2024 exists by definition, after all, in the countably infinite halls of Babel.[5] This book has always existed and forever will, and its presence in your hands is an astronomical coincidence. How lucky, you think, as you pocket the book, return home, and know precisely what you’re going to do at 2:43 that afternoon. (We are also supposing that a magical omniscient genie has taken one glance at the Book, and told you that everything inside was accurate.)
The Book of Death
Meanwhile, Babel contains every possible sequence of letters. This necessarily includes the perfect plan you illegally pocketed (you forgot to renew your library card), but also the perfect plan for killing all humans.[6] Imagine Babelspace as an abstract space laid over reality. The notion of “intelligence” means being able to explore Babelspace while maximally restricting options, providing smaller and smaller degrees of uncertainty in the quest to find any given Book. The most basic algorithm through Babelspace might be one that picks one Book at random for you to read and, if that doesn't work, you have it try again. The most advanced algorithm on the other hand is able to extract maximal information from its own searching, because it is capable of extraordinarily wide and accurate generalization.[7]
We’ll call the book that perfectly describes how to kill all humans as quickly as possible “The Book of Death”. One of the main goals humanity is pursuing—or should be pursuing—is to avoid the appearance of the Book of Death.
Everything is approximation. If you do not have a magical genie floating around, you cannot tell with certainty whether you have found the true form of the Book of Death. There do exist cheap knockoffs of the Book of Death, which might take a few more days to cause total extinction, or might contain a spelling mistake on page 56. The better you are at combing possibility space, the clearer your image of the true Book of Death, and the more shams you can discard.
If I were to personally supply my best guess for what was in the Book of Death, I might come up with something like travelling to Moscow with a speech prepared for Vladimir Putin, in order to convince him to empty his nuclear arsenal, or I might imagine something to do with taking a course on bio-engineering. Meanwhile, I am part of a species that has already located "The Book of Moth Death", and deploys the vast and sinister arcane powers of that tome regularly. Moths cannot even begin to fathom a path through possibility space that would allow them to do anything like this.
For the same information, the more intelligence you have, the higher-quality your edition of the Book of Death.
The algorithm that can produce The Book of Death is inherently an infohazard
The Book of Death is an infohazard. You could even argue it is the worse infohazard possible (although I wouldn’t underestimate the library of Babel).[8] If a fissure in the ground opened and the voice of a hundred flaming demons hissed “Behold, the Book of Death” before handing the book to a you and returning to the 9th circle of Hell, the rational course of action for humanity would be to burn that book (if it’s even flammable) so that no one can read it. It doesn’t matter that the average person is good, deep down, or even (hypothetically speaking) that the average person is rational enough not to follow the instructions in a book labeled The Book of Death: a dummies’ guide to single-handedly precipitating human extinction. In fact, even if the book had appeared in dath ilan, where there is such a concept as thorough global coordination and, say, everyone were a perfect benevolent bayesian reasoner, it would still be a good idea to burn the book, even if you’re almost absolutely certain that no one would pick it up, or open it, or follow the instructions.[9]
An algorithm which can locate the Book of Death in possibility space is to be treated just as cautiously as the copy the demons handed you. There is a difference, in that the kind of AI current labs are aiming for can instantly locate The Book of Curing Cancer, and the Book of Building A Perfectly Managed Economy, just as well as they can locate Death. There's good reason to want to build this kind of algorithm at some point, whereas there's no good reason at all to keep a demonic book. But as was the case with the book, even a small chance that the agent controlling the book is bad can have disastrous consequences. The Book of Death is far from its vacuum state: it is highly unstable. A single person finding the book increases the risk that they open the book and follow the instructions, setting off a chain reaction. It really doesn't take much agency at all to end the world, if you have the right knowledge. So if there's any doubt at all that the agency behind an algorithm capable of spawning Books several orders of magnitude more powerful than our own, you should stop doing what you're doing. The burden of proof is on you.
This is not an instance where I want to hear you say "oops".
Only counting current population, of course. Extinction by AI is qualitatively different from mere catastrophe in a few different ways, which include "everything humans consider of value disappears" but also "your children, friends, and loved ones all get annihilated".
There are a few technical assumptions I'm making here, like that the AI could locate the string almost instantly and that we couldn't stop it as it is searching. Another assumption I'm making is that there exists, in fact, a String of Death, even when that string must by definition show the AI how to get through whatever safeguards we have put in place (and the more safeguards you have, the more intelligent the AI must be to locate the string since the string is necessarily more complex.) Finally, I'm also assuming that there is such a string in the first place. It shouldn't be difficult to imagine, though, that there exists somewhere in possibility space a sequence of words which would convince Vladimir Putin to unleash thermonuclear hell on the world. And there exists without a doubt a string containing the genetic information required to build the deadliest possible virus for humans.
How many steps would really be required in order to take over the world? I don't know, you tell me!
E.g. while the nation of Liechtenstein is certainly part of the world you are taking over, the book makes no explicit mention of it because it is (correctly) assumed that previous steps in the plan already implicitly cover “taking control of Liechtenstein”.
In the original short story by Borges, the halls are actually finite, as the narrator mentions how long each book is, and how they are all the same length.
To give a specific example, there is one Book in Babelspace that contains the full genome sequence of the optimally-dangerous biological virus for homo sapiens. It would be at the Pareto frontier between incubation time, deadliness, virality, etc. Babelspace also contains the most optimal cure for cancer or for civilisational adequacy, which is important.
For more exploration of this concept, see Yudkowsky’s That Alien Message, and this post about whether a superintelligence could deduce general relativity from two frames of an apple falling (no).
“Triggering vacuum Decay in 3 easy steps”, “Tiling the entire observable and unobservable universe with tiny useless rhombuses using FTL Von Neumann probes in 3 easy steps”, “A beginner’s guide to sending a perfectly offensive email to the Simulators so that they shut down the simulation”, “Tiling the universe with 100 trillion Shrikes”.
The original dath ilan isn’t full of perfect bayesian reasoners. That doesn’t matter to this argument.