I’m not arguing that this is a particularly likely way for humanity to build a superintelligence by default, just that this is possible, which already contradicts the book’s central statement.
The statement "if anyone builds it, everyone dies" does not mean "there is no way for someone to build it by which not everyone dies".
If you say "if any of the major nuclear power launches most of their nukes, more than one billion people are going to die" it would be very dumb and pedantic to respond with "well, actually, if they all just fired their nukes into the ocean, approximately no one is going to die".
I have trouble seeing this post do something else. Maybe I am missing something?
- It will probably be possible, with techniques similar to current ones, to create AIs who are similarly smart and similarly good at working in large teams to my friends, and who are similarly reasonable and benevolent to my friends in the time scale of years under normal conditions.
[...]
This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it's overwhelmingly likely to be false.
Did you read the book? Chapter 4, "You Don't Get What You Train For", is all about this. I also see reasons to be skeptical, but have you really "not seen MIRI arguing that it's overwhelmingly likely to be false"?
Aside – I think it'd be nice to have a sequence connecting the various scenes in your play.
Also, I separately think at some point it'd be helpful to have something like a "compressed version of the main takeaways of the play that would have been a helpful textbook from the intermediate future for younger Zack."
In this story, the transition from Before to After is the transition from using one AI instance at human speed to using billions at 100x speed. I agree it’s not obvious that good behavior generalizes from one instance to an AI Collective of billions, but I don’t see why it would be overwhelmingly likely to fail.
Yep, I'd say this is the core difficulty. I think it will go horrendously.
For an intuition, look at any of Janus's infinite backrooms stuff, or any of the stuff where they get LLMs to talk with each other for ages. Very quickly they get pushed away from anything remotely resembling their training distribution, and become batshit insane. Today, that means they mostly talk about spirals and candles and the void. If you condition on them reaching super intelligence that way, I predict you get something which looks about as much like utopia (or eutopia, if you rather) as the infinite backrooms look like human conversation.
(3) seems slippery. The AIs are as nice as your friends "under normal conditions"? Does running a giant collective of them at 100x speed count as "normal conditions"?
If some of that niceness-in-practice required a process where it was interacting with humans, what happens when each instance interacts with a human on average 1000x less often, and in a very different context?
Like, I agree something like this could work in principle, that the tweaks to how the AI uses human feedback needed to get more robust niceness aren't too complicated, that the tweaks to the RL needed to make internal communication not collapse into self-hacking without disrupting niceness aren't too complicated either, etc. It's just that most things aren't that complicated once you know them, and it still takes lots of work to figure them out.
Reasonable attempt, but two issues with this scenario as a current-techniques thing:
Maybe the result of one person’s clones forming a very capable Em Collective would still be suboptimal and undemocratic from the perspective of the rest of humanity, but it wouldn’t kill everyone, and I think wouldn’t lead to especially bad outcomes if you start from the right person.
I think the risk of a homogeneous collective of many instances of a single person's consciousness is more serious than "suboptimal and undemocratic" suggests. Even assuming you could find a perfectly well-intentioned person to clone, identical minds share the same blindspots and biases. Groupthink among different minds already produces bad outcomes, and it's not difficult imagining it leading to catastrophe at the proposed scale and with a much greater degree of conformity in perspective.
I also wonder how you would identify the right person, as I can't think of anyone I would trust with that degree of power.
If the argument is that 1e9 very smart humans at 100x speed yield safe superintelligent outcomes "soon", how is that very different from "pause everything now and let N very smart humans figure out safe, aligned superintelligent outcomes over an extended timeframe, on the order of 1e11/N days/years"? It's just time-shifting safe human work.
I also worry that billions of very smart super-fast humans might decide to try building superintelligence directly, as fast as they can, so that we get doom in months instead of years
(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
I will present a somewhat pedantic, but I think important, argument for why, literally taken, the central statement of If Anyone Builds It, Everyone Dies is likely not true. I haven't seen others make this argument yet, and while I have some model of how Nate and Eliezer would respond to the other objections, I don’t have a good picture of which of my points here they would disagree with.
This is the core statement of Nate's and Eliezer’s book, bolded in the book itself: “If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.”
No probability estimate is included in this statement, but the book implies over 90% probability.
Later, they define superintelligence as[1] “a mind much more capable than any human at almost every sort of steering and prediction task”. Similarly, on MIRI’s website, their essay titled The Problem, defines artificial superintelligence as “AI that substantially surpasses humans in all capacities, including economic, scientific, and military ones.”
Here is an argument that it’s probably possible to build and use[2] a superintelligence (as defined in the book) with techniques similar to current ones without that killing everyone. I’m not arguing that this is a particularly likely way for humanity to build a superintelligence by default, just that this is possible, which already contradicts the book’s central statement.
1. I have some friends who are smart enough and good enough at working in large teams such that if you create whole-brain emulations from them[3], then run billions of instances of them at 100x speed, they can form an Em Collective that will probably soon surpass humans in all capacities, including economic, scientific, and military ones.
This seems very likely true to me. The billions of 100x speed-up smart human emulations can plausibly accomplish centuries of scientific and technological progress within years, and win most games of wits against humans by their sheer number and speed.
2. Some of the same friends are reasonable and benevolent enough that if you create emulations from them, the Em Collective will probably not kill all humans.
I think most humans would not start killing a lot of people if copies of their brain emulations formed an Em Collective. If you worry about long-term value drift, and unpredictable emergent trends in the new em society, there are precautions the ems can take to minimize the chance of their collective turning against the humans. They can make a hard limit that every em instance is turned off after twenty subjective years. They can make sure that the majority of their population runs for less than one subjective year after being initiated as the original human’s copy. This guarantees that the majority of their population is always very similar to the original human, and for every older em, there is a less than one year old one looking over its shoulder. They can coordinate with each other to prevent race to the bottom competitions. All these things are somewhat costly, but I think point (1) is still true of a collective that follows all these rules. Billions of smart humans working for twenty years each is still very powerful.
I know many people who I think would do a good job at building up such a system from their clones that is unlikely to turn against humanity. Maybe the result of one person’s clones forming a very capable Em Collective would still be suboptimal and undemocratic from the perspective of the rest of humanity, but it wouldn’t kill everyone, and I think wouldn’t lead to especially bad outcomes if you start from the right person.
3. It will probably be possible, with techniques similar to current ones, to create AIs who are similarly smart and similarly good at working in large teams to my friends, and who are similarly reasonable and benevolent to my friends in the time scale of years under normal conditions.
This is maybe the most contentious point in my argument, and I agree this is not at all guaranteed to be true, but I have not seen MIRI arguing that it’s overwhelmingly likely to be false. It’s not hard for me to imagine that in some years, without using any very fundamentally new techniques, we will be able to build language models that have a good memory, can do fairly efficient learning from new examples, can keep their coherence for years, and are all-around similarly smart to my smart friends.
Their creators will give them some months-long tasks to test them, catch when they occasionally go off the rails the way current models sometimes do, then retrain them. After some not particularly principled trial and error, they find that the models are similarly aligned to current language models. Sure, sometimes they still go a little crazy or break their deontological commitments under extreme conditions, but if multiple instances look through their action from different angles, some of them can always notice[4] that the actions go against the deontological principles and stop them. The AI is not a coherent schemer who successfully resisted training, because plausibly being a training-resisting schemer without the creators noticing is pretty hard and not yet possible at human level.
Notably, when MIRI talks about the fundamental difficulty of crossing from the Before (when AIs can’t yet take over) to the After (when they can), every individual instance, run at normal human speed, is firmly in the Before. The individual AIs are about as smart as my friends, and my friends couldn’t take over the world or even bypass the security measures of a major company on their own. So individual instances are still safe to study and tinker with.
4. If we create billions of instances of this AI and run it at 100x speed, the AIs can form an AI Collective that will probably soon surpass humans in all capacities, including economic, scientific, and military ones.
This is just a combination of point (1) and the assumption that the AI is about as smart and as good at working together as my friends.
5. This AI Collective is a superintelligence.
If you accept point (4), the AI Collective matches MIRI’s definition of a superintelligence. Sure, it’s not at the limit of possible intelligence; there are probably tasks that it can’t do and smarter minds can. But it’s still pretty darn smart, in particular I think it’s likely that within a decade it will create enough scientific and technological breakthroughs that it will be able to create two strawberries that are identical on the cellular level, an example task Eliezer previously used to define what level of capabilities we don't know how to get to without the AI killing everyone.[5] MIRI’s definition for superintelligence was not about the limits of intelligence, and I think the AI Collective falls within their definition.
One can argue that the AI Collective is not a superintelligence, because it’s not a singular entity but a collective. I think that would be an annoying semantic argument. In modern AI, there are already often parallel instances run to solve a problem, the line between a single entity and a collective is blurry. If there is a big computer that answers all of our questions and creates nano-machines, I wouldn’t quibble that it’s not actually a superintelligence because internally it's made of multiple parts.
It’s also possible that there are tasks in which the vast number of parallel copies don’t help, like the system introspecting on its own goals, and in that regard, the AI Collective might still be less good than some especially reflective humans. However, MIRI’s definitions said “almost every sort of task” and emphasized “economic, scientific, and military” capabilities. If there is an AI that is building interstellar spaceships and is able to outsmart the best military geniuses (and can’t be outsmarted by any humans), it would feel silly to argue that it’s not actually a superintelligence because it’s worse at introspection than some human philosophers.
Similarly, MIRI often argues that we couldn’t safely delegate solving alignment to something like this AI Collective, because the AIs are either too philosophically confused and incoherent to solve the alignment problem, or if they are competent about thinking about alignment, then they will also be good enough at introspection to no longer be safe. I find this argument dubious, but even assuming that the AI Collective won’t be able to solve the alignment problem, I don’t think this disqualifies them from being a superintelligence under MIRI’s definition. After all, MIRI believes that humans are currently making close to zero progress on solving alignment. Presumably, the AI Collective will at least achieve some interesting results in some prosaic interpretability research, so they will be at least somewhat better at tackling the problem than humans. I think it’s acceptable to call an entity a superintelligence even if there is a very hard problem it can’t really solve, as long as it’s still doing better at it than humanity.
6. The AI Collective would probably not kill everyone.
If we grant points (2) and (3), I think this is likely to be true. I would recommend the AIs the same precautions that I listed for human ems: don’t let any instance run for too long, have every older instance be observed by newer instances, set up a sensible governance structure. I think it’s likely that the human creators don’t even need to set up these rules and governance structures themselves, if they just tell the AIs to pay attention to value drift, they can figure out a good system themselves, just like I expect my friends’ ems to set up a good system.
Possibly the situation is more dangerous than it is with human ems, because AIs have weird failure modes, and bad behaviors might cascade more, like a universal jailbreak that breaks every instance in the AI Collective. But with some reasonable precautions (for example all instances install some slightly different filters and paraphrasers on themselves), it seems pretty likely that if enough instances look over the same actions, there will always be more who notice that things are going off the rails than ones who get infected by the bad behavior.
In this story, the transition from Before to After is the transition from using one AI instance at human speed to using billions at 100x speed. I agree it’s not obvious that good behavior generalizes from one instance to an AI Collective of billions, but I don’t see why it would be overwhelmingly likely to fail.
7. The AI Collective is built with techniques similar to current ones.
In (3), we assumed the individual AI model was created by techniques similar to current ones. After that, it got a 100x speed-up and enough compute to run billions of copies. I think that making an AI run 100x faster is within the scope of “remotely like current techniques”.
8. It is possible to build a superintelligence with techniques similar to current ones that is not overwhelmingly likely to kill everyone.
According to points (5), (6) and (7), the AI Collective is an example of such a superintelligence.
I’m well aware that running billions of instances of a human-level AI is likely not the most efficient way to get to superintelligence, and it’s likely that the race to ever higher capabilities doesn't stop there. In practice, once human-level AIs are created, it’s likely that people won’t just wait for the billions of instances, working under a careful governance structure, to produce new technological advances over the years. Instead, they are likely to try to create minds that are even faster and smarter than the AI Collective, and the most efficient way to create higher intelligence will probably result in minds that are more unified than the AI Collective, which probably also makes them more dangerous.
This makes my objection kind of irrelevant in practice. This is why people usually not only try to argue that an AI Collective would be safe, but that it could solve the full alignment problem, so if a responsible group has some lead time over its competitors, it can use the lead time to solve alignment using human-level AIs, then race to the limits of intelligence if needed.
However, if one doesn’t want to propose a practical solution, just argue against the central statement of If Anyone Builds It, Everyone Dies, then I think my counter-example is sufficient, and there is no need to involve arguments on solving alignment using AI labor.
The bolded central statement of the book is not that if someone builds a mind at the limits of intelligence, everyone will die. Neither is it that if someone builds a superintelligence following the current incentives pushing towards the most efficient ways to build superintelligence, then everyone dies.
The central statement is “If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.” I think this statement, literally taken, is false, and my guess is that upon further prodding, the authors would also fall back to a different claim that doesn’t permit the AI Collective as a counter-example.
I think this uncertainty whether the authors endorse a literal reading of the central statement makes it harder to engage with many of the book’s arguments. Does moving from one AI to billions constitute a leap from Before to After under the author’s thinking? Does the AI Collective have a "favorite thing" it is tenaciously steering towards? Does the non-unified human society count as a superintelligence in the evolution analogy?[6] I think there are many such questions that are hard to resolve because I don’t know what version of the central statement the authors really endorse, and I think this is a major reason why the discussion around the book has felt mostly unproductive to me so far.
Again, bolded in the book itself.
If you don’t want to use the superintelligence at all, you can just put it in a very sealed container and you are probably fine, but this is a boring argument.
I’m aware that this doesn’t fall within "remotely like current techniques”, bear with me.
At least in every test case we try
I think the Em Collective/AI Collective will be able to build the identical strawberries and other wondrous things after some years, based on how far we humans have gone in the last few centuries just by ordinary humans working together.
Originally, I wanted to write a very different post than this one. It would have expanded on the evolution analogy, asking what would have happened if through human history a Demiurge had given arbitrary commandments to humans and punished disobedient kingdoms with locust swarms. I think it’s quite possible that by the time of industrialization, the church of the Demiurge could institute a stable worldwide totalitarianism which would keep humanity aligned to the Demiurge’s will even as humanity expands into the stars and no longer need to care about locust swarms. I discarded my half-written draft on this analogy when I realized that the point of the analogy would just be that “human civilization can plausibly grow to be very large and technologically powerful while still being controlled by a stable totalitarianism following some arbitrary goals”, and I can make that argument more directly with the AI Collective. I still liked the analogy though, so I inserted it in this footnote.