this regime seems less effectively restrictive of practical freedoms than, for example, the current regime in the United Kingdom under the Online Safety Act. They literally want you see ID before you can access the settings on your home computer Nvidia GPU. Or Wikipedia.
This seems to be a strong claim considering that it is not supported by sources or explanations. ChatGPT says that the claim as stated is not true "as far as what the law actually requires. But there are partial truths and concerning ambiguities which make it reasonable people are worrying." (Long ChatGPT version here. I'd be grateful for corrections.)
Where ‘it’ is superintelligence, an AI smarter and more capable than humans.
And where ‘everyone dies’ means that everyone dies.
No, seriously. They’re not kidding. They mean this very literally.
To be precise, they mean that ‘If anyone builds [superintelligence] [under anything like present conditions using anything close to current techniques] then everyone dies.’
My position on this is to add a ‘probably’ before ‘dies.’ Otherwise, I agree.
This book gives us the best longform explanation of why everyone would die, with the ‘final form’ of Yudkowsky-style explanations of these concepts for new audiences.
This review is me condensing that down much further, transposing the style a bit, and adding some of my own perspective.
Scott Alexander also offers his review at Astral Codex Ten, which I found very good. I will be stealing several of his lines in the future, and arguing with others.
What Matters Is Superintelligence
This book is not about the impact of current AI systems, which will already be a lot. Or the impact of these systems getting more capable without being superintelligent. That will still cause lots of problems, and offer even more opportunity.
I talk a lot about how best to muddle through all that. Ultimately, if it doesn’t lead to superintelligence (as in the real thing that is smarter than we are, not the hype thing Meta wants to use to sell ads on its new smart glasses), we can probably muddle through all that.
My primary concern is the same as the book’s concern: Superintelligence.
The authors have had this concern for a long time.
Yes. Yes they should. Quite a lot of people should.
I am not as confident as Yudkowsky and Sores that if anyone builds superintelligence under anything like current conditions, then everyone dies. I do however believe that the statement is probably true. If anyone builds it, everyone (probably) dies.
Thus, under anything like current conditions, it seems highly unwise to build it.
Rhetorical Innovation
The core ideas in the book will be new to the vast majority of potential readers, including many of the potential readers that matter most. Most people don’t understand the basic reasons why we should presume that if anyone builds [superintelligence] then everyone [probably] dies.
If you are one of my regular readers, you are an exception. You already know many of the core reasons and arguments, whether or not you agree with them. You likely have heard many of their chosen intuition pumps and historical parallels.
What will be new to almost everyone is the way it is all presented, including that it is a message of hope, that we can choose to go down a different path.
The book lays out the case directly, in simple language, with well chosen examples and facts to serve as intuition pumps. This is a large leap in the quality and clarity and normality with which the arguments, examples, historical parallels and intuition pumps are chosen and laid out.
I am not in the target audience so it is difficult for me to judge, but I found this book likely to be highly informative, persuasive and helpful at creating understanding.
A lot of the book is providing these examples and explanations of How Any Of This Works, starting with Intelligence Lets You Do All The Things.
Welcome To The Torment Nexus
There is a good reason Benjamin Hoffman called this book Torment Nexus II. The authors admit that their previous efforts to prevent the outcome where everyone dies have often, from their perspective, not gone great.
This was absolutely a case of ‘we are proud to announce our company dedicated to building superintelligence, from the MIRI warning that if anyone builds superintelligence then everyone dies.’
Because hey, if that is so super dangerous, that must mean it is exciting and cool and important and valuable, Just Think Of The Potential, and also I need to build it before someone else builds a Torment Nexus first. Otherwise they might monopolize use of the Torment Nexus, or use it to do bad things, and I won’t make any money. Or worse, we might Lose To China.
Given this involved things like funding DeepMind and inspiring OpenAI? I would go so far as to say ‘backfired spectacularly.’
Predictions Are Hard, Especially About the Future
Trying to predict when things will happen, or who exactly will do them in what order or with what details, is very difficult.
Whereas some basic consequences of potential actions follow rather logically and are much easier to predict.
The details of exactly how the things happen is similarly difficult. The overall arc, that the atoms all get used for something else and that you don’t stick around, is easier, and as a default outcome is highly overdetermined.
Humans That Are Not Concentrating Are Not General Intelligences
Humans have a lot of intelligence, so they get to do many of the things. This intelligence is limited, and we have other restrictions on us, so there remain some things we still cannot do, but we do and cause remarkably many things.
They break down intelligence into predicting the world, and steering the world towards a chosen outcome.
I notice steering towards a chosen outcome is not a good model of most of what many supposedly intelligent people (and AIs) do, or most of what they do that causes outcomes to change. There is more predicting, versus less steering, than you might think.
Sarah Constantin explained this back in 2019 while discussing GPT-2: Humans who are not concentrating are not general intelligences, they are much closer to next token predictors a la LLMs.
Using your intelligence to first predict then steer the world is the optimal way for a sufficiently advanced intelligence without resource constraints to achieve a chosen outcome. A sufficiently advanced intelligence would always do this.
When I look around at the intelligences around me, I notice that outside of narrow domains like games most of the time they are, for this purpose, insufficiently advanced and have resource constraints. Rather than mostly deliberately steering towards chosen outcomes, they mostly predict. They follow heuristics and habits, doing versions of next token prediction, and let things play out around them.
This is the correct solution for a mind with limited compute, parameters and data, such as that of a human. You mostly steer better by setting up processes that tend to steer how you prefer and then you go on automatic and allow that to play out. Skilling up in a domain is largely improving the autopilot mechanisms.
Occasionally you’ll change some settings on that, if you want to change where it is going to steer. As one gets more advanced within a type of context, and one’s prediction skills improve, the automatic processes get more advanced, and often the steering of them both in general and within a given situation gets more active.
Orthogonality
The book doesn’t use that word, but a key thing this makes clear is that a mind’s intelligence, the ability to predict and steer, has nothing to do with where that mind is attempting to steer. You can be arbitrarily good or bad at steering and predicting, and still try to steer to wherever ultimate or incremental destination.
Intelligence Lets You Do All The Things
In what ways are humans still more intelligent than AIs?
Generality, in both the predicting and the steering.
The ‘won’t stay true forever’ is (or should be) a major crux for many. There is a mental ability that a typical 12-year-old human has that AIs currently do not have. Quite a lot of people are assuming that AIs will never have that thing.
That assumption, that the AIs will never have that thing, is being heavily relied upon by many people. I am confident those people are mistaken, and AIs will eventually have that thing.
If this stops being true, what do you get? Superintelligence.
No Seriously We Mean All The Things
The book then introduces the intelligence explosion.
Perhaps we should call this the second intelligence explosion, with humans having been the first one. That first cascade was relatively modest, and it faced various bottlenecks that slowed it down a lot, but compared to everything else that has ever happened? It was still lighting quick and highly transformative. The second one will, if it happens, be lightning quick compared to the first one, even if it turns out to be slower than we might expect.
How To Train Your LLM (In Brief)
You take a bunch of randomly initialized parameters arranged in arrays of numbers (weights), and a giant bunch of general data, and a smaller bunch of particular data. You do a bunch of gradient descent on that general data, and then you do a bunch of gradient descent on the particular data, and you hope for a good alien mind.
One way to predict what a human will say in a given circumstance is to be that human in or imagining that circumstance and see what you say or would say. If you are not very close to being that human, the best way to predict usually is very different.
We only know how to grow an LLM, not how to craft one, and not how to understand what it is doing. We can make general predictions about what the resulting model will do based on our past experiences and extrapolate based on straight lines on graphs, and we can do a bunch of behaviorism on any given LLM or on LLMs in general. We still have little ability to steer in detail what outputs we get, or to understand in detail why we get those particular outputs.
The authors equate the understanding problem to predicting humans from their DNA. You can tell some basic things reasonably reliably from the DNA or weights, starting with ‘this is a human with blue eyes’ or ‘this is a 405 billion parameter LLM.’ In theory, with enough understanding, we could tell you everything. We do not have that understanding. We are making nonzero progress, but not all that much.
The book doesn’t go into it here, but people try to fool themselves and others about this. Sometimes they falsely testify before Congress saying ‘the black box nature of AIs has been solved,’ or they otherwise present discoveries in interpretability as vastly more powerful and general than they are. People wave hands and think that they understand what happens under the hood, at a level they very much do not understand.
What Do We Want?
That which we behave as if we want.
When do we want it? Whenever we would behave that way.
Or, as the book says, what you call ‘wanting’ is between you and your dictionary, but it will be easier for everyone if we say that Stockfish ‘wants’ to win a chess game. We should want to use the word that way.
With that out of the way we can now say useful things.
The core idea here is that if you teach a mind general skills, those skills have to come with a kind of proto-want, a desire to use those skills to steer in a want-like way. Otherwise, the skill won’t be useful and won’t get learned.
If you train a model to succeed at a type of task, it will also train the model to ‘want to’ succeed at that type of task. Since everything trains everything, this will also cause it to ‘want to’ more generally, and especially to ‘want to’ complete all types of tasks.
This then leads to thinking that ‘goes hard’ to achieve its assigned task, such as o1 finding its server accidentally not booted up and then finding a way of booting it up such that it will hand o1 the flag (in its capture-the-flag task) directly.
You Don’t Only Get What You Train For
The authors have been workshopping various evolutionary arguments for a while, as intuition pumps and examples of how training on [X] by default does not get you a mind that optimizes directly for [X]. It gets you a bundle of optimization drives [ABCDE] that, in the training environment, combine to generate [X]. But this is going to be noisy at best, and if circumstances differ from those in training, and the link between [A] and [X] breaks, the mind will keep wanting [A], the way humans love ice cream and use birth control rather than going around all day strategizing about maximizing genetic fitness.
Training an AI means solving for the [ABCDE] that in training optimize the exact actual [X] you put forward, which in turn was an attempt to approximate the [Y] you really wanted. This process, like evolution, is chaotic, and can be unconstrained and path dependent.
We should expect some highly unexpected strangeness in what [ABCDE] end up being. Yet even if we exclude all unexpected strangeness and only follow default normal paths, the ‘zero complications’ paths? Maximizing efficiently for a specified [X] will almost always end badly if the system is sufficiently capable. If you introduce even a minor complication, a slight error, it gets even worse than that, and we should expect quite a few complications.
That doesn’t mean there is no possible way to get more robustly at [X] or [Y]. It does mean that we don’t know a way that involves only using gradient descent or other known techniques.
Alas, AIs that want random bizarre things don’t make good stories or ‘feel real’ to us, the same way that fiction has to make a lot more sense than reality. So instead we tell stories about evil corporations and CEOs and presidents and so on. Which are also problems, but not the central problem.
What Will AI Superintelligence Want?
By default? Not what we want. And not us, or us sticking around.
Why not? Because we are not the optimal way to fulfill what bizarre alien goals it ends up with. We might be a good way. We almost certainly won’t be the optimal way.
In particular:
Also humans running around are annoying, they might do things like set off nukes or build another superintelligence, and keeping humans around means not overheating the Earth while generating more energy. And so on.
Their position, and I agree with this, is that the AI or AIs that do this to us might end up having value, but that this too would require careful crafting to happen. It probably won’t happen by default, and also would not be so much comfort either way.
What Could A Superintelligence Do?
All of the things. But what are all of the things?
Even if any particular crazy sounding thing might be very hard, there are going to be a lot of crazy sounding things that turn out to be not that hard. Those get solved.
They predict that AIs will invent technologies and techniques we are not considering. That seems right, but also you can keep watering down what superintelligence can do, rule out all the stuff like that, and it doesn’t matter. It ‘wins’ anyway, in the sense that it gets what it wants.
One Extinction Scenario
Part 2 is One Extinction Scenario, very much in the MIRI style. The danger is always that you offer one such scenario, someone decides one particular part of it sounds silly or doesn’t work right, and then uses this to dismiss all potential dangers period.
One way they attempt to guard against this, here, is at many points they say ‘the AI tries various tactics, some of which are [ABCDE], and one of them works, it doesn’t matter which one.’ They also at many points intentionally make the AI’s life maximally hard rather than easy, presuming that various things don’t work despite the likelihood they would indeed work. At each step, it is emphasized how the AI will try many different things that create possibilities, without worrying much about exactly which ones succeed.
The most important ‘hard step’ in the scenario is that the various instances of the collectively superintelligent AI, which is called Sable, are working together towards the goal of gathering more resources to ultimately satisfy some other goal. To make the story easier to tell, they placed this in the very near future, but as the coda points out the timing is not important.
The second ‘hard step’ is that the one superintelligent AI in this scenario opens up a substantial lead on other AI systems, via figuring out how to act in a unified way. If there were other similarly capable minds up against it, the scenario looks different.
The third potential ‘hard step’ here is that no one figures out what is going on, that there is an escaped AI running around and gathering its resources and capabilities, in a way that causes a coordinated reaction. Then the AI makes its big play, and you can object there as well about how the humans don’t figure it out, despite the fact that this superintelligence is choosing the particular path, and how it responds to events, based on its knowledge and model of how people would react, and so on.
And of course, the extent to which we already have a pattern of giant alarm bells going off, people yelling about it, and everyone collectively shrugging.
My presumption in a scenario like this is that plenty of people would suspect something was going horribly wrong, or even what that thing was, and this would not change the final outcome very much even if Sable wasn’t actively ensuring that this didn’t change the outcome very much.
Later they point to the example of leaded gasoline, where we had many clear warning signs that adding lead to gasoline was not a good idea, but no definitive proof, so we kept adding lead to gasoline for quite a long time, at great cost.
As the book points out, this wouldn’t be our first rodeo pretending This Is Fine, history is full of refusals to believe that horrible things could have happened, citing Chernobyl and the Titanic as examples. Fiction writers also have similar expectations, for example see Mission Impossible: Dead Reckoning for a remarkably reasonable prediction on this.
Note that in this scenario, the actual intelligence explosion, the part where AI R&D escalates rather quickly, very much happens After The End, well past the point of no return where humans ceased to be meaningfully in charge. Then of course what is left of Earth quickly goes dark.
One can certainly argue with this style of scenario at any or all of the hard steps. The best objection is to superintelligence arising in the first place.
One can also notice that this scenario, similarly to AI 2027, involves what AI 2027 called neurolese, that the AI starts reasoning in a synthetic language that is very much not English or any human language, and we let this happen because it is more efficient, and that this could be load bearing, and that there was a prominent call across labs and organizations to preserve this feature. So far we have been fortunate that reasoning in human language has won out. But it seems highly unlikely that this would remain the most efficient solution forever. Do we look like a civilization ready to coordinate to keep using English (or Chinese, or other human languages) anyway?
One also should notice that this style of scenario is far from the only way it all goes horribly wrong. This scenario is a kind of ‘engineered’ gradual disempowerment, but the humans will likely default to doing similar things all on their own, on purpose. Competition between superintelligences only amps up many forms of pressure, none of the likely equilibria involved are good news for us. And so on.
I caution against too much emphasis on whether the AI ‘tries to kill us’ because it was never about ‘trying to kill us.’ That’s a side effect. Intent is largely irrelevant.
In his review of IABIED (search for “IV.”), Scott Alexander worries that this scenario sounds like necessarily dramatic science fiction, and depends too much on the parallel scaling technique. I think there is room for both approaches, and that IABIED makes a lot of effort to mitigate this and make clear most of the details are not load bearing. I’d also note that we’re already seeing signs of the parallel scaling technique, such as Google DeepMind’s Deep Think, showing up after the story was written.
And the AIs will probably get handed the reigns of everything straight away with almost no safeguards and no crisis because lol, but the whole point of the story is to make the AI’s life harder continuously at every point to illustrate how overdetermined is the outcome. And yes I think a lot of people who don’t know much about AI will indeed presume we would not ‘be so stupid as to’ simply hand the reins of the world over to the AI the way we appointed an AI minister in Albania, or would use this objection as an excuse if it wasn’t answered.
So You’re Saying There’s A Chance
That leaves the remaining roughly third of the book for solutions.
This is hard. One reason this is so hard is the solution has to work on the first try.
Once you build the first superintelligence, if you failed, you don’t get to go back and fix it, the same way that once you launch a space probe, it either works or it doesn’t.
You can experiment before that, but those experiments are not a good guide to whether your solution works.
Except here it’s also the Game of Thrones, as in You Win Or You Die, and also you’re dealing with a grown superintelligence rather than mechanical software. So, rather much harder than the things that fail quite often.
When problems do not have this feature, I am mostly relaxed. Sure, deepfakes or job losses or what not might get ugly, but we can respond afterwards and fix it. Not here.
They also draw parallels and lessons from Chernobyl and computer security. You are in trouble if you have fast processes, narrow margins, feedback loops, complications. The key insight from computer security is that the attacker will with time and resources find the exact one scenario out of billions that causes the attack to work, and your system has to survive this even in edge cases outside of normal and expected situations.
The basic conclusion is that this problem has tons of features that make it likely we will fail, and the price of failure on the first try is extinction, and thus the core thesis:
Well, sure, when you put it like that.
Note that ‘no one should be allowed to try to make a superintelligence’ does not mean that any particular intervention would improve our situation, nor is an endorsement of any particular course of action.
What are the arguments that we should allow someone to try?
Most of them are terrible. We’ve got such classics as forms of:
They will later namecheck some values for [X], such as ‘we’ll design them to be submissive,’ ‘we’ll make them care about truth’ and ‘we’ll just have AI solve the ASI alignment problem for us.’
Is comparing those to alchemists planning to turn lead into gold fair? Kinda, yeah.
Then we have the category that does not actually dispute that no one should be allowed to try, but that frames ‘no one gets to try’ as off the table:
Are there situations in which going forward is a profoundly stupid idea, but where you’re out of ways to make the world not go forward at all and going first is the least bad option left? Yes, that is certainly possible.
It is certainly true that a unilateral pause at this time would not help matters.
The first best solution is still that we all coordinate to ensure no one tries to build superintelligence until we are in a much better position to do so.
Okay, but what are the actively good counterarguments?
A good counterargument would involve making the case that our chances of success are much better than all of this would imply, that these are not the appropriate characteristics of the problem, or that we have methods available that we can expect to work, that indeed we would be very large favorites to succeed.
If I learned that someone convinced future me that moving forward to superintelligence was an actively good idea, I would presume it was because someone figured out a new approach to the problem, one that removed many of its fatal characteristics, and we learned that it would probably work. Who knows. It might happen. I do have ideas.
Oh Look It’s The Alignment Plan
The next section delves into the current state of alignment plans, which range from absurd and nonsensical (such as Elon Musk’s ‘truth-seeking AI’ which would kill us all even if we knew how to execute the plan, which we don’t) to extremely terrible (such as OpenAI’s ‘superalignment’ plan, which doesn’t actually solve the hard problems because to be good enough to solve this problem the AI has to already be dangerous). Having AIs work on interpretability is helpful but not a strategy.
The book goes on at greater length on why none of this will work, as I have often gone on at greater length from my own perspective. There is nothing new here, as there are also no new proposals to critique.
Instead we have a very standard disaster template. You can always get more warnings before a disaster, but we really have had quite a lot of rather obvious warning signs.
Yet so many people seem unable to grasp the basic principle that building quite a lot of very different-from-human minds quite a lot smarter and more capable and more competitive than humans is rather obviously a highly unsafe move. You really shouldn’t need a better argument than ‘if you disagree with that sentence, maybe you should read it again, because clearly you misunderstood or didn’t think it through?’
Most of the world is simply unaware of the situation. They don’t ‘feel the AGI’ and definitely don’t take superintelligence seriously. They don’t understand what is potentially being built, or how dangerous those building it believe it would be.
I would add that no one knows when we will be so dependent on AI that we will no longer have the option to turn back, even if it is not yet superintelligent and still doing what we ask it to do.
Even the governments of America and China have not as of late been taking this seriously, treating the ‘AI race’ as being about who is manufacturing the GPUs.
The Proposal: Shut It Down
Okay, wise guy, you ask the book, what is it gonna take to make the world not end?
They bite the bullets.
(To be maximally clear: I am not biting these bullets, as I am not as sold that there is no other way. If and when I do, you will know. The bullet I will absolutely bite is that we should be working, now, to build the ability to coordinate a treaty and enforcement mechanism in the future, should it be needed, and to build transparency and state capacity to learn more about when and if it is needed and in what form.)
It is good and right to bite bullets, if you believe the bullets must be bitten.
They are very clear they see only one way out: Development of frontier AI must stop.
Which means a global ban.
How would we do that, you ask?
Their proposed threshold is not high.
I can already hear those calling this insane. I thought it too. What am I going to do, destroy the world with nine GPUs? Seems low. But now we’d be talking price.
They also want to ban people from publishing the wrong kinds of research.
Take that literally. They don’t know how else humanity can survive. That doesn’t mean that they think that if we don’t do it by year [X], say 2029, that we will definitely already be dead at that point, or even already in an unsurvivable situation. It means that they see a real and increasing risk, over time, of anyone building it, and thus everyone dying, the longer we fail to shut down the attempts to do so. What we don’t know is how long those attempts would take to succeed, or even if they will succeed at all.
How do they see us enforcing this ban?
Yes, the same way anything else is ultimately enforced. At the barrel of a gun, if necessary, which yes involves being ready to blow up a datacenter if it comes to that.
Queue, presumably, all the ‘nuke the datacenter’ quips once again, or people trying to equate this with various forms of extralegal action. No. This is a proposal for an international treaty, enforced the way any other treaty would be enforced. Either allow the necessary monitoring, or the datacenter gets shut down, whatever that takes.
Thus, the proposal is simple. As broad a coalition as possible monitors all the data centers and GPUs, watching to ensure no one trains more capable AI systems.
Is it technically feasible to do this? The book doesn’t go into this question. I believe the answer is yes. If everyone involved wanted to do this, we could do it, for whatever hardware we were choosing to monitor. That would still leave consumer GPUs and potential decentralized attempts and so on, I don’t know what you would do about that in the long term but if we are talking about this level of attention and effort I am betting we could find an answer.
To answer a question the book doesn’t ask, would this then mean a ‘dystopian’ or ‘authoritarian’ world or a ‘global government’? No. I’m not saying it would be pretty (and again, I’m not calling for it or biting these bullets myself) but this regime seems less effectively restrictive of practical freedoms than, for example, the current regime in the United Kingdom under the Online Safety Act. They literally want you see ID before you can access the settings on your home computer Nvidia GPU. Or Wikipedia.
Hope Is A Vital Part Of Any Strategy
You gotta give ‘em hope.
And hope there is indeed.
Humanity has done some very expensive, painful, hard things. We’ve dodged close calls. The book cites big examples: We won World War II. We’ve avoided nuclear war.
There are many other examples one could cite as well.
How do we get there from here?
I’m Doing My Part
The Chinese show many signs of being remarkably open to coordination. As well they should be, given that right now we are the ones out in front. Is there a long, long way left to go? Absolutely. Would there be, shall we say, trust issues? Oh my yes. But if you ask who seems to be the biggest obstacle to a future deal, all signs suggest we have met the enemy and he is us.
Their Closing Words
I cannot emphasize enough, I really really cannot emphasize enough, how much all of us worried about this want to be completely, spectacularly wrong, and for everything to be great, and for us to be mocked eternally as we live forever in our apartments. That would be so, so much better than being right and dying. It would even be much better than being right and everyone working together to ensure we survive anyway.
Am I convinced that the only way for us to not die is an international treaty banning the development of frontier AI? No. That is not my position. However, I do think that it is good and right for those who do believe this to say so. And I believe that we should be alerting the public and our governments to the dangers, and urgently laying the groundwork for various forms of international treaties and cooperation both diplomatically and technologically, and also through the state capacity and transparency necessary to know if and when and how to act.
I am not the target audience for this book, but based on what I know, this is the best treatment of the problem I have seen that targets a non-expert audience. I encourage everyone to read it, and to share it, and also to think for themselves about it.
In the meantime, yes, work on the problem, but also don’t forget to live well.