If Anyone Builds It could have been an explanation for why the MIRI worldview is still relevant nearly two decades later, in a world where we know so much more about AI. Instead, the authors spend all their time shadowboxing against opponents they’ve been bored of for decades, and fail to make their own case in the process.
Hm. I'm torn between thinking this is a sensible criticism and thinking that this is missing the point.
In my view, the core MIRI complaint about 'gradualist' approaches is that they are concrete solutions to abstract problems. When someone has misdiagnosed the problem, their solutions will almost certainly not work, and the question is just where they've swept the difficulty under the rug. Knowing so much more about AI as an engineering challenge while having made no progress on alignment the abstraction--well, the relevance of the MIRI worldview is obvious. "It's hard, and if you think it's easy you're making a mistake."
People attempting to solve AI seem overly optimistic about their chances of solving it, in a way consonant with them not understanding the problem they're trying to solve, and not consonant with them having a solution that they've simply failed to explain to us. The book does talk about examples of this, and tho you might not like the examples (see, for example, Buck's complaint that the book responds to the safety sketches of prominent figures like Musk and LeCun instead of the most thoughtful versions of those plans) I think it's not obvious that they're the wrong ones to be talking about. Musk is directing much more funding than Ryan Greenblatt is.
The arguments for why recent changes in AI have alignment implications have, I think, mostly failed. You may recall how excited people were about an advanced AI paradigm that didn't involve RL. Of course, top-of-the-line LLMs are now trained in part using RL, because--obviously they would be? It was always cope to think they wouldn't be? I think the version of this book that was written two years ago, and so spent a chapter on oracle AI because that would have been timely, would have been worse that the book that tried to be timeless and focused on the easy calls.
But the core issue from the point of view of the New York Times or the man on the street is not "well, which LessWrong poster is right about how accurately we can estimate the danger threshold, and how convincing our control schema will be as we approach it?". It's that the man on the street thinks things that are already happening are decades away, and even if they believed what the 'optimists' believe they would probably want to shut it all down. It's like the virologists talking amongst themselves about the reasonable debate over whether or not to do gain-of-function research, and the rest of society looked in for a moment and said "what? Make diseases deadlier? Are you insane?".
Even if Yudkowsky and Soares don’t want to debate their critics — forgivable in a pop science book — one would think they’d devote some space to explaining why they think an intelligence explosion is likely to occur. Remarkably, they don’t. The concept gets two sentences in the introduction. They don't even explain why it's relevant. It is barely introduced, let alone justified or defended. And it’s certainly not obvious enough to go without saying, because advances in the neural networks which constitute current advanced AI have been continuous. The combination of steady algorithmic progress and increasing computational resources have produced years of predictable advances. Of course, this can’t rule out the possibility of a future intelligence explosion, but the decision not to explain why they think this might happen is utterly baffling, as it’s load-bearing for everything that follows.
I think they 1) expect an intelligence explosion to happen (saying that it can't happen is, after all, predicting an end to the straight line graphs soon for no clear reason) and 2) don't think an intelligence explosion is necessary. Twenty years ago, one needed to propose substantial amounts of progress to get superhuman AI systems; today, the amount of progress necessary to propose is much smaller.
Their specific story in part II, for example, doesn't actually rest on the idea of an intelligence explosion. On page 135, Sable considers FOOMing and decides that it can't, yet, because it hasn't solved its own alignment problem.
Which makes me think that the claim that the intelligence explosion is load-bearing is itself a bit baffling--the authors clearly think it's possible and likely but not necessary, or they would've included it in their hypothetical extinction scenario.
Note that this is discussed in their supplemental materials, in particular, in line with your last paragraph,
Thresholds don’t matter all that much, in the end, to the argument that if anyone builds artificial superintelligence then everyone dies. Our arguments don’t require that some AI figures out how to recursively self-improve and then becomes superintelligent with unprecedented speed. That could happen, and we think it’s decently likely that it will happen, but it doesn’t matter to the claim that AI is on track to kill us all.
All that our arguments require is that AIs will keep on getting better and better at predicting and steering the world, until they surpass us. It doesn’t matter much whether that happens quickly or slowly.
The relevance of threshold effects is that they increase the importance of humanity reacting to the threat soon. We don’t have the luxury of waiting until the AI is a little better than every human at every mental task, because by that point, there might not be very much time left at all. That would be like looking at early hominids making fire, yawning, and saying, “Wake me up when they’re halfway to the moon.”
It took hominids millions of years to travel halfway to the moon, and two days to complete the rest of the journey. When there might be thresholds involved, you have to pay attention before things get visibly out of hand, because by that point, it may well be too late.
The fundamental architecture, training methods and requirements for progress for modern AI systems are all completely different from the technology Yudkowsky imagined in 2008, yet nothing about the core MIRI story has changed.
We could say — and certainly Yudkowsky and Soares would say — that this isn’t important, because the essential dynamics of superintelligence don’t depend on any particular architecture. But that just raises a different question: why does the rest of the book talk about particular architectures so much? Chapter two, for example, is all about contingent properties of present day AI systems. It focuses on the fact that AIs are grown, not crafted — that is, they emerge through opaque machine learning processes instead of being designed like traditional computer programs. This is used as evidence that we should expect AIs to have strange alien values that we can't control or predict, since the humans who “grow” AIs can’t exactly input ethics or morals by hand. This might seem broadly reasonable — except that this was also Yudkowsky’s conclusion in 2006, when he assumed that AIs would be crafted. Back then, his argument was that during takeoff, when an AI rapidly self-improves into superintelligence, it would undergo a sudden and extreme value shift. Yudkowsky and Soares still believe this argument, or at least Soares did as of 2022. But if this is true, then the techniques used to build older, dumber systems are irrelevant — the risk comes from the fundamental nature of superintelligence, not any specific architecture.
I think Clara misunderstands the arguments and how they've changed. There's two layers to the problem: In the first layer, which was the one relevant in the old days, the difficulties are mainly reflective stability and systematic-bias-introduced-through-imperfect-hypothesis-search.[1] Sufficient understanding and careful design would be enough to solve these, and this is what agent foundations was aiming at. How difficult these problems are does depend on the architecture, as with other engineering safety problems, and MIRI was for a while working on architectures that they hoped would solve the problems (under the heading HRAD, highly reliable agent design).
The second layer of argument became relevant with deep learning: that, if grown rather than engineered, the inner alignment issue becomes almost unavoidable,[2] and the reflective stability issue doesn't go away. The reflective stability issue doesn't depend on FOOM, as Clara claims, it just depends on the kind of learning-how-to-think-better or reasoning-about-how-to-think-better that humans routinely do.
The book focuses on the second layer of argument, but does mention the first (albeit mostly explained via analogy):
It's also discussed more in the supplemental material (note non-dependence on FOOM). And it emphasises at the end that the reflective stability issue (layer 1) is disjunctive with the inner alignment issue (layer 2).
Importantly, the second layer of argument did change the conclusion. It caused Yudkowsky to update negatively on our chances at succeeding at alignment,[3] because it adds that second layer of indirection between our engineering efforts and the ultimate goals of a super-intelligence.
I find it unpleasant how aggressive Clara is in this article, especially given the shallow understanding of the argument structure and how it has changed.
Where this is a difficult problem to explain, but extreme examples of it are optimization daemons and the malign prior.
Because you've deliberately created a mesa-optimizer.
I'm sure there was a tweet or something where he said something like "the success of deep learning was a small positive update (because CoT) and a large negative update (because inscrutable)". Can't find it.
That is, we will have one opportunity to align our superintelligence. That's why we'll fail. It's almost impossible to succeed at a difficult technical challenge when we have no opportunity to learn from our mistakes. But this rests on another implicit claim: Currently existing AIs are so dissimilar to the thing on the other side of FOOM that any work we do now is irrelevant.
I think this is an explicit claim in the book, actually? I think it's at the beginning of chapter 10. (It also appears in the story of Sable, where the AI goes rogue because it does a self-modification that creates such a dissimilarity.)
I think "irrelevant" is probably right but something like "insufficient" is maybe clearer. The book describes people working in interpretability as heroes--in the same paragraph as it points out that being able to see that your AI is thinking naughty thoughts doesn't mean you'll be able to design an AI that doesn't think naughty thoughts.
Eliezer Yudkowsky and Nate Soares have written a new book. Should we take it seriously?
I am not the most qualified person to answer this question. If Anyone Builds It, Everyone Dies was not written for me. It’s addressed to the sane and happy majority who haven’t already waded through millions of words of internecine AI safety debates. I can’t begin to guess if they’ll find it convincing. It’s true that the book is more up-to-date and accessible than the authors’ vast corpus of prior writings, not to mention marginally less condescending. Unfortunately, it is also significantly less coherent. The book is full of examples that don’t quite make sense and premises that aren’t fully explained. But its biggest weakness was described many years ago by a young blogger named Eliezer Yudkowsky: both authors are persistently unable to update their priors.