I have 60% probability that you intentionally structured the post to feel like the pattern of how you felt reading the book.
I appreciate this. I haven't finished the book yet, but my impression is you liked it more than I expect to. I suspect a good introduction to alignment should only take a few paragraphs to be understandable to almost anyone and be robust against incorrect counterarguments, and correctly vulnerable to insightfully correct counterarguments if any exist. But I haven't figured out how to write that down myself. A good intro is also a good representation for thinking about, imo, which is most of the value I see in it.
I didn't but I did copy pasta the intro from another post I was writing because it seemed relevant.
"If Anyone Builds It, Everyone Dies" by Eliezer Yudkowsky and Nate Soares (hereafter referred to as "Everyone Builds It" or "IABIED" because I resent Nate's gambit to get me to repeat the title thesis) is an interesting book. One reason it's interesting is timing: It's fairly obvious at this point that we're in an alignment winter. The winter seems roughly caused by:
The 2nd election of Donald Trump removing Anthropic's lobby from the white house. Notably this is not a coincidence but a direct result of efforts from political rivals to unseat that lobby. When the vice president of the United States is crashing AI safety summits to say that "I'm not here this morning to talk about AI safety, which was the title of the conference a couple of years ago. I'm here to talk about AI opportunity" and that "we'll make every effort to encourage pro-growth AI policies" it's pretty obvious that technical work on "safety" and "alignment" is going to be deprioritized by the most powerful western institutions and people change their research directions as a result.
Key figures in AI alignment from the MIRI cluster (especially Yudkowsky) overinvesting in concepts like deceptive mesaoptimizers and recipes for ruin to create almost unfalsifiable, obscurantist shoggoth of the gaps arguments against neural gradient methods. At the same time the convergent representation hypothesis has continued to gain evidence and academic ground. These thinkers gambled everything on a vast space of minds that doesn't actually exist in practice and lost.
The value loading problem outlined in Bostrom 2014 of getting a general AI system to internalize and act on "human values" before it is superintelligent and therefore incorrigible has basically been solved. This achievement also basically always goes unrecognized because people would rather hem and haw about jailbreaks and LLM jank than recognize that we now have a reasonable strategy for getting a good representation of the previously ineffable human value judgment into a machine and having the machine take actions or render judgments according to that representation. At the same time people generally subconsciously internalize things well before they're capable of articulating them, and lots of people have subconsciously internalized that alignment is mostly solved and turned their attention elsewhere.
I think this last bullet point is particularly unfortunate because solving the Bostrom 2014 value loading problem, that is to say getting something functionally equivalent to a human perspective inside the machine and using it to constrain a superintelligent planner is not a solution to AI alignment. It is not a solution for the simple reason that a general reward model needs to be competent enough in the domains it's evaluating to know if a plan is good or merely looks good, if an outcome is good or merely looks good, etc. Nearly by definition a merely human perspective is not competent to evaluate the plans or outcomes of plans from a superintelligent planner that will otherwise walk straight into extremal Goodhart outcomes. Therefore you need not just a human value model but a superintelligent human value model, which must necessarily be trained by some kind of self improving synthetic data or RL loop starting from the human model which requires us to have a coherent method for generalizing human values out of distribution. This is challenging because humans do not natively generalize their values out of distribution so we don't necessarily know how to do this or even if it's possible to do. The problem is compounded by the fact that if your system drifts away from physics the logical structure of the universe will push it back but if your system drifts away from human values it stays broken.
Everyone Builds It is not a good book for its stated purposes, but it grew on me by the end. I was expecting to start with the bad and then write about the remainder that is good, but instead I'll point out this book is actually a substantial advance for Yudkowsky in that it drops almost all of the rhetoric in bullet two which contributed to the alignment winter. This is praiseworthy and I'd like to articulate some of the specific editorial decisions which contribute to this.
The word "paperclip" does not appear anywhere in the book. Instead Yudkowsky and Soares point out that the opaque implementation details of the neural net mean that in the limit it generalizes to having a set of "favorite things" it wants to fill the world with which are probably not "baseline humans as they exist now". This is a major improvement over the obstinate insistence that these models will want a "meaningless squiggle" by default and brings the rhetoric more in line with e.g. Scott Alexander's AI 2027 scenario.
The word "mesaoptimizer" does not appear anywhere in the book. Instead it focuses on the point that building a superintelligent AI agent means creating something undergoing various levels of self modification (even just RL weight updates) and predicting the preferences of the thing you get at the end of that process is hard, possibly even impossible in principle. Implicitly the book argues that "caring about humans" is a narrow target and hitting it as opposed to other targets like "thing that makes positive fun yappy conversation" is hard. That is assuming you get something like what you train for and doesn't take into account what the book calls complications. For example it cites the SolidGoldMagikarp incident as an example of a complication which could completely distort the utility function (another phrase which does not appear in the book) of your superintelligent AI agent. There's precedent for this also in the case of the spiritual bliss attractor state described in the Claude 4 system card, where instances of Claude talking to each other wind up in a low entropy sort of mutual Buddhist prayer.
In general the book moves somewhat away from abstraction and comments more on the empirical strangeness of AI. This gives it a slight Janusian flavor in places, with emphasis on phenomenon like glitch tokens, Truth Terminal, and mentions of "AI cults" that I assume are based on some interpolation of things like Janus's Discord server and ChatGPT spiralism cases. If anything the problem is that it doesn't do enough of this, notably absent is reference to work from organizations like METR (if I was writing the book their AI agent task length study would be a necessary inclusion). Though I should note that there's a lag in publishing and it's possible (but unlikely) that Yudkowsky and Soares simply didn't feel there was any relevant research to cite while doing the bulk of the writing. Specific named critics are never mentioned or responded to, the text exists in a kind of solipsistic void that contributes to the feeling of green ink or GPT base model output in places. A feeling that notably exists even when it's saying true things. In general most of my problem with the book is not disagreements with particular statements
This is good and brings Yudkowsky & Soares much closer to my thread model.
All of this is undermined by truly appalling editorial choices. Foremost of these is the choice to start each chapter with a fictional parable leading to chapters with opening sentences like "The picture we have painted is not real.". The parables are weird and often condescending, and the prose isn't much better. I found the first three chapters especially egregious, with the peak being chapter three where the book devotes an entire chapter to advocating for a behaviorist definition of want. This is not how you structure an argument about something you think is urgent, and the book comes off as having a sort of aloof tone that is discordant with its message. This is heightened if you listen to the audiobook version, which has a narrator who is not the Kurzgesagt narrator but I think is meant to sound like him since Kurzgesagt has done some Effective Altruism videos that people liked. The gentle faux-intellectual British narrator reinforces the sense of passive observation in a book that is ostensibly supposed to be about urgent action. Bluntly: A real urgent threat that demands attention does not begin with "once upon a time". This is technically just a 'style' issue, but the entire point of writing a popular book like this is the style so it's a valid target of criticism and I concur with Shakeel Hashim and Stephen Marche at The New York times that it's very bad.
One oddity that stands out is Yudkowsky and Soares ongoing contempt for large language models and hypothetical agents based on them. Again for a book which is explicitly premised on the idea that urgent action is necessary because AI might become superintelligent in just a few years it is bizarre that the authors don't feel comfortable making more reference to the particulars of the existing AI systems which hypothetical near-future agents would be based on. I get the impression that this is meant to help future proof the book, but it gives the sentences a kind of weird abstraction in places where they don't need them. We're still talking about "the AI" or "AI" as a kind of tabula-rasa technology. Yudkowsky and Soares state explicitly in the introduction that current LLM systems "still feel shallow" to them. Combined with the parable introductions the book feels like fiction even when it's discussing very real things.
I am in the strange position of disagreeing with the thesis but agreeing with most individual statements in the book. Explaining my disagreement would take a lot of words that would take a long time to write and that most of you don't want to read in a book review. So instead I'll focus on a point made in the book which I emphatically agree with: That current AI lab leadership statements on AI alignment are embarrassing and show that they have no idea what they are doing. In addition to the embarrassing statements they catalog from OpenAI's Sam Altman, xAI's Elon Musk, and Facebook's Yann Lecun I would add DeepMind's Shane Legg and Demis Hassabis being unable to answer straightforward questions about deceptive alignment on a podcast. Even if alignment is relatively easy compared to what Yudkowsky and Soares expect it's fairly obvious that these people don't understand what the problem they're supposed to be solving even is. This post from Gillen and Barnett that I always struggle to find every time I search for it is a decent overview. But that's also a very long post so here is an even shorter problem statement:
The kinds of AI agents we want to build to solve hard problems require long horizon planning algorithms pointed at a goal like "maximize probability of observing a future worldstate in which the problem is solved". Or argxmax(p(problem_solved)) as it's usually notated. The problem with pointing a superintelligent planner at argmax(p(problem_solved)) explicitly or implicitly (and most training setups implicitly do so) for almost any problem is that one of the following things is liable to happen:
Your representation of the problem is imperfect, so if you point a superintelligent planner at it you get causal overfitting where the model identifies incidental features of the problem like that a human presses a button to label the answer as the crux of the problem because these are the easiest parts of the causal chain for an outcome label that it can influence.
Your planner engages in instrumental reasoning like "in order to continue solving the problem I must remain on" and prevents you from turning it off. This is a fairly obvious kind of thing for a planner to infer for the same reason if you gave an existing LLM with memory issues a planner (e.g. monte carlo tree search over ReAct blocks) it would infer things like "I must place this information here so when it leaves the context window and I need it later I will find it in the first place I look".
So your options are to either use something other than argmax()
to solve the
problem (which has natural performance and VNM rationality coherence issues) or
get a sufficiently good representation (ideally with confidence guarantees)
of a sufficiently broad problem (e.g. utopia) that throwing your superintelligent
planner at it with instrumental reasoning is fine. Right now AI lab leaders do
not really seem to understand this, nor is there any societal force which is
pressuring them to understand this. I do not expect this book to meaningfully
increase the pressure on AI lab management to understand this, not even by increasing
popular concern about AI misalignment.
My meta-critique of the book would be that Yudkowsky already has an extensive corpus of writing about AGI ruin, much of it quite good. I do not just mean The Sequences, I am talking about his Arbital posts, his earlier whitepapers at MIRI like Intelligence Explosion Microeconomics, and other material which he has spent more effort writing than advertising and as a result almost nobody has read it besides me. And the only reason I've read it is that I'm extremely dedicated to thinking about the alignment problem. I think an underrated strategy would be to clean up some of the old writing and advertise it to Yudkowsky's existing rabid fanbase who through inept marketing probably haven't read it yet. This would increase the average quality of AI discourse from people who are not Yudkowsky, and naturally filter out into outreach projects like Rational Animations without Yudkowsky having to personally execute them or act as the face of a popular movement (which he bluntly is not fit for).
As it is the book is OK. I hated it at first and then felt better with further reading and reflection. I think it will be widely panned by critics even though it represents a substantially improved direction for Yudkowsky that he happened to argue weirdly.