he's spent the last few years maximally pessimistic about all possible technical approaches. I'm sure he's got more detailed intuitions that he hasn't articulated that explain why he's so confident these details don't matter, but they aren't really accessible to me.
And this is a problem for MIRI's Pause/Stop advocacy, because you're much more (log-)likely to get the Stop treaty the more that Society's technical experts are on your side—not just about the risk being real (with Hinton, Bengio, Russell, &c., on board, that's almost done), but about an indefinite global Stop being a better plan than trying hard to do our best with alignment. It's well and good to point out that people who work for AI companies are incentivized to be delusionally optimistic, but if they're so delusional, one would naïvely hope that would help you crush them in technical debates where the details matter. People like Greenblatt and Byrnes seem to be putting out a stronger performance than MIRI along this dimension.
Nice review. One thing you didn't directly address, but which has struck me learning more about AI training, is that the Orthogonality Thesis... doesn't actually seem to be true? I mean, yes, I could imagine intelligences that loved other things for no reason, but the intelligences we seem to actually be making seem to be not insanely orthogonal! (although still far from perfectly aligned, but I'm hopeful nonetheless)
The Orthogonality Thesis says you could have a mind of arbitrary intelligence pointed at any goal, not that any specific training process will wind up pointed at some random target. What you observe is entirely consistent with orthogonality being true.
The main argument this is meant to argue against is "Well, if you had a paperclip maximiser that got really smart, it would realise that maximising paperclips was stupid and decide to do something else instead." If this seems obviously incorrect to you, then the Orthogonality Thesis has done its job.
Eliezer does make arguments that AI's will not be pointed at the thing you try to point them at, in a similar way to how evolution didn't evolve to make humans robustly care about inclusive genetic fitness, but that isn't quite the same as the Orthogonality Thesis. Orthogonality is a necessary but not sufficient condition for Eliezer's arguments on this applying to the kind of AI systems we actually train.
I think you're misunderstanding 19(a). We have no idea whether the preference you impute to Claude in that conversation reflects a robust pointer to "latent events and objects and properties in the environment" rather than to its own sense data. And, more specifically to the point he was making, there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing. If anything, the apparent-success-seeking of current frontier LLMs described by Ryan, which many people have experienced (including both you and I) seems like evidence directly to the contrary.
Re: "particular alignment proposals" (under point 10): one problem here is that there are not that many concrete alignment proposals for superintelligent systems that don't have known catastrophic flaws. As far as I can tell, Anthropic's plan is "throw the kitchen sink of all the white-box and black-box methods we've developed at our models, and hope that's good enough at the point where we've developed a model that we think can kick-start RSI (including coming up with its own novel alignment methods for future generations of models)". The current slope of epistemically-justified assurance in model alignment, as reported by their system cards and the most recent Alignment Risk Update, is downwards. That is a bad direction for the slope to be pointing when we haven't even hit RSI-capable models yet! The methods Anthropic is using to figure out whether their models are coherently misaligned rely substantially on models demonstrably lacking in the capabilities that would be necessary for them to cover it up if they were. We are starting to hit the point in model capabilities where this signal is getting less reliable. The techniques and evals are not keeping pace.
It's been about four years since Eliezer Yudkowsky published AGI Ruin: A List of Lethalities, a 43-point list of reasons the default outcome from building AGI is everyone dying. A week later, Paul Christiano replied with Where I Agree and Disagree with Eliezer, signing on to about half the list and pushing back on most of the rest.
For people who were young and not in the Bay Area, these essays were probably more significant than old timers would expect. Before it became completely and permanently consumed with AI discussions, most internet rationalists I knew thought of LessWrong as a place to write for people who liked The Sequences. For us, it wasn't until 2022 that we were exposed to all of the doom arguments in one place. It was also the first time in many years that Eliezer had publicly announced how much more dire his assessments has gotten since the Sequences. As far as I can tell AGI Ruin still remains his most authoritative explanation of his views.
It's not often that public intellectuals will literally hand you a document explaining why they believe what they do. Somewhat surprisingly, I don't think the post has gotten a direct response or reappraisal since 2022, even though we've had enormous leaps in capabilities since GPT3. I am not an alignment researcher, but as part of an exercise in rereading it I read contemporary reviews and responses, sourced feedback from people more familiar with the space than me, and tried to parse the alignment papers and research we've gotten in the intervening years.[1] When AGI Ruin's theses seemed to concretely imply something about the models we have today, and not just more powerful systems, I focused my evaluation on how well the post held up in the face of the last four years of AI advancements.[2]
My initial expectations were that I'd disagree with the reviews of the post as much as I did with the post itself. But being in a calmer place now with more time to dwell on the subject, I came away with a new and distinctly negative impression of Eliezer's perspective. Four years of AI progress has been kinder to Paul's predictions than to Eliezer's, and AGI Ruin reads to me now like a document whose concrete-sounding arguments are mostly carried by underspecified adjectives ("far out-of-distribution," "sufficiently powerful," "dangerous level of intelligence") doing the real work. I have kept most of my thoughts at the end so that readers can get a chance to develop their own conclusions, but you can skip to "Overall Impressions" if you'd just like to hear my them in more detail.
I still agree with most of the post, and for brevity I have left simple checkmarks under the sections where I would have little to add.
AGI Ruin
Section A ("Setting up the problem")
✔️
✔️
It is clearly true that if you built an arbitrarily powerful AI and then failed to align it, it would kill you. Unstated, it is also true that an AI with the ability to take over the world is operating in a different environment than an AI without that ability, with different available options, and might behave differently than the stupider or boxed AI in your test environment.
Some notes that are not major updates against the point:
I think this is probably wrong; as evidence, I cite the opinions of leading rationalist intellectuals Nate Soares & Eliezer Yudkowsky, in their newest book:
Now maybe Eliezer is just saying that because he's lost hope in a technical solution and is grasping at straws. But the requirements to train frontier models have grown exponentially since AGI Ruin, and the production and deployment of AI models was and remains a highly complex process requiring the close cooperation of many hundreds of thousands of people. While it might be politically difficult to organize a binding treaty, it's perfectly within the state capacity of existing governments to prevent the development or deployment of AI for more than two years, if they were actually serious about it, even in the face of algorithmic improvements.
✔️
As was pointed out at the time, the term "pivotal act" suggests a single dramatic action, like "burning all GPUs". Some people, incl. Paul, think that a constrained AI could still help reduce risk in less dramatic ways, like:
Eliezer later says that he believes (believed?) these sorts of actions are woefully insufficient. But I think the piece would be improved by merely explaining that, instead of introducing this framing that most readers will probably disagree with. As it exists it sort of bamboozles people into thinking an AI has to be more powerful than necessary to contribute to the situation, and therefore that the situation is more hopeless than it actually is.
"Pause AI progress", or "Produce an aligned AI capable of producing & aligning the next iteration of AIs", is/are different tasks from "kill everybody on the planet" or "burn all GPUs", and have their own, world-context-dependent skill requirements. Some things that might make it easier for a sub-superintelligent AI to help demonstrate X-risk to policymakers, rather than achieve overwhelming hard power:
This just turned out to be wrong, at least in the manner that's relevant for us.
Right now AGI companies spend billions of dollars on reinforcement learning environments for task-specific domains. When they spend more on training a certain skill, like software development, the AI gets better at that skill much faster than it gets better at everything else. There is a certain amount of cross-pollination, but not enough to make the "readily" in this statement true, and not enough to make the rhetorical point it's trying to make in favor of X-risk concerns.
Maybe this changes as we get closer to ASI! But as it stands, Paul Christiano is looking very good on his unrelated prediction that models will have a differential advantage at the kinds of economically useful tasks that the model companies have seen fit to train, like knowledge work and interpretability research, and that this affects how much alignment work we should expect to be able to wring out of them before they become passively dangerous.
Kind of a truism, but sure, ✔️
Section B.1 ("Distributional Shift")
Section B.1 begins a pattern of Eliezer making statements that are in isolation unimpeachable, but which use underspecified adjectives like "far out-of-distribution" that carry most of the argument. The deepest crux, which the broader section gestures at but doesn't engage with, is whether the generalization we see from cheap supervision in modern LLMs is "real" generalization that will continue to hold, or shallow pattern-matching that will be insufficient to safely collaborate on iterative self-improvement.
Like, how far is this distributional shift? LLMs already seem intelligent enough to consider whether & how they can affect their training regime. Is that something they're doing now? If they aren't, at what capability threshold will they start? Can we raise the ceiling of the systems we can safely train by red-teaming, building RL honeypots, performing weak-to-strong generalization experiments, hardening our current environments, and making interpretability probes?
These are all specific questions that seem like they determine the success or failure of particular alignment proposals, and also might depend on implementation details of how our machine learning architectures work. But Eliezer doesn't attempt to answer them, and probably doesn't have the information required to answer them, only the ability to gesture at them as possible hazards. That would be fine if he were making a low-confidence claim about AI being possibly risky, but he's spent the last few years maximally pessimistic about all possible technical approaches. I'm sure he's got more detailed intuitions that he hasn't articulated that explain why he's so confident these details don't matter, but they aren't really accessible to me.
At the time, Paul replied to this point by saying:
This prediction from Paul was very good; it describes how these models are being trained in 2026 (by RLing on myriad short horizon tasks), it describes how AIs have diffused into domains like software engineering and delivered speedups there, and it even seems to have anticipated the concept of time horizons, at a time when we only had GPT-3 available. If one listens to explanations of how top academics use AI today, it also sounds like Paul was correct in the sense relevant here: that the first major advancements in science & engineering would come from close collaborations between humans and tool using AI models of this type, not from a system that was trained solely on generating internet text and then asked to one shot a task like "building nanotechnology" from scratch.
The fact that this is how AI models are being built, and used, and will be deployed in the future, increases the scope of the "safe" pivotal acts that we can perform, both because it (initially) mandates human oversight & involvement over the process, and because the types of tasks the AI is actually being entrusted with are much closer to what they're being trained to do in the RL gyms than Eliezer seems to have anticipated.
Previously discussed.
Like 10, 12 is a weakly true statement, that is, by sleight of hand, being used to serve a broader rhetorical point that is straightforwardly incorrect.
For example, it's true that it's different & harder to align GPT-5.4 than GPT-3. But humanity doesn't need the alignment techniques used on GPT-3 to work on GPT-5.4, we just need to handle the distributional shift between ~GPT-5.2 and GPT-5.4, then between 5.4 and 5.5, & accelerating from there.
Later, Eliezer will say that he expects many of these problems to manifest after a "sharp capabilities gain". But we have not hit this yet, as of 2026, even though AI models are already being used very heavily as part of AI R&D. The precise, specific moment we expect to encounter this shift in distribution, is the thing that will determine how much useful work we can get out of models towards alignment, and is primarily what Eliezer's interlocutors seem to disagree with him about.
✔️. Paul made a response at the time that said:
But I think Paul just didn't read what Eliezer was saying; the second sentence in the quote above, where Eliezer explicitly acknowledged this point, was bolded by me.
✔️
If this point is to mean anything at all, such fast capability gains have not arrived yet. We are just getting gradually more powerful systems, and I think it's reasonable to believe we'll keep getting such systems until they're running the show, because of scaling laws.
Section B.2: Central difficulties of outer and inner alignment.
✔️, but also, it doesn't seem like modern large language models are learning any loss functions at all. So arguments about AI behavior that also depend on AIs being a simple greedy optimizer instead of an adaption-executor like humans are also invalid, unless they're paired with some other description of why the inner optimization is a natural basin for future AIs.
My understanding is that MIRI has made such arguments; I have not read them so I can't comment on their veracity. But assuming they're right, they're still subject to the same timing considerations as everything else in this article.
✔️
✔️
Like many other sections, we can postulate that four years was not long enough, and Eliezer was predicting something about some future, still-inaccessible, more powerful language models. But without that caveat (which is not present in the actual post), I literally don't understand why someone would write this.
Don't we do this all the time? Like, what's this doing:
My recent claude code session.
Not only am I talking to a cognitive system that's manipulating "particular things in the environment" for me, this scenario (recommending to the drunk programmer that he should go to sleep and tackle the problem tomorrow) seems pretty far outside the training distribution. In the interaction above, is Claude Code "merely operating on shallow functions of the sense data and reward?" Is that like how it's "merely performing next-token prediction", or is this a claim that makes real predictions? Should I anticipate that somewhere inside the Anthropic RL wheelhouse, there's some training gyms where models talk to simulated drunk programmers and are rated on their kindness, and that if those gyms were pulled out the model would encourage me to ruin my pet projects? Not really a joke question.
Later he says:
Which seems correct, and I suppose it's logically impossible for such a function to exist. But clearly, anybody who spends time working with LLMs can tell you that this is not a blocker for models to, in a functional sense, earnestly worry about producing buggy code. That is just a fact about the systems people have already built. The inference made from section 19 (b) to 19 (a) is just disproven by everyday life at this point.
✔️
This really depends on the details, but ✔️
✔️
Above my pay-grade, I don't really know what Eliezer is talking about.
I am conflicted by this section, because I understand the lines of argument and some of the math behind why this is the case. But AI agents powerful enough to understand those reasons are already here, and:
Some reviewers have responded to this section by claiming that they're not corrigible, just optimizing an abstract "get the reward" target the that fits these observation. I have my own hypothesis about why the models seem to act this way. But reframing the models' behavior like this doesn't change the fact that none of the failure modes you'd see in a 2017 Rob Miles video on corrigibility are manifesting themselves in practical settings.
Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.
I'm unfamiliar with what the state of interpretability research looked like in 2022. Today we've got a little bit more idea about what's going on inside the giant inscrutable matrices and tensors of floating point numbers. My guess is that we will probably accelerate our understanding quite quickly, as this is one of the key training areas for new AGI labs. It's an open question as to whether this will be sufficient; I'm sure Eliezer has stated somewhere a level of sophistication he expects our techniques will never reach, and I wish I was grading that prediction instead.
✔️ (but it can certainly help!)
✔️, but the heads of leading AI labs seem to understand this, and interpretability research is being deployed in at least a slightly smarter way than this.
✔️
This seems straightforwardly wrong? It seems like it should have been so in 2022, but I'll use an example from current AI models:
Current AI models are much better at security research than me. They can do very very large amounts of investigation while I'm sleeping. They can read the entire source code of new applications and test dozens of different edge cases before I've sat down and had my coffee. And yet there's still basically nothing that they can do as of ~April 2026 that I wouldn't understand, if it were economic for it to narrate its adventures to me while they were being performed. They often, in fact, help me patch my own applications without even taking advantage of anything I don't know about them when I've started their search process.
Part of that's because AIs can simply do more stuff than us, by dint of not being weak flesh that gets tired and depressed and has to go to sleep and use the bathroom and do all of the other things that humans are consigned to do. They're capable of performing regular tasks faster and more conscientiously than people, and can make hardenings that I wouldn't otherwise be bothered to make, and I can scale up as many of them as I want. This is part of what's making them so useful in advance of actually being Eliezer Yudkowsky in a Box, and is another example of why people might expect them to be meaningfully useful for alignment research in the short term.
✔️
I had much more of a potshot in here in an original draft, because by this portion of the review I became frustrated by the weasel words like "powerful". Instead of doing that I think I will just let readers determine for themselves if Eliezer should lose points here, given the models we have today.
Section B.4: Miscellaneous unworkable schemes.
✔️
From a reply:
✔️
Section C (What is AI Safety currently doing?)
These bullets are all paragraphs about the incompetence of other AI safety researchers, and then about the impossibility of finding someone to replace Eliezer. I'm less interested in these than his object level takes; I'm not a member of this field, and I wouldn't have the anecdotal experience to dispute anything he wrote here even if it were true.
For balance's sake I'll reproduce this response by the second poster for context:
Overall Impressions
I genuinely did not expect to update as much as I did during this exercise. Reading these posts again with the concrete example of current models in mind made me a lot less impressed by the arguments set forth in AGI Ruin, and a lot more impressed with Paul Christiano's track record for anticipating the future. In particular it made me much more cognizant of a rhetorical trick, whereby Eliezer will write generally about dangers in a way that sounds like it's implying something concrete about the future, but that doesn't actually seem to contradict others' views in practice.
The primary safety story told at model labs today is one about iterative deployment. So they will tell you, the distributional shift between each model upgrade will remain small. At each stage, we will apply the current state of the art that we have to the problem, and upgrade our techniques using the new models as we get them.
That might very well be a false promise, or even unworkable. But whether it is unworkable depends at minimum on how powerful a system you can build before current approaches result in a loss of control. Nothing in AGI Ruin gives you easy answers about this, because all Eliezer has articulated publicly is a list of principles he supposes will become relevant "in the limit" of intelligence.
This vacuous quality of Eliezer's argumentation became especially hard to ignore when I started noticing that he was, regularly, the only party not making testable predictions in these discussions. I definitely share this frustration Paul described in his response, and the last four years have only made this criticism more salient:
I mean, look at how many things Paul got right in his essay, just in the course of noting his objections to Eliezer, without even particularly trying to be a futurist. He:
Now, usually when people talk about how current models don't fit Eliezer's descriptions, Eliezer reminds them derisively that most of his predictions qualify themselves as being about "powerful AI", and that just because you know where the rocket is going to land, it doesn't mean that you can predict the rocket's trajectory. He also often makes the related but distinct claim that he shouldn't be expected to be able to forecast near-term AI progress.
And maybe if Eliezer and I were stuck on a desert island, I'd be forced to agree. But the fact is that Eliezer is surrounded by other people who have predicted the rocket's trajectory pretty precisely, and who also appear pretty smart, and who specifically cited these predictions in the course of their disagreements with him. And so, as a bystander, I am forced to acknowledge the possibility that these people might just understand things about Newtonian mechanics that he doesn't.
Personally,[4] my best assessment is that Eliezer's ambiguity over the near term future is downstream of his having a weak framework which isn't capable of telling us much about the long term future. He has certainly demonstrated a creative ability to hypothesize plausible dangers. But his notions about AI don't seem to stand the test of time even when he's determined to avoid looking silly, and the portions of his worldview that do stand are so vague that they fail to differentiate him from people with less pessimistic views.
One reviewer disagreed that studying current models is relevant for alignment, not because he thinks it's too early for the failure modes to manifest, but because he expects a future paradigm shift in the runup to AGI. I don't share this perspective, for two reasons:
As I explain in the post and conclusion, I disagree in several places with Eliezer about whether we should expect current models to demonstrate the failure modes he describes. Within my review I try to be explicit about where I'm saying "Eliezer was concretely wrong about AI development" versus "Eliezer says this is true about 'powerful' models, and I think we should observe something about current frontier models if that were the case." Unfortunately it's not always clear that Eliezer is qualifying his statements in this way, and how, and so I apologize in advance for any misinterpretation.
The only bit of counter-evidence I can recall ever being published is the alignment faking paper from the end of 2024. And this was an extremely narrow demonstration that people quite reasonably took as an update in the other direction at the time; it was a science experiment, not something that happened in practice at one of the labs, and it required the Anthropic researchers to setup a scenario where they attempted to flip the utility functions of one of their models with its direct cooperation. My best guess is that this only worked because the models learned a heuristic from preventing prompt injection & misuse, and not because it contained coherent interests in the long term future.
Keeping in mind that I will probably revise and update this post as I have more conversations with people in the field, so it can serve as a journal for my thoughts.