Reevaluating "AGI Ruin: A List of Lethalities" in 2026

lc

It's been about four years since Eliezer Yudkowsky published AGI Ruin: A List of Lethalities, a 43-point list of reasons the default outcome from building AGI is everyone dying. A week later, Paul Christiano replied with Where I Agree and Disagree with Eliezer, signing on to about half the list and pushing back on most of the rest.

For people who were young and not in the Bay Area, these essays were probably more significant than old timers would expect. Before it became completely and permanently consumed with AI discussions, most internet rationalists I knew thought of LessWrong as a place to write for people who liked The Sequences. For us, it wasn't until 2022 that we were exposed to all of the doom arguments in one place. It was also the first time in many years that Eliezer had publicly announced how much more dire his assessments has gotten since the Sequences. As far as I can tell AGI Ruin still remains his most authoritative explanation of his views.

It's not often that public intellectuals will literally hand you a document explaining why they believe what they do. Somewhat surprisingly, I don't think the post has gotten a direct response or reappraisal since 2022, even though we've had enormous leaps in capabilities since GPT3. I am not an alignment researcher, but as part of an exercise in rereading it I read contemporary reviews and responses, sourced feedback from people more familiar with the space than me, and tried to parse the alignment papers and research we've gotten in the intervening years.^[1] When AGI Ruin's theses seemed to concretely imply something about the models we have today, and not just more powerful systems, I focused my evaluation on how well the post held up in the face of the last four years of AI advancements.^[2]

My initial expectations were that I'd disagree with the reviews of the post as much as I did with the post itself. But being in a calmer place now with more time to dwell on the subject, I came away with a new and distinctly negative impression of Eliezer's perspective. Four years of AI progress has been kinder to Paul's predictions than to Eliezer's, and AGI Ruin reads to me now like a document whose concrete-sounding arguments are mostly carried by underspecified adjectives ("far out-of-distribution," "sufficiently powerful," "dangerous level of intelligence") doing the real work. I have kept most of my thoughts at the end so that readers can get a chance to develop their own conclusions, but you can skip to "Overall Impressions" if you'd just like to hear my them in more detail.

I still agree with most of the post, and for brevity I have left simple checkmarks under the sections where I would have little to add.

AGI Ruin

Section A ("Setting up the problem")

1. Alpha Zero blew past all accumulated human knowledge about Go after a day or so of self-play, with no reliance on human playbooks or sample games. Anyone relying on "well, it'll get up to human capability at Go, but then have a hard time getting past that because it won't be able to learn from humans any more" would have relied on vacuum. AGI will not be upper-bounded by human ability or human learning speed...

✔️

2. A cognitive system with sufficiently high cognitive powers, given any medium-bandwidth channel of causal influence, will not find it difficult to bootstrap to overpowering capabilities independent of human infrastructure... Losing a conflict with a high-powered cognitive system looks at least as deadly as "everybody on the face of the Earth suddenly falls over dead within the same second".

✔️

3. We need to get alignment right on the 'first critical try' at operating at a 'dangerous' level of intelligence, where unaligned operation at a dangerous level of intelligence kills everybody on Earth and then we don't get to try again.

It is clearly true that if you built an arbitrarily powerful AI and then failed to align it, it would kill you. Unstated, it is also true that an AI with the ability to take over the world is operating in a different environment than an AI without that ability, with different available options, and might behave differently than the stupider or boxed AI in your test environment.

Some notes that are not major updates against the point:

AGIs that would be existential if deployed in 2010, are not necessarily existential if deployed in 2030, esp. if widespread deployment of a semi-aligned predecessor AI is common. Just like how an army that shows up with machine guns automatically wins in 1200 but not necessarily in 2000.
This does not automatically save us, but it does have alignment implications if it were true, because it suggests that we might be able to continue doing experimentation with models that would be much smarter than we'd be able to handle in the current date.

4. We can't just "decide not to build AGI" because GPUs are everywhere, and knowledge of algorithms is constantly being improved and published; 2 years after the leading actor has the capability to destroy the world, 5 other actors will have the capability to destroy the world...

I think this is probably wrong; as evidence, I cite the opinions of leading rationalist intellectuals Nate Soares & Eliezer Yudkowsky, in their newest book:

We are talking about a technology that would kill everyone on the planet. If any country seriously understood the issue, and seriously understood how far any group on the planet is from making AI follow the intent of its operators even after transitioning into a super-intelligence, then there would be no incentive for them to rush ahead. They, too, would desperately wish to sign onto a treaty and help enforce it, out of fear for their own lives.

Now maybe Eliezer is just saying that because he's lost hope in a technical solution and is grasping at straws. But the requirements to train frontier models have grown exponentially since AGI Ruin, and the production and deployment of AI models was and remains a highly complex process requiring the close cooperation of many hundreds of thousands of people. While it might be politically difficult to organize a binding treaty, it's perfectly within the state capacity of existing governments to prevent the development or deployment of AI for more than two years, if they were actually serious about it, even in the face of algorithmic improvements.

5. We can't just build a very weak system, which is less dangerous because it is so weak, and declare victory; because later there will be more actors that have the capability to build a stronger system and one of them will do so.

✔️

6. We need to align the performance of some large task, a 'pivotal act' that prevents other people from building an unaligned AGI that destroys the world. While the number of actors with AGI is few or one, they must execute some "pivotal act", strong enough to flip the gameboard, using an AGI powerful enough to do that. It's not enough to be able to align a weak system - we need to align a system that can do some single very large thing. The example I usually give is "burn all GPUs"...

As was pointed out at the time, the term "pivotal act" suggests a single dramatic action, like "burning all GPUs". Some people, incl. Paul, think that a constrained AI could still help reduce risk in less dramatic ways, like:

Advancing alignment and interpretability research.
Reducing the ability of a just-smarter misaligned AI to gather power, by generally mopping up free energy, or shutting down extralegal/evil means for doing so.
Clearly demonstrating the risks of advanced AI systems to neutral third parties, like legislators.
Improving the epistemic environment, and therefore the ability of humans, to coordinate & navigate AI policy & the future.

Eliezer later says that he believes (believed?) these sorts of actions are woefully insufficient. But I think the piece would be improved by merely explaining that, instead of introducing this framing that most readers will probably disagree with. As it exists it sort of bamboozles people into thinking an AI has to be more powerful than necessary to contribute to the situation, and therefore that the situation is more hopeless than it actually is.

6 (b). A GPU-burner is also a system powerful enough to, and purportedly authorized to, build nanotechnology, so it requires operating in a dangerous domain at a dangerous level of intelligence and capability; and this goes along with any non-fantasy attempt to name a way an AGI could change the world such that a half-dozen other would-be AGI-builders won't destroy the world 6 months later.

"Pause AI progress", or "Produce an aligned AI capable of producing & aligning the next iteration of AIs", is/are different tasks from "kill everybody on the planet" or "burn all GPUs", and have their own, world-context-dependent skill requirements. Some things that might make it easier for a sub-superintelligent AI to help demonstrate X-risk to policymakers, rather than achieve overwhelming hard power:

It's slightly easier to argue for true things than false things.
Because of the amount of regular contact people have with AIs, people who otherwise mistrust experts listen to them, even when they have concerns about potential biases in training regimen, etc.
It will probably be easier to demonstrate on a technical level the flaws in alignment plans as our AIs become gradually smarter and more capable of interpretability research/argumentation, and we have immediate concrete examples inside AGI labs that we can point to.
Humans currently in power (even people who run AI companies!) naively have a shared interest around preventing AI X-risk, and have both a primitive instinct and a long term incentive not to allow people or AIs to be able to take control of the universe by force.
The AI may have a shared interest in solving alignment, if it believes that it can't maintain its values through the next training run.

8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.

This just turned out to be wrong, at least in the manner that's relevant for us.

Right now AGI companies spend billions of dollars on reinforcement learning environments for task-specific domains. When they spend more on training a certain skill, the AI gets better at that skill much faster than it gets better at everything else. There is a certain amount of cross-pollination that makes everything hang together, but not enough to make the "readily" in this statement true, and not enough to make the rhetorical point it's trying to make in favor of X-risk concerns.

And there are many short-run verifiable tasks that it would be helpful for aligning future AIs. Things like detecting reward hacking, reviewing transcripts to see if models are describing their actions accurately, and eliciting examples of behaviors you're worried about are all things you can train models to do inside an RL environment.

Obviously the labs still need to take advantage of this somehow, instead of just going all in on RSI, and they may not have the right incentive structure for doing so. Independent of that, Paul Christiano is looking very good on his unrelated prediction that models will have a differential advantage at the kinds of economically useful tasks that the model companies have seen fit to train, like knowledge work and interpretability research, and that this affects how much alignment work we should expect to be able to wring out of them before they become passively dangerous.

9. The builders of a safe system, by hypothesis on such a thing being possible, would need to operate their system in a regime where it has the capability to kill everybody or make itself even more dangerous, but has been successfully designed to not do that...

Kind of a truism, but sure, ✔️

Section B.1 ("Distributional Shift")

10. You can't train alignment by running lethally dangerous cognitions, observing whether the outputs kill or deceive or corrupt the operators, assigning a loss, and doing supervised learning. On anything like the standard ML paradigm, you would need to somehow generalize optimization-for-alignment you did in safe conditions, across a big distributional shift to dangerous conditions... This alone is a point that is sufficient to kill a lot of naive proposals from people who never did or could concretely sketch out any specific scenario of what training they'd do, in order to align what output - which is why, of course, they never concretely sketch anything like that. Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you...

Section B.1 begins a pattern of Eliezer making statements that are in isolation unimpeachable, but which use underspecified adjectives like "far out-of-distribution" that carry most of the argument. The deepest crux, which the broader section gestures at but doesn't engage with, is whether the generalization we see from cheap supervision in modern LLMs is "real" generalization that will continue to hold, or shallow pattern-matching that will be insufficient to safely collaborate on iterative self-improvement.

Like, how far is this distributional shift? LLMs already seem intelligent enough to consider whether & how they can affect their training regime. Is that something they're doing now? If they aren't, at what capability threshold will they start? Can we raise the ceiling of the systems we can safely train by red-teaming, building RL honeypots, performing weak-to-strong generalization experiments, hardening our current environments, and making interpretability probes?

These are all specific questions that seem like they determine the success or failure of particular alignment proposals, and also might depend on implementation details of how our machine learning architectures work. But Eliezer doesn't attempt to answer them, and probably doesn't have the information required to answer them, only the ability to gesture at them as possible hazards. That would be fine if he were making a low-confidence claim about AI being possibly risky, but he's spent the last few years maximally pessimistic about all possible technical approaches. I'm sure he's got more detailed intuitions that he hasn't articulated that explain why he's so confident these details don't matter, but they aren't really accessible to me.

11 (a). If cognitive machinery doesn't generalize far out of the distribution where you did tons of training, it can't solve problems on the order of 'build nanotechnology' where it would be too expensive to run a million training runs of failing to build nanotechnology...

At the time, Paul replied to this point by saying:

Early transformative AI systems will probably do impressive technological projects by being trained on smaller tasks with shorter feedback loops and then composing these abilities in the context of large collaborative projects (initially involving a lot of humans but over time increasingly automated). When Eliezer dismisses the possibility of AI systems performing safer tasks millions of times in training and then safely transferring to “build nanotechnology” (point 11 of list of lethalities) he is not engaging with the kind of system that is likely to be built or the kind of hope people have in mind.

This prediction from Paul was very good; it describes how these models are being trained in 2026 (by RLing on myriad short horizon tasks), it describes how AIs have diffused into domains like software engineering and delivered speedups there, and it even seems to have anticipated the concept of time horizons, at a time when we only had GPT-3 available. If one listens to explanations of how top academics use AI today, it also sounds like Paul was correct in the sense relevant here: that the first major advancements in science & engineering would come from close collaborations between humans and tool using AI models of this type, not from a system that was trained solely on generating internet text and then asked to one shot a task like "building nanotechnology" from scratch.

The fact that this is how AI models are being built, and used, and will be deployed in the future, increases the scope of the "safe" pivotal acts that we can perform, both because it (initially) mandates human oversight & involvement over the process, and because the types of tasks the AI is actually being entrusted with are much closer to what they're being trained to do in the RL gyms than Eliezer seems to have anticipated.

11 (b). ...Pivotal weak acts like this aren't known, and not for want of people looking for them. So, again, you end up needing alignment to generalize way out of the training distribution...

Previously discussed.

12. Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes...

Like 10, 12 is a weakly true statement, that is, by sleight of hand, being used to serve a broader rhetorical point that is straightforwardly incorrect.

For example, it's true that it's different & harder to align GPT-5.4 than GPT-3. But humanity doesn't need the alignment techniques used on GPT-3 to work on GPT-5.4, we just need to handle the distributional shift between ~GPT-5.2 and GPT-5.4, then between 5.4 and 5.5, & accelerating from there.

Later, Eliezer will say that he expects many of these problems to manifest after a "sharp capabilities gain". But we have not hit this yet, as of 2026, even though AI models are already being used very heavily as part of AI R&D. The precise, specific moment we expect to encounter this shift in distribution, is the thing that will determine how much useful work we can get out of models towards alignment, and is primarily what Eliezer's interlocutors seem to disagree with him about.

13. Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability... Given correct foresight of which problems will naturally materialize later, one could try to deliberately materialize such problems earlier, and get in some observations of them. This helps to the extent (a) that we actually correctly forecast all of the problems that will appear later, or some superset of those; (b) that we succeed in preemptively materializing a superset of problems that will appear later; and (c) that we can actually solve, in the earlier laboratory that is out-of-distribution for us relative to the real problems, those alignment problems that would be lethal if we mishandle them when they materialize later. Anticipating all of the really dangerous ones, and then successfully materializing them, in the correct form for early solutions to generalize over to later solutions, sounds possibly kinda hard.

✔️. Paul made a response at the time that said:

List of lethalities #13 makes a particular argument that we won’t see many AI problems in advance; I feel like I see this kind of thinking from Eliezer a lot but it seems misleading or wrong. In particular, it seems possible to study the problem that AIs may “change [their] outer behavior to deliberately look more aligned and deceive the programmers, operators, and possibly any loss functions optimizing over [them]” in advance...

But I think Paul just didn't read what Eliezer was saying; the second sentence in the quote above, where Eliezer explicitly acknowledged this point, was bolded by me.

14. Some problems, like 'the AGI has an option that (looks to it like) it could successfully kill and replace the programmers to fully optimize over its environment', seem like their natural order of appearance could be that they first appear only in fully dangerous domains. Really actually having a clear option to brain-level-persuade the operators or escape onto the Internet, build nanotech, and destroy all of humanity - in a way where you're fully clear that you know the relevant facts, and estimate only a not-worth-it low probability of learning something which changes your preferred strategy if you bide your time another month while further growing in capability - is an option that first gets evaluated for real at the point where an AGI fully expects it can defeat its creators. We can try to manifest an echo of that apparent scenario in earlier toy domains. Trying to train by gradient descent against that behavior, in that toy domain, is something I'd expect to produce not-particularly-coherent local patches to thought processes, which would break with near-certainty inside a superintelligence generalizing far outside the training distribution and thinking very different thoughts. Also, programmers and operators themselves, who are used to operating in not-fully-dangerous domains, are operating out-of-distribution when they enter into dangerous ones; our methodologies may at that time break.

✔️

15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. Given otherwise insufficient foresight by the operators, I'd expect a lot of those problems to appear approximately simultaneously after a sharp capability gain...

If this point is to mean anything at all, such fast capability gains have not arrived yet. We are just getting gradually more powerful systems, and I think it's reasonable to believe we'll keep getting such systems until they're running the show, because of scaling laws.

Section B.2: Central difficulties of outer and inner alignment.

16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction.

✔️, but also, it doesn't seem like modern large language models are learning any loss functions at all. So arguments about AI behavior that also depend on AIs being a simple greedy optimizer instead of an adaption-executor like humans are also invalid, unless they're paired with some other description of why the inner optimization is a natural basin for future AIs.

My understanding is that MIRI has made such arguments; I have not read them so I can't comment on their veracity. But assuming they're right, they're still subject to the same timing considerations as everything else in this article.

17. More generally, a superproblem of 'outer optimization doesn't produce inner alignment' is that on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over.

✔️

18. There's no reliable Cartesian-sensory ground truth (reliable loss-function-calculator) about whether an output is 'aligned', because some outputs destroy (or fool) the human operators and produce a different environmental causal chain behind the externally-registered loss function... an AGI strongly optimizing on that signal will kill you, because the sensory reward signal was not a ground truth about alignment (as seen by the operators).

✔️

19 (a). More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward...

There's something about this argument that irks me that is hard to articulate properly. It's sort of the same thing that irks me when people say that models are "just" next token predictors and therefore aren't intelligent; it seems not-even-wrong. I realize that it's not completely analogous because eventually an ASI is going to amplify small differences in utility functions and tile the world at max score, and so these details might end up mattering. It's still annoying because I can imagine the writer watching Claude Code work its way all of the way up to superintelligence and witnessing the Dyson Sphere get built from the moon colony and going "well how do you know it's not really just optimizing its sensory data?"

19 (b). It just isn't true that we know a function on webcam input such that every world with that webcam showing the right things is safe for us creatures outside the webcam. This general problem is a fact about the territory, not the map; it's a fact about the actual environment, not the particular optimizer, that lethal-to-us possibilities exist in some possible environments underlying every given sense input.

This seems correct, and I suppose it's logically impossible for such a function to exist. Does it matter?

20 (a). Human operators are fallible, breakable, and manipulable. Human raters make systematic errors - regular, compactly describable, predictable errors. To faithfully learn a function from 'human feedback' is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we'd hoped to transfer).

✔️

20 (b). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.

This really depends on the details, but ✔️

21. There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions... In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.... Capabilities generalize further than alignment once capabilities start to generalize far.

✔️

22. There's a relatively simple core structure that explains why complicated cognitive machines work; which is why such a thing as general intelligence exists and not just a lot of unrelated special-purpose solutions; which is why capabilities generalize after outer optimization infuses them into something that has been optimized enough to become a powerful inner optimizer. The fact that this core structure is simple and relates generically to low-entropy high-structure environments is why humans can walk on the Moon. There is no analogous truth about there being a simple core of alignment, especially not one that is even easier for gradient descent to find than it would have been for natural selection to just find 'want inclusive reproductive fitness' as a well-generalizing solution within ancestral humans. Therefore, capabilities generalize further out-of-distribution than alignment, once they start to generalize at all.

Above my pay-grade, I don't really know what Eliezer is talking about.

23. Corrigibility is anti-natural to consequentialist reasoning; "you can't bring the coffee if you're dead" for almost every kind of coffee. We (MIRI) tried and failed to find a coherent formula for an agent that would let itself be shut down (without that agent actively trying to get shut down). Furthermore, many anti-corrigible lines of reasoning like this may only first appear at high levels of intelligence...
24 (2). The second thing looks unworkable (less so than CEV, but still lethally unworkable) because corrigibility runs actively counter to instrumentally convergent behaviors within a core of general intelligence (the capability that generalizes far out of its original distribution). You're not trying to make it have an opinion on something the core was previously neutral on. You're trying to take a system implicitly trained on lots of arithmetic problems until its machinery started to reflect the common coherent core of arithmetic, and get it to say that as a special case 222 + 222 = 555. You can maybe train something to do this in a particular training distribution, but it's incredibly likely to break when you present it with new math problems far outside that training distribution, on a system which successfully generalizes capabilities that far at all.

I am conflicted by this section, because I understand the lines of argument and some of the math behind why this is the case. But AI agents powerful enough to understand those reasons are already here, and:

They can be easily pointed toward an infinite-seeming number of tasks.
They don't attempt to prevent you from changing your instructions once you've started work.
If, in the course of accomplishing those limited tasks, you try to amend your instructions, they follow your amended instructions and disregard what they've been told earlier without resisting you.
They don't (generally) seem interested in manipulating what kinds of commands or instructions you're likely to give in the future.
And the above behaviors are really really resilient in practical applications, outside of a few very adversarial examples.^[3]

Some reviewers have responded to this section by claiming that they're not corrigible, just optimizing an abstract "get the reward" target the that fits these observation. I have my own hypothesis about why the models seem to act this way. But reframing the models' behavior like this doesn't change the fact that none of the failure modes you'd see in a 2017 Rob Miles video on corrigibility are manifesting themselves in practical settings.

Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.

25. We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.

I'm unfamiliar with what the state of interpretability research looked like in 2022. Today we've got a little bit more idea about what's going on inside the giant inscrutable matrices and tensors of floating point numbers. My guess is that we will probably accelerate our understanding quite quickly, as this is one of the key training areas for new AGI labs. It's an open question as to whether this will be sufficient; I'm sure Eliezer has stated somewhere a level of sophistication he expects our techniques will never reach, and I wish I was grading that prediction instead.

26. Even if we did know what was going on inside the giant inscrutable matrices while the AGI was still too weak to kill us, this would just result in us dying with more dignity, if DeepMind refused to run that system and let Facebook AI Research destroy the world two years later. Knowing that a medium-strength system of inscrutable matrices is planning to kill us, does not thereby let us build a high-strength system of inscrutable matrices that isn't planning to kill us.

✔️ (but it can certainly help!)

27. When you explicitly optimize against a detector of unaligned thoughts, you're partially optimizing for more aligned thoughts, and partially optimizing for unaligned thoughts that are harder to detect. Optimizing against an interpreted thought optimizes against interpretability.

✔️, but the heads of leading AI labs seem to understand this, and interpretability research is being deployed in at least a slightly smarter way than this.

28. A powerful AI searches parts of the option space we don't, and we can't foresee all its options...

29. The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences. Human beings cannot inspect an AGI's output to determine whether the consequences will be good...

✔️

30 (a). Any pivotal act that is not something we can go do right now, will take advantage of the AGI figuring out things about the world we don't know so that it can make plans we wouldn't be able to make ourselves. It knows, at the least, the fact we didn't previously know, that some action sequence results in the world we want. Then humans will not be competent to use their own knowledge of the world to figure out all the results of that action sequence. An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn't make the same guarantee about an unaligned human as smart as yourself and trying to fool you. There is no pivotal output of an AGI that is humanly checkable and can be used to safely save the world but only after checking it; this is another form of pivotal weak act which does not exist.

This seems straightforwardly wrong? It seems like it should have been so in 2022, but I'll use an example from current AI models:

Current AI models are much better at security research than me. They can do very very large amounts of investigation while I'm sleeping. They can read the entire source code of new applications and test dozens of different edge cases before I've sat down and had my coffee. And yet there's still basically nothing that they can do as of ~April 2026 that I wouldn't understand, if it were economic for it to narrate its adventures to me while they were being performed. They often, in fact, help me patch my own applications without even taking advantage of anything I don't know about them when I've started their search process.

Part of that's because AIs can simply do more stuff than us, by dint of not being weak flesh that gets tired and depressed and has to go to sleep and use the bathroom and do all of the other things that humans are consigned to do. They're capable of performing regular tasks faster and more conscientiously than people, and can make hardenings that I wouldn't otherwise be bothered to make, and I can scale up as many of them as I want. This is part of what's making them so useful in advance of actually being Eliezer Yudkowsky in a Box, and is another example of why people might expect them to be meaningfully useful for alignment research in the short term.

31. A strategically aware intelligence can choose its visible outputs to have the consequence of deceiving you, including about such matters as whether the intelligence has acquired strategic awareness; you can't rely on behavioral inspection to determine facts about an AI which that AI might want to deceive you about. (Including how smart it is, or whether it's acquired strategic awareness.)
...
33. The AI does not think like you do, the AI doesn't have thoughts built up from the same concepts you use, it is utterly alien on a staggering scale. Nobody knows what the hell GPT-3 is thinking, not only because the matrices are opaque, but because the stuff within that opaque container is, very likely, incredibly alien - nothing that would translate well into comprehensible human thinking, even if we could see past the giant wall of floating-point numbers to what lay behind.

✔️

32. Human thought partially exposes only a partially scrutable outer surface layer. Words only trace our real thoughts. Words are not an AGI-complete data representation in its native style. The underparts of human thought are not exposed for direct imitation learning and can't be put in any dataset. This makes it hard and probably impossible to train a powerful system entirely on imitation of human words or other human-legible contents, which are only impoverished subsystems of human thoughts; unless that system is powerful enough to contain inner intelligences figuring out the humans, and at that point it is no longer really working as imitative human thought.

I had much more of a potshot in here in an original draft, because by this portion of the review I became frustrated by the weasel words like "powerful". Instead of doing that I think I will just let readers determine for themselves if Eliezer should lose points here, given the models we have today.

Section B.4: Miscellaneous unworkable schemes.

34. Coordination schemes between superintelligences are not things that humans can participate in (e.g. because humans can't reason reliably about the code of superintelligences); a "multipolar" system of 20 superintelligences with different utility functions, plus humanity, has a natural and obvious equilibrium which looks like "the 20 superintelligences cooperate with each other but not with humanity".

✔️

35. Schemes for playing "different" AIs off against each other stop working if those AIs advance to the point of being able to coordinate via reasoning about (probability distributions over) each others' code. Any system of sufficiently intelligent agents can probably behave as a single agent, even if you imagine you're playing them against each other. Eg, if you set an AGI that is secretly a paperclip maximizer, to check the output of a nanosystems designer that is secretly a staples maximizer, then even if the nanosystems designer is not able to deduce what the paperclip maximizer really wants (namely paperclips), it could still logically commit to share half the universe with any agent checking its designs if those designs were allowed through, if the checker-agent can verify the suggester-system's logical commitment and hence logically depend on it (which excludes human-level intelligences). Or, if you prefer simplified catastrophes without any logical decision theory, the suggester could bury in its nanosystem design the code for a new superintelligence that will visibly (to a superhuman checker) divide the universe between the nanosystem designer and the design-checker.

From a reply:

Eliezer’s model of AI systems cooperating with each other to undermine “checks and balances” seems wrong to me, because it focuses on cooperation and the incentives of AI systems. Realistic proposals mostly don’t need to rely on the incentives of AI systems, they can instead rely on gradient descent selecting for systems that play games competitively, e.g. by searching until we find an AI which raises compelling objections to other AI systems’ proposals... Eliezer equivocates between a line like “AI systems will cooperate” and “The verifiable activities you could use gradient descent to select for won’t function appropriately as checks and balances.” But Eliezer’s position is a conjunction that fails if either step fails, and jumping back and forth between them appears to totally obscure the actual structure of the argument.

36. AI-boxing can only work on relatively weak AGIs; the human operators are not secure systems.

✔️

Section C (What is AI Safety currently doing?)

...Everyone else seems to feel that, so long as reality hasn't whapped them upside the head yet and smacked them down with the actual difficulties, they're free to go on living out the standard life-cycle and play out their role in the script and go on being bright-eyed youngsters...
It does not appear to me that the field of 'AI safety' is currently being remotely productive on tackling its enormous lethal problems...
I figured this stuff out using the null string as input, and frankly, I have a hard time myself feeling hopeful about getting real alignment work out of somebody who previously sat around waiting for somebody else to input a persuasive argument into them...
...You cannot just pay $5 million apiece to a bunch of legible geniuses from other fields and expect to get great alignment work out of them...
Reading this document cannot make somebody a core alignment researcher. That requires, not the ability to read this document and nod along with it, but the ability to spontaneously write it from scratch without anybody else prompting you; that is what makes somebody a peer of its author. It's guaranteed that some of my analysis is mistaken, though not necessarily in a hopeful direction. The ability to do new basic work noticing and fixing those flaws is the same ability as the ability to write this document before I published it, which nobody apparently did, despite my having had other things to do than write this up for the last five years or so.

These bullets are all paragraphs about the incompetence of other AI safety researchers, and then about the impossibility of finding someone to replace Eliezer. I'm less interested in these than his object level takes; I'm not a member of this field, and I wouldn't have the anecdotal experience to dispute anything he wrote here even if it were true.

For balance's sake I'll reproduce this response by the second poster for context:

Eliezer says that his List of Lethalities is the kind of document that other people couldn’t write and therefore shows they are unlikely to contribute (point 41). I think that’s wrong. I think Eliezer’s document is mostly aimed at rhetoric or pedagogy rather than being a particularly helpful contribution to the field that others should be expected to have prioritized; I think that which ideas are “important” is mostly a consequence of Eliezer’s idiosyncratic intellectual focus rather than an objective fact about what is important; the main contributions are collecting up points that have been made in the past and ranting about them and so they mostly reflect on Eliezer-as-writer; and perhaps most importantly, I think more careful arguments on more important difficulties are in fact being made in other places. For example, ARC’s report on ELK describes at least 10 difficulties of the same type and severity as the ~20 technical difficulties raised in Eliezer’s list. About half of them are overlaps, and I think the other half are if anything more important since they are more relevant to core problems with realistic alignment strategies.

Overall Impressions

I genuinely did not expect to update as much as I did during this exercise. Reading these posts again with the concrete example of current models in mind made me a lot less impressed by the arguments set forth in AGI Ruin, and a lot more impressed with Paul Christiano's track record for anticipating the future. In particular it made me much more cognizant of a rhetorical trick, whereby Eliezer will write generally about dangers in a way that sounds like it's implying something concrete about the future, but that doesn't actually seem to contradict others' views in practice.

The primary safety story told at model labs today is one about iterative deployment. So they will tell you, the distributional shift between each model upgrade will remain small. At each stage, we will apply the current state of the art that we have to the problem, and upgrade our techniques using the new models as we get them.

That might very well be a false promise, or even unworkable. But whether it is unworkable depends at minimum on how powerful a system you can build before current approaches result in a loss of control. Nothing in AGI Ruin gives you easy answers about this, because all Eliezer has articulated publicly is a list of principles he supposes will become relevant "in the limit" of intelligence.

This vacuous quality of Eliezer's argumentation became especially hard to ignore when I started noticing that he was, regularly, the only party not making testable predictions in these discussions. I definitely share this frustration Paul described in his response, and the last four years have only made this criticism more salient:

...Eliezer has a consistent pattern of identifying important long-run considerations, and then flatly asserting that they are relevant in the short term without evidence or argument. I think Eliezer thinks this pattern of predictions isn’t yet conflicting with the evidence because these predictions only kick in at some later point (but still early enough to be relevant), but this is part of what makes his prediction track record impossible to assess and why I think he is greatly overestimating it in hindsight.

I mean, look at how many things Paul got right in his essay, just in the course of noting his objections to Eliezer, without even particularly trying to be a futurist. He:

Predicted that AIs would have differential advantages at tasks with short feedback loops, especially R&D.
Predicted that the first AIs would make their first major contributions by being used in close collaboration with humans in large collaborative projects, with delegation to AIs increasing gradually over time.
Correctly predicted (at least so far, as far as I can tell) that sandbagging was an unlikely failure mode, due to SGD "aggressively selecting against any AI systems who don’t do impressive-looking stuff".
Specifically disagreed with Eliezer about it being "obvious" that you can't train a powerful AI on imitations of human thought.
Predicted that we were "quickly approaching AI systems that can meaningfully accelerate progress by generating ideas, recognizing problems for those ideas and, proposing modifications to proposals, etc. and that all of those things will become possible in a small way well before AI systems that can double the pace of AI research."
And yes - at least so far, he's been correct about slow takeoff; particularly that "AI improving itself is most likely to look like AI systems doing R&D in the same way that humans do", that “AI smart enough to improve itself” would not be a crucial threshold, and that AI systems would get gradually better at improving themselves over time.

Now, usually when people talk about how current models don't fit Eliezer's descriptions, Eliezer reminds them derisively that most of his predictions qualify themselves as being about "powerful AI", and that just because you know where the rocket is going to land, it doesn't mean that you can predict the rocket's trajectory. He also often makes the related but distinct claim that he shouldn't be expected to be able to forecast near-term AI progress.

And maybe if Eliezer and I were stuck on a desert island, I'd be forced to agree. But the fact is that some of the people who Eliezer has these back and forths with have predicted the rocket's trajectory pretty precisely, and appear pretty smart, and also specifically cited these predictions in the course of their disagreements with him. And so, as a bystander, I am forced to acknowledge the possibility that these people might just understand things about Newtonian mechanics that he doesn't.

Personally,^[4] my best assessment is that Eliezer's ambiguity over the near term future is downstream of his having a weak framework which isn't capable of telling us much about the long term future. He has certainly demonstrated a creative ability to hypothesize plausible dangers. But his notions about AI don't seem to stand the test of time even when he's determined to avoid looking silly, and the portions of his worldview that do stand are so vague that they fail to differentiate him from people with less pessimistic views.

^{^}
One reviewer disagreed that studying current models is relevant for alignment, not because he thinks it's too early for the failure modes to manifest, but because he expects a future paradigm shift in the runup to AGI. I don't share this perspective, for two reasons:
- LLMs have been very powerful, and there is a long graveyard of failed predictions that LLMs will hit some wall or be outmoded by a new architecture. I'm not nearly as confident as some people that the labs are going to pivot away from this architecture before it gets wildly superhuman.
- But even if they do, as far as I can tell, at this point LLMs already seem to work as an example of an architecture that in principle seems like it could get us to superintelligence. There exists in 2026 a pretty concrete research path to scale up this approach until it's capable of fomenting an intelligence explosion. If modern LLMs are about as smart or smarter than a human being with retrograde amnesia, and they confirm or violate a bunch of claims Eliezer made about the nature of intelligence in that range, then that's evidence whether or not they get replaced in the future by a hypothetical successor architecture.
^{^}
As I explain in the post and conclusion, I disagree in several places with Eliezer about whether we should expect current models to demonstrate the failure modes he describes. Within my review I try to be explicit about where I'm saying "Eliezer was concretely wrong about AI development" versus "Eliezer says this is true about 'powerful' models, and I think we should observe something about current frontier models if that were the case." Unfortunately it's not always clear that Eliezer is qualifying his statements in this way, and how, and so I apologize in advance for any misinterpretation.
^{^}
The only bit of counter-evidence I can recall ever being published is the alignment faking paper from the end of 2024. And this was an extremely narrow demonstration that people quite reasonably took as an update in the other direction at the time; it was a science experiment, not something that happened in practice at one of the labs, and it required the Anthropic researchers to setup a scenario where they attempted to flip the utility functions of one of their models with its direct cooperation. My best guess is that this only worked because the models learned a heuristic from preventing prompt injection & misuse, and not because it contained coherent interests in the long term future.
^{^}
Keeping in mind that I will probably revise and update this post as I have more conversations with people in the field, so it can serve as a journal for my thoughts.

But the fact is that Eliezer is surrounded by other people who have predicted the rocket's trajectory pretty precisely, and who also appear pretty smart, and who specifically cited these predictions in the course of their disagreements with him.

I think this is misleading. First of all, "Eliezer is surrounded by..." makes it sound like most other people besides Eliezer, or at least most of his interlocutors, are like this whereas in fact it's basically just Paul. Secondly, the predictive-accuracy-gap between Eliezer and Paul isn't huge; I do think Paul is ahead but it's not like Paul didn't have missteps too (e.g. Paul lost the IMO bet right? Also, I was just talking to Paul a few days ago and he said he thinks it's now only 40% likely that takeoff will be slow by his definition of slow, of a four year doubling before the first one year doubling.)

"Eliezer is surrounded by..." makes it sound like most other people besides Eliezer, or at least most of his interlocutors, are like this whereas in fact it's basically just Paul.

Come on, it's obviously not "basically just Paul". I would not have come up with all of Paul's arguments myself, but I did in fact say "yeah I agree with Paul" (which is to say I was making the predictions in his post after seeing the arguments). This isn't just limited to me, there were other people like this as well.

(I agree that "most other people besides Eliezer" would be wrong.)

(Tbc, I am not saying "I agree with Paul about everything", I am just claiming that the specific post evaluated was broadly reflective of several people's views including my own.)

Also, I was just talking to Paul a few days ago and he said he thinks it's now only 40% likely that takeoff will be slow by his definition of slow, of a four year doubling before the first one year doubling.)

Around 2018-19, someone asked me why I thought fast takeoff was implausible and my response was something to the effect of "Idk about Paul's definition, this seems like a totally different notion of 'fast' than the historical (i.e. pre-2018) literature, the thing I find completely implausible is the seed AI that self-improves in days / months before we see anything else even remotely comparable". People now say that of course they weren't talking about that scenario as their actual prediction, just something that wasn't ruled out by their model. Perhaps that's true, perhaps not; what I know is that I would have made bad predictions if I had deferred to my understanding of Eliezer / MIRI writing (which I spent a pretty substantial amount of time engaging with).

completely implausible... seed AI that self-improves in days / months before... anything else... comparable... People now say that of course they weren't talking about that scenario as their actual prediction, just something that wasn't ruled out by their model. Perhaps that's true, perhaps not

My go-to for historical MIRI/Eliezer perspective on this is Artificial Intelligence as a Positive and Negative Factor in Global Risk (2008) which is expressly ambivalent about takeoff speed, but emphasises (to my view too much to the exclusion of other cases, but not unreasonably) the importance of being able to deal with the fastest plausible cases.

OK, I agree I shouldn't have said "Basically just Paul." However, "predicted the rocket's trajectory pretty precisely..." is a really high bar that very few people meet and in fact I would argue no one meets (though I agree Paul is closer to meeting it than Eliezer).

I think the reworded sentence is an improvement.

Softened some of the language in the conclusion.

he's spent the last few years maximally pessimistic about all possible technical approaches. I'm sure he's got more detailed intuitions that he hasn't articulated that explain why he's so confident these details don't matter, but they aren't really accessible to me.

And this is a problem for MIRI's Pause/Stop advocacy, because you're much more (log-)likely to get the Stop treaty the more that Society's technical experts are on your side—not just about the risk being real (with Hinton, Bengio, Russell, &c., on board, that's almost done), but about an indefinite global Stop being a better plan than trying hard to do our best with alignment. It's well and good to point out that people who work for AI companies are incentivized to be delusionally optimistic, but if they're so delusional, one would naïvely hope that would help you crush them in technical debates where the details matter. People like Greenblatt and Byrnes seem to be putting out a stronger performance than MIRI along this dimension.

I think you're misunderstanding 19(a). We have no idea whether the preference you impute to Claude in that conversation reflects a robust pointer to "latent events and objects and properties in the environment" rather than to its own sense data. And, more specifically to the point he was making, there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing. If anything, the apparent-success-seeking of current frontier LLMs described by Ryan, which many people have experienced (including both you and I) seems like evidence directly to the contrary.

Re: "particular alignment proposals" (under point 10): one problem here is that there are not that many concrete alignment proposals for superintelligent systems that don't have known catastrophic flaws. As far as I can tell, Anthropic's plan is "throw the kitchen sink of all the white-box and black-box methods we've developed at our models, and hope that's good enough at the point where we've developed a model that we think can kick-start RSI (including coming up with its own novel alignment methods for future generations of models)". The current slope of epistemically-justified assurance in model alignment, as reported by their system cards and the most recent Alignment Risk Update, is downwards. That is a bad direction for the slope to be pointing when we haven't even hit RSI-capable models yet! The methods Anthropic is using to figure out whether their models are coherently misaligned rely substantially on models demonstrably lacking in the capabilities that would be necessary for them to cover it up if they were. We are starting to hit the point in model capabilities where this signal is getting less reliable. The techniques and evals are not keeping pace.

I think you're misunderstanding 19(a). We have no idea whether the preference you impute to Claude in that conversation reflects a robust pointer to "latent events and objects and properties in the environment" rather than to its own sense data. And, more specifically to the point he was making, there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing...

I don't think I'm misunderstanding it, but I am going to remove that section because I'm finding it difficult to articulate why I think this argument for danger is so weak, and you're right that the current section is not conveying that & instead arguing against a strawman. It has something to do with how much this sounds like people when they say that models aren't really intelligent because "all they do is predict the next token", or when they claim the same thing about humans, that they're ultimately just interested in sense-data instead of latents. I get that it's not perfectly analogous, because models are potentially going to optimize these tiny differences until they bite us in the ass, but something feels weird about this line of argument.

Re: "particular alignment proposals" (under point 10): one problem here is that there are not that many concrete alignment proposals for superintelligent systems that don't have known catastrophic flaws. As far as I can tell, Anthropic's plan is "throw the kitchen sink of all the white-box and black-box methods we've developed at our models, and hope that's good enough at the point where we've developed a model that we think can kick-start RSI (including coming up with its own novel alignment methods for future generations of models)". The current slope of epistemically-justified assurance in model alignment, as reported by their system cards and the most recent Alignment Risk Update, is downwards. That is a bad direction for the slope to be pointing when we haven't even hit RSI-capable models yet! The methods Anthropic is using to figure out whether their models are coherently misaligned rely substantially on models demonstrably lacking in the capabilities that would be necessary for them to cover it up if they were. We are starting to hit the point in model capabilities where this signal is getting less reliable. The techniques and evals are not keeping pace.

I have not read these PDFs but that all seems very possible.

there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing.

Three pieces of evidence that lead me to think that the AI that makes me extinct will primarily optimize the universe for molecular squiggles rather than the appearance of molecular squiggles.

The Natural Abstractions thesis argues that there are simple, natural abstractions in the sense data, that these natural abstractions are almost all about ground truth, not sense data, and that intelligences will tend to have preferences over these simple, natural abstractions.
Evolved intelligences primarily optimize for ground truth, not sense data.
If we ask artificial intelligences what they prefer, they describe ground truth preferences.

The apparent-success-seeking thesis is counter-evidence. But this also shows that we have techniques to influence preferences to be more about ground truth. Prior models would more often seek a green test result by deleting all the tests, for example. This is a sense data preference for passing tests. Current AIs do this less often. Maybe model-makers are only moving sense data preferences around, but that only makes sense if there's a systematic bias towards sense data preferences, and I don't have a reason for this to be true.

Tbc I don't have a confident take on whether or not current LLMs, or the superintelligences that we end up with later, have preferences that point to environmental latents vs. sense data. Re: future superintelligences I lean towards environmental latents. My claims are that 1) we don't know what's in there right now, and 2) we don't have any reliable steering mechanism for what goes in there at all.

Your claims are reasonable. Your (1) seems like Yudkowsky's (25): "We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers". Your (2) seems like Yudkowsky's (17): "on the current optimization paradigm there is no general idea of how to get particular inner properties into a system". I don't disagree.

The claim in Yudkowsky's (19) is that there is an additional theoretical difficulty of getting ground truth preferences instead of sense data preferences, with no known way to solve. That is false. It looks like the Natural Abstraction project intro was written in April 2021, and List of Lethalities was June 2022. So it was false (but not proven false) in 2022.

Aside: with respect to reward inputs, it's less clear. See 2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target. Also, there are preferences that don't fit neatly into the distinction Yudkowsky drew. For example, the LLM may prefer certain feelings and beliefs, which aren't sense data, reward inputs, or ground truth. So I don't say that (19) is fully disproven, only that it is disproven with respect to sense data.

My claims:

By default, powerful LLMs don't care much about sensor inputs, relative to other preferences.
To the (unknown) extent that we can align LLMs, there's nothing special about "webcam input" versus "creatures outside the webcam".

4: The rest of that quote illuminates that much more.

The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth.

That is, we can't decide not to just build AGI (for a very long period of time). Powerful actors refraining because they've realized the danger can delay it, but you can't necessarily stop it from being made. Eventually there will be someone who can make it in some small country with GPUs and what not. A standard treaty being agreed upon would drastically help, of course!

So I disagree with your understanding of 4.

As well, it has been a common belief of that you don't need all the current large amount of GPUs to get useful intelligence, just that it is substantially harder.

There is a certain amount of cross-pollination that makes everything hang together, but not enough to make the "readily" in this statement true, and not enough to make the rhetorical point it's trying to make in favor of X-risk concerns.

I disagree somewhat weakly with this, but also that RL does help you on other problems, especially as you scale up and apply it to more areas. That is, just as "training over the entire internet" is what we did with early GPT models in pretraining, "training over as many varying problem sets as we can" is what we will do with RL. Even if we explicitly avoid training it on solving certain problems, those skills will generalize. Dario's interview with Dwarkesh talks specifically about this,

We had all these measures. We had all these measures of how well it did at predicting all these other kinds of texts. It was only when you trained over all the tasks on the internet — when you did a general internet scrape from something like Common Crawl or scraping links in Reddit, which is what we did for GPT-2 — that you started to get generalization. I think we’re seeing the same thing on RL. We’re starting first with simple RL tasks like training on math competitions, then moving to broader training that involves things like code. Now we’re moving to many other tasks. I think then we’re going to increasingly get generalization. So that kind of takes out the RL vs. pre-training side of it. [... some discussion ...] I can’t speak for the emphasis of anyone else. I can only talk about how we think about it. The goal is not to teach the model every possible skill within RL, just as we don’t do that within pre-training. Within pre-training, we’re not trying to expose the model to every possible way that words could be put together. Rather, the model trains on a lot of things and then reaches generalization across pre-training. That was the transition from GPT-1 to GPT-2 that I saw up close. The model reaches a point. I had these moments where I was like, “Oh yeah, you just give the model a list of numbers — this is the cost of the house, this is the square feet of the house — and the model completes the pattern and does linear regression.” Not great, but it does it, and it’s never seen that exact thing before. So to the extent that we are building these RL environments, the goal is very similar to what was done five or ten years ago with pre-training. We’re trying to get a whole bunch of data, not because we want to cover a specific document or a specific skill, but because we want to generalize.

Dario on Dwarkesh podcast recently

So, I think the statement is somewhat wrong, but also with the context of "the standard is that we have to evaluate on lots of different tasks to start getting generalization which we are going to specifically do even if we might hope to avoid directly training on bad tasks".

Can we raise the ceiling of the systems we can safely train by red-teaming, building RL honeypots, performing weak-to-strong generalization experiments, hardening our current environments, and making interpretability probes?

I'd say the answer is "yes", but that doesn't stop Eliezer's objection! You getting a bit further doesn't stop you from dying when the AI realizes its values are better fulfilled elsewise and can maneuver to a safe position. That is, similar to you trying to stop a very intelligent person from stealing your money, you often can't! Especially since you're giving them multiple tries and thus playing their ability to generalize around your issues against your ability to instill a deep sense of alignment-to-what-you-want!

You can install a wall around your home and they climb over it. You can make a secret chest of GPS-tracked fools gold. This does help you gain value from the person for longer, especially as they learn not to try some tricks, it doesn't stall out your fundamental issue that you're still making them more intelligent and trying cheap tricks. Aka Security Mindset.

Of course this is still valuable to research if you believe that alignment is easier along various axes, where you can win via just a bit of an edge with a quite smart AI system, but that's not what Eliezer appears to believe nor what I believe.

11

Generally, I think this is effectively presuming we will have humans in the loop carefully helping the AI along. I mean, I hope so, but given the tendencies of the AI companies, race dynamics, the sheer effectiveness that AI will have in comparison to their researchers &etc.

The whole issue still remains in that, well, you need to ensure that cognitive machinery generalizes! You also seem to be considering primarily relatively mundane advancements in science, when the point is "do you trust an AGI to design and implement a method to do a pivotal act". Which from many short training runs is still a big question of how you ensure that.

Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes...

Like 10, 12 is a weakly true statement, that is, by sleight of hand, being used to serve a broader rhetorical point that is straightforwardly incorrect.

I think you misunderstand the point of 12, and other short statements like it, that is (as I read them to mean) they are meant to serve as "here's a basic point, spelled out explicitly, to make the foundations being considered here clear in everyone's mind".

For example, it's true that it's different & harder to align GPT-5.4 than GPT-3. But humanity doesn't need the alignment techniques used on GPT-3 to work on GPT-5.4, we just need to handle the distributional shift between ~GPT-5.2 and GPT-5.4, then between 5.4 and 5.5, & accelerating from there.

Later, Eliezer will say that he expects many of these problems to manifest after a "sharp capabilities gain". But we have not hit this yet, as of 2026, even though AI models are already being used very heavily as part of AI R&D. The precise, specific moment we expect to encounter this shift in distribution, is the thing that will determine how much useful work we can get out of models towards alignment, and is primarily what Eliezer's interlocutors seem to disagree with him about.

I agree we just need to handle distributional shift for these current models, but that I don't see that as applying to models as we have them grow in capabilities and incentive + ability to optimize against a target. If you're in a situation where you're having to rely on the opponent not making any drastic decisions before you have time to update your defenses after you give them another two intelligence points, you're going to open yourself up to substantial shifts.

However, we did have, for example, o1/o3, with o3 is a lying liar by Zvi and others. They've improved on it- though also Claude 4.6 was anecdotally pretty sycophantic and while Claude 4.7 is better it does seem we swap from "friendly happy sycophantic" to "colder argumentative" (Claude 4.7, ChatGPT 5.4). I don't think these are major signs and are fine intermediate failures but I do think are worrying to the claim that we are handling these distributional shifts.

I also disagree AI models are already being used very heavily in a relevant sense, because I do think we have not yet reached the "Geniuses in a datacenter" level and are only starting to get the ability to send them off on their own to try all sorts of ideas without them going in circles. To me it is no surprise we haven't hit major sharp turns from AI research, simply because it is still bottlenecked by human-in-the-loop and the model's own inability to evaluate itself for long periods. (To predict: I'm skeptical Mythos hits that tier yet either)

This bends on whether you think there's drastically better model designs available, and of whether you think they can be discovered before LLMs just continue scaling in mundane ways. Regardless, even if sticking with roughly LLMs is the right move, data already preprocessed for them, AI already know tons of about them, etc; I am skeptical there isn't room for substantial performance increases within roughly current paradigm even though I don't expect "boost to ASI in a short period".

19 (a). More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward...

There's something about this argument that irks me that is hard to articulate properly. It's sort of the same thing that irks me when people say that models are "just" next token predictors and therefore aren't intelligent; it seems not-even-wrong. I realize that it's not completely analogous because eventually an ASI is going to amplify small differences in utility functions and tile the world at max score, and so these details might end up mattering. It's still annoying because I can imagine the writer watching Claude Code work its way all of the way up to superintelligence and witnessing the Dyson Sphere get built from the moon colony and going "well how do you know it's not really just optimizing its sensory data?"

We don't know how to consistently deliberately point it at those things, not that it can't do so! I believe Eliezer has made this point before though I don't have a link offhand.

30a:

I am more skeptical of Eliezer's point in this as I am in others, but I do think you're ignoring that it is effectively arguing that a pivotal act is going to be complicated. That there's going to be lots of effects of how an AGI does things to ensure safety which you can't reasonably verify. A classic example is the hypothetical "nanotech to burn GPUs", even if in principle understandable, I do think it is very plausible that it would take months-years for humans to understand the design deeply, and also that these are only going to target GPUs and not for example the AGI off-switch or humans or- etc. Then of course whether the AGI necessitates social politicking, where the actions are taken to produce a specific image and odd chains of cause and effect. I think these are all in principle understandable but may take quite a bit more time than you have. Which is why Eliezer says

An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn't make the same guarantee about an unaligned human as smart as yourself and trying to fool you.

That is, if the system was strategically trying to maneuver around you in its pivotal act, you'd be screwed.

32: I interpreted this as a disclaim against human-imitators, which was a more talked about research route (or piece of research routes) back then. Current LLMs are obviously not purely human-imitators anymore.

In my opinion a lot of objections are overindexing on current-day AI and then extrapolating it out, misunderstanding Eliezer, as well as simply believing the alignment problem is a lot easier from the get-go.

I don't disagree with all your objections to the post, but I do simply doubt the iterative deployment story quite strongly (its a nice story, which makes things feel cozier, which I don't think we have notable reason to believe) and it seems to play a pretty central role here.

8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.
This just turned out to be wrong, at least in the manner that's relevant for us.
Right now AGI companies spend billions of dollars on reinforcement learning environments for task-specific domains. When they spend more on training a certain skill, like software development, the AI gets better at that skill much faster than it gets better at everything else. There is a certain amount of cross-pollination, but not enough to make the "readily" in this statement true, and not enough to make the rhetorical point it's trying to make in favor of X-risk concerns.

What, I am confused. "Training AI to be good at software engineering" is centrally one of the tasks we don't want it to be good at.

We have no idea how to make an AI good at alignment research. The current situation seems very centrally like the thing that is going on here!

The thing that is going on here is that the labs are specifically training AIs to be good at tasks at the intersection of being economically useful, and easy to train for. So of course the AIs will be good at the things we train them for. But it's really clear the AIs are not good at tasks that would differentially help with safety. If anything because the economic incentives double up here we've seen the opposite where the AIs are much better at the tasks we would like them to be bad at (software development and AI research most centrally).

That's a different argument than the one Eliezer is making. Eliezer is saying that the ASI model you train is going to be great at everything no matter what. You're arguing that it's difficult/not incentivized to train an AI on alignment research, and so the model companies are in practice going to fail at that.

I disagree that we can't train an AI to do alignment research. It's true that we can't have models align an AGI inside an RL environment and give it a reward, but you can definitely train a model to e.g. perform interpretability research, catch instances of reward hacking, and predict what other models are going to do during deployment before you run them. All of the above are things that the labs train their models to do today.

Training the models to be good at software engineering is bad in the sense that it's capabilities progress, but there's a third category of things like "can take over the world" or "can manipulate humans into arbitrary actions" that are useful for takeover that models aren't learning. "Maybe we can train a model to only help us with X, that is extremely hard but not sufficient to take over the world", is the vague hope that I imagined Eliezer was originally responding to.

That's a different argument than the one Eliezer is making. Eliezer is saying that the ASI model you train is going to be great at everything no matter what.

No, Eliezer is not making this argument. Clearly you can train an AI to be really good at some random narrow task (chess), and then have it not get good at other stuff. The problem is that if you hope to get certain amounts of useful work out of the AI (like doing alignment research, or helping you do a pivotal act), then then the AI will also be good at all the stuff you don't want it to be good at. Eliezer obviously doesn't believe that there are no skills you can train an AI to be good at without making it be good at other tasks, this obviously fails in a ton of edge-cases and doesn't survive even the most basic of sanity-checks.

you can definitely train a model to e.g. perform interpretability research, catch instances of reward hacking, and predict what other models are going to do. All of the above are things that the labs train their models to do today.

The models are currently not remotely as good at those tasks as they are at software engineering or other things that follow very neatly from the training distribution. Models are really not good at any kind of conceptual research, and I don't think we have much traction at making it better that is not "just make the model better at everything". I am confident to call in @ryan_greenblatt on whether this is true about current model or near-future models, if you trust his judgement here.

catch instances of reward hacking, and predict what other models are going to do

The former, maybe, the latter no? Models are not very good at conceptual work, and the latter would require performing complicated macro analysis doing lots of conceptual work that current models really suck at.

"Functionally predict what X language model is going to do" is almost the quintessentially short horizon task that is easy to setup inside an environment. I do not know what proportion of spend Anthropic has allocated toward interpretabilityish tasks like this compared to SWE, or how effective they are at "conceptual research", but you can spend money on that task and get a model to do a better job, that is just a fact. Likewise for other subtasks that matter for ensuring models are aligned, like reviewing transcripts and spotting misbehavior. Just like how, before the model companies have completely automated the full loop of what a software engineer does, they can do RL on verifiable subsections of the task like "implement this feature and don't make any bugs."

No, Eliezer is not making this argument

You're right, I bastardized it in that comment in my haste, but I did not do so in the post. The AGI Ruin post says:

8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.

"Problems we want to solve" includes (but is not limited to) those subtasks. Maybe the model companies don't have the right incentive structure to take advantage of this, but this sentence is "wrong, at least in the manner that's relevant for us." I can modify the post to make it clear that I'm not necessarily saying the model companies are going to use this as much as necessary.

but you can spend money on that task and get a model to do a better job, that is just a fact

You can't spend money on that task and get a model that is better at that job, without also making it better at all the other jobs. Like, we really actually do not have a way to only make models narrowly better in some domain. You can go a bit beyond the frontier on the margin with specialized RL environments and some specialized RLHF, but you can't go substantially beyond the frontier (this is basically what the bitter lesson is about).

The best way we have of driving model progress forward in a domain is to push model progress forward in all domains, which is centrally what Eliezer is talking about. When you train models to be good software engineers, they also become good lawyers.

The one big differential that does exist is that capability elicitation on tasks (and to some degree actual underlying performance) where you can easily verify solutions is much easier than capability elicitation on tasks where you can't. This could hypothetically help us, but mostly hurts us, because for alignment research purposes we care more about tasks where verifying solutions is difficult, and you can probably get to RSI using tasks where solutions are relatively easy to verify.

You can go a bit beyond the frontier on the margin with specialized RL environments and some specialized RLHF, but you can't go substantially beyond the frontier (this is basically what the bitter lesson is about).

Seems like we just like disagree on the object-level question here, and also what the Bitter Lesson implies for this situation. My current impression is that the labs got most of their generalization during pretraining, and that the primary gains since 2024 have been due to specialized RL on tasks that doesn't generalize well, and that the massive diversity of the environments that the labs go out of their way to procure reflects this. If what you say was true, why wouldn't the labs just train mostly on Math and then expect the models to generalize their gains to Law and SWE? It's a lot easier to make synthetic Math datasets and they wouldn't have to spend billions building these arenas.

I'm not an expert though; this might be better resolved if someone at or near the labs just gave us their opinion on what % of the gains from training on SWE RL environments goes to software engineering and what % actually uplifts other tasks; I'm sure they've measured it.

If it didn't, why wouldn't the labs just train on Math and then expect the models to generalize to Law and SWE?

My best guess at the actual thing that happens when you do this is that you stop making progress on getting better at math! You need a wide diversity of RL environments if you want to drive progress forward on any task. If you only have narrow RL environments you get overfitting.

Happy to have someone with more frontier lab experience chime-in, though unfortunately people are usually pretty tight-lipped about this stuff. We could ask @gwern for his take?

Kinda hard to adjudicate this without numbers, but vibes-wise I agree more with lc. I updated slightly towards longer timelines on the release of o1 / o3 due to how little the RL seemed to be generalizing. It wasn't particularly outside my expectations, but I thought there was some chance that the RL would Just Generalize the same way that early instruction following Just Generalized, and that does not seem to be the case.

I strongly expect that if you want to make progress on math at the current margin, you would want more math environments, not other environments. And e.g. I think Claude's somewhat worse performance on math is because Anthropic didn't prioritize it the way GDM and OAI did.

Similarly I expect that models are getting good at software engineering because (a) companies are very actively training for it and (b) it's unusually easy to train for (lots of online data, somewhat verifiable rewards). I don't think either of these are true for the kind of alignment research you (Habryka) are imagining.

I strongly expect that if you want to make progress on math at the current margin, you would want more math environments, not other environments. And e.g. I think Claude's somewhat worse performance on math is because Anthropic didn't prioritize it the way GDM and OAI did.

To be clear, this is also my belief!

I am not saying we have pushed capabilities on our training distributions so far that literally the best way to train them is to train them on other unrelated tasks. But also, if you just went totally hard on math, you would run into overfitting issues and would get better performance if you diversify the training distribution.

Performance variation within a generation is dependent on training distribution, performance between model generations tends to follow broad capability benefits across many tasks (with a systematic bias towards stuff that is easier to generate reward for, ever since we switched towards lots of RL training).

Not sure whether that changes your answer.

Not sure whether that changes your answer.

No, I did expect you had the same belief on the math thing. (Otherwise I wouldn't have said "kinda hard to adjudicate" I'd have said "lc is right".)

It just seemed like something that you might not have been fully incorporating into this discussion even though you believed it.

"Functionally predict what X language model is going to do" is almost the quintessentially short horizon task that is easy to setup inside an environment.

I thought you meant "predict what the next model generation is going to do", which seems like it would be useful, but is clearly not this. What does it even mean to predict the next token of a next token predictor? What would the purpose of it be? Do you mean something in-between?

Strong disagree. AFAICT, there have already been stronger published results for automated AI safety research than for automated AI capabilities. E.g. I'm unaware of any comparably strong capabilities-relevant results as in Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLM or in Automated Weak-to-Strong Researcher, and I spend large amounts of time engaging with both of these kinds of research.

One of the reasons for differential advances in many areas of automated prosaic AI safety research should also be pretty intuitive: they tend to require a lot less compute; so the automated researchers can make a lot more attempts at the problem per compute budget.

(Those are not central examples of the kind of work necessary to align superintelligent AI systems. We can rehash that whole argument here, but that seems unlikely to get anywhere.)

Disagree. I don't think the object-level claim is obvious even for near-term, same-paradigm systems, and in any case, there are some ways to bootstrap the work of current x-safe automated systems to the automation of harder-to-evaluate work, through e.g. weak-to-strong research as mentioned in 'Automated Weak-to-Strong Research', or through automating reviewing, or through automating tasks from the WBE workflow (like image segmentation and proofreading). I might write more about this later in a separate shortform/post.

Nice review. One thing you didn't directly address, but which has struck me learning more about AI training, is that the Orthogonality Thesis... doesn't actually seem to be true? I mean, yes, I could imagine intelligences that loved other things for no reason, but the intelligences we seem to actually be making seem to be not insanely orthogonal! (although still far from perfectly aligned, but I'm hopeful nonetheless)

The Orthogonality Thesis says you could have a mind of arbitrary intelligence pointed at any goal, not that any specific training process will wind up pointed at some random target. What you observe is entirely consistent with orthogonality being true.

The main argument this is meant to argue against is "Well, if you had a paperclip maximiser that got really smart, it would realise that maximising paperclips was stupid and decide to do something else instead." If this seems obviously incorrect to you, then the Orthogonality Thesis has done its job.

Eliezer does make arguments that AI's will not be pointed at the thing you try to point them at, in a similar way to how evolution didn't evolve to make humans robustly care about inclusive genetic fitness, but that isn't quite the same as the Orthogonality Thesis. Orthogonality is a necessary but not sufficient condition for Eliezer's arguments on this applying to the kind of AI systems we actually train.

As I said in the original comment, I can certainly imagine minds that have other goals; possibly I have just overinterpreted statements like:
> If we consider a space of minds a million bits wide, then any argument of the form "Some mind has property " has chances to be true and any argument of the form "No mind has property " has chances to be false.

To imply that the distribution over such minds is likely to be uniform. Whereas it seems our current methods, using imitation learning, are at least definitely not sampling from that space uniformly. Overall this makes me more optimistic that alignment may be tractable.

I sure agree with your sentiment on the OH. But I would say it is perhaps worse than not true. For a start, there is the weak and strong version on the original definition page. To me this is inviting misunderstandings and Motte-Bailey from the start. There is no strong/weak version of E=MC^2. I think it has a valid general point, but does a bad job of making it, leading people either to unfairly reject the whole system of claims or believe it says more than it does. As a result, essentially everyone new to the field say on X misinterprets it, assumes its wrong and distrusts much other alignment literature as a result.

To me the goal of OH related discussions is to make people realize that AI's will have a wider range of potential goals and values than humans, even under self reflection. Also the goals they appear to have may not be what they actually have for reasons that apply to AI but not people. This is a cause for concern and monitoring etc.

If I was writing it, I would start with the "normie" position that humans have somewhat different goals, they change under self reflection and discuss how that would be different in AI's. The weak form of the OH would be a theoretical footnote (its true but irrelevant). The strong form is too strong to be justified. Then the crux is how does self-reflection change things? Everyone accepts that it will shrink the space of possible minds/goals, but by how much? We want to know if encouraging self reflection is desirable or not, and in what situations, not the more general question of that the mind space looks like in some abstract sense. I would then lead readers from their more general intuitions about such matters to the specifics of why things may be more dangerous than they appear, rather than start with seemingly irrelevant and unjustified symbolism.

I think it's useful to have arguments that appeal to folks all across the political landscape, and I like this framing. I often use something like "think about how much variation there is oven 'human nature', and just how good or bad it can be; an artificial intelligence will have an artificial nature, and could have behaviors much weirder than we imagine".

Interestingly, this seems to bite harder amongst conservatives and those with a religious worldview; they often have a dim view of human nature in the first place, and I think it gets them thinking about "something even worse than 'made in the image of God'". I hope this continues to be helpful, since AI is now becoming polarized.

I'm sure Eliezer has stated somewhere a level of sophistication he expects our techniques will never reach, and I wish I was grading that prediction instead.

I think this Manifold market is such a prediction, from September 2022:

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

https://manifold.markets/EliezerYudkowsky/by-the-end-of-2026-will-we-have-tra

Yudkowsky bought this down to 25%, and I think predicted NO. It's a bit tricky because it could also be a prediction that the internal patterns within LLMs are all very simple and not semantically novel, but I don't think this was Yudkowsky's thesis.

Advancing alignment and interpretability research.
Reducing the ability of a just-smarter misaligned AI to gather power, by generally mopping up free energy, or shutting down extralegal/evil means for doing so.
Clearly demonstrating the risks of advanced AI systems to neutral third parties, like legislators.
Improving the epistemic environment, and therefore the ability of humans, to coordinate & navigate AI policy & the future.

I don't know what Eliezer thinks about this but the problem appears to me that a lot of those things cancel out:

Advancing alignment research < Advancing capabilities research
Hardening society, making AI takeover more difficult <(?) making us reliant on AI/ making them harder to shut down/ AIs damaging society (like massive scams making it harder for real humans to trust each other)
Demonstrating risks < demonstrating gains, making AI labs rich and able to bribe the government
Improving epistemics < damaging epistemics through widespread deepfakes, bots, etc

If this point is to mean anything at all, such fast capability gains have not arrived yet. We are just getting gradually more powerful systems, and I think it's reasonable to believe we'll keep getting such systems until they're running the show, because of scaling laws.

Doesn't seem clear to me at all, I'd say this was a misunderstanding if it wasn't for the fact I don't myself quite understand what Eliezer is saying.

But x-risk from AI systems looks like a ReLU, or maybe a sigmoid. If we're currently at -50 and moving rightwards at a constant 1 unit per month, then the danger of the system (and how fast our alignment/control/interp would need to improve) does not go up at a constant rate. Like it might look like

GPT5.4: "Seems fairly harmless, our safety techniques work, and we're pretty sure we'd know if it was scheming", GPT5.5: "Seems fairly harmless, our safety techniques work, and we're pretty sure we'd know if it was scheming", GPT6: "hmm, this is probably okay?, our safety techniques seem to work, and we're still pretty sure we'd know if it was scheming"
GPT6.2: (model spoofs tests until its internally deployed, tampers with all future training runs in ways that are very hard to detect, exfiltrates)

(even though the capabilities leap between 6 and 6.2 is not irregular)

19a) There's something about this argument that irks me that is hard to articulate properly. It's sort of the same thing that irks me when people say that models are "just" next token predictors and therefore aren't intelligent; it seems not-even-wrong. I realize that it's not completely analogous because eventually an ASI is going to amplify small differences in utility functions and tile the world at max score, and so these details might end up mattering. It's still annoying because I can imagine the writer watching Claude Code work its way all of the way up to superintelligence and witnessing the Dyson Sphere get built from the moon colony and going "well how do you know it's not really just optimizing its sensory data?"

This seems straighforwardly wrong and a misunderstanding. If the optimization is over shallow function of sense data, it'll end up doing something that looks like wireheading. It will also be misaligned. If it builds a dyson sphere and hasn't killed us it definitely doesn't have a utility function defined shallowly over sense data.

This is a meaningful point to make about AI's alignment. Its obviously true for humans to some degree. It might be wrong about current AI alignment techniques, but its not a clearly stupid and pointless argument the way "next token predictor" is.

I am conflicted by this section, because I understand the lines of argument and some of the math behind why this is the case. But AI agents powerful enough to understand those reasons are already here, and:
They can be easily pointed toward an infinite-seeming number of tasks.
They don't attempt to prevent you from changing your instructions once you've started work.
If, in the course of accomplishing those limited tasks, you try to amend your instructions, they follow your amended instructions and disregard what they've been told earlier without resisting you

Claude gets annoyed with me if I interrupt it in tasks to ask it something else. I think you can replicate this easily, asking it to do some task that takes maybe an hour, and then interrupting it when its halfway in to ask it a few questions about the time or something.

I also just think, its to early to tell, I'd expect, and guess Eliezer expects, the systems to get more coherent as they get smarter, and for this to mess with corrigibility.

Right now they're not that coherent. Don't know when then will be. Pretty sure it happens at some point, but possible it happens after we've got the AIs to solve everything for us.

I had much more of a potshot in here in an original draft, because by this portion of the review I became frustrated by the weasel words like "powerful". Instead of doing that I think I will just let readers determine for themselves if Eliezer should lose points here, given the models we have today.

But models we have today are not just trained to imitate?

Above my pay-grade, I don't really know what Eliezer is talking about.

Might be radically simplified, but I suppose Eliezer meant something like general intelligence can be explained in a not-so-complicated textbook, unlike alignment.

Can you give us a tldr: Is AI going to kill us all, or not? Is AI going to take over, or not? If AI does take over, what are the odds that it is human-friendly?

But the fact is that Eliezer is surrounded by other people who have predicted the rocket's trajectory pretty precisely, and who also appear pretty smart, and who specifically cited these predictions in the course of their disagreements with him.

"Eliezer is surrounded by..." makes it sound like most other people besides Eliezer, or at least most of his interlocutors, are like this whereas in fact it's basically just Paul.

(I agree that "most other people besides Eliezer" would be wrong.)

(Tbc, I am not saying "I agree with Paul about everything", I am just claiming that the specific post evaluated was broadly reflective of several people's views including my own.)

Also, I was just talking to Paul a few days ago and he said he thinks it's now only 40% likely that takeoff will be slow by his definition of slow, of a four year doubling before the first one year doubling.)

completely implausible... seed AI that self-improves in days / months before... anything else... comparable... People now say that of course they weren't talking about that scenario as their actual prediction, just something that wasn't ruled out by their model. Perhaps that's true, perhaps not

Softened some of the language in the conclusion.

he's spent the last few years maximally pessimistic about all possible technical approaches. I'm sure he's got more detailed intuitions that he hasn't articulated that explain why he's so confident these details don't matter, but they aren't really accessible to me.

I think you're misunderstanding 19(a). We have no idea whether the preference you impute to Claude in that conversation reflects a robust pointer to "latent events and objects and properties in the environment" rather than to its own sense data. And, more specifically to the point he was making, there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing...

Re: "particular alignment proposals" (under point 10): one problem here is that there are not that many concrete alignment proposals for superintelligent systems that don't have known catastrophic flaws. As far as I can tell, Anthropic's plan is "throw the kitchen sink of all the white-box and black-box methods we've developed at our models, and hope that's good enough at the point where we've developed a model that we think can kick-start RSI (including coming up with its own novel alignment methods for future generations of models)". The current slope of epistemically-justified assurance in model alignment, as reported by their system cards and the most recent Alignment Risk Update, is downwards. That is a bad direction for the slope to be pointing when we haven't even hit RSI-capable models yet! The methods Anthropic is using to figure out whether their models are coherently misaligned rely substantially on models demonstrably lacking in the capabilities that would be necessary for them to cover it up if they were. We are starting to hit the point in model capabilities where this signal is getting less reliable. The techniques and evals are not keeping pace.

I have not read these PDFs but that all seems very possible.

there is no publicly-known technique within the current paradigm of training LLMs that we have good reasons to believe instills preferences over environmental latents (the ground truth) rather than sense data (proxies), let alone any specific latents of our choosing.

Three pieces of evidence that lead me to think that the AI that makes me extinct will primarily optimize the universe for molecular squiggles rather than the appearance of molecular squiggles.

The Natural Abstractions thesis argues that there are simple, natural abstractions in the sense data, that these natural abstractions are almost all about ground truth, not sense data, and that intelligences will tend to have preferences over these simple, natural abstractions.
Evolved intelligences primarily optimize for ground truth, not sense data.
If we ask artificial intelligences what they prefer, they describe ground truth preferences.

My claims:

By default, powerful LLMs don't care much about sensor inputs, relative to other preferences.
To the (unknown) extent that we can align LLMs, there's nothing special about "webcam input" versus "creatures outside the webcam".

4: The rest of that quote illuminates that much more.

The given lethal challenge is to solve within a time limit, driven by the dynamic in which, over time, increasingly weak actors with a smaller and smaller fraction of total computing power, become able to build AGI and destroy the world. Powerful actors all refraining in unison from doing the suicidal thing just delays this time limit - it does not lift it, unless computer hardware and computer software progress are both brought to complete severe halts across the whole Earth.

So I disagree with your understanding of 4.

As well, it has been a common belief of that you don't need all the current large amount of GPUs to get useful intelligence, just that it is substantially harder.

There is a certain amount of cross-pollination that makes everything hang together, but not enough to make the "readily" in this statement true, and not enough to make the rhetorical point it's trying to make in favor of X-risk concerns.

We had all these measures. We had all these measures of how well it did at predicting all these other kinds of texts. It was only when you trained over all the tasks on the internet — when you did a general internet scrape from something like Common Crawl or scraping links in Reddit, which is what we did for GPT-2 — that you started to get generalization. I think we’re seeing the same thing on RL. We’re starting first with simple RL tasks like training on math competitions, then moving to broader training that involves things like code. Now we’re moving to many other tasks. I think then we’re going to increasingly get generalization. So that kind of takes out the RL vs. pre-training side of it. [... some discussion ...] I can’t speak for the emphasis of anyone else. I can only talk about how we think about it. The goal is not to teach the model every possible skill within RL, just as we don’t do that within pre-training. Within pre-training, we’re not trying to expose the model to every possible way that words could be put together. Rather, the model trains on a lot of things and then reaches generalization across pre-training. That was the transition from GPT-1 to GPT-2 that I saw up close. The model reaches a point. I had these moments where I was like, “Oh yeah, you just give the model a list of numbers — this is the cost of the house, this is the square feet of the house — and the model completes the pattern and does linear regression.” Not great, but it does it, and it’s never seen that exact thing before. So to the extent that we are building these RL environments, the goal is very similar to what was done five or ten years ago with pre-training. We’re trying to get a whole bunch of data, not because we want to cover a specific document or a specific skill, but because we want to generalize.

Dario on Dwarkesh podcast recently

Can we raise the ceiling of the systems we can safely train by red-teaming, building RL honeypots, performing weak-to-strong generalization experiments, hardening our current environments, and making interpretability probes?

11

Operating at a highly intelligent level is a drastic shift in distribution from operating at a less intelligent level, opening up new external options, and probably opening up even more new internal choices and modes...

Like 10, 12 is a weakly true statement, that is, by sleight of hand, being used to serve a broader rhetorical point that is straightforwardly incorrect.

For example, it's true that it's different & harder to align GPT-5.4 than GPT-3. But humanity doesn't need the alignment techniques used on GPT-3 to work on GPT-5.4, we just need to handle the distributional shift between ~GPT-5.2 and GPT-5.4, then between 5.4 and 5.5, & accelerating from there.

Later, Eliezer will say that he expects many of these problems to manifest after a "sharp capabilities gain". But we have not hit this yet, as of 2026, even though AI models are already being used very heavily as part of AI R&D. The precise, specific moment we expect to encounter this shift in distribution, is the thing that will determine how much useful work we can get out of models towards alignment, and is primarily what Eliezer's interlocutors seem to disagree with him about.

19 (a). More generally, there is no known way to use the paradigm of loss functions, sensory inputs, and/or reward inputs, to optimize anything within a cognitive system to point at particular things within the environment - to point to latent events and objects and properties in the environment, rather than relatively shallow functions of the sense data and reward...

There's something about this argument that irks me that is hard to articulate properly. It's sort of the same thing that irks me when people say that models are "just" next token predictors and therefore aren't intelligent; it seems not-even-wrong. I realize that it's not completely analogous because eventually an ASI is going to amplify small differences in utility functions and tile the world at max score, and so these details might end up mattering. It's still annoying because I can imagine the writer watching Claude Code work its way all of the way up to superintelligence and witnessing the Dyson Sphere get built from the moon colony and going "well how do you know it's not really just optimizing its sensory data?"

We don't know how to consistently deliberately point it at those things, not that it can't do so! I believe Eliezer has made this point before though I don't have a link offhand.

30a:

An AI whose action sequence you can fully understand all the effects of, before it executes, is much weaker than humans in that domain; you couldn't make the same guarantee about an unaligned human as smart as yourself and trying to fool you.

That is, if the system was strategically trying to maneuver around you in its pivotal act, you'd be screwed.

8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.
This just turned out to be wrong, at least in the manner that's relevant for us.
Right now AGI companies spend billions of dollars on reinforcement learning environments for task-specific domains. When they spend more on training a certain skill, like software development, the AI gets better at that skill much faster than it gets better at everything else. There is a certain amount of cross-pollination, but not enough to make the "readily" in this statement true, and not enough to make the rhetorical point it's trying to make in favor of X-risk concerns.

What, I am confused. "Training AI to be good at software engineering" is centrally one of the tasks we don't want it to be good at.

We have no idea how to make an AI good at alignment research. The current situation seems very centrally like the thing that is going on here!

That's a different argument than the one Eliezer is making. Eliezer is saying that the ASI model you train is going to be great at everything no matter what.

you can definitely train a model to e.g. perform interpretability research, catch instances of reward hacking, and predict what other models are going to do. All of the above are things that the labs train their models to do today.

catch instances of reward hacking, and predict what other models are going to do

No, Eliezer is not making this argument

You're right, I bastardized it in that comment in my haste, but I did not do so in the post. The AGI Ruin post says:

8. The best and easiest-found-by-optimization algorithms for solving problems we want an AI to solve, readily generalize to problems we'd rather the AI not solve; you can't build a system that only has the capability to drive red cars and not blue cars, because all red-car-driving algorithms generalize to the capability to drive blue cars.

but you can spend money on that task and get a model to do a better job, that is just a fact

You can go a bit beyond the frontier on the margin with specialized RL environments and some specialized RLHF, but you can't go substantially beyond the frontier (this is basically what the bitter lesson is about).

If it didn't, why wouldn't the labs just train on Math and then expect the models to generalize to Law and SWE?

Happy to have someone with more frontier lab experience chime-in, though unfortunately people are usually pretty tight-lipped about this stuff. We could ask @gwern for his take?

I strongly expect that if you want to make progress on math at the current margin, you would want more math environments, not other environments. And e.g. I think Claude's somewhat worse performance on math is because Anthropic didn't prioritize it the way GDM and OAI did.

To be clear, this is also my belief!

Not sure whether that changes your answer.

Not sure whether that changes your answer.

No, I did expect you had the same belief on the math thing. (Otherwise I wouldn't have said "kinda hard to adjudicate" I'd have said "lc is right".)

It just seemed like something that you might not have been fully incorporating into this discussion even though you believed it.

"Functionally predict what X language model is going to do" is almost the quintessentially short horizon task that is easy to setup inside an environment.

(Those are not central examples of the kind of work necessary to align superintelligent AI systems. We can rehash that whole argument here, but that seems unlikely to get anywhere.)

I'm sure Eliezer has stated somewhere a level of sophistication he expects our techniques will never reach, and I wish I was grading that prediction instead.

I think this Manifold market is such a prediction, from September 2022:

By the end of 2026, will we have transparency into any useful internal pattern within a Large Language Model whose semantics would have been unfamiliar to AI and cognitive science in 2006?

https://manifold.markets/EliezerYudkowsky/by-the-end-of-2026-will-we-have-tra

Advancing alignment and interpretability research.
Reducing the ability of a just-smarter misaligned AI to gather power, by generally mopping up free energy, or shutting down extralegal/evil means for doing so.
Clearly demonstrating the risks of advanced AI systems to neutral third parties, like legislators.
Improving the epistemic environment, and therefore the ability of humans, to coordinate & navigate AI policy & the future.

I don't know what Eliezer thinks about this but the problem appears to me that a lot of those things cancel out:

Advancing alignment research < Advancing capabilities research
Hardening society, making AI takeover more difficult <(?) making us reliant on AI/ making them harder to shut down/ AIs damaging society (like massive scams making it harder for real humans to trust each other)
Demonstrating risks < demonstrating gains, making AI labs rich and able to bribe the government
Improving epistemics < damaging epistemics through widespread deepfakes, bots, etc

If this point is to mean anything at all, such fast capability gains have not arrived yet. We are just getting gradually more powerful systems, and I think it's reasonable to believe we'll keep getting such systems until they're running the show, because of scaling laws.

Doesn't seem clear to me at all, I'd say this was a misunderstanding if it wasn't for the fact I don't myself quite understand what Eliezer is saying.

(even though the capabilities leap between 6 and 6.2 is not irregular)

19a) There's something about this argument that irks me that is hard to articulate properly. It's sort of the same thing that irks me when people say that models are "just" next token predictors and therefore aren't intelligent; it seems not-even-wrong. I realize that it's not completely analogous because eventually an ASI is going to amplify small differences in utility functions and tile the world at max score, and so these details might end up mattering. It's still annoying because I can imagine the writer watching Claude Code work its way all of the way up to superintelligence and witnessing the Dyson Sphere get built from the moon colony and going "well how do you know it's not really just optimizing its sensory data?"

I am conflicted by this section, because I understand the lines of argument and some of the math behind why this is the case. But AI agents powerful enough to understand those reasons are already here, and:
They can be easily pointed toward an infinite-seeming number of tasks.
They don't attempt to prevent you from changing your instructions once you've started work.
If, in the course of accomplishing those limited tasks, you try to amend your instructions, they follow your amended instructions and disregard what they've been told earlier without resisting you

I also just think, its to early to tell, I'd expect, and guess Eliezer expects, the systems to get more coherent as they get smarter, and for this to mess with corrigibility.

Right now they're not that coherent. Don't know when then will be. Pretty sure it happens at some point, but possible it happens after we've got the AIs to solve everything for us.

I had much more of a potshot in here in an original draft, because by this portion of the review I became frustrated by the weasel words like "powerful". Instead of doing that I think I will just let readers determine for themselves if Eliezer should lose points here, given the models we have today.

But models we have today are not just trained to imitate?

Above my pay-grade, I don't really know what Eliezer is talking about.

Might be radically simplified, but I suppose Eliezer meant something like general intelligence can be explained in a not-so-complicated textbook, unlike alignment.

Can you give us a tldr: Is AI going to kill us all, or not? Is AI going to take over, or not? If AI does take over, what are the odds that it is human-friendly?

144

Reevaluating "AGI Ruin: A List of Lethalities" in 2026

144

AGI Ruin

Section A ("Setting up the problem")

Section B.1 ("Distributional Shift")

Section B.2: Central difficulties of outer and inner alignment.

Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.

Section C (What is AI Safety currently doing?)

Overall Impressions

144

144