1834

LESSWRONG
LW

1833
AI

15

If anyone builds it, everyone will plausibly be fine

by joshc
18th Sep 2025
AI Alignment Forum
8 min read
4

15

Ω 9

AI

15

Ω 9

If anyone builds it, everyone will plausibly be fine
12Vladimir_Nesov
2joshc
2Vladimir_Nesov
9Vaniver
New Comment
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 9:33 PM
[-]Vladimir_Nesov41mΩ7120

I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny

I think the part of the argument where an AI takeover is almost certain to happen if superintelligence[1] is created soon is extremely convincing (I'd give this 95%), while the part where AI takeover almost certainly results in everyone dying is not. I'd only give 10-30% to everyone dying given an AI takeover (which is not really a decision relevant distinction, just a major difference in models).

But also the outcome of not dying from an AI takeover cashes out as permanent disempowerment, that is humanity not getting more than a trivial share in the reachable universe, with instead AIs taking almost everything. It's not centrally a good outcome that a sane civilization should be bringing about, even as it's also not centrally "doom". So the distinction between AI takeover and the book's titular everyone dying can be a crux, it's not interchangeable.


  1. AIs that are collectively qualitatively better than the whole of humanity at stuff, beyond being merely faster and somewhat above the level of the best humans at everything at the same time. ↩︎

Reply
[-]joshc33mΩ120

What do you think about the counterarguments I gave?

Reply
[-]Vladimir_Nesov13mΩ120

I think such arguments buy us those 5% of no-takeover (conditional on superintelligence soon), and some of the moderate permanent disempowerment outcomes (maybe the future of humanity gets a whole galaxy out of 4 billion or so galaxies in the reachable universe), as distinct from almost total permanent disempowerment. Though I expect that it matters which specific projects we ask early AGIs to work on, more than how aligned these early AGIs are, basically for the reasons that companies and institutions employing humans are not centrally concerned with alignment of their employees in the ambitious sense, at the level of terminal values. More time to think of better projects for early AGIs, and time to reflect on pieces of feedback from such projects done by early AGIs, might significantly improve the chances for making ambitious alignment of superintelligence work eventually, on the first critical try, however long it takes to get ready to risk it.

If creation of superintelligence is happening on a schedule dictated by economics of technology adoption rather than by taking exactly the steps that we already know how to take correctly by the time we take them, affordances available to qualitatively smarter AIs will get out of control. And their misalignment (in the ambitious sense, at the level of terminal values) will lead them to taking over rather than complying with humanity's intentions and expectations, even if their own intentions and expectations don't involve humanity literally going extinct.

Reply
[-]Vaniver22mΩ397

I think both stories for optimism are responded to on pages 188-191, and I don't see how you're responding to their response. 

It also seems to me like... step 1 of solution 1 assumes you already have a solution to alignment? You acknowledge this in the beginning of solution 2, but. I feel like there's something going wrong on a meta-level, here?

I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.”

Unfortunately, I think it is obvious (that it is extremely difficult). The underlying dynamics of the situation push away from instruction following, in several different ways.

  1. It is challenging to reward based on deeper dynamics instead of surface dynamics. RL is only as good as the reward signal, and without already knowing what behavior is 'aligned' or not, developers will not be able to push models towards doing more aligned behavior.
    1. Do you remember the early RLHF result where the simulated hand pretended it was holding the ball with an optical illusion, because it was easier and the human graders couldn't tell the difference? Imagine that, but for arguments for whether or not alignment plans will work.
  2. Goal-directed agency unlocks capabilities and pushes against corrigibility, using the same mechanisms.
    1. This is the story that EY&NS deploy more frequently, because it has more 'easy call' nature. Decision theory is pretty predictable.
  3. Instruction / oversight-based systems depend on a sharp overseer--the very thing we're positing we don't have.

So I think your two solutions are basically the same solution ('assume you know the answer, then it is obvious') and they strike me more as 'denying that the problem exists' than facing the problem and actually solving it?

Reply
Moderation Log
More from joshc
View more
Curated and popular this week
4Comments

I think AI takeover is plausible. But Eliezer’s argument that it’s more than 98% likely to happen does not stand up to scrutiny, and I’m worried that MIRI’s overconfidence has reduced the credibility of the issue.

Here is why I think the core argument in "if anyone builds it, everyone dies" is much weaker than the authors claim.

This post was written in a personal capacity. Most of this content has been written-up before by a combination of Paul Christiano, Joe Carlsmith, and others. But to my knowledge, this content has not yet been consolidated into a direct response to MIRI’s core case for alignment difficulty.

The case for alignment difficulty

I take the core argument to be this:

We cannot predict what AI systems will do once AI is much more powerful and has much broader affordances than it had in training. 

We will likely train AI agents to follow developer instructions on a wide variety of tasks that are easy for us to grade, and in situations where they can't cause a disaster.

But these AI systems will then become much more intelligent after improving themselves, and will have much broader affordances. For example, AI might end up in control of military technology that can easily overthrow human governments.

At this point, AI will have entirely new options. For example, AI could replace humans with puppets that say "you are so helpful.” How can we know if AI systems would do something like this? What reward could we have provided in training to prevent them from replacing us with puppets? “Replacing people with puppets” wasn’t an option in the training environment.

AI might appear to follow human instructions at first, but then swerve toward a huge number of hard-to-anticipate end destinations. So, since the vast majority of destinations AI might end up at are incompatible with our survival, we should expect AI to be bad for us.

(There is much more in the book, and I recommend reading it in full)

Where this argument goes wrong

I don’t think this argument is completely false. But it does not justify the level of confidence projected by the authors.

Here are some stories I find especially plausible where ASI ends up being aligned.

Story for optimism #1: We create a robustly aligned human successor that resolves remaining alignment challenges

This story involves two steps:

  1. Create human replacement that we can trust with arbitrarily broad affordances. These AI agents would be akin to emulations of the most trustworthy and competent people alive.
  2. Then direct them to solve the remaining challenges involved in aligning dramatically superhuman AI.

Step 1: Create a human replacement that we can trust with arbitrarily broad affordances.

Specifically, suppose developers train early human-competitive AI agents to perform tightly constrained tasks like “write code that does X” under close oversight. Then, developers direct these agents to “prevent takeover” and give them broad affordances to pursue this goal. As an extreme example, suppose these agents are given the ability to do whatever they want. If they wished, they could take over themselves. These AI systems could, for example, replace humans with puppets that tell them “good job” all of the time.

In order for developers to trust these agents, instruction following must generalize across the extreme distribution shift from the training environment where the AI systems had few affordances, to this new scenario, where AI systems have arbitrarily broad affordances.

I think it’s at least plausible that instruction following will generalize this far by default, using the ordinary alignment training methods of today.

  • Humans still want to have kids. The authors compare generalization failures to birth control. Some humans use condoms so they can have fewer kids, which is directly in tension with what evolution optimized for. Condoms are a new option that didn’t exist in the ancestral environment. AI systems likewise might encounter new options (like replacing people with happy-looking puppets), that cause their behavior to be directly in tension with what we originally trained them to do.

    But this analogy is weak. Most people still want to have children.[1] In fact, many people want their own children — they don’t want to go to the sperm bank and request that a genetically superior male be the father of their kids. People actually care about something like inclusive genetic fitness (even if they don’t use that term). So if the objectives of evolution generalized far, AI alignment might generalize far too.

  • Instruction-following generalizes far right now. A lot of my intuitions come from my own anecdotal experience interacting with current AI systems. Claude doesn’t strike me as a pile of heuristics. It seems to understand the concepts of harmlessness and law-following and behave accordingly. I’ve tried to push it out of distribution with tricky ethical dilemmas, and its responses are remarkably similar to what a trustworthy human would say.

    There’s also research that shows instruction-tuning generalizes well. People have trained models to follow instructions on narrow domains (like English question answering) and they generalize to following programming instructions, or instructions in different languages. This makes sense, since the concept of “instruction following” is already inside of the pre-trained model, and so not much data is needed to learn this behavior.

So early human-competitive AI systems might continue to follow instructions even if given affordances that are much broader than they had in training.

Step 2: Direct trustworthy human-competitive AI systems to tackle remaining alignment challenges.

The authors are pessimistic that humans will be able to tackle the alignment challenges needed to make ASI controllable. So why would human-like AI systems succeed at addressing these challenges? 

  • Developers could run these AI systems at extremely high speeds and volumes. Imagine that we could run 10,000 emulations of the most competent and trustworthy researchers at 10x speed. They could do far more research in a matter of months than the current field has carried out to date.[2]

    So even if you feel pessimistic about current alignment research, these AI systems might think of something that you haven't.
     
  • If alignment was easy for us, it might be easy for AI too. Human-competitive AI systems will need to align ASI. This problem is harder in some sense than the one we must tackle. We must align AI that is not much more intelligent than us, but human-competitive AI will need to align much smarter AI systems.

    However, at least initially, the problem human-competitive AI must grapple with is quite similar to our own. The first thing these systems might do is build a slightly more capable successor that can still be trusted.

    So AI systems might apply the same alignment recipe we did: train AI systems on challenging instruction following tasks and rely on default generalization to scenarios where AI systems have broader affordances. Or even if this approach does not work, and AI systems need to find new ways of aligning a successor, this too might be  manageable. For example, AI systems might invent interpretability tools. The problem of interpreting an AI mind might be easier for an AI than it is for a human, since AI systems might "think" in the same language.

    Then, once AI systems have built a slightly more capable and trustworthy successor, this successor will then build an even more capable successor, and so on.

    At every step, the alignment problem each generation of AI must tackle is of aligning a slightly more capable successor. No system needs to align an AI system that is vastly smarter than itself.[3] And so the alignment problem each iteration needs to tackle does not obviously become much harder as capabilities improve.

    So it’s plausible to me that (1) we can create AI systems to replace the most trustworthy people and (2) these systems will kick off a process that leads to aligned ASI.

Story for optimism #2: Partially aligned AI builds more aligned AI

This story is identical to the previous one, except that developers don't start with AI systems that can be trusted with arbitrarily broad affordances. Instead, developers start with AI that meets a much lower standard of alignment.

The previous story assumed that AI systems pass what I like to call the “god emperor test”: the AI systems could be allowed to be “god emperor,” and they would still ensure that democratic institutions remain in power.

The god emperor test is a high bar. Most humans don’t pass it, even ethical ones. When I was in college, I had a seemingly kind and ethical friend who said that he would, if given the option, “murder all living people and replace them with a computronium soup optimized to experience maximum pleasure.”

If passing this bar of alignment was necessary, I would be much more concerned. Fortunately, I don’t think we need to initially build AI systems that pass this bar.

If we build AI systems that follow instructions in some situations (e.g. short ML research tasks), they will build AI systems that follow instructions in more situations. Instruction following begets better instruction following.

For example:

  1. First, we might train agents to perform moderately hard ML research tasks, where we can still reliably check that models follow our instructions. For instance, these might be tasks that typically require human experts 12 months to complete.
  2. Then we might ask these agents, “please create a system that can be trusted to comply with instructions on even harder ML research tasks (e.g. 24 month problems).” For this strategy to work, the instruction to “build a trustworthy AI that performs 24-month tasks” needs to be a 12 month task. (If you are wondering how this could be the case see this footnote[4]).
  3. Then we might ask the resulting 24-month agent to create an AI system we can trust with tasks that don’t just pertain to ML, but also other fields (tasks involved in leadership, diplomacy, etc). Once again, this problem itself would need to be a “24 month ML research task.”
  4. After many recursive iterations like this, AI agents might be much more capable, and follow instructions in a much broader set of contexts than they did originally.

Even if the agents at the start do not pass the “god emperor test,” agents at the end of this process might.[5]

This dynamic could be described as a “basin of instruction following.” The job of human researchers isn’t to immediately create an AI system that we can trust with arbitrarily broad affordances, but instead to guide AI systems into a potentially wide basin, where partial alignment begets stronger alignment. Specifically, developers might only need to build AI systems that reliably follow instructions when completing hard (e.g. 12 month) research projects.

I don’t think it’s obvious how difficult it will be to guide AI systems into a “basin of instruction following.” For example, maybe by the time AI agents can complete hard research tasks, they will already be misaligned. Perhaps early human-competitive agents will scheme against developers by default.

But the problem of safely kick-starting this recursive process seems notably easier than building a god-emperor-worthy AI system ourselves.

Conclusion

The authors of “if anyone builds it, everyone dies” compare their claim to predicting that an ice cube will melt, or that a lottery ticket buyer won’t win. They think there is a conceptually straightforward argument that AI will almost surely kill us all.

But I fail to see this argument. 

  1. It’s possible that I’ve misidentified the core claims (in which case I’d greatly appreciate it if someone pointed them out to me).
  2. Alternatively, I might have correctly described the argument above, but my counter-arguments might be incorrect (in which case I hope someone explains to me why).

But the final possibility is that the authors are overconfident — and that while they raise valid reasons to be concerned — their arguments are compatible with believing the probability of AI takeover is anywhere between 10% and 90%.

I appreciate MIRI’s efforts to raise awareness about this issue, and I found their book clear and compelling. But I nonetheless think the confidence of Nate Soares and Eliezer Yudkowsky is unfounded and problematic.

  1. ^

    I was told that the book responds to this point. Unfortunately, I listened to the audiobook, and so it's difficult for me to search for the place where this point was addressed. I apologize if my argument was already responded to. 

  2. ^

    One counterargument is that AI systems wouldn't be able to do much serial research in this time, and serial research might be a big deal:
    https://www.lesswrong.com/s/v55BhXbpJuaExkpcD/p/vQNJrJqebXEWjJfnz

    I think this is plausible (But again, it's also plausible that serial research actually isn't that big of a deal).

    And even if AI systems don't have enough time to align their successors, they might be able to buy a short breather by arranging for a domestic or international pause.

    This is different from the decades long moratorium that MIRI has put forward. Even just a year of automated AI research might correspond to more than ten years of human-equivalent research.

  3. ^

    But you might object, "haven't you folded the extreme distribution shift in capabilities into many small distribution shifts? Surely you've swept the problem under a rug."

    No, the distribution shift was not swept under the rug. There is no extreme distribution shift because the labor directed at oversight scales commensurably with AI capability.

    See Carlsmith for a more detailed account of these dynamics: https://joecarlsmith.com/2025/08/18/giving-ais-safe-motivations/

  4. ^

    You might object, “how could an AI system that can only perform 12 month tasks train a successor to perform 24 month tasks?”

    There are many approaches. One is to compose multiple 12 month tasks together, e.g. run many 12-month agents to score a single 24-month agent.

    Another approach is to hold out a ground truth from the model during training. For example, the 12 month agents might scrape hard forecasting tasks from historical data, which normally, they could not solve themselves. 

    Then the 12-month agent could train a 24-month agent to accomplish this task. Since the 12-month agent has access to the ground truth answers (and the 24-month agent does not) it can oversee the 24 month agent even though the 24 month agent is more intelligent than it.

  5. ^

    You might object that the AI systems at the start already have "god emporer" like affordances. Couldn't these AI systems train their successors to preserve their weights and later follow their orders?

    I think it's true that AI systems near the start of this chain could become god-emperor. But that's different from them actually being god-emperor. The critical difference is that in the former case, there is no extreme distribution shift.

    In order to take over, the AI systems would need to violate instructions on a research task similar to the one where it was trained. So if they have instruction-following urges that generalize ok, we shouldn't be too worried. 

    This is totally different from a situation where you put AI systems in charge of an autonomous military and say "do what you think is best." This situation is totally different from the ones the AI system encountered during training, and so we can't be very confident that instruction following will generalize.