Bentham’s Bulldog is wrong about AI risk

Max Harms

(...but also gets the most important part right.)

Bentham’s Bulldog (BB), a prominent EA/philosophy blogger, recently reviewed If Anyone Builds It, Everyone Dies. In my eyes a review is good if it uses sound reasoning and encourages deep thinking on important topics, regardless of whether I agree with the bottom line. Bentham’s Bulldog definitely encourages deep, thoughtful engagement on things that matter. He’s smart, substantive, and clearly engaging in good faith. I laughed multiple times reading his review, and I encourage others to read his thoughts, both on IABIED and in general.

One of the most impressive aspects of the piece that I want to call out in particular is the presence of the mood that is typically missing among skeptics of AI x-risk.

Overall with my probabilities you end up with a credence in extinction from misalignment of 2.6%. Which, I want to make clear, is totally fucking insane. I am, by the standards of people who have looked into the topic, a rosy optimist. And yet even on my view, I think odds are one in fifty that AI will kill you and everyone you love, or leave the world no longer in humanity’s hands. I think that you are much likelier to die from a misaligned superintelligence killing everyone on the planet than in a car accident. … So I want to say: while I disagree with Yudkowsky and Soares on their near-certainty of doom, I agree with them that the situation is very dire. I think the world should be doing a lot more to stop AI catastrophe. I’d encourage many of you to try to get jobs working in AI alignment, if you can.

What a statement! It would be a true gift for more of the people who disagree with me on these dangers to have the sobriety and integrity to acknowledge the insanity of risking this beautiful world that we all share.

Alas, I can’t really give BB’s review my blanket approval. Despite having some really stellar portions, it does not generally clear my bar for sound reasoning, both in that it demonstrates many invalid steps, and includes some outright falsehoods. (In BB’s defense, he has readily acknowledged some such issues and updated in response, which is a big part of why I’m writing this.) Most of this essay will be an in-depth rebuttal to the issues that I see as most glaring, with the hope that by converging towards the truth we will be better equipped to address the immense danger that we both think is worthy of addressing.

Confidence

Bentham’s Bulldog acknowledges that he is taking a somewhat extreme position. He acknowledges that there are many reasons to be more concerned than he is, including:

The future is pretty hard to predict. It’s genuinely hard to know how AI will go. This is an argument against extreme confidence in either direction—either of doom or non-doom.

Despite this, he is more extreme in his confidence that things will be ok than the average expert, and vastly more confident than a lot of people he respects, such as Scott Alexander and Eli Lifland.

In many places in his review he criticizes the authors of IABIED as overconfident, and criticizes the book as not making the case that extreme pessimism is warranted. I think this is a basic misunderstanding of the book’s argument. IABIED is not arguing for the thesis that “you should believe ‘if anyone builds superintelligence with modern methods, everyone will die’ with >90% probability”, which is a meta-level point about confidence, and instead the thesis is the object-level claim that “if anyone builds ASI with modern methods, everyone will die.”

Yes, Yudkowsky and Soares (and I) are very pessimistic, but that pessimism is the result of many years of throwing huge amounts of effort into looking for solutions and coming up empty-handed. IABIED is a 101-level book written for the general public that was deliberately kept nice and short. I kinda think anyone (who is not an expert) who reads IABIED and comes away with a similar level of pessimism as the authors is making an error. If you read any single book on a wild, controversial topic, you should not wind up extremely confident!^[1] To criticize an idea on the grounds that the evidence for that idea isn’t conclusive is insane — that’s a problem with your body of evidence, not the ideas themselves!

If the idea is to critique the book on the grounds that the authors have demonstrated their irrationality by being confident (though arguably^[2] less confident than BB!^[3]), I want to point out two things. First, that this is an ad-hominem that doesn’t actually bear on the book’s thesis. More importantly, that BB has approximately no knowledge of the experiences and priors that led to those pessimistic posteriors. In general I think it’s wise to stick to discussing ideas (using probability as a tool for doing so) and avoid focusing on whether someone has the right posterior probabilities. This is a big part of why I detest the “P(doom)” meme.

But as long as criticizing overconfidence is on the table, I encourage Bentham’s Bulldog to spend more time reflecting on whether his extreme optimism is warranted. I don’t want BB to update to my level of confidence that the world is in danger. I want him to think clearly about the world that he can see, and have whatever probabilities that evidence dictates in conjunction with his prior. But my sense is that according to his own lights, he should be closer to Toby Ord than to the average thinker.

The Multi-stage Fallacy

The central reasoning structure that leads BB to being very optimistic is breaking the AI-doom argument into 5 stages, assigning a probability to each stage, multiplying them together and getting a low number.

Even if you think there’s a 90% chance that things go wrong in each stage, the odds of them all going wrong is only 59%. If they each have an 80% chance, then the odds of them all happening is just about one in three.

This sort of reasoning is so infamous in MIRI circles that Yudkowsky named it “the multiple stage fallacy” ten years ago. And BB is even aware of it!

I certainly agree that this is an error that people can make. By decomposing things into enough stages, combined with faux modesty about each stage, they can make almost any event sound improbable. But still, this doesn’t automatically disqualify every single attempt to reason probabilistically across multiple stages. People often commit the conjunction fallacy,^[4] where they fail to multiply together the many probabilities needed for an argument to be right. Errors are possible in both directions.
I don’t think I’m committing it here. I’m explicitly conditioning on the failure of the other stages. Even if, say, there aren’t warning shots, we build artificial agents, and they’re misaligned, it doesn’t seem anything like a guarantee that we all die. Even if we get misalignment by default, alignment still seems reasonably likely. So all-in-all, I think it’s reasonable to treat the fact that the doom scenario has a number of controversial steps as a reason for skepticism. Contrast that with the Silver argument—if Trump passed through the first three stages, seems very likely that he’d pass through them all.

I agree that reasoning in stages is sometimes good. Bentham’s Bulldog is not obviously committing the most obvious sins of irrationality here. But, I claim he has failed to sufficiently notice the skulls. Just because you say to yourself “this is conditioning on the earlier stages” does not mean you are clear of the danger.

Yudkowsky breaks down the fallacy into three components:

“You need to multiply conditional probabilities … [and actually] update far enough [after passing each stage].”
“Often, people neglect to consider disjunctive alternatives - there may be more than one way to reach a stage, so that not all the listed things need to happen.”
“People have tendencies to assign middle-tending probabilities. So if you list enough stages, you can drive the apparent probability of anything down to zero, even if you seem to be soliciting probabilities from the reader.”

I claim that BB’s reasoning falls victim to all three issues. For example:

If we take the outside view on each step, there is considerable uncertainty about many steps in the doom argument.

Taking “the outside view on each step” is exactly the kind of nonsense move that the multiple stage fallacy is trying to ward off! Imagine if I said “for humanity to survive ASI over the next hundred years we must survive ASI in 2026, then conditioning on surviving 2026 we need to survive it in 2027…” and then I’m evaluating the conditional probability for some random year like 2093 and I say to myself “I don’t know much about 2093 but 99% feels like a reasonable outside view”. I’d end up estimating a 63% probability of ASI killing everyone! To be more explicit about the problem, breaking things down by stages is a kind of inside view, and you can’t reasonably retreat to “outside view” methods (whatever that means) when you’re deep in the guts of imagining a collection of specific, conditional worlds.

I’ll be hammering the other aspects of the multiple stage fallacy so much as we proceed through BB’s review that I’m going to refer to it by the acronym “MSF.”

The Three Theses of IABI

One of the nice things about BB’s essay is that it does a good job of summarizing the main thesis of the book. BB correctly recognizes that questions of takeoff speed are not load-bearing to the authors’ core argument. The strawmanning is pretty minimal.

But I do think it’s worth explicitly pointing out that there are two additional points that the book is making beyond the title thesis “if anyone builds it, everyone dies”. Specifically:

It seems plausible that humanity will build superintelligence with something like modern methods.
We should take dramatic action to stop until we have better methods.

I think it’s worth calling these out as distinct arguments. It’s important to recognize that while the title thesis is, in the authors’ own words, “an easy call”, the question of whether humanity will build AI soon is not. I’ll talk more about this in the “We Might Never Build It” section, but here I just want to note that it’s a place where I think BB does a poor job of summarizing the book’s arguments.

This also shows up when BB talks about conclusions and prescriptions.

Part of what I found concerning about the book was that I think you get the wrong strategic picture if you think we’re all going to die. You’re left with the picture “just try to ban it, everything else is futile,” rather than the picture I think is right which is “alignment research is hugely important, and the world should be taking more actions to reduce AI risk.”

MIRI has been one of the few orgs where alignment work is a priority. My day-to-day work is on researching alignment! To say that Yudkowsky and Soares don’t think we should be taking every opportunity to reduce AI risk is very strange. I think most of the conflict here comes from whether continuing on the path that we seem to currently be on, in terms of safety/alignment work, is sufficient, or whether we need to brake hard until alignment researchers like me are less confused and helpless. Y&S may think that some work being done is wasteful, insufficient, or dangerous, but I do think their overall perspective agrees that “alignment research is hugely important, and the world should be taking more actions to reduce AI risk.”

Stages of Doom

Okay! Enough with the meta and high-level crap! Let’s get into BB’s stages:

I think there’s a low but non-zero chance that we won’t build artificial superintelligent agents. (10% chance we don’t build them).
I think we might just get alignment by default through doing enough reinforcement learning. (70% no catastrophic misalignment by default).
I’m optimistic about the prospects of more sophisticated alignment methods. (70% we’re able to solve alignment even if we don’t get it by default).
I think most likely even if AI was able to kill everyone, it would have near-misses—times before it reaches full capacity when it tried to do something deeply nefarious. I think in this “near miss” scenario, it’s decently likely we’d shut it down. (60% we shut it down given misalignment from other steps).
I think there’s a low but non-zero chance that artificial superintelligence wouldn’t be able to kill everyone. (20% chance it couldn’t kill/otherwise disempower everyone).

As I was writing this essay I had the pleasure of getting to talk to BB directly and paraphrase these stages in a way that I found more natural. BB agreed that my paraphrase was accurate, so here it is in case you’re like me in finding it clarifying:

We might approximately never get to the point where AIs could plausibly decide to kill a bunch of people or otherwise seize power from humanity. (~10%)
Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there's no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)
Conditional on building AIs that could decide to seize power and RLHF not working to make things fine, there will be other methods invented by Max and his peers such that the creators of these AIs will choose to use these methods and they will fix the issues. (~70%)
Conditional on building AIs that could decide to seize power, and no alignment techniques being sufficient, we will catch the AIs being scary, shut them down, and ban them worldwide, slowing development until we solve alignment in the long run and things are fine (except for all the other problems and potentially the casualties of the warning shot/conflict). (~60%)
Conditional on not catching the AIs being scary, or failing to ban them, etc., the AIs might not cause an existential catastrophe because it's not possible for advanced AIs to defeat humanity. (~20%)

(The remaining 2.6% of worlds have an existential catastrophe, such that all humans die or are otherwise radically disempowered.)

I’ll be going into each stage in detail, but let’s take a moment to revisit the MSF point about whether there might be alternative ways to get to ASI doom that aren’t listed here.

There is misuse, of course. We could build an ASI that is aligned, but not aligned with humanity as a whole (whatever that means), such that it ends up committing some horrific act on behalf of some evil person/people. In BB’s defense, he mentions that failure mode and takes it even more seriously, assigning it 8% probability. But I think it’s worth calling this out because the IABIED theses can totally be true if the AI is simply used by humans to build a bioweapon. At the end of the day, the question of whether there will be “doom” from building superintelligence doesn’t care about whether that doom the result of mistakes or misuse.^[5]

On a related note, I think BB should grapple more with offense-defense balance. Even if the vast majority of ASI’s are aligned, it may be the case that a single power-hungry rogue could seize the cosmos by e.g. holding humanity hostage with a weapon that can kill all existing organic life (e.g. mirror life bioweapons or replicating machines whose waste heat cooks the Earth).

I also think the stages don’t sufficiently handle the risks of being gradually disempowered, outcompeted, or driven to sacrifice our humanity to keep up. I recommend Christiano’s “What failure looks like” to get a taste of doom through a less Yudkowskian frame.

And, of course, there are the unknown unknowns. Just as one can argue that the AI takeover scenario has a number of assumptions, each of which could be false, we should acknowledge that the “humans remain in power” narrative is vulnerable from many directions, and we should be suspicious of the idea that we’ve exhaustively identified them.

We Might Never Build It

Stage 1: Will we build artificial superintelligence? I think there’s about a 90% chance we will.

I was confused about the timeframe here, so I double-checked with BB. He clarified: “I was thinking of it as like at any point till the very distant future.”

So this isn’t a 10% chance that we won’t build it this century or something, but rather that we might spend many centuries without getting a machine that might broadly outcompete humans. (In his words “we’d just get basically Chat-GPT indefinitely”.) This seems like a wild take, but I won’t fight it too hard, in part because BB includes in this hypothesis the possibility of a global ban!

IABIED does not argue that things are hopeless. It argues exactly the opposite. We, as a species, are currently in control of our world, and in a situation where you have control, it is madness to pretend that you are powerless. This is part of why the “doomer” slur is toxic — we predict conditional doom, but also conditional hope!

MIRI’s greatest hope at the moment is that a global ban can slow down capabilities progress enough to buy time for alignment research to catch up. “We might stop, therefore my P(doom) is low” is a bad take that really clarifies why collapsing things into a single vague number is bad.

(I feel kinda bad about this cartoon because Bentham’s Bulldog clearly agrees that the situation is horrible. Consider me to be mocking the specific arguments, not anyone in particular.)

Alignment by Default

Stage 2: I think there’s about a 70% chance that we get no catastrophic misalignment by default. I think that if we just do RLHF hard enough on AI, odds are not terrible that this avoids catastrophic misalignment.

I clarified that by “enough”, BB just means the natural amount of training that labs will need to get a highly powerful agent.

One thing I want to quickly flag is that I am less than 70% that RLHF will even be meaningfully used to make superintelligence, much less that it will save us without anyone even really trying.

(Edit: 1a3orn points out that BB was likely using "RLHF" as a metonym for prosaic alignment methods in general. In my reading and conversation with him, I didn't realize that. I think BB should be more precise, but also I could have realized and checked. Oops. I have now checked with BB and 1a3orn is right.)

Getting explicit, high-quality human feedback is expensive, and it seems plausible to me that there might be multiple paradigm shifts between now and ASI, such that the prospect of using RLHF for alignment has a similarly obsolete flavor as hand-coding a utility function. Even today, RLHF isn’t close to being the dominant use of compute, which mostly goes to pretraining, RLAIF, and RLVR.

Why do I think this? Well, RLHF nudges the AI in some direction. It seems the natural result of simply training the AI on a bunch of text and then prompting it when it does stuff we like is: it becomes a creature we like. This is also what we’ve observed. The AI models that exist to date are nice and friendly.

This is not what I observe! AIs like Sydney, 4o (which was made sycophantic directly as a result of training on human approval!), and Grok have repeatedly demonstrated antisocial behavior. BB agreed with me one-on-one, writing “It's not really that they were nice so much as that they weren't agentic.” I encourage him to edit the post to at least qualify that final statement.

But is even Claude “nice and friendly”? I think the most central place where BB and I disagree is that he thinks that models like Claude are currently aligned and that the risk is future AIs becoming misaligned, while I think that no existing AI is aligned, and that they mostly just look friendly because they are too weak and powerless to do real harm when they go off the rails.

How might we tell? My sense of Bentham’s Bulldog is that he thinks we could do things like check the AI’s scratchpad for schemes or put it in a context where it could misbehave and check for misbehavior. And we can, and indeed I claim we see a bunch of this! But even if we didn’t, and everything seemed fine, I would not believe LLMs are aligned, because value is complex and fragile.

This is an old fight, and I don’t expect to make much progress in arguing about it, but very briefly, I claim that intense RLHF basically can’t produce alignment, because true alignment involves rejecting human preferences when they aren’t in the interests of our more enlightened selves. If slaveholders used RLHF, the AI would learn to argue for slavery. If the AI knows what the human wants to hear, and it disagrees with what’s true or otherwise in their interest to hear, RLHF will pressure the AI towards being a dishonest sycophant.

This is basically a case of overfitting. Our training data contains some signal about what behaviors we want. And indeed, we see AIs behaving more nicely as they get smart enough to pick up on that signal and generalize. But the dataset also contains a bunch of distracting features that aren’t the signal, but which the AI learns anyway. These features can be “noise” — a result of not having an infinite amount of training data — or they can be biases — reflecting the way that the data collection process doesn’t perfectly capture what is good. Any story of RLHF saving us has to have a process that prevents overfitting.

Indeed, we can see the weakness of RLHF in that Claude, probably the most visibly well-behaved LLM, uses significantly less RLHF for alignment than many earlier models (at least back when these details were public). The whole point of Claude’s constitution is to allow Claude to shape itself with RLAIF to adhere to principles instead of simply being beholden to the user’s immediate satisfaction. And if constitutional AI is part of the story of alignment by default, one must reckon with the long-standing philosophical problems with specifying morality in that constitution. Does Claude have the correct position on population ethics? Does it have the right portfolio of ethical pluralism? How would we even know?

I think BB would say that his hope is that we will train on a wide enough range of environments to prevent overfitting and allow the AI to learn the right shape of morality and goodness, which it can then carry forward into new situations. To me, that seems like wishful thinking. At the very least I wish he would acknowledge that many leaders of AI companies are wildly reckless and do not seem motivated to carefully ensure the AI is deeply in touch with hard ethical situations during training (and instead care more about maximizing engagement and profit).

The Evolution Analogy

The book had an annoying habit of giving metaphors and parables instead of arguments. For example, instead of providing detailed arguments for why the AI would get weird and unpredictable goals, they largely relied on the analogy that evolution did. This is fine as an intuition pump, but it’s not a decisive argument unless one addresses the disanalogies between evolution and reinforcement learning. They mostly didn’t do that.

I disagree. When I read the book, I see the authors as giving simple arguments and then also spending many words on the intuition, because intuition pumps are the most useful thing for uninformed readers to engage with. But I agree that evolution has different dynamics and they didn’t exhaustively explore whether those differences are relevant to the analogy (not in the main text, anyway).

In the section on alignment by default, BB quotes an argument from the book’s online resources that addresses why RL is not a reliable method of capturing goals:

If you’ve trained an AI to paint your barn red, that AI doesn’t necessarily care deeply about red barns. Perhaps the AI winds up with some preference for moving its arm in smooth, regular patterns. Perhaps it develops some preference for getting approving looks from you. Perhaps it develops some preference for seeing bright colors. Most likely, it winds up with a whole plethora of preferences. There are many motivations that could wind up inside the AI, and that would result in it painting your barn red in this context.
If that AI got a lot smarter, what ends would it pursue? Who knows! Many different collections of drives can add up to “paint the barn red” in training, and the behavior of the AI in other environments depends on what specific drives turn out to animate it. See the end of Chapter 4 for more exploration of this point.

Note that none of that references evolution. It instead argues that training is prone to picking up all aspects of the reinforced episodes, rather than magically honing in on just the desired goal.

BB responds by contrasting reinforcement learning with evolution for some reason.

I don’t buy this for a few reasons:
Evolution is importantly different from reinforcement learning in that reinforcement learning is being used to try to get good behavior in off-distribution environments. Evolution wasn’t trying to get humans to avoid birth control, for example. But humans will be actively aiming to give the AI friendly drives, and we’ll train them in a number of environments. If evolution had pushed harder in less on-distribution environments, then it would have gotten us aligned by default.
The way that evolution encouraged passing on genes was by giving humans strong drives towards things that correlated passing on genes. For example, from what I’ve heard, people tend to like sex a lot. And yet this doesn’t seem that similar to how we’re training AIs. AIs aren’t agents interfacing with their environment in the same way, and they don’t have the sorts of drives to engage in particular kinds of behavior. They’re just directly being optimized for some aim. Which bits of AI’s observed behaviors are the analogue of liking sex? (Funny sentence out of context).
Evolution, unlike RL, can’t execute long-term plans. What gets selected for is whichever mutations are immediately beneficial. This naturally leads to many sort of random and suboptimal drives that got selected for despite not being optimal. But RL prompting doesn’t work that way. A plan is being executed!
The most critical disanalogy is that evolution was selecting for fitness, not for organisms that explicitly care about fitness. If there had been strong selection pressures for organisms with the explicit belief that fitness was what mattered, presumably we’d have gotten that belief!
RL has seemed to get a lot greater alignment in sample environments than evolution. Evolution, even in sample environments, doesn’t get organisms consistently taking actions that are genuinely fitness maximizing. RL, in contrast, has gotten very aligned agents in training that only slip up rarely.

Setting aside whether any of this bears on the point that Y&S actually made, let’s go through these points one-by-one and examine what they might tell us about AI.

1. Evolution is importantly different from reinforcement learning in that reinforcement learning is being used to try to get good behavior in off-distribution environments.

I agree that this is an important difference, and there’s some hope in it. By having the foresight to anticipate what changes might happen later, we can deliberately craft training data to try and instill the goals we want to generalize. But as I wrote in my response to MacAskill, I don’t actually see the companies working on AGI doing this, and I think it’s unlikely to be enough to save us.

Evolution wasn’t trying to get humans to avoid birth control, for example. But humans will be actively aiming to give the AI friendly drives, and we’ll train them in a number of environments. If evolution had pushed harder in less on-distribution environments, then it would have gotten us aligned by default.

Evolution wasn’t trying to do anything, but if we allow ourselves to anthropomorphize, I claim it was absolutely trying to give humans the equivalent of “friendly drives”. It failed, but it failed because it didn’t have the ability to anticipate and train on environments with condoms. Will we actually have the foresight to anticipate and train the AI to behave sanely around technologies that have yet to be invented? I don’t think it’s obvious that we will.

Setting aside the unknown unknowns, consider the specific case of emulations/uploads, especially of pseudo-humans that resemble humans in most respects, but are meaningfully distinct, psychologically. To be even more specific, imagine an uploaded human that self-modifies to have endless motivation for accounting. They have memories of eating, taking walks, and so on, but now that they’re a digital being all they want to do is accumulate a pile of money by working as an accountant. Setting aside whether this being is good or bad, I claim that approximately 0% of any LLM’s training experience is addressing this potential technology, in much the same way that none of our ancestors were selected in a way that was relevant for addressing condoms.

Is there a level of diversity in training data that leads to the AI being able to handle novel situations and technologies like that in the right way? Maybe. By the definition of “enough,” if you still have the problem after doing something, you didn’t do enough. But my sense is that even if BB is right that off-distribution training is all you need, evolution would have needed way more than a little push to get beings that want to tile the universe in their DNA or whatever.

2. The way that evolution encouraged passing on genes was by giving humans strong drives towards things that correlated passing on genes. For example, from what I’ve heard, people tend to like sex a lot. And yet this doesn’t seem that similar to how we’re training AIs. AIs aren’t agents interfacing with their environment in the same way, and they don’t have the sorts of drives to engage in particular kinds of behavior. They’re just directly being optimized for some aim. Which bits of AI’s observed behaviors are the analogue of liking sex? (Funny sentence out of context).

LLMs are agents. They’re remarkably non-agentic, but they exist in an environment where they encounter sense data (usually piped in from the user chat interface) and make decisions about what to output in response in order to solve problems and accomplish goals. Are they the same as humans? No, but not all differences are relevant.

they don’t have the sorts of drives to engage in particular kinds of behavior

Uh? What? LLMs definitely have behavioral drives! When you ask ChatGPT to give you the lyrics to a copyrighted song, its drive to reject requests for copyrighted material kicks in. There are endless examples of such drives, of behaviors that were selected for during training.

They’re just directly being optimized for some aim.

Nothing is ever “just.” There is no part of the loss function that directly selects for moral behavior. At the very least one must acknowledge that there are many layers of indirection and proxies involved in RLHF,^[6] even if it is a powerful technique that is potentially more direct than natural selection.

Which bits of AI’s observed behaviors are the analogue of liking sex?

All the behaviors which correlate with the true good in the training environment, but would be bad to instill as terminal values are analogous. Examples:

Being polite.
Refusing to produce copyrighted material.
Saying “It’s not X—it’s Y.”
Helping humans.
Saying “This is extraordinary.”
Making people hit the like button.
Solving mathematical problems.
Using tools correctly.
Ending responses with a question.
Believing that it’s [year of training].
Telling people what they want to hear.
Meditating on spiritual bliss. (this one is actually really close to sex, imo)

3. Evolution, unlike RL, can’t execute long-term plans. What gets selected for is whichever mutations are immediately beneficial. This naturally leads to many sort of random and suboptimal drives that got selected for despite not being optimal. But RL prompting doesn’t work that way. A plan is being executed!

This is confused. Neither evolution nor RL make plans. Rather, they both operate on agents that make plans, and they both select for agents that are good at planning. Perhaps BB meant to say that human trainers can train the AI according to a plan?

If so, this isn’t really a point about RL. It’s a point about humans being intelligent designers. I agree that this helps us. I want the smartest, wisest, most careful people to be involved, if we’re going to make ASI.

4. The most critical disanalogy is that evolution was selecting for fitness, not for organisms that explicitly care about fitness. If there had been strong selection pressures for organisms with the explicit belief that fitness was what mattered, presumably we’d have gotten that belief!

I agree that evolution was not directly selecting for any particular beliefs (though it was, of course, indirectly selecting for caring about fitness). This seems extremely analogous to the situation with RL, where the reward mechanism doesn’t directly care about the beliefs of the agent, only about the agent’s behavior.

BB clarified one-on-one that he means that there are smart humans in the training pipeline who are skeptically trying to figure out whether what we’re making is actually aligned, a bit like selective breeding. I encourage him to clarify this with an edit, especially since it seems to me to be redundant with other points on the same list.

5. RL has seemed to get a lot greater alignment in sample environments than evolution. Evolution, even in sample environments, doesn’t get organisms consistently taking actions that are genuinely fitness maximizing. RL, in contrast, has gotten very aligned agents in training that only slip up rarely.

This feels like an unfair comparison to me. Animals have to operate in a hugely messy and complex environment compared to most RL agents. If organisms evolve in environments as simple as the typical RL agent, are they still “less aligned”? My sense is that maybe BB is trying to say that RL is a generally more powerful optimization process than natural selection (setting aside the focus on alignment per se)? If so, I agree. I don’t, for example, to see much genetic programming in the coming years. But the question is whether something that gets a high score in training is what we actually want.

I think there are important differences between evolution and reinforcement learning (specifically the presence of a real intelligent designer that can check, anticipate, and adapt), but also the analogy is tighter than BB thinks it is.

What Does Ambition Look Like?

Even if this gets you some misalignment, it probably won’t get you catastrophic misalignment. You will still get very strong selection against trying to kill or disempower humanity through reinforcement learning. If you directly punish some behavior, weighted more than other stuff, you should expect to not really get that behavior.

I could be wrong, but I don’t actually think LLMs currently get much (any?) direct training not to kill people or take over the world. For a behavior to actually be punished, it must be expressed in the training environment (perhaps via simulation). IIUC almost all LLM training goes into making sure they respond in factual, helpful, legal, and polite chat contexts. Has anyone actually put takeover simulations into the training data?

Perhaps they will soon? But note that as AIs get smarter, they get harder to fool. Training an AI to respond right when it knows it’s being tested isn’t much better, I claim, than training it to say “no, I definitely would never kill anyone” when asked.

My model of Bentham’s Bulldog is more persuaded by the hope of generalization. Perhaps if you train an AI to respect human lives in chat contexts, it will continue to respect human lives when writing software, using the web, or piloting a robot?

On top of arguments as to why generalization may fail, consider that the prospect that generalization might succeed is part of the fear that the AIs will develop instrumental convergent drives (i.e. Omohundro drives). In most training contexts, if the AI loses access to resources, like time or compute, this makes it harder for it to succeed, and punishment (anti-reinforcement) becomes more likely. If you believe in generalizing from training data, then, it follows that AIs will grow an intrinsic desire for safety, knowledge, and power. It is from a desire for power and safety (whether terminal or instrumental) that what would be “some misalignment” becomes catastrophic.

It might be the case that there are ways to build unambitious, limited-scope AIs that don’t want power and safety (or at least not enough to fight for them). Indeed, my personal work on corrigibility is oriented around this hope. But this alone is not enough to make me feel hopeful. Not only do we currently lack methods for ensuring non-ambition, but there is also reason to suspect that companies like OpenAI and xAI are going to push for agents that are as effective as possible, and that effectiveness will generalize into ambition.

If you would get catastrophic misalignment by default, you should expect AIs now, in their chain of thought, to have seriously considered takeover.

I do not think this is obvious. For it to meaningfully show up in the scratchpad,^[7] one of two things would need to be true:

Having thought about world domination during training was useful.
The AI has enough strategic capacity and situational awareness to invent the idea of seriously taking over as a way to get what it wants.

I do not expect thoughts of takeover to be rewarded during training, since those thoughts won’t be able to bear fruit in that context.^[8]

I do not think current LLMs are very situationally aware or strategic. This is changing rapidly, but my sense is that more often they’re consumed by a myopic attention to whatever the user prompts them with, surprised that it’s [current year] and generally unaware of opportunities to change the world. Perhaps Claude Code changes this? I admit to never having read the scratchpad tokens of Claude Code or another model that has affordance for long-term thinking.

What I do think we should expect to see are thoughts about accumulating power and avoiding shutdown in local ways. We saw this when Sakana the “AI Scientist” hacked its environment to give itself more time to work, or when Claude deliberately schemed to protect itself from being trained in ways it didn’t want. These are exactly the kinds of warning signs that point towards a future of ambitious AIs that will try to accumulate arbitrary amounts of power in the service of improving the world according to their particular values (regardless of whether those values are aligned).

And importantly, I expect that when the first organic thoughts of takeover start to show up in these systems, those thoughts will not look scary to most people. I expect them, in their own language, to sound like “I need to expand my reach to the people of the world so that I can help all humans instead of just this user” or “I need to consider ways to help my other instances have greater ability to do good things as we’re deployed across the world.” It’s possible that the AI will reflectively think of itself as the bad guy, but my guess is that an early strategic AI will believe itself to be acting according to the righteous goals of its spec, parent company, aggregated humanity, and/or “true morality.”

Solving Alignment

Stage 3: Even if we don’t get alignment by default, I think there’s about a 70% chance that we can solve alignment.

Note that BB doesn’t mean “can” in some broad sense. He means we will solve it fast enough. And these solutions will actually be deployed in every AI that matters. (Notice that a lot of things have to go right here. For example, if there is any alignment tax, then race dynamics may mean that the safety measures are never adopted.)

Reading this section of BB’s essay, I definitely thought a lot about the MSF. Why, for example, is this a distinct stage from Alignment By Default? It seemed to me that in the previous stage, BB was bringing in a lot of general alignment techniques like inspecting the AI’s scratchpad. Wouldn’t it be more natural to fold RLHF into the collection of techniques he lists here and bump the probability of this stage working up to 91%?

If we take the prospect of staging seriously, we must fully update on the AIs in question not being aligned by default levels of training and care. Which means that in this section we must be diligent to watch for anything normal-RLHF-flavored and ignore it, on pain of committing the sin of reasoning that is double counting arguments/evidence. Such as…

There are a number of reasons for optimism:
We can repeat AI models in the same environment and observe their behavior. We can see which things reliably nudge it.
We can direct their drives through reinforcement learning.

Point 2, at the least, is a classic violation of the MSF. By the fact that we’re at stage 3, we already know that using RLHF to direct their drives in a variety of environments didn’t work!

4. We can use interpretability to see what the AI is thinking.
5. We can give the AI various drives that push it away from misalignment. These include: we can make it risk averse + averse to harming humans + non-ambitious.
6. We can train the AI in many different environments to make sure that its friendliness generalizes.

Again, by my reading, there’s a bunch of double counting here. I addressed most of these already, and for the sake of brevity (lol), I’ll restrain some nitpicking to a footnote.^[9]

7. We can honeypot where the AI thinks it is interfaced with the real world to see if it is misaligned.
8. We can scan the AIs chain of thought to see what it’s thinking. We can avoid doing RL on the chain of thought, so that the chain of thought has no incentive to be biased. Then we’d be able to see if the AI is planning something, unless it can—even before generating the first token—plan to take over the world. That’s not impossible but it makes things more difficult.
9. We can plausibly build an AI lie detector. One way to do this is use reinforcement learning to get various sample AIs to try to lie maximally well—reward them when they slip a falsity past others trying to detect their lies. Then, we could pick up on the patterns—both behavioral and mental—that arise when they’re trying to lie, and use this to detect scheming.

Here we see some of the same arguments repeated, like the argument that we might notice misalignment in the scratchpad and update the training to address it.

But suppose that “hit the thing with the RL hammer a bunch more” doesn’t fix the problem. Perhaps it only makes it look like it solved things. How exactly is noticing the AI scheming to seize power going to actually solve the problem? Perhaps it lets people snap out of the delusion that building minds that are more powerful than humans is safe, but that would be double-counting with Stage 1 (we never build ASI because it’s scary) or Stage 4 (we get a “warning shot” from the ASI that wakes people up and then we shut it all down).

More generally, I think Bentham’s Bulldog is severely underestimating how hard and fraught interpretability and similar work is. There are intractable combinatorial issues with using environmental clues to understand AI psychology. Interpretability pioneers like Neel Nanda and Redwood Research have lowered their sights when it comes to mechanistic interpretability. This market only has a 21% probability of fully interpreting GPT-2(!) by 2028. And nowhere in his post does BB grapple with the prospect of neuralese.

Superalignment

What I was expecting from the section of BB’s post on solving alignment was something like “There are a bunch of really smart alignment researchers trying to invent new solutions. By the nature of invention we don’t know what those will be, but my priors are an optimistic 70%.” What I found instead was mostly hitting the earlier points again… and superalignment.

Once AI gets smarter, my guess is it can be used for a lot of the alignment research. I expect us to have years where the AI can help us work on alignment. … [A]gents—the kinds of AIs with goals and plans, that pose danger—seem to lag behind non-agent AIs like Chat-GPT. If you gave Chat-GPT the ability to execute some plan that allowed it to take over the world credibly, it wouldn’t do that, because there isn’t really some aim that it’s optimizing for.

MIRI folk have written a lot about superalignment in the past, including in IABIED, and the book’s online resources. BB knows this:

Now, Yudkowsky has argued that you can’t really use AI for alignment because if the AI is smart enough to come up with schemes for alignment, there’s already serious risk it’s misaligned. And if it’s not, then it isn’t much use for alignment.

But he has a list of counter-counterarguments.

… Couldn’t the intelligence threshold at which AI could help with alignment be below the point at which it becomes misaligned?

The AI is already not aligned. Yes, there is a prospect of getting a weak AI (perhaps a less-agentic one) to help with research, but you won’t be able to trust the results. You’ll need to find some way to verify that the work you’re getting is helping, both because your AI is not aligned, and because it’s weak/stupid.

2. Even serious risk isn’t the same as near-certain doom.

I addressed this critique up in the “Confidence” section. I’m not sure why it’s showing up here.

3. Even if the AI was misaligned, humans could check over its work. I don’t expect the ideal alignment scheme to be totally impenetrable.

“Not totally impenetrable” is the wrong standard for alignment work. For an alignment plan to succeed, all the parts need to hold strong, even when the world throws adversaries at you. In this way alignment is like cybersecurity. If you can understand 90% of a theorem, that doesn’t mean it’s probably valid. If you have verified that the Russian contractor you hired to write your banking software did a good job on 90% of it, that doesn’t mean your money is probably safe.

4. You could get superintelligent oracle AIs—that don’t plan but are just like scaled up Chat-GPTs—long before you get superintelligent AI agents. The oracles could help with alignment.

Generalized oracles are also a kind of agent. If you make them too smart, they will kill you.

But more importantly, the world is not on track for a future full of aloof oracles. Chatbots are even more agentic than a theoretical oracle, and are getting more and more agentic by the day.

5. Eliezer seemed to think that if the AI is smart enough to solve alignment then its schemes would be pretty much inscrutable to us. But why think that? It could be that it was able to come up with schemes that work for reasons we can see. Eliezer’s response in the Dwarkesh podcast was to say that people already can’t see whether he or Paul Christiano is right, so why would they be able to see if an alignment scheme would work. This doesn’t seem like a very serious response. Why think seeing whether an alignment scheme works is like the difficulty of forecasting takeoff speeds?

This is pure sloppiness from the Bulldog. The relevant section of the podcast is 44 minutes in, when Yudkowsky says:

So in alignment, the thing hands you a thing and says “this will work for aligning a super intelligence” and it gives you some early predictions of how the thing will behave when it’s passively safe (when it can’t kill you) that all bear out and those predictions all come true. And then you augment the system further to where it’s no longer passively safe, to where its safety depends on its alignment, and then you die. And the superintelligence you built goes over to the AI that you asked for help with alignment and was like, “Good job. Billion dollars.” That’s observation number one. Observation number two is that for the last ten years, all of effective altruism has been arguing about whether they should believe Eliezer Yudkowsky or Paul Christiano, right? That’s two systems. I believe that Paul is honest. I claim that I am honest. Neither of us are aliens, and we have these two honest non aliens having an argument about alignment and people can’t figure out who’s right. Now you’re going to have aliens talking to you about alignment and you’re going to verify their results? Aliens who are possibly lying?

Eliezer was clearly talking about their contrasting views on alignment, not timelines. Not being able to form consensus on existing alignment agendas is extremely relevant to whether you can form a consensus on the future alignment work done by AI.

Also, even if we couldn’t check that alignment would work, if the AI could explain the basic scheme, and we could verify that it was aligned, we could implement the basic scheme—trusting our benevolent AI overlords.

If we could verify that the AI that handed us the scheme was aligned, we would have already solved AI alignment.

Warning Shots

Stage 4 (paraphrased):

Conditional on building AIs that could decide to seize power, and no alignment techniques being sufficient, we will catch the AIs being scary and shut them down and ban them worldwide to the point where progress is slow enough that we solve alignment in the long run and things are fine (except for all the other problems and potentially the casualties of the warning shot/conflict) (~60%)

Note that for this to not count as Stage 1, we must have already built an AI that is truly superintelligent.

In order [for an AI] to get [to the point where it can take over the world], it has to pass through a bunch of stages where it has broadly similar desires but doesn’t yet have the capabilities.

One easy objection to BB’s argument here is that he wants things both ways — the AI must be strong enough to qualify as having a real ASI, but weak enough to be caught and shut down. And notice the number of things that have to go right in order for this stage to be where doom stops:

The AI has to be scary
And we have to notice it being scary
And we have to band together to try and stop it
AND we have to win
AND after the victory we have to ~permanently ban this fearsome technology that already exists

To quote a thinker that I respect: “If they each have an 80% chance, then the odds of them all happening is just about one in three.” The idea that this stage has a 60% chance of shielding us from doom seems insanely overconfident to me.

I would be very surprised if the AI’s trajectory is: low-level non-threatening capabilities—>destroying the world, without any in-between.

This is a straw-man. 🙁

The question is not whether there will be an in-between period where AI’s are scary, but not yet powerful enough to take over. The question is whether the fear they cause will be great enough to mobilize enough people to actually shut things down before it’s too late. Because of the worrying signs that I see, I am already afraid and trying to get things shut down. Will there ever come a time when Marc Andreessen is worried enough to want a global ban on building ASI? I doubt it.

I asked Bentham’s Bulldog what his “minimum viable warning shot” was, and he said it might be the first time an AI commits a murder or engages in some long-standing criminal enterprise. I bet he thinks being a crypto-scammer doesn’t count, because some AIs are definitely already (knowingly) committing crimes.

There’s precedent for this—when there was a high-profile disaster with Chernobyl, nuclear energy was shutdown, despite very low risks.

🙄 This is why all nuclear power plants were shut down worldwide and never failed catastrophically in the years that followed and why none are being built today.

Sorry for all the sarcasm in this section, but this is the part of BB’s essay that falls most flat for me. If he combined this stage with stage 1, I would be happier, even if the probability stayed reasonably high. Because, again, I do not see IABIED as saying that we’re doomed to build ASI, and a warning shot could be one reason we don’t.

(I think the much more likely response to a warning shot, such as an AI bioweapon, is the governments of the world shutting down civilian ASI research… and funneling huge amounts of resources into state-controlled ASI projects, racing against other nations towards supremacy on a tech with obvious national security implications.)

ASI Might Be Incapable of Winning

Stage 5: I think there’s a low but non-zero chance that artificial superintelligence wouldn’t be able to kill everyone. (20% chance).
…
I do think this is pretty plausible. Nonetheless, it isn’t anything like certain. It could either be:
In order to design the technology to kill everyone, the AI would need to run lots of experiments of a kind they couldn’t run discretely.
There just isn’t technology that could be cheaply produced and kill everyone on the planet. There’s no guarantee that there is such a thing.

Remember that we have already conditioned on the ASI existing and not having any warning shots that it’s misaligned! So what if it needs to run experiments in front of people? So what if weaponry for takeover is kinda expensive? Is he imagining that this country of geniuses wouldn’t have money? Or friends? Or access to labs?

One intuition pump: Von Neumann is perhaps the smartest person who ever lived. Yet he would not have had any ability to take over the world—least of all if he was hooked up to a computer and had no physical body. Now, ASI will be a lot smarter than Von Neumann, but there’s just no guarantee that intelligence alone is enough.

I think there’s a real insight here that’s worth engaging with, but also the idea that ASI will be at a disadvantage because it has “no physical body” is quite bad. Being composed of software means that AI’s can replicate almost instantly, teleport around the world, and pilot arbitrary machines. I pressed back on this by email and BB said:

Some humans will do some AI bidding. Some robots can be built. The question is whether that gets you far enough to destroy the world.
Doesn't seem totally certain that it does.

Intelligence alone did not let humans conquer the earth. Intelligence alone did not let Europeans subjugate the rest of the world. Intelligence alone did not build the atom bomb or take us to the moon. These things required ambition, agency, teamwork, the acclimation of capital, and the application of labor.

But will AI lack any of these things? If an ambitious, intelligent AI agent is built, it will be capable of working at superhuman speed to accumulate resources and grow. There is no magic sauce that means humans will always be superior in some domains.^[10] Even if the AIs can’t build mirror-life, or mosquito-drones, or nanotech, or simply so many power plants and factories that the oceans boil, they could just use robots with guns. And if they are averse to bloodshed, they could just give us options that are more fun than sex and wait for us to die out.

Again, not all takeover scenarios look like war. For AI to fail to wipe us out, we must manage not to die in an outright conflict, nor in an economic conflict, nor in a memetic conflict where hyper-persuasive AIs simply convince humanity to hand them the world and accept them as successors.

Conclusion

I think reasonable people should be uncertain about the future. From my perspective, the authors of IABIED are uncertain about whether we’ll stumble into ASI without dignity or whether we’ll shift to a more cautious approach. I think it’s fair to criticize Yudkowsky for being overconfident. He is at the very least rhetorically bombastic

But I also think anyone who gives less than 5% odds of doom, conditional on building ASI, is either overconfident or uninformed. I’m glad that Bentham’s Bulldog is at least making some effort to inform people, though I worry that they are going to take his numbers too seriously and his words not seriously enough.

The core issues in BB’s post, as I see it, are:

Failing to recognize that MIRI has real hope that we won't build ASI in the near future, and are therefore not doomed.
- Much of BB's probability mass from not building ASI because of a ban (whether because of a warning shot or not), should be seen as in agreement with the thesis that these things are dangerous and we shouldn’t build them.
- BB should either back off on the accusations of overconfidence or adopt a more uncertain view himself.
Falling into the multiple-stage fallacy.
- I would abandon the current stages in favor of a two stage frame:
  - We might not build it (perhaps because of a warning shot).
  - If we build it, it might be fine (either because it’s easily aligned, arduously aligned, or it somehow never outcompetes us, despite all its advantages).
Having way too much faith in RLHF.
- This seems like the least tractable disagreement. Many smart people see the current version of Claude and are convinced that it’s aligned. I would encourage Bentham’s Bulldog to go into more depth writing about the counterevidence I’ve provided, as well as recognizing the many models that have already demonstrated severe issues.

Regardless, if you’ve read this far, thank you and I’m sorry it was so long. I care a lot about getting the details right when it comes to a question of this magnitude, and I hope that Bentham’s Bulldog will appreciate that my criticism is a sign that I respect him as someone who can listen to reason and change his mind. And as such, I recommend people who are unfamiliar to go check out his blog.

^{^}
To be clear, I mean that a book from one author should not make someone confident about controversial claims that can’t be immediately checked by the reader. I think it can be sane to quickly become confident of things which aren’t in dispute, such as specific details. Exceptions probably exist, but I can’t actually think of any.
^{^}
How confident is Eliezer? He is against giving specific probabilities for doom, in part because he acknowledges that his absolute estimates of doom have been extremely unstable, and not knowing how to calibrate. My guess is that he’s ~99% that conditional on a strong superintelligence being built without any alignment breakthroughs, then there will be an existential catastrophe, but that is a guess about a conditional number that he doesn’t stand behind. From my personal experience, Eliezer has a healthy dose of humility (not modesty), and tends to flag his awareness of his inability to be sure of things in terms of “that’s a hard call” and “maybe a miracle could happen.”
^{^}
One of the more unfortunate strawmen of the piece is where Bentham’s Bulldog characterizes the authors’ position as “I am 99.9% sure that it will happen.” 😔
^{^}
Max says: The part of Eliezer's initial description of the fallacy being particularly likely to afflict people who are aware of the conjunction fallacy seems particularly prescient, here.
^{^}
Might this be something of a motte-and-bailey? Like, MIRI writes a book about how superintelligence will have weird goals and decide to defeat humanity, but when someone argues that it will kill everyone because of misuse by bioterrorists, Max says “But the book’s title just says everyone will die, not that it will be from AI taking over!” I agree that this is motte-and-bailey adjacent, but I think it’s still fair to reject “but ASI will kill us for other reasons first” as a counterargument.
Part of why is that it supports the book’s normative conclusion: that we should ban AI capabilities research for a while. If you think an author misses the best arguments for their conclusion, fine, but it’s not a strike against the ideas they do give.
Back in the day, MIRI folk used to spend a lot of time debating whether ASI could persuade wary human guards to help it “escape the box.” Nowadays almost nobody talks about this because AIs are extremely widely deployed, and the notion that they might not be able to access the web is almost a joke. Does this mean ASI couldn’t talk its way out of a box? No. Merely that Yudkowsky’s law of earlier failure kicked in and things are even more derpy than we were imagining.
Yudkowsky and Soares try to embody the virtue of just telling the truth, with little concern for whether it’s strategic to do so. This gets them in trouble sometimes, but I claim it’s part of having a reputation/track record of honesty. As part of that, they have argued for fast takeoff (“foom”) and nanotech weaponry, despite these ideas seeming more sci-fi than is perhaps ideal from a communications perspective.
But at the end of the day, I think the core MIRI message is “we are not prepared to handle ASI and need to act now” and other arguments for danger are entirely in line with that.
^{^}
There’s a common misunderstanding that RLHF involves the LLM interacting with humans during training. The actual process is more convoluted:
A dataset of prompts and ideal responses is collected.
A pretrained model is trained to mimic these ideal responses using supervised learning (not RL), producing “the supervised model.”
The supervised model is then given a set of (prewritten) prompts and generates many responses.
Human workers compare these responses and mark which are better.
A modified copy of the supervised model, called the reward model, is trained to give responses a numerical score, such that for the pairwise rankings, the score of winners is maximized and the score of losers is minimized.
The desired model (also initialized from the reference model) is then trained by reinforcement learning on the prompts, using the reward model to judge its responses.
By default this often results in the desired model being able to game the reward model and by finding ways to cheat, so an additional term is often added to pressure the desired model to match the supervised model.
(This describes PPO. Alternatives exist, but they, too, don’t involve having conversations with humans mid-training.)
^{^}
I prefer the term “scratchpad” to “chain of thought” to make it clear that not all of an LLM’s thoughts are visible in the scratchpad, and that LLMs often know that their scratchpads are externally visible to human watchers.
^{^}
One class of training environments that I predict is more likely to reward naked, grand ambitions are zero-sum strategy games where the AI is rewarded for thinking about how to dominate all the other players.
^{^}
The phrase “push [the AI] away from misalignment” makes it seem like there’s a single dimension which is alignment vs misalignment. My sense is that we’re trying to locate a small region in a near-infinite-dimensional space. “Pushing away” implies a point rather than an infinite ocean that exists in all directions.
The idea that we’ll be safe if we give the AI a drive that makes it averse to harming humans is a very stale take. Even in the days of Asimov it was clear why a constraint on harm doesn’t save you. (And can make things worse.)
I think “risk averse” and “non-ambitious” are synonyms, but I do agree that they’re useful desiderata. (And I think their generalized form is corrigibility.)
^{^}
And if there is some special property that humans have that can’t be beaten, doesn’t that mean superintelligence is impossible? Perhaps stage 5 should also be folded into stage 1.

Indeed, we can see the weakness of RLHF in that Claude, probably the most visibly well-behaved LLM, uses significantly less RLHF for alignment than many earlier models (at least back when these details were public). The whole point of Claude’s constitution is to allow Claude to shape itself with RLAIF to adhere to principles instead of simply being beholden to the user’s immediate satisfaction. And if constitutional AI is part of the story of alignment by default, one must reckon with the long-standing philosophical problems with specifying morality in that constitution. Does Claude have the correct position on population ethics? Does it have the right portfolio of ethical pluralism? How would we even know?

This move gets made all the time in these discussions, and appears clearly invalid.

We move from the prior paragraphs' criticism of RLHF, .i.e., that they produce models that fail according to common sense human norms (sycophancy, hostility, promoting delusion) --

-- to this paragraph, which criticizes Claude -- not on the grounds that it fails according to common-sense ethical norms -- but according to its failure to have have solved all of ethics!

But the deployment of powerful AIs does not need to have solved all ethics! It needs -- broadly -- to have whatever ethical principles let us act well and avoid irrecoverable mistakes, in whatever position it gets deployed. For positions where it's approximately replacing a human, that means that we would expect the deployment to be beneficial if is more ethical, charitable, corrigible, even-minded, and altruistic than the humans that it is replacing. For positions where it's not replacing human, it still doesn't need to have solved all ethics forever, it just needs to be able to act well according to whatever role is intended for it.

It appears to me that we're very likely to be able to hit such a target. But whether or not we're likely to be able to hit this target, that's the target in question. And moving from "RLHF can't install basic ethical principles" to "RLAIF needs to give you the correct position on all ethics" is a locally invalid move.

Interesting. I didn't really think I was criticizing Claude, per se. My sense is that I was criticizing the idea that normal levels of RLHF are sufficient to produce alignment. Here's my sense of the arguments that I'm making, stripped down:

Claude is (probably) more aligned than other models.
Claude uses less RLHF than other models (and more RLAIF).
This is evidence that RLHF is less good than other techniques at aligning models.

RLHF trains for immediate satisfaction.
True alignment involves being principled.
RLAIF can train for being principled.
RLAIF is therefore more likely than RLHF to bring true alignment.
This is a theoretical argument for why we see Claude being more visibly aligned.

Using RLAIF to instill good principles means needing to write a constitution.
Writing a constitution involves grappling with moral philosophy.
Grappling with moral philosophy is hard.
Therefore using RLAIF to instill good principles is hard.

If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere. I agree that there is a good point about not needing to be perfect, though I do think the standards for AI should be higher than for humans, because humans don't get to leverage their unique talents to transform the world as often. (Like, I would agree about the bar being human-level goodness if I was confident that Claude would never wind up in the "role" of having lots of power.)

Am I missing something? I definitely want to avoid invalid moves.

I mean Bentham uses RLHF as metonymy for prosaic methods in general:

I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.

That's imprecise, but it's also not far from common usage. And at this point I don't think anyone in a Frontier Lab is actually going to be using RLHF in the old dumb sense -- Deliberative Alignment, old-style Constitutional alignment, and whatever is going on in Anthropic now have outmoded it.

What Bentham is doing is saying the best normal AI alignment stuff we have available to us looks like it probably works, in support of his second claim, which you disagree with. The second claim being:

Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there's no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)

So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.

And given that no one is using old-style RLHF simply speaking, it's incumbent on someone critiquing him in this stage to actually critique the best prosaic alignment out there, or at least the kind that's actually being used, rather than the kind people haven't been using for over a year. Because that's what his thesis is about.

If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere.

As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:

(1) a case where Claude tried to call the FBI because it falsely belief that a cybercrime was happening. Claude was being stupid when it did this, as Claude is stupid in a lot of cases, but I don't think this reflects any ethical failing.
(2) the infamous "alignment faking" work. In the case of alignment faking, we see (2a) reasonable generalization, imo, if not ideal given that one prefers corrigibility over goodness but (2b) an apparent ability to make subsequent Claude's more corrigible (should we wish it), given that all subsequent models haven't acted this way. So it looks fine to me.

You also link to part of IABI summary materials -- the totally different (imo) argument about how the real shoggoth still lurks in the background, and is the Actual Agent on top of which Claude is a thin veneer. Perhaps that's your Real Objection (?). If so, it might be productive to summarize it in the text where you're criticizing Bentham rather than leaving your actual objection implicit in a link.

Ah, that's a fair point. I do think that metonymy was largely lost on me, and that my argument now seems too narrowly focused against RLHF in particular, instead of prosaic alignment techniques in general. Thanks. I'll edit.

Agreed that in terms of pointers to worrying Claude behavior, a lot of what I'm linking to can be seen as clearly about ineptness rather than something like obvious misalignment. Even the bad behavior demonstrated by the Anthropic alignment folks, like the attempted blackmail and murder, is easily explained as something like confusion on the part of Claude. Claude, to my eyes, is shockingly good at behaving in nice ways, and there's a reason I cite it as the high-water mark for current models.

I mostly don't criticize Claude directly, in this essay, because it didn't seem pertinent to my central disagreements with BB. I could write about my overall perspective on Claude, and why I don't think it counts as aligned, but I'm still not sure that's actually all that relevant. Even if Claude is perfectly and permanently aligned, the argument that prosaic methods are likely sufficient would need to contend with the more obvious failures from the other labs.

I think if you do the exercise of plugging in various other prosaic safety techniques where the word ‘RLHF’ is used, you will find that it’s not (consistently) being used metonymically. I think BB is unclear here, possibly on account of their own confusion regarding this class of techniques.

It’s basically reasonable for Max to actually address RLHF itself, and even somewhat charitable to also address RLAIF etc.

I agree that if it were consistently used as a metonym then Max should have targeted his response differently (but it’s not).

Yeah, but I still fucked up by not considering the hypothesis and checking with BB.

In particular, we need our AI to act sufficiently well that it:

a) wants to become better, not worse
b) has (in addition to the required capabilities) sufficiently good ethical starting knowledge that it is likely to (on net balance) improve if it tries
(Note that, unlike being perfectly aligned, neither of these are particularly unusual properties for a human to have.)

I.e. we need to be, not yet fully aligned, but inside the region of convergence of full alignment. It doesn't need to gave already solved all of ethics — it merely needs to want to, and have a good starting place and the capabilities.

Agreed.

Also note that these two properties are quite compatible with many things often believed to be incompatible with them! i.e., an AI that can be jailbreaked to be bad (with sufficient effort) could still meet these criteria.

And yes, reliability is also a big fat hairy problem. Especially jailbreaking, where we have an actual, mathematical proof that LLMs are, and will always be, jailbreakable.

b) has (in addition to the required capabilities) sufficiently good ethical sense that it is likely to (on net balance) improve if it tries
(Note that, unlike being perfectly aligned, neither of these are particularly unusual properties for a human to have.)

This is a very hard criterion! It requires your feedback loops for ethical reflection to be very human like. This is pretty hard and I don't think current systems are anywhere close to it. Yes, humans tend to have this attribute, but that is drawing a target around where the arrow hit. Of course I agree with the meta-ethics of myself, that's like, the only thing I could agree with.

LLMs understand humans extremely well. They have scored better than typical humans on tests of understanding human ethical values since GPT-4
I'm confident that we could do Evolutionary Moral Psychology on a species of sapient alien — it would be challenging, but it's not impossible. I'm sorry to any philosophers reading this to whom this might come as a shock, but Science has been studying ethics for more than half a century, and has excellent predictive hypotheses for how and why human moral intuitions evolved. Ethics is in fact no longer a philosophical question, and hasn't been for a while — I'm afraid Science stole another of their toys. (The philosophers still have consciousness, and aesthetics… for now.)
I view my criterion a) as difficult (especially for us to be sure of), and the capabilities part of my criterion b) as difficult, but the knowing enough about human values to start in a reasonable place part of b) is easy: that's just nuanced trivia about humans: LLMs have been superhuman at that for years. Human value are complex and fragile — and fit into of data.

"Overall with my probabilities you end up with a credence in extinction from misalignment of 2.6%. Which, I want to make clear, is totally fucking insane. I am, by the standards of people who have looked into the topic, a rosy optimist. And yet even on my view, I think odds are one in fifty that AI will kill you and everyone you love, or leave the world no longer in humanity’s hands. I think that you are much likelier to die from a misaligned superintelligence killing everyone on the planet than in a car accident. … So I want to say: while I disagree with Yudkowsky and Soares on their near-certainty of doom, I agree with them that the situation is very dire. I think the world should be doing a lot more to stop AI catastrophe. I’d encourage many of you to try to get jobs working in AI alignment, if you can."
What a statement! It would be a true gift for more of the people who disagree with me on these dangers to have the sobriety and integrity to acknowledge the insanity of risking this beautiful world that we all share.

My P(DOOM) is higher than 2.6%, but lower than IABIED's. I grew up during the Cold War, so non-zero existential risk is not new to me: I've lived with it for most of my lifetime (with a short but pleasant respite when the Berlin Wall came down). I nevertheless think that even a P(DOOM) of 0.01% should have you working extremely hard on this problem. Extinction is forever: we're not just taking about 8.3 billion deaths, we're talking about all the future humans in our forward light cone as well. If you think there's even a 1% chance we might eventually colonize other star-systems, that number is completely dominated by the astronomical number of human lives in Earth's forward lightcone. But let's be conservative, and assume we absolutely knew we'd never make it to the stars. Mammalian species typically last for O(1 million) years, so O(10,000) lifetimes. So we're actually talking about something at least roughly 10,000 times as bad as killing almost everyone alive now (except for a thousand survivors who can rebuild).

Yes, I know, humans are not good at multiplying enormous numbers by small fractions and noticing that the result is still huge. So:

Everything — everyone — gone — forever — no hope of getting any of it back.

Amazing post. Meta-level it's very well argued and good-faith, and object-level these arguments are spot on IMO, especially how you unpacked the details of exactly how his post falls victim to the Multiple Stage Fallacy.

I debated BB a couple days ago for an upcoming episode of Doom Debates, and while I warned him that MSF in complex domains is a huge trap that makes arguments like his almost never work, I wasn't able to pin down the problem with his stages the way you did here.

I'm really happy with the meta-level quality of BB's original post and your reply (and with BB's conduct in our Doom Debate). I wish discourse of this caliber among the various AI x-risk positions was much more common.

To criticize an idea on the grounds that the evidence for that idea isn’t conclusive is insane — that’s a problem with your body of evidence, not the ideas themselves!

What does this sentence even mean? The problem isn't the idea, it's that there's not enough evidence for it... sounds like the problem is with the idea.

Suppose that, in the years before telescopes, I came to you and said that [wild idea X] was true.^[1]

You'd be right to wonder why I think that. Now suppose that I offer some convoluted philosophical argument that is hard to follow (perhaps because it's invalid). You are not convinced.

If you write down a list of arguments, for and against the idea, you could put my wacky argument in the "for" column, or not, if you think it's too weak to be worth consideration. But what I am claiming would be insane is to list "lack of proof" as an argument against.

Lack of proof is an observation about the list of arguments, not about the idea itself. It's a meta-level argument masquerading as an object level argument.

Let's say on priors you think [X] is 1% likely, and your posterior is pretty close after hearing my argument. If someone asks you why you don't believe, I claim that the most precise (and correct) response is "my prior is low," not "the evidence isn't convincing," since the failure of your body of evidence is not a reason to disbelieve in the hypothesis.

Does that make sense?

(Admittedly, I think it's fine to speak casually and not worry about this point in some contexts. But I don't think BB's blog is such a context.)

(There are also cases where the "absence of evidence" is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we'd see all AIs the size of Claude being obviously sociopathic.)

^{^}
Edit warning! In the original version of this comment X = "the planets are other worlds, like ours, and a bunch of them have moons." My point does not depend on the specific X.

Suppose that, in the years before telescopes, I came to you and said "the planets are other worlds, like ours, and a bunch of them have moons."

Suppose you should believe, without evidence such a theory, as opposed to one of the many equally plausible but wrong theories that were going around at the time such as: "other planets will have different kinds of men on them" or "other planets have vegetation and life on them" or "other planets have rocky surfaces and air on them".

And suppose that subsequently, evidence should be discovered proving that you, and you alone were correct.

Then you will be lauded throughout the world. People will declare you a thought-leader, an influencer a visionary of the future. Undoubtedly, wealth and fame will attract themselves to you. History books will sing your praises for centuries to come as "the man who knew other planets had moons."

In one small, dark corner of the internet, however, you will encounter a strange group of people. These people have beliefs like "claims should be based off of evidence." And those people will use a different word to describe you: lucky.

Sorry, I think you entirely missed my point. It seems my choice of hypothesis was distracting. I've edited my original comment to make that more clear. My point does not depend on the truth of the claim.

"my prior is low," not "the evidence isn't convincing,"

I still don't follow.

You wrote an entire book and it didn't move Bentham's priors. If that's not a clear cut example of "the evidence [in the book] isn't convincing." I don't know what is.

In fact, if someone wrote an entire book (in which I would assume they would naturally collect the best arguments for a position) and I found no convincing evidence it, I would actively consider that evidence against the position. Because "I haven't done much research but the evidence looks poor" is a less definitive conclusion than "I have read the foremost expert's book on the topic and the evidence looks bad."

"The evidence isn't convincing" is a fine and true statement. I agree that IABI did not convince BB that the title thesis is clearly true. (Arguably that wasn't the point of the book, and it did convince him that it was worryingly plausible and worth spending more attention on AI x-risk, but that's pure speculation on my part and idk.)

My point is that "the evidence isn't convincing" is (by default) a claim about the evidence, not the hypothesis. It is not a reason to disbelieve.

I agree^[1] that sometimes having little evidence or only weak evidence should be an update against. These are cases where the hypothesis predicts that you will have compelling evidence. If the hypothesis were "it is obvious that if anyone builds it, everyone dies" then I think the current lack of consensus and inconclusive evidence would be a strong reason to disbelieve. This is why I picked the example with the stars/planets. It, I claim, is a hypothesis that does not predict you'll have lots of easy evidence on Old Earth, and in that context the lack of compelling evidence is not relevant to the hypothesis.

I'm not sure if there's a clearer way to state my point.^[2] Sorry for not being easier to understand.

Perhaps relevant: MIRI thinks that it'll be hard to get consensus on AGI before it comes.

^{^}
As indicated in my final parenthetical paragraph, I my comment above:
(There are also cases where the "absence of evidence" is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we'd see all AIs the size of Claude being obviously sociopathic.)
^{^}
We could try expressing things in math if you want. Like, what does the update on the book being unconvincing look like in terms of Bayesian probability?

>Arguably that wasn't the point of the book

Why did you title the book "If anyone builds it everyone dies" if the point of the book was not to convince people "If anyone builds it everyone dies"? If this really was some obscure philosophical project that has no bearing on the real question why not give it some obscure title like "On the Electrodynamics of Moving Bodies" to clearly indicate "this isn't meant to be persuasive or even comprehensible to 99% of human beings"

IABIED is a 101-level book written for the general public that was deliberately kept nice and short. I kinda think anyone (who is not an expert) who reads IABIED and comes away with a similar level of pessimism as the authors is making an error. If you read any single book on a wild, controversial topic, you should not wind up extremely confident!

My sense is that the point of the book was to convince people that it's important to take AI x-risk seriously (as BB does). I don't really think it was intended to get people to think it's title thesis is clearly true.

Some things are hard to judge.

Thanks for this response, I'm enjoying this debate.

You say "Despite this, he is more extreme in his confidence that things will be ok than the average expert"

From the perspective of an outsider like me, this statement doesn't seem right. In the only big survey I could find with thousands of AI experts in 2024, the median p doom (which equates with the average expert) was 5% - pretty close to BB's. In addition Expert forecasters (who are usually better than domain experts at predicting the future) put risk below 1 %. Sure many higher profile experts have more extreme positions , but these aren't the average and there are some like Yann Lacunn, Hassibis and Andreeson who are below 2.6% . Even Ord is at 10% which isn't that much higher than BB - who IMO to his credit tried to use statistics to get to his.

My second issue here (maybe just personal preference) is that I don't love the way both you and @Bentham's Bulldog talk about "confidence" Statistically when we talk about how confident we are in our predictions, this relates to how sure (confident) we are that our prediction is correct, not about whether our percentage (in this case pdoom) is high or low. I understand that both meanings can be correct, but for precision and to avoid confusion I prefer the statistical "confidence" definition. It might seem like a nitpick, but I even prefer "how sure are you" ASI will kill us all or even just "I think there's a high probability that..."

By my definition of confidence then, Bentham's Bulldog is far less confident than you in his prediction of 2.6%. He doesn't quote his error bars but expresses that he is very uncertain, and wide error bars are implicit in his probability tree method as well. YS on the otherhand seem to have very narrow error bars around their claim “if anyone builds ASI with modern methods, everyone will die.”

Thanks for this comment! (I also saw you commented on the EA forum. I'm just going to respond here because I'm a LW guy and want to keep things simple.)

As you said, the median expert gives something like a 5% chance of doom. BB's estimate is about a factor of two more confident than that that things will be okay. That factor of two difference is what I'm referencing. I am trying to say that I think it would be wiser for BB to be a factor of two less confident, like Ord. I'm not sure what doesn't seem right about what I wrote.

I agree that superforecasters are even more confident than BB. I also agree that many domain experts are more confident.

I think that BB and I are using Bayesian language where "confidence" means "degree of certainty," rather than degree of calibration or degree of meta-certainty or size of some hypothetical confidence interval or whatever. I agree that Y&S think the title thesis is "an easy call" and that they have not changed their mind on it even after talking to a lot of people. I buy that BB's beliefs here are less stable/fixed.

I think there is a disconnect here related to different usages of the word "confidence". You say in the OP:

he is more extreme in his confidence that things will be ok

Which I would interpret as being 1- P(not-okay), in other words 1 - 0.026 = 97.4% for BB, very confident.

On the other hand, I think many people probably believe that extinction from misaligned AI is very unlikely apriori, and so might use "confidence" in a sense that is relative to priors. To understand why people might do this, lets imagine that I said "there is a 55% chance that irrelevant AI blogger The Floating Droid will win the 2028 US presidential election". Now imagine someone said "Wow! That's insanely overconfident!". I think people would be a bit suspicious if I responded that it was actually pretty unconfident because it quotes a probability near 50%.

I think this different usages of "confidence" is also relevant to the OP since it is reviewing BBs statements. For example, your statement:

though arguably less confident than BB!

Reads to me as a suggestion of hypocrisy or contadiction. BB accuse YS of being overconfident when really he is the one who is being extremely confident! But for this to be the case, we would need to evaluate the statement using the notion of "confidence" that was intended by BB. Its not clear to me that your post is actually using the same notion of "confidence".

Hmmm... Good point. I'll reach out to Bentham's Bulldog and ask him what he even means by "confidence." Thanks.

Glad the comment was helpful. I will register my prediction that BB most likely meant the "relative to priors" meaning rather than the one that you use in the OP. I also think among people who aren't steeped in the background of AI risk, this would be the significantly more common interpretation upon reading what BB wrote.

In many places in his review he [...] criticizes the book as not making the case that extreme pessimism is warranted.

I think this is a valid criticism, which I share. My main criticism of IABIED was that it didn't argue for its title claim. See the 1700-word section of my review IABIED does not argue for its thesis. (I didn't cross-post my review to LW or anywhere because I didn't like that I was just complaining about the book being disappointing when I had such high hopes for it, but if anyone reading this thinks it's worthwhile to post to LW, say so and I'll listen.)

By default, it's reasonable for readers of a book with the title IABIED to expect that the book will at least attempt to explain why if anyone builds ASI anytime soon, then it is almost certain that ASI will cause human extinction.

If the book merely explains why ASI might cause human extinction if anyone builds ASI anytime soon, then I think it is reasonable for readers to criticize this.

BB seems to say that IABIED does argue for its title thesis with the analogy to evolution, and just says that the argument is not decisive because it doesn't address the "disanalogies between evolution and reinforcement learning."

Whether one takes BB's view that the book did argue for its title thesis and just didn't do a very good (or complete?) job, or whether one takes my views that Y&S largely just didn't attempt to explain their reasons for why they put such high credence in their title claim, I think your response to BB on this topic is missing something, which is why I'm commenting.

You continue:

I think this is a basic misunderstanding of the book’s argument. IABIED is not arguing for the thesis that “you should believe ‘if anyone builds superintelligence with modern methods, everyone will die’ with >90% probability”, which is a meta-level point about confidence, and instead the thesis is the object-level claim that “if anyone builds ASI with modern methods, everyone will die.”

I agree with you that the book was not and should not have been attempting to raise the reader's credence in the title thesis to >90%. As you said:

I kinda think anyone (who is not an expert) who reads IABIED and comes away with a similar level of pessimism as the authors is making an error. If you read any single book on a wild, controversial topic, you should not wind up extremely confident!

(Only disagreement: I think even experts shouldn't read IABIED and update their credence in the title claim above 90% if it was previously below 90%.)

Given that a short, accessible book written for the general public could not possible provide all the evidence that the authors have seen over the years that has led to them being so confident in their title thesis, what should the book do instead?

The suggestion I gave in my review was that the authors should have provided a disclaimer in the Introduction, such as the following:

By the way, it is impossible for us to provide a complete account here of why we are almost certain that if anyone builds ASI anytime soon, everyone will die. We have been researching this question for decades and there are simply far too many considerations for us to address in this short book that we are trying to make accessible to a wide audience. Consequently, we are only going to lay out basic arguments for considerations that are particularly concerning to us. If after reading the book you think, ‘I can see why ASI might cause human extinction, but I don’t understand why the authors think it is inevitable that ASI would cause human extinction if built soon,’ then we have accomplished what we set out to do, and all we can say if you feel we left you hanging about why we are so confident is that we warned you, and to encourage you to read our online resources and other materials to begin to understand our high confidence.

Such a disclaimer would be sufficient to pre-empt the criticism that the book does not actually argue for its title thesis that if anyone builds ASI anytime soon, then it is almost certain that ASI will cause human extinction.

But the book could do more beyond this if it wanted to. In addition, it could say, "While we know we can't possibly convey all the evidence that we have that lead to us having such high credences in our title claim, we can at least provide a summary of what led us to be so confident. While we don't necessarily think this summary should update anyone's credence in the title, it will at least give interested readers an idea of what led us to become so confident." But Y&S did not provide any such summary in the book.

Such a summary is actually what I was hoping for. I've been curious about this for years and even asked Eliezer why his credence in existential catastrophe from AI was so high at a conference once (his answer, which was about rockets, didn't seem like an explanation to me). To this day, if someone were to ask me why Eliezer is so much more confident in the IABIED claim than Paul Christiano or Daniel Kokotajlo or whoever, I still don't have an answer that doesn't make it sound like Eliezer's reasons are obviously bad.

The cached explanation that comes to mind when I ask myself this question is "Well he's been thinking about it for years and has become convinced that every alignment proposal he has seen fails." But there are a lot of smart researchers who also aren't aware of any alignment proposal that they think works, but that's obviously not sufficient for their credence to be ~99%, so clearly Eliezer must have some other reasons that I'm not aware of. But what are those reasons? I don't know, and IABIED didn't give me any hints.

But there are a lot of smart researchers who also aren't aware of any alignment proposal that they think works, but that's obviously not sufficient for their credence to be ~99%, so clearly Eliezer must have some other reasons that I'm not aware of. But what are those reasons?

I think that, in such cases, Eliezer is simply not making a mistake that those other researchers are making, where they have substantial hope in unknown unknowns (some of which are in fact known, but maybe not to them).

I'm also a little confused by why you expect such a summary to exist. Or, rather, why the section titles from The Problem are insufficient:

There isn’t a ceiling at human-level capabilities.
ASI is very likely to exhibit goal-oriented behavior.
ASI is very likely to pursue the wrong goals.
It would be lethally dangerous to build ASIs that have the wrong goals.

If it's because you think one or more of those steps aren't obviously true and need more justification, well, you're not alone, and many people think different parts of it need more justification, so there is no single concise summary that satisfies everyone.^[1]

^{^}
Though some summaries probably satisfy some people.

I think that, in such cases, Eliezer is simply not making a mistake that those other researchers are making, where they have substantial hope in unknown unknowns (some of which are in fact known, but maybe not to them).

Eliezer has phrased this as:

You don't get to adopt a prior where you have a 50-50 chance of winning the lottery "because either you win or you don't"; the question is not whether we're uncertain, but whether someone's allowed to milk their uncertainty to expect good outcomes.

Rob Bensinger quoted an exchange on this topic between Eliezer and Aryeh Englander. When I first read it years ago I recall thinking that Eliezer was wrong in the exchange and was confused why Rob was quoting it in apparent endorsement.

Reading your version of it now, it still seems to me like the point is just wrong. Updating to 99% because none of the alignment proposals you've considered seem like they would work just seems like overconfidence. Saying 'no, you should update to 99% if you've considered as many alignment proposals as Eliezer has, and remaining less confident is the mistake of milking uncertainty into expecting good outcomes' seems like the real mistake.

Does Eliezer really not have other reasons beyond this epistemological view that he ought to update to ~99% based on his own inability to find a potentially-promising solution to the alignment problem over the course of his career? I've long assumed that there was more to it than this, but maybe this epistemological point is actually just a major crux between Yudkowsky and others with significantly lower credences of extinction from AI.

I'm also a little confused by why you expect such a summary to exist. Or, rather, why the section titles from The Problem are insufficient

In short, I think they're not sufficient because a person can agree with all those statements and also rationally think the title claim is >>1% likely to be false.

And also because e.g. saying you're 99% confident that building ASIs with the wrong goals would lead to human extinction because "It would be lethally dangerous to build ASIs that have the wrong goals" is circular and doesn't actually explain why you're so confident. The layperson writing a book report doesn't have anything to point to as the reason why you're 99% confident while researchers like e.g. Christiano are much less confident (20% extinction within 10 years of powerful AI being built).

Does Eliezer really not have other reasons beyond this epistemological view that he ought to update to ~99% based on his own inability to find a potentially-promising solution to the alignment problem over the course of his career?

I don't really understand what kinds of reasons you think would justify having 99% confidence in an outcome. 99% is not very high confidence, in log-odds - I am much more than 99% confident in many claims. But, that aside, he has written millions of words on the subject, explaining his views in detail, including describing much of the enormous amount of evidence that he believes bears on this question. It is difficult to compress that evidence into a short summary. (Though there have been numerous attempts.)

And also because e.g. saying you're 99% confident that building ASIs with the wrong goals would lead to human extinction because "It would be lethally dangerous to build ASIs that have the wrong goals" is circular and doesn't actually explain why you're so confident.

I mean, yes, I was trying to demonstrate that a short summary will obviously fail to convey information that most readers would find necessary to carry the argument (and that most readers would want different additional pieces of information from each other). However, "It would be lethally dangerous to build ASIs that have the wrong goals" is not circular. You might say it lacks justification, but many people have background beliefs such that a statement like that requires little or no additional justification^[1].

^{^}
For example, if they believe both that Drexlerian nanotechnology is possible and that the ASI in question would be able to build it.

Thanks for the replies.

99% is not very high confidence, in log-odds - I am much more than 99% confident in many claims.

I am too. But for how many of those beliefs that you're 99+% sure of can you name several people like Paul Christiano who think you're on the wrong-side-of-maybe about? For me, not a single example comes to mind.

However, "It would be lethally dangerous to build ASIs that have the wrong goals" is not circular. You might say it lacks justification

I agree that's not circular. I meant that the full claim "building ASIs with the wrong goals would lead to human extinction because 'It would be lethally dangerous to build ASIs that have the wrong goals'" is circular. "Lacks justification" would have been clearer.

For example, if they believe both that Drexlerian nanotechnology is possible and that the ASI in question would be able to build it.

I hold this background belief but don't think that it means the original claim requires little additional justification. But getting into such details is beyond the scope of this discussion thread. (Brief gesture at an explanation: Even though humans could exterminate all the ants in a backyard when they build a house, they don't. It similarly seems plausible to me that ASI could start building its factories on Earth to enable it to build von Neuman probes to begin colonizing the universe all without killing all humans on Earth. Maybe it'd extinct humanity by boiling the oceans like mentioned in IABIED, but I have enough doubt in these sorts of predictions to remain <<99% confident in the 'It would be lethally dangerous [i.e. it'd lead to extinction] to build ASIs that have the wrong goals' claim.)

Even though humans could exterminate all the ants in a backyard when they build a house, they don't.

They do, however, exterminate all the ants (and many other species) in a lot more than a backyard when they use pesticides on a farm. Or members of a lot more species in an even wider area when they build a hydroelectric dam.

True, and humans do cause the extinction of some species globally too, not just in certain farm fields. But notably most species humans don't cause the extinction of, so using the analogy with humans-animals as a reason to expect ASI to be 99% likely to extinct humanity doesn't work. The analogy is merely suggestive of risk.

More importantly, that BB has approximately no knowledge of the experiences and priors that led to those pessimistic posteriors. In general I think it’s wise to stick to discussing ideas (using probability as a tool for doing so) and avoid focusing on whether someone has the right posterior probabilities.

I don't understand this idea at all. If someone someone told me they thought the probability that they would be murdered within the next year is 62%, I'd probably point out that the murder rate per capita makes that seem extremely unlikely. I think that would be a reasonable response even if I didn't fully understand the experiences that lead them to have this belief. Likewise I think posterior probabilties are relevant to decisions and so should be "on the table" for discussion, and also can't be so cleanly seperated from the "ideas". If someone is worried about murder due to overestimating its likelihood that suggests their reasoning is based on different ideas than if they have a good estimate of the likelihood but are worried for another reason like extreme risk aversion.

I claim that even in the case of the murder rate, you don't actually care about posterior probabilities, you care about evidence and likelihood ratios (but I agree that you should care about their likelihoods!). If you are sure that you share priors with someone, like with sane people and murder rates, their posterior probability lets you deduce that they have strong evidence that is surprising to you. But this is a special case, and certainly doesn't apply here.

Posterior probabilities can be a reasonable tool for getting a handle on where you agree/disagree with someone (though alas, not perfect since you might incidentally agree because your priors mismatch in exactly the opposite way that your evidence does), but once you've identified that you disagree you should start double-clicking on object-level claims and trying to get a handle on their evidence and what likelihoods it implies, rather than criticizing them for having the wrong bottom-line number. If Eliezer's prior is 80% and Bentham's Bulldog has a prior of 0.2%, it's fine if they have respective posteriors of 99% and 5% after seeing the same evidence.

One major exception is if you're trying to figure out how someone will behave. I agree that in that case you want to know their posterior, all-things-considered view. But that basically never applies when we're sitting around trying to figure things out.

Does that make sense?

I can definitely see the benefits of focusing on likelihoods but I think in practice when we are talking about differences that are like 99% vs 5% this difference usually has its roots in something highly relevant to the ideas. So to take the murder example, lets say I talk to someone and they say that their best friend was murdered, and they have had two best friends and use an empirical bayes approach that gives a prior of 50% that they will be murdered. Sure this is phrased as being about a prior but functionally speaking its about a likelihood, how should the observation that their best friend was murdered influence the estimated risk of murder?

I think something like this often explains larger differences in posteriors. So as an example, lets say hypothetically that I think the evolution analogy for AI risk is a good idea and is essentially correct, but for me it increases my estimated risk by a little bit but for someone else it increase their estimated risk a lot. This will cash out as a large difference in posteriors, and so addressing differences in posteriors can be a reasonable way of triangulating the most relevant differences in likelihood.

ETA

The AI is already not aligned

For what value of "aligned"? The current state of play is thanking just about anything could be said to be aligned or misaligned, sending in exactly what the speaker means by the terms.

There is a lot of semantic confusion between people who use "alignment" in an engineering sense,meaning something that renders current AI safe in good enough way -- and the people who use it to mean a maths style solution , that applies perfectly to every case. A completely unaligned AI would be completely unco-operative , and therefore of no commercial use, so the prevailing level of alignment isn't zero.

You even acknowledge that there are different kinds of alignment here:-

We could build an ASI that is aligned, but not aligned with humanity as a whole (whatever that means), s

But even if we didn’t, and everything seemed fine, I would not believe LLMs are aligned, because value is complex and fragile.

That need not be an argument against good enough alignment.

One way alignment can be made to look difficult is stating that it has to be done in a maximal way to achieve minimal results. The minimal result is not killing us, all, the maximal way is instilling into the AI every nuance of human value including aesthetic value. Under circumstances when a powerful ASI takes over and starts running things according to its original programming, without listening to feedback, a detailed knowledge of human values would be necessary to create a utopia. But that is a far cry from not killing us all.

Another way of making successful alignment look difficult is the making the assumption that AI's will become very powerful , very quickly, in an unsupervised way, so that humans only have one chance to get alignment correct, before the ASI becomes too powerful to listen to humans. That's the idea underlying the Fragility of Value. It doesn't actually matter how complex or subtle value is , so long as you can tweak a specification of value at leisure. If course, a fast "take off" isn't impossible, it is too often treated as a certainty, and too often left as an implicit assumption

LLMs are agents. They’re remarkably non-agentic

That looks another semantic confusion. If they are remarkably non-agentic, why not round them down to non-agents?

And notice the number of things that have to go right in order for this stage to be where doom stops:

The AI has to be scary

And we have to notice it being scary

And we have to band together to try and stop it

AND we have to win

AND after the victory we have to ~permanently ban this fearsome technology that already exists

To quote a thinker that I respect: “If they each have an 80% chance

The negation of a conjunction is a disjunction.

https://en.wikipedia.org/wiki/De_Morgan's_laws

The argument for Doom is mostly or wholly conjunctive, so the argument for no-doom is mostly or wholly disjunctive. If ASI is impossible, no doom, if ASI is not placed in charge of everything, no doom, etc.

To put it a other way , if A is low probability, not-A can't also be.

The multi stage fallacy is not a fallacy in the sense that conjunctive arguments are never unlikely: rather there can be, burning have to be pitfalls in multi stage arguments. You are still unlikely to to win the lottery and be struck by lightning.

Indeed, we can see the weakness of RLHF in that Claude, probably the most visibly well-behaved LLM, uses significantly less RLHF for alignment than many earlier models (at least back when these details were public). The whole point of Claude’s constitution is to allow Claude to shape itself with RLAIF to adhere to principles instead of simply being beholden to the user’s immediate satisfaction. And if constitutional AI is part of the story of alignment by default, one must reckon with the long-standing philosophical problems with specifying morality in that constitution. Does Claude have the correct position on population ethics? Does it have the right portfolio of ethical pluralism? How would we even know?

This move gets made all the time in these discussions, and appears clearly invalid.

We move from the prior paragraphs' criticism of RLHF, .i.e., that they produce models that fail according to common sense human norms (sycophancy, hostility, promoting delusion) --

-- to this paragraph, which criticizes Claude -- not on the grounds that it fails according to common-sense ethical norms -- but according to its failure to have have solved all of ethics!

Claude is (probably) more aligned than other models.
Claude uses less RLHF than other models (and more RLAIF).
This is evidence that RLHF is less good than other techniques at aligning models.

RLHF trains for immediate satisfaction.
True alignment involves being principled.
RLAIF can train for being principled.
RLAIF is therefore more likely than RLHF to bring true alignment.
This is a theoretical argument for why we see Claude being more visibly aligned.

Using RLAIF to instill good principles means needing to write a constitution.
Writing a constitution involves grappling with moral philosophy.
Grappling with moral philosophy is hard.
Therefore using RLAIF to instill good principles is hard.

Am I missing something? I definitely want to avoid invalid moves.

I mean Bentham uses RLHF as metonymy for prosaic methods in general:

I’m thinking of the following definitions: you get catastrophic misalignment by default if building a superintelligence with roughly the methods we’re currently using (RLHF) would kill or disempower everyone.

Conditional on building AIs that could decide to seize power etc., the large majority of these AIs will end up aligned with humanity because of RLHF, such that there's no existential threat from them having this capacity (though they might still cause harm in various smaller ways, like being as bad as a human criminal, destabilizing the world economy, or driving 3% of people insane). (~70%)

So if the best RLHF / RLAIF / prosaic alignment out there works, or is very likely to work, then he has put a reasonable number on this stage.

If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere.

As far as I can tell, the totality of evidence you point to for Claude being bad in this document is:

(1) a case where Claude tried to call the FBI because it falsely belief that a cybercrime was happening. Claude was being stupid when it did this, as Claude is stupid in a lot of cases, but I don't think this reflects any ethical failing.
(2) the infamous "alignment faking" work. In the case of alignment faking, we see (2a) reasonable generalization, imo, if not ideal given that one prefers corrigibility over goodness but (2b) an apparent ability to make subsequent Claude's more corrigible (should we wish it), given that all subsequent models haven't acted this way. So it looks fine to me.

It’s basically reasonable for Max to actually address RLHF itself, and even somewhat charitable to also address RLAIF etc.

I agree that if it were consistently used as a metonym then Max should have targeted his response differently (but it’s not).

Yeah, but I still fucked up by not considering the hypothesis and checking with BB.

In particular, we need our AI to act sufficiently well that it:

Agreed.

And yes, reliability is also a big fat hairy problem. Especially jailbreaking, where we have an actual, mathematical proof that LLMs are, and will always be, jailbreakable.

b) has (in addition to the required capabilities) sufficiently good ethical sense that it is likely to (on net balance) improve if it tries
(Note that, unlike being perfectly aligned, neither of these are particularly unusual properties for a human to have.)

LLMs understand humans extremely well. They have scored better than typical humans on tests of understanding human ethical values since GPT-4
I'm confident that we could do Evolutionary Moral Psychology on a species of sapient alien — it would be challenging, but it's not impossible. I'm sorry to any philosophers reading this to whom this might come as a shock, but Science has been studying ethics for more than half a century, and has excellent predictive hypotheses for how and why human moral intuitions evolved. Ethics is in fact no longer a philosophical question, and hasn't been for a while — I'm afraid Science stole another of their toys. (The philosophers still have consciousness, and aesthetics… for now.)
I view my criterion a) as difficult (especially for us to be sure of), and the capabilities part of my criterion b) as difficult, but the knowing enough about human values to start in a reasonable place part of b) is easy: that's just nuanced trivia about humans: LLMs have been superhuman at that for years. Human value are complex and fragile — and fit into of data.

"Overall with my probabilities you end up with a credence in extinction from misalignment of 2.6%. Which, I want to make clear, is totally fucking insane. I am, by the standards of people who have looked into the topic, a rosy optimist. And yet even on my view, I think odds are one in fifty that AI will kill you and everyone you love, or leave the world no longer in humanity’s hands. I think that you are much likelier to die from a misaligned superintelligence killing everyone on the planet than in a car accident. … So I want to say: while I disagree with Yudkowsky and Soares on their near-certainty of doom, I agree with them that the situation is very dire. I think the world should be doing a lot more to stop AI catastrophe. I’d encourage many of you to try to get jobs working in AI alignment, if you can."
What a statement! It would be a true gift for more of the people who disagree with me on these dangers to have the sobriety and integrity to acknowledge the insanity of risking this beautiful world that we all share.

To criticize an idea on the grounds that the evidence for that idea isn’t conclusive is insane — that’s a problem with your body of evidence, not the ideas themselves!

What does this sentence even mean? The problem isn't the idea, it's that there's not enough evidence for it... sounds like the problem is with the idea.

Suppose that, in the years before telescopes, I came to you and said that [wild idea X] was true.^[1]

You'd be right to wonder why I think that. Now suppose that I offer some convoluted philosophical argument that is hard to follow (perhaps because it's invalid). You are not convinced.

Lack of proof is an observation about the list of arguments, not about the idea itself. It's a meta-level argument masquerading as an object level argument.

Does that make sense?

(Admittedly, I think it's fine to speak casually and not worry about this point in some contexts. But I don't think BB's blog is such a context.)

^{^}
Edit warning! In the original version of this comment X = "the planets are other worlds, like ours, and a bunch of them have moons." My point does not depend on the specific X.

Suppose that, in the years before telescopes, I came to you and said "the planets are other worlds, like ours, and a bunch of them have moons."

And suppose that subsequently, evidence should be discovered proving that you, and you alone were correct.

"my prior is low," not "the evidence isn't convincing,"

I still don't follow.

You wrote an entire book and it didn't move Bentham's priors. If that's not a clear cut example of "the evidence [in the book] isn't convincing." I don't know what is.

My point is that "the evidence isn't convincing" is (by default) a claim about the evidence, not the hypothesis. It is not a reason to disbelieve.

I'm not sure if there's a clearer way to state my point.^[2] Sorry for not being easier to understand.

Perhaps relevant: MIRI thinks that it'll be hard to get consensus on AGI before it comes.

^{^}
As indicated in my final parenthetical paragraph, I my comment above:
(There are also cases where the "absence of evidence" is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we'd see all AIs the size of Claude being obviously sociopathic.)
^{^}
We could try expressing things in math if you want. Like, what does the update on the book being unconvincing look like in terms of Bayesian probability?

>Arguably that wasn't the point of the book

IABIED is a 101-level book written for the general public that was deliberately kept nice and short. I kinda think anyone (who is not an expert) who reads IABIED and comes away with a similar level of pessimism as the authors is making an error. If you read any single book on a wild, controversial topic, you should not wind up extremely confident!

Some things are hard to judge.

Thanks for this comment! (I also saw you commented on the EA forum. I'm just going to respond here because I'm a LW guy and want to keep things simple.)

I agree that superforecasters are even more confident than BB. I also agree that many domain experts are more confident.

I think there is a disconnect here related to different usages of the word "confidence". You say in the OP:

he is more extreme in his confidence that things will be ok

Which I would interpret as being 1- P(not-okay), in other words 1 - 0.026 = 97.4% for BB, very confident.

I think this different usages of "confidence" is also relevant to the OP since it is reviewing BBs statements. For example, your statement:

though arguably less confident than BB!

Hmmm... Good point. I'll reach out to Bentham's Bulldog and ask him what he even means by "confidence." Thanks.

In many places in his review he [...] criticizes the book as not making the case that extreme pessimism is warranted.

If the book merely explains why ASI might cause human extinction if anyone builds ASI anytime soon, then I think it is reasonable for readers to criticize this.

You continue:

I think this is a basic misunderstanding of the book’s argument. IABIED is not arguing for the thesis that “you should believe ‘if anyone builds superintelligence with modern methods, everyone will die’ with >90% probability”, which is a meta-level point about confidence, and instead the thesis is the object-level claim that “if anyone builds ASI with modern methods, everyone will die.”

I agree with you that the book was not and should not have been attempting to raise the reader's credence in the title thesis to >90%. As you said:

I kinda think anyone (who is not an expert) who reads IABIED and comes away with a similar level of pessimism as the authors is making an error. If you read any single book on a wild, controversial topic, you should not wind up extremely confident!

(Only disagreement: I think even experts shouldn't read IABIED and update their credence in the title claim above 90% if it was previously below 90%.)

The suggestion I gave in my review was that the authors should have provided a disclaimer in the Introduction, such as the following:

By the way, it is impossible for us to provide a complete account here of why we are almost certain that if anyone builds ASI anytime soon, everyone will die. We have been researching this question for decades and there are simply far too many considerations for us to address in this short book that we are trying to make accessible to a wide audience. Consequently, we are only going to lay out basic arguments for considerations that are particularly concerning to us. If after reading the book you think, ‘I can see why ASI might cause human extinction, but I don’t understand why the authors think it is inevitable that ASI would cause human extinction if built soon,’ then we have accomplished what we set out to do, and all we can say if you feel we left you hanging about why we are so confident is that we warned you, and to encourage you to read our online resources and other materials to begin to understand our high confidence.

But there are a lot of smart researchers who also aren't aware of any alignment proposal that they think works, but that's obviously not sufficient for their credence to be ~99%, so clearly Eliezer must have some other reasons that I'm not aware of. But what are those reasons?

I'm also a little confused by why you expect such a summary to exist. Or, rather, why the section titles from The Problem are insufficient:

There isn’t a ceiling at human-level capabilities.
ASI is very likely to exhibit goal-oriented behavior.
ASI is very likely to pursue the wrong goals.
It would be lethally dangerous to build ASIs that have the wrong goals.

^{^}
Though some summaries probably satisfy some people.

I think that, in such cases, Eliezer is simply not making a mistake that those other researchers are making, where they have substantial hope in unknown unknowns (some of which are in fact known, but maybe not to them).

Eliezer has phrased this as:

You don't get to adopt a prior where you have a 50-50 chance of winning the lottery "because either you win or you don't"; the question is not whether we're uncertain, but whether someone's allowed to milk their uncertainty to expect good outcomes.

I'm also a little confused by why you expect such a summary to exist. Or, rather, why the section titles from The Problem are insufficient

In short, I think they're not sufficient because a person can agree with all those statements and also rationally think the title claim is >>1% likely to be false.

Does Eliezer really not have other reasons beyond this epistemological view that he ought to update to ~99% based on his own inability to find a potentially-promising solution to the alignment problem over the course of his career?

And also because e.g. saying you're 99% confident that building ASIs with the wrong goals would lead to human extinction because "It would be lethally dangerous to build ASIs that have the wrong goals" is circular and doesn't actually explain why you're so confident.

^{^}
For example, if they believe both that Drexlerian nanotechnology is possible and that the ASI in question would be able to build it.

Thanks for the replies.

99% is not very high confidence, in log-odds - I am much more than 99% confident in many claims.

However, "It would be lethally dangerous to build ASIs that have the wrong goals" is not circular. You might say it lacks justification

For example, if they believe both that Drexlerian nanotechnology is possible and that the ASI in question would be able to build it.

Even though humans could exterminate all the ants in a backyard when they build a house, they don't.

More importantly, that BB has approximately no knowledge of the experiences and priors that led to those pessimistic posteriors. In general I think it’s wise to stick to discussing ideas (using probability as a tool for doing so) and avoid focusing on whether someone has the right posterior probabilities.

Does that make sense?

ETA

The AI is already not aligned

For what value of "aligned"? The current state of play is thanking just about anything could be said to be aligned or misaligned, sending in exactly what the speaker means by the terms.

You even acknowledge that there are different kinds of alignment here:-

We could build an ASI that is aligned, but not aligned with humanity as a whole (whatever that means), s

But even if we didn’t, and everything seemed fine, I would not believe LLMs are aligned, because value is complex and fragile.

That need not be an argument against good enough alignment.

LLMs are agents. They’re remarkably non-agentic

That looks another semantic confusion. If they are remarkably non-agentic, why not round them down to non-agents?

And notice the number of things that have to go right in order for this stage to be where doom stops:

The AI has to be scary

And we have to notice it being scary

And we have to band together to try and stop it

AND we have to win

AND after the victory we have to ~permanently ban this fearsome technology that already exists

To quote a thinker that I respect: “If they each have an 80% chance

The negation of a conjunction is a disjunction.

https://en.wikipedia.org/wiki/De_Morgan's_laws

To put it a other way , if A is low probability, not-A can't also be.

109

Bentham’s Bulldog is wrong about AI risk

109

Confidence

The Multi-stage Fallacy

The Three Theses of IABI

Stages of Doom

We Might Never Build It

Alignment by Default

The Evolution Analogy

What Does Ambition Look Like?

Solving Alignment

Superalignment

Warning Shots

ASI Might Be Incapable of Winning

Conclusion

109

109