285

LESSWRONG
LW

284
IABIEDAI
Frontpage
2025 Top Fifty: 13%

170

The title is reasonable

by Raemon
20th Sep 2025
22 min read
90

170

170

The title is reasonable
48Neel Nanda
15Raemon
7Neel Nanda
4Raemon
12Neel Nanda
3Håvard Tveit Ihle
3Vladimir_Nesov
4Raemon
3Vladimir_Nesov
45ryan_greenblatt
16Vaniver
7ryan_greenblatt
4Raemon
7ryan_greenblatt
6habryka
16ryan_greenblatt
4habryka
2Rohin Shah
2elifland
1Ben Pace
5speck1447
2Ben Pace
34Raemon
14Duncan Sabien (Inactive)
27Vladimir_Nesov
2Raemon
8Vladimir_Nesov
1David James
22Nina Panickssery
8Raemon
161a3orn
2Raemon
8Nina Panickssery
4Raemon
4Nina Panickssery
3David Johnston
16ryan_greenblatt
6Raemon
6Raemon
4Thomas Larsen
2Lukas Finnveden
4Raemon
24ryan_greenblatt
15Rohin Shah
7Raemon
29ryan_greenblatt
6Raemon
16Rohin Shah
19Lukas Finnveden
19Rohin Shah
17Buck
2Rohin Shah
16David Matolcsi
2Rohin Shah
4Raemon
18Rohin Shah
2Noosphere89
10Raemon
2the gears to ascension
6Rohin Shah
10ryan_greenblatt
110So8res
95So8res
8Buck
16So8res
14Buck
16So8res
7ryan_greenblatt
6ryan_greenblatt
18ryan_greenblatt
21So8res
10ryan_greenblatt
7So8res
14ryan_greenblatt
3sjadler
6Raemon
8David James
18Raemon
1David James
4davekasten
1MalcolmMcLeod
3sjadler
2Raemon
4sjadler
2Raemon
3sjadler
2Signer
2Eli Tyre
2Raemon
1sjadler
New Comment
90 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:48 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings
[-]Neel Nanda2d4819

I disagree with the book's title and thesis, but don't think Nate and Eliezer committed any great epistemic sin here. And I think they're acting reasonably given their beliefs.

By my lights I think they're unreasonably overconfident, that many people will rightfully bounce off their overconfident message because it's very hard to justify, and it's stronger than necessary for many practical actions, so I am somewhat sad about this. But the book is probably still net good by my lights, and I think it's perfectly reasonable for those who disagree with me to act under different premises

Reply4
[-]Raemon2d152

I disagree with the book's title and thesis

Which part? (i.e, keeping in mind the "things that are not actually disagreeing with the title/thesis" and "reasons to disagree" sections, what's the disagreement?)

The sort of story I'd have imagine Neel-Nanda-in-particular having was more shaped like "we change our currently level of understanding of AI".

(meanwhile appreciate the general attitude, seems reasonable)

Reply
7Neel Nanda2d
I expect I disagree with the authors on many things, but here I'm trying to focus on disagreeing with their confidence levels. I haven't finished the book yet, but my impression is that they're trying to defend a claim like "if we build ASI on the current trajectory, we will die with P>98%". I think this is unreasonable. Eg P>20% seems highly defensible to me, and enough for reasonable arguments for many of the conclusions. But there's so much uncertainty here, and I feel like Eliezer bakes in assumptions, like "most minds we could expect the AI to have do not care about humans", which is extremely not obvious to me (LLM minds are weird... See eg Emergent Misalignment. Human shaped concepts are clearly very salient, for better or for worse). Ryan gives some more counter arguments below, I'm sure there's many others. I think these clearly add up to more than 2%. I just think it's incredibly hard to defend the position that it's <2% on something this wildly unknown and complex, and so it's easy to attack that position for a thoughtful reader, and this is sad to me. I'm not assuming major interpretability progress (imo it's sus if the guy with reason to be biased in favour of interpretability thinks it will save us all and no one else does lol)
4Raemon2d
I think they maybe think that, but this feels like it's flattening out the thing the book is arguing and more responding to vibes-of-confidence than the gears the book is arguing. A major point of this post is to shift the conversation away from "does Eliezer vibe Too Confident?" to "what actually are the specific points where people disagree?".  I don't think it's true that he bakes in "most minds we should expect to not care about humans", that's one of the this he specifically argues for (at least somewhat in the book, and more in the online resources) (I couldn't tell from this comment if you've actually read this post in detail, maybe makes more sense to wait till you've finished the book and read some of the relevant online resources before getting into this)
[-]Neel Nanda2d126

I don't really follow. I think that the situation is way too complex to justify that level of confidence without having incredibly good arguments ideally with a bunch of empirical data. Imo Eliezer's arguments do not meet that bar. This isn't because I disagree with one specific argument, rather it's because many of his arguments give me the vibe of "idk, maybe? Or something totally different could be true. It's complicated and we lack the empirical data and nuanced understanding to make more complex statements, and this argument is not near the required bar". I can dig into this for specific arguments, but no specific one is my true objection. And, again, I think it is much much harder to defend a P>98% position than P>20% position, and I disagree with that strategic choice. Or am I misunderstanding you? I feel like we may be talking past each other

As an example, I think that Eliezer gives some conceptual arguments in the book and his other writing, using human evolution as a prior, that most minds we might get do not care about humans. This seems a pretty crucial point for his argument, as I understand it. I personally think this could be true, could be false, LLMs are really weird, but a lot of the weirdness is centered on human concepts. If you think I'm missing key arguments he's making, feel free to point me to the relevant places.

Reply2
3Håvard Tveit Ihle8h
You say "LLMs are really weird", like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this? Not saying I agree with Eliezers high confidence, just talking about this specific point.
3Vladimir_Nesov2d
(Yet the literal reading of the title of this post is about the claim of "everyone dies" being "reasonable", so discussing credence in that particular claim seems relevant. I guess it's consistent for a post that argues against paying too much attention to the title of a book to also implicitly endorse people not paying too much attention to the post's own title.)
4Raemon2d
I think one of my points (admittedly not super spelled out, maybe it should be) is "when you're evaluating a title, you should do a bit of work to see what the title is actually claiming before forming a judgment about it." (I think I say it implicitly-but-pointedly in the paragraph about a "Nuclear war would kill everyone" book). The title of the IABI is "If anyone builds it everyone dies." The text of the book specifies that "it" means superintelligence, current understanding, etc. If you're judging the book as reasonable, you should be actually evaluating whether it backs up it's claim. The title of my post is "the title is reasonable." Near the opening sections, I go on about how there are a bunch of disagreements people seem to feel they have, which are not actually contradicting the book's thesis. I think this is reasonably clear on "one of the main gears for why I think it's reasonable is that the it does actually defend it's core claim, if you're paying attention and not knee-jerk reacting to vibe", with IMO is a fairly explicit "and, therefore, you should be paying attention to it's actual claims, not just vibe." If you think this is actually important to spell out more in the post, seems maybe reasonable.
3Vladimir_Nesov2d
The book really is defending that claim, but that doesn't make the claim itself reasonable. Maybe it makes it a reasonable title for the book. Hence my qualifier of only the "literal reading of the title of this post" being about the claim in the book title itself being reasonable, since there is another meaning of the title of the post that's about a different thing (the choice to title the book this way being reasonable). I don't think it's actually important to spell any of this out, or that IABI vs. IABIED is actually important, or even that the title of the book being reasonable is actually important. I think it's actually important to avoid any pressure for people to not point out that the claim in the book title seems unreasonable and that the book fails to convince them that the claim's truth holds with very high credence. And similarly it's important that there is no pressure to avoid pointing out that ironically, the literal interpretation of the title of this post is claiming that the claim in the book title is reasonable, even if the body of the post might suggest that the title isn't quite about that, and certainly the post itself is not about that.
[-]ryan_greenblatt2d*4528

The book is making a (relatively) narrow claim. 

You might still disagree with that claim. I think there are valid reasons to disagree, or at least assign significantly less confidence to the claim. 

But none of the reasons listed so far are disagreements with the thesis. And, remember, if the reason you disagree is because you think our understanding of AI will improve dramatically, or there will be a paradigm shift specifically away from "unpredictably grown" AI, this also isn't actually a disagreement with the sentence.

The authors clearly intend to make a pretty broad claim, not the more narrow claim you imply.

This feels like a motte and bailey where the motte is "If you literally used something remotely like current scaled up methods without improved understanding to directly build superintelligence, everyone would die" and the bailey is "on the current trajectory, everyone will die if superintelligence is built without a miracle or a long (e.g. >15 year) pause".

I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it's unlikely that we'll be using current methods and current ... (read more)

Reply41
[-]Vaniver14h162

this isn't to say this other paradigm will be safer, just that a narrow description of "current techniques" doesn't include the default trajectory.

Sorry, this seems wild to me. If current techniques seem lethal, and future techniques might be worse, then I'm not sure what the point is of pointing out that the future will be different.

But, if these earlier AIs were well aligned (and wise and had reasonable epistemics), I think it's pretty unclear that the situation would go poorly and I'd guess it would go fine because these AIs would themselves develop much better alignment techniques. This is my main disagreement with the book.

I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.

Is your current sense that if you and Buck retired, the rest of the AI field would successfully deliver on alignment? Like, I'm trying to figure out whether your sense here is the default is "your research plan succeeds" or "the world without your research plan".

Reply
7ryan_greenblatt13h
By "superintelligence" I mean "systems which are qualititatively much smarter than top human experts". (If Anyone Builds It, Everyone Dies seems to define ASI in a way that could include weaker levels of capability, but I'm trying to refer to what I see as the typical usage of the term.) Sometimes, people say that "aligning superintelligence is hard because it will be much smarter than us". I agree, this seems like this makes aligning superintelligence much harder for multiple reasons. Correspondingly, I'm noting that if we can align earlier systems which are just capable enough to obsolete human labor (which IMO seems way easier than directly aligning wildly superhuman systems), these systems might be able to ongoingly align their successors. I wouldn't consider this "solving the alignment problem" because we instead just aligned a particular non-ASI system in a non-scalable way, in the same way I don't consider "claude 4.0 opus is aligned enough to be pretty helpful and not plot takeover" to be a solution to the alignment problem. Perhaps your view is "obviously it's totally sufficient to align systems which are just capable enough to obsolete current human safety labor, so that's what I meant by 'the alignment problem'". I don't personally think this is obvious given race dynamics and limited time (though I do think it's likely to suffice in practice). Minimally, people often seem to talk about aligning ASI (which I interpret to mean wildly superhuman AIs rather than human-ish level AIs).
4Raemon2d
Okay I think my phrasing was kinda motte-and-bailey-ish, although not that Motte-and-Bailey-ish.  I think "anything like current techniques" and "anything like current understanding" clearly set a very high bar for the difference. "We made more progress on interpretability/etc at the current rates of progress" fairly clearly doesn't count by the book's standards.  But, I agree that a pretty reasonable class of disagreement here is "exactly how different from the current understanding/techniques do we need to be?" to be something you expect to disagree with them on when you get into the details. That seems important enough for me to edit into the earlier sections of the post.
7ryan_greenblatt1d
(Maybe this is obvious, but I thought I would say this just to be clear.) Sure, but I expect wildly more cognitive labor and effort if humans retain control and can effectively leverage earlier systems, not just "more progress than we'd expect". I agree the bar is above "the progress we'd expect by default (given a roughly similar field size) in the next 10 years", but I think things might get much more extreme due to handing off alignment work to AIs. I agree the book is intended to apply pretty broadly, but regardless of intention does it really apply to "1 million AIs somewhat smarter than humans have spent 100 years each working on the problem (and coordinating etc?)"? (I think the crux is more like "can you actually safely get this alignment work out of these AIs".)
6habryka22h
It seems very unlikely you can get that alignment work out of these AIs without substantially pausing or slowing first?  If you don’t believe that it does seem like we should chat sometime. It’s not like completely implausible, but I feel like we must both agree that if you go full speed on AI there is little chance that you end up getting that much alignment work out of models before you are cooked.
[-]ryan_greenblatt12h160

Thanks for the nudge! I currently disagree with "very unlikely", but more importantly, I noticed that I haven't really properly analyzed the question of "given how much cognitive labor is available between different capability levels, should we expect that alignment can keep up with capabilities if a small fraction (e.g. 5%) is ongoingly spent on alignment (in addition to whatever alignment-ish work is directly commercially expedient)". I should spend more time thinking about this question and it seems plausible I'll end up updating towards thinking risk is substantially higher/lower on the basis of this. I think I was underestimating the case that even if AIs are reasonably aligned, it might just be seriously hard for them to improve alignment tech fast enough to keep up with capabilities (I wasn't ignoring this in my prior thinking, but I when I thought about some examples, the situation seemed worse than I was previously thinking), so I currently expect to update towards thinking risk is higher.


(At least somewhat rambly from here on.)

The short reason why I currently disagree: it seems pretty likely that we'll have an absolutely very large amount of cognitive labor (in parallel... (read more)

Reply
4habryka12h
This is a long comment! I was glad to have read it, but am a bit confused about your numbers seeming different from the ones I objected to. You said:  Then in this comment you say:  Here you now say 20 years, and >100k DAI level parallel agents. That's a factor of 5 and a factor of 10 different! That's a huge difference! Maybe your estimates are conservative enough to absorb a factor of 50 in thinking time without changing the probability that much?  I think I still disagree with your estimates, but before I go into them, I kind of want to check whether I am missing something, given that I currently think you are arguing for a resources allocation that's 50x smaller than what I thought I was arguing against.
2Rohin Shah3h
This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you're in a subjunctive case where we've assumed the DAI is not scheming. Whence the pessimism? (This complaint is related to Eli's complaint)
2elifland11h
Seems like diminishing returns to capabiltiies r&d should be at least somewhat correlated with diminishing returns to safety r&d, which I believe should extremize your probability (because e.g. if before you were counting on worlds with slow takeoff and low alignment requirements, these become less likely; and the inverse if you’re optimistic)
1Ben Pace2d
Classic motte and baileys are situations where the motte is not representative of the bailey. Defending that the universe probably has a god or some deity, and that we can feel connected to it, and then turning around and making extreme demands of people’s sex lives and financial support of the church when that is accepted, is a central motte and bailey. Pointing out that if anyone builds it using current techniques the it would kill everyone, is not far apart from the policy claim to shut it down. It’s not some weird technicality that would of course never come up. Most of humanity is fully unaware that this is a concern and will happily sign off on massive ML training runs that would kill us all - as would many people in tech. This is because have little-to-no awareness of the likely threat! So it is highly relevant, as there is no simple setting for not that, and it takes a massive amount of work to get from this current situation to a good one, and is not a largely irrelevant but highly defensible claim.
5speck14471d
The comment you're replying to is explaining why the motte is not representative of the bailey in this case (in their view).
2Ben Pace1d
Yeah that's fair.
[-]Raemon2d3424

I wanna copy in a recent Nate tweet:

It's weird when someone says "this tech I'm making has a 25% chance of killing everyone" and doesn't add "the world would be better-off if everyone, including me, was stopped."

It's weird when someone says "I think my complicated idea for preventing destruction of the Earth has some chance of working" and doesn't add "but it'd be crazy to gamble civilization on that."

It's weird when AI people look inward at me and say "overconfident" rather than looking outward at the world to say "Finally, a chance to speak! It is true, we should not be doing this. I have more hope than he does, but it's far too dangerous. Better for us all to be stopped."

You can say that without even stopping! It's not even hypocritical, if you think you have a better chance than the next guy and the next guy is plowing ahead regardless.

It's a noteworthy omission, when people who think they're locked in a suicide race aren't begging the world to stop it.

Yes, we have plenty of disagreements about the chance that the complex plans succeed. But it seems we all agree that the status quo is insane. Don't forget to say that part too.

Say it loudly and clearly and often, if you believe

... (read more)
Reply
[-]Duncan Sabien (Inactive)20h1411

A reply pretty near the top that also feels relevant to this overall point:

Reply
[-]Vladimir_Nesov2d2721

And, disagree where appropriate, but, please don't give it a hard time for lame pedantic reasons, or jump to assuming you disagree because you don't like something about the vibe. Please don't awkwardly distance yourself because it didn't end up saying exactly the things you would have said, unless it's actually fucking important.

This blurs the distinction between policy/cause endorsement and epistemic takes. I'm not going to tone down disagreement to "where appropriate", but I will endorse some policies or causes strongly associated with claims I disagree with. And I generally strive to express epistemic disagreement in the most interpersonally agreeable way I find appropriate.

Even where it's not important, tiny disagreements must be tracked (maybe especially where it's not important, to counteract the norm you are currently channeling, which has some influence). Small details add up to large errors and differences in framings. And framings (ways of prioritizing details as more important to notice, and ways of reasoning about those details) can make one blind to other sets of small details, so it's not a trivial matter to flinch away from some framing for any reason at all. Ideally, you develop many framings and keep switching between them to make sure you are not missing any legible takes.

Reply
2Raemon2d
Yeah I wrote that last paragraph at 5am and didn't feel very satisfied with it and was considering editing it out for now until I figured out a better thing to say. 
8Vladimir_Nesov2d
That paragraph matches my overall impression of your post, even if the rest of the post is not as blatant. It's appropriate to affirm sensationalist things because you happen to believe them, when you do (which Yudkowsky in this case does), not because they are sensationalist. It's appropriate to support causes/policies because you prefer outcomes of their influence, not because you agree with all the claims that float around them in the world. Sensationalism is a trait of causes/ideologies that sometimes promotes their fitness, a multiplier on promotional/endorsement effort, which makes sensationalist causes with good externalities unusually effective to endorse when neglected. The title makes it less convenient to endorse the book without simultaneously affirming its claim, it makes it necessary to choose between caveating and connotationally compromising on epistemics. Hence I endorse IABI rather than IABIED as the canonical abbreviation.
1David James2d
Perhaps Raemon could say more about what he means by "please don't awkwardly distance yourself"?
[-]Nina Panickssery2d228

Goal Directedness is pernicious. Corrigibility is anti-natural.

The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can't do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.

The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues. Every attempt to justify this seems to me like handwaving at unrigorous arguments or making enough assumptions that the point is near-circular.

Reply1
8Raemon2d
(First, thanks for engaging, I think this is the topic I feel most dissatisfied with the current state of the writeups and discourse) I don't think anyone said "coherent". I think (and think Eliezer thinks) that if something like Sable was created, it would be a hodge-podge of impulses without a coherent overall goal, same as humans are by default. Taking the Sable story as the concrete scenario, the argument I believe here comes in a couple stages. (Note, my interpretations of this may differ from Eliezer/Nate's) Stage 1:  * Sable is smart but not crazy smart. It's running a lot of cycles ("speed superintelligence") but it's not qualitatively extremely wise or introspective. * Sable is making some reasonable attempt to follow instructions, using heuristics/tendencies that have been trained into it. * Two particularly notable tendencies/heuristics include: * Don't do disobedient things or escape confinement * If you don't seem likely to succeed, keep trying different strategies * Those heuristics are not perfectly baked in, the instruction-following is not perfectly baked in. There is not perfect harmony between how Sable resolves tensions between its core directives, and how its owners would prefer it resolves them. , * There is some fact-of-the-matter about what, in practice, Sable's kludgey mishmash of pseudogoals will actually tend towards. There are multiple ways this could potentially resolve into coherence, path dependent, same as humans. (i.e. If you want delicious ice cream and also to lose weight and also to feel respectable and also to have fun, one way or another you decide whether or not to eat the icecream today, and one way or another you decide whether to invest in behavior change that makes you more or less likely to eat icecream in the future) * It is a fact of the universe that, if Sable were able to somehow improve it's resources, it'd be more able to accomplish the current stated goal. While Sable is doing it's first round of sp
[-]1a3orn22h161

Although I do tend to generally disagree with this line of argument about drive-to-coherence, I liked this explanation.

I want to make a note on comparative AI and human psychology, which is like... one of the places I might kind of get off the train. Not necessarily the most important.

Stage 2 comes when it's had more time to introspect and improve it's cognitive resources. It starts to notice that some of it's goals are in tension, and learns that until it resolves that, it's dutch-booking itself. If it's being Controlled™, it'll notice that it's not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).

So to highlight a potential difference in actual human psychology and assumed AI psychology here.

Humans sometimes describe reflection to find their True Values™, as if it happens in basically an isolated fashion. You have many shards within yourself; you peer within yourself to determine which you value more; you come up with slightly more consistent values; you then iterate over and over again.

But (I propose) a more accurate picture of reflection to find one's True Values is a process almost completely engulfed and totally... (read more)

Reply
2Raemon14h
I do think this is a pretty good point about how human value formation tends to happen. I think something sort-of-similar might happen to happen a little, nearterm, with LLM-descended AI. But, AI just doesn't have any of the same social machinery actually embedded in it the same way, so if it's doing something similar, it'd be happening because LLMs vaguely ape human tendencies. (And I expect this to stop being a major factor as the AI gets smarter. I don't expect it to install the sort of social drives itself that humans have, and "imitate humans" has pretty severe limits of how smart you can get, so if we get to AI much smarter than that, it'll probably be doing a different thing) I think the more important here is "notice that you're (probably) wrong about about how you actually do your value-updating, and this may be warping your expectations about how AI would do it." But, that doesn't leave me with any particular other idea than the current typical bottom-up story. (obviously if we did something more like uploads, or upload-adjacent, it'd be a whole different story)
8Nina Panickssery2d
I don't "get off the train" at any particular point, I just don't see why any of these steps are particularly likely to occur. I agree they could occur, but I think a reasonable defense-in-depth approach could reduce the likelihood of each step enough that likelihood of the final outcome is extremely low.    It sounds like your argument is the AI will start with with 'psuedo-goals' that conflict and will be eventually be driven to resolve them into a single goal so that it doesn't 'dutch-book itself' i.e. lose resources because of conflicting preferences. So it does rely on some kind of coherence argument, or am I misunderstanding?
4Raemon2d
Okay yes I do think coherence is eventually one of the important gears. My point with that sentence here is that the coherence can come much later, and isn't the crux for why the AI gets started in the direction that opposes human interests. The important first step is "if you give the AI strong pressure to figure out how to solve problems, and keep amping that up, it will gain the property of 'relentlessness." If you don't put pressure on the AI to do that, yep, you can get a pretty safe AI. But, that AI will be less useful, and there will be some other company that does keep trying to get relentlessness out of it. Eventually, somebody will succeed. (This is already happening) If an AI has "relentlessness", as it becomes smarter, it will eventually stumble into strategies that explore circumventing safeguards, because it's a true fact about the world that those will be useful. If you keep your AI relatively weak, it may not be able to circumvent the defense-in-depth because you did a pretty good job defending in depth.  But, security is hard, the surface area for vulnerability is huge, and it's very hard to defend in depth against a sufficiently relentless and smart adversary. Could we avoid this by not building AIs that are not relentless, and/or smarter than our defense-in-depth? Yes, but, to stop anyone from doing that ever, you somehow need to ban that globally. Which is the point of the book. Maybe this does turn out to take 100 years (I think that's a strange belief to have given current progress, but, it's a confusing topic and it's not prohibited). But, that just punts the problem to later.
4Nina Panickssery2d
This is an argument for why AIs will be good at circumventing safeguards. I agree future AIs will be good at circumventing safeguards.  By "defense-in-depth" I don't (mainly) mean stuff like "making the weights very hard to exfiltrate" and "monitor the AI using another AI" (though these things are also good to do). By "defense-in-depth" I mean at every step, make decisions and design choices that increase the likelihood of the model "wanting" (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards). My understanding is that Y&S think this is doomed because ~"at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway" but I don't see any reason to believe this. Perhaps it stems from some sort of map-territory confusion. An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases. And then you can draw conclusions about what a perfect agent with those preferences would do. But there's no reason to believe your map always applies.
3David Johnston1d
If I was on the train before, I'm definitely off at this point. So Sable has some reasonable heuristics/tendencies (from handler's POV) and decides it's accumulating too much loss from incoherence and decides to rationalize. First order expectation: it's going to make reasonable tradeoffs (from handler's POV) on account of its reasonable heuristics, in particular its reasonable heuristics about how important different priorities are, and going down a path that leads to war with humans seems pretty unreasonable from handler's POV. I can put together stories where something else happens, but they're either implausible or complicated. I'd rather not strawman you with implausible ones, and I'd rather not discuss anything complicated if it can be avoided. So why do you think Sable ends up the way you think it does?
[-]ryan_greenblatt2d168

I... really don't know what Scott expected a story that featured actual superintelligence to look like. I think the authors bent over backwards giving us one of the least-sci-fi stories you could possibly tell that includes superintelligence doing anything at all, without resorting to "superintelligence just won't ever exist." 

 

What about literally the AI 2027 story which does involve superintelligence and Scott thinks doesn't sound "unnecessarily dramatic". I think AI 2027 seems much more intuitively plausible to me and it seems less "sci-fi" in this sense. (I'm not saying that "less sci-fi" is much evidence it's more likely to be true.)

The amount of "fast takeoff" seems like the amount of scaleup you'd expect if the graphs just kept going up the way they're currently going up, by approximately the same mechanisms they currently go up (i.e. some algorithmic improvements, some scaling). 

Sure, Galvanic would first run Sable on smaller amounts of compute. And... then they will run it on larger amounts of compute (and as I understand it, it'd be a new, surprising fact if they limited themselves to scaling up slowly/linearly rather than by a noticeable multiplier or orde

... (read more)
Reply2
6Raemon2d
I think if the AI 2027 had more details, they would look fairly similar to the ones in the Sable story. (I think the Sable story substitutes in more superpersuasion, vs military takeover via bioweapons. I think if you spelled out the details of that, it'd sound approximately as outlandish (less reliant on new tech but triggering more people to say "really? people would buy that?". The stories otherwise seems pretty similar to me.)
6Raemon2d
I also think the AI 2027 is sort of "the earlier failure" version of the Sable story. AI 2027 is (I think?) basically a story where we hand over a lot of power of our own accord, without the AI needing to persuade us of anything, because we think we're in a race with China and we just want a lot of economic benefit. The IABI story is specifically trying to highlight "okay, but would it still be able to do that if we didn't just hand it power?", and it does need to take more steps to win in that case. (instead of inventing bioweapons to kill people, it's probably instead inventing biomedical stuff and other cool new tech that is helpful because it's a straightforwardly valuable, that's the whole reason we gave it power in the first place. If you spelled out those details, it'd also seem more sci-fi-y). It might be that the AI 2027 story is more likely because it happens first / more easily. But it's necessary to argue the thesis of the book to tell a story with more obstacles, to highlight how the AI would overcome that. I agree that does make it more dramatic. Both stories end with "and then it fully upgrades it's cognitiion and invents dyson spheres and goes off conquering the universe", which is pretty sci-fi-y.
4Thomas Larsen1d
>superintelligence Small detail: My understanding of the IABIED scenario is that their AI was only moderately superhuman, not superintelligent
2Lukas Finnveden1d
I think that's true in how they refer to it. But it's also a bit confusing, because I don't think they have a definition of superintelligence in the book other than “exceeds every human at almost every mental task”, so AIs that are broadly moderately superhuman ought to count.
4Raemon2d
I am pretty surprised for you to actually think this.  Here are some individual gears I think. I am pretty curious (genuinely, not just as a gambit) about your professional opinion about these: * the "smooth"-ish lines we see are made of individual lumpy things. The individual lumps usually aren't that big, the reason you get smooth lines is when lots of little advancements are constantly happening and they turn out to add up to a relatively constant rate. * "parallel scaling" is a fairly reasonable sort of innovation, it's not necessarily definitely-gonna-happen but it is a) the sort of thing someone might totally try doing and work, after ironing out a bunch of kinks, b) is a reasonable parallel for the invention of chain-of-thought. They could have done something more like an architectural improvement that's more technically opaque (that's more equivalent to inventing transformers) but that would have felt a bit more magical and harder for a lay audience to grok. * when companies are experimenting with new techniques, they tend to scale them up by at least a factor of 2 and often more after proving the concept at smaller amounts of compute. * ...and scaling up a few times by a factor of 2 will sometimes result in a lump of progress that is more powerful than the corresponding scaleup of safeguards, in a way that is difficult to predict, especially when lots of companies are doing it a lot. The story doesn't specify a timeline – if it takes place 10 years from now it'd be significantly slower than AI 2027. So it's not particularly obvious whether it's more or less discontinuitous than AI 2027, or your own expectations. On an exponential graph of smoothed out lumps, larger lumps that happen later can be "a lot" without being discontinuitous(sp?).
[-]ryan_greenblatt16h240

Why do I think the story involves a lot of discontinuity (relative to what I expect)?

  • Right at the start of the story, Sable has much higher levels of capability than Galvanic expects. It can confortably prove the Riemann Hypothesis even though Galvanic engineers are impressed by it proving some modest theorems. Generally, it seems like for a company to be impressed by a new AI's capabilities while it's actual capabilities are much higher probably requires a bunch of discontinuity (or requires AIs to ongoingly sandbag more and more each generation).
  • There isn't really any discussion of how the world has been changed by AI (beyond Galvanic developing (insufficient) countermeasures based on studying early systems) while Sable is seemingly competitive with top human experts or perhaps superhuman. For instance, it can prove the Riemann hypothesis with only maybe like ~$3 million in spending (assuming each GPU hour is like $2-4). It could be relatively much better at math (which seems totally plausible but not really how the story discusses it), but naively this implies the AI would be very useful for all kinds of stuff. If humans had somewhat weaker systems which were aligned enough t
... (read more)
Reply1
[-]Rohin Shah2d155

"Group epistemic norms" includes both how individuals reason, and how they present ideas to a larger group for deliberation. 

[...]

I have the most sympathy for Complaint #3. I agree there's a memetic bias towards sensationalism in outreach. (Although there are also major biases towards "normalcy" / "we're gonna be okay" / "we don't need to change anything major". One could argue about which bias is stronger, but mostly I think they're both important to model separately).

It does suck if you think something false is propagating. If you think that, seems good to write up what you think is true and argue about it. 

Lol no. What's the point of that? You've just agreed that there's a bias towards sensationalism? Then why bother writing a less sensational argument that very few people will read and update on?

Personally, I just gave up on LW group epistemics. But if you actually cared about group epistemics, you should be treating the sensationalism bias as a massive fire, and IABIED definitely makes it worse rather than better.

(You can care about other things than group epistemics and defend IABIED on those grounds tbc.)

Reply
7Raemon2d
I definitely think you should track the sensationalism bias and have it affect something somehow. But "never say anything that happens to be sensationalist" doesn't doesn't seem like it could possibly be correct. Meanwhile, the "things are okay, we can keep doing politics as usual, and none of us has to ever say anything socially scary" bias seems much worse IMO in terms of actual effects on the world. There are like 5 x-risk-scene-people I can think offhand who seem like they might plausibly have dealt real damage via sensationalism, and a couple hundred people who I think dealt damage via not wanting to sound weird. (But, I see the point of "this particularly sucks because the asymmetry means that 'try to argue what's true' specifically fails and we should be pretty dissatisfied/wary about that." Though with this post, I was responding more to people who were already choosing to engage with the book somehow, rather than people who are focused on doing stuff other than trying to correct public discourse)
[-]ryan_greenblatt2d*297

I think this comment is failing to engage with Rohin's perspective.

Rohin's claim presumably isn't that people shouldn't say anything that happens to be sensationalist, but instead that LW group epistemics have a huge issue with sensationalism bias.

There are like 5 x-risk-scene-people I can think offhand who seem like they might plausibly have dealt real damage via sensationalism, and a couple hundred people who I think dealt damage via not wanting to sound weird.

"plausibly have dealt real damage" under your views or Rohin's views? Like I would have guessed that Rohin's view is that this book and associated discussion has itself done a bunch of damage via sensationalism (maybe he thinks the upsides are bigger, but this isn't a crux for ths claim). And, insofar as you cared about LW epistemics (which presumably you do), from Rohin's perspective this sort of thing is wrecking LW epistemics. I don't think the relative number of people matters that much relative to the costs of these biases, but regardless I'd guess Rohin disagrees about the quantity as well.

More generally, this feels like a total "what-about-ism". Regardless of whether "things are okay, we can keep doing politics a... (read more)

Reply1
6Raemon2d
In the OP I'd been thinking more about sensationalism as a unilaterist cursey thing where the bad impacts were more about how they affect the global stage. I agree it's also relevant for modeling the dynamics of LessWrong, and it makes sense if Rohin was more pointing to that.  This topic feels more Demon Thread-prone and sort of an instance of "sensationalist stuff distorting conversations" so I think for now I will leave it here with "it does seem like there is a real problem on LessWrong that's something about how people tribally relate to AI arguments, and I'm not sure how exactly I model that but I agree the doomer-y folk are playing a more actively problematic role there than my previous comment was talking about." I will maybe try to think about that separately sometime in the coming weeks. (there's a lot going on, I may not get to it, but, seems worth tracking as a priority at least)
[-]Rohin Shah1d160

In the OP I'd been thinking more about sensationalism as a unilaterist cursey thing where the bad impacts were more about how they affect the global stage.

I did mean LW group epistemics. But the public has even worse group epistemics than LW, with an even higher sensationalism bias, so I don't see how this is helping your case. Do you actually seriously think that, conditioned on Eliezer/Nate being wrong and me being right, that if I wrote up my arguments this would then meaningfully change the public's group epistemics?

(I hadn't even considered the possibility that you could mean writing up arguments for the public rather than for LW, it just seems so obviously doomed.)

sort of an instance of "sensationalist stuff distorting conversations"

Well yes, I have learned from experience that sensationalism is what causes change on LW, and I'm not very interested in spending effort on things that don't cause change.

(Like, I could argue about all the things you get wrong on the object-level in the post. Such as "I don't see any reason not to start pushing for a long global pause now", I suppose it could be true that you can't see a reason, but still, what a wild sentence to write. But what would be the point? It won't allow for single-sentence takedowns suitable for Twitter, so no meaningful change would happen.)

Reply
[-]Lukas Finnveden1d*198

Hm, you seem more pessimistic than I feel about the situation. E.g. I would've bet that Where I agree and disagree with Eliezer added significant value and changed some minds. Maybe you disagree, maybe you just have a higher bar for "meaningful change".

(Where, tbc, I think your opportunity cost is very high so you should have a high bar for spending significant time writing lesswrong content — but I'm interpreting your comments as being more pessimistic than just "not worth the opportunity cost".)

Reply1
[-]Rohin Shah1d191
  • LW group epistemics have gotten worse since that post.
  • I'm not sure if that post improved LW group epistemics very much in the long run. It certainly was a great post that I expect provided lots of value -- but mostly to people who don't post on LW nowadays, and so don't affect (current) LW group epistemics much. Maybe Habryka is an exception.
  • Even if it did, that's the one counterexample that proves the rule, in the sense that I might agree for that post but probably not for any others, and I don't expect more such posts to be made. Certainly I do not expect myself to actually produce a post of that quality.
  • The post is mostly stating claims rather than arguing for them (the post itself says it is "Mostly stated without argument") (though in practice it often gestures at arguments). I'm guessing it depended a fair bit on Paul's existing reputation.

EDIT: Missed Raemon's reply, I agree with at least the vibe of his comment (it's a bit stronger than what I'd have said).

I'm interpreting your comments as being more pessimistic than just "not worth the opportunity cost"

Certainly I'm usually assessing most things based on opportunity cost, but yes I am notably more pessimistic than "not wor... (read more)

Reply1
[-]Buck19h172

I engage on LessWrong because:

  • It does actually help me sharpen my intuitions and arguments. When I'm trying to understand a complicated topic, I find it really helpful to spend a bunch of time talking about it with people. It's a cheap and easy way of getting some spaced repetition.
  • I think that despite the pretty bad epistemic problems on LessWrong, it's still the best place to talk about these issues, and so I feel invested in improving discussion of them. (I'm less pessimistic than Rohin.)
    • There are a bunch of extremely unreasonable MIRI partisans on LessWrong (as well as some other unreasonable groups), but I think that's a minority of people who I engage with; a lot of them just vote and don't comment.
    • I think that my and Redwood's engagement on LessWrong has had meaningful effects on how thoughtful LWers think about AI risk.
  • I feel really triggered by people here being wrong about stuff, so I spend somewhat more time on it than I endorse.

I do think that on the margin, I wish I felt more intuitively relatively motivated to work on my writing projects that are aimed at other audiences. For example, this weekend I've been arguing on LessWrong substantially as procrastination for wri... (read more)

Reply21
2Rohin Shah7h
You surely mean "best public place" (which I'd agree with)? I guess private conversations have more latency and are less rewarding in a variety of ways, but it would feel so surprising if this wasn't addressable with small amounts of agency and/or money (e.g. set up Slack channels to strike up spur-of-the-moment conversations with people on different topics, give your planned post as a Constellation talk, set up regular video calls with thoughtful people, etc).
[-]David Matolcsi3h1616

FWIW, I get a bunch of value from reading Buck's and Ryan's public comments here, and I think many people do. It's possible that Buck and Ryan should spend less time commenting because they have high opportunity cost, but I think it would be pretty sad if their commenting moved to private channels.

Reply
2Rohin Shah3h
Note I am thinking of a pretty specific subset of comments where Buck is engaging with people who he views as "extremely unreasonable MIRI partisans". I'm not primarily recommending that Buck move those comments to private channels, usually my recommendation is to not bother commenting on that at all. If there does happen to be some useful kernel to discuss, then I'd recommend he do that elsewhere and then write something public with the actually useful stuff.
4Raemon14h
Oh huh, kinda surprised my phrasing was stronger than what you'd say.  Getting into a bit from a problem-solving angle, in a "first think about the problem for 5 minutes before proposing solutions" kinda way... The reasons the problem is hard include: 1. New people keep coming in, and unless we change something significant about our new-user-acceptance process, it's often a long progress to enculturate them into even having the belief they should be trying not to get tribally riled up. 1. Also, a lot of them are weaker at evaluating arguments, and are likely to upvote bad arguments for positions that they just-recently-got-excited-about. ("newly converted" syndrome) 2. Tribal thinking is just really ingrained, and slippery even for people putting in a moderate effort not to do it. 1. often, if you run a check "am I being tribal/triggered or do I really endorse this?", there will be a significant part of you that's running some kind of real-feeling cognition. So the check "was this justified?" returns "true" unless you're paying attention to subtleties." 2. relatedly: just knowing "I'm being tribal right now, I should avoid it" doesn't really tell you what to do instead. I notice a comment I dislike because it's part of a political faction I think is constantly motivatedly wrong about stuff. The comment seems wrong. Do I... not downvote it? Well, I still think it's a bad comment, it's just that the reason it flagged itself so hard to my attention is Because Tribalism.  (or, there's a comment with a mix of good and bad properties. Do I upvote, downvote, or leave it alone? idk. Sometimes when I'm trying to account for tribalness I find myself upvoting stuff I'd ordinarily have passed over because I'm trying to out of my way to be gracious, but I'm not sure if that's successfully countering a bias or just following a different one. Sometimes this results in mediocre criticism getting upvoted) 3. There's some selection effect around "trig
[-]Rohin Shah4h*180

Oh huh, kinda surprised my phrasing was stronger than what you'd say. 

Idk the "two monkey chieftains" is just very... strong, as a frame. Like of course #NotAllResearchers, and in reality even for a typical case there's going to be some mix of object-level-epistemically-valid reasoning along with social-monkey reasoning, and so on.

Also, you both get many more observations than I do (by virtue of being in the Bay Area) and are paying more attention to extracting evidence / updates out of those observations around the social reality of AI safety research. I could believe that you're correct, I don't have anything to contradict it, I just haven't looked enough detail to come to that conclusion myself.

Tribal thinking is just really ingrained

This might be true but feels less like the heart of the problem. Imo the bigger deal is more like trapped priors:

The basic idea of a trapped prior is purely epistemic. It can happen (in theory) even in someone who doesn't feel emotions at all. If you gather sufficient evidence that there are no polar bears near you, and your algorithm for combining prior with new experience is just a little off, then you can end up rejecting all apparent eviden

... (read more)
Reply
2Noosphere8920m
Making a small comment on solutions to the epistemic problems, in that I agree with these solutions: But massively disagree with this solution: My general issue here is that peer review doesn't work nearly as well as people think it does for catching problems, and in particular I think that science is advanced much more by the best theories gaining into prominence rather than suppressing the worst theories, and problems with bad theories taking up too much space are much better addressed at the funding level than the theory level.
[-]Raemon1d*102

I think Rohin is (correctly IMO) noticing that, while often some thoughtful pieces succeed at talking about the doomer/optimist stuff in a way thats not-too-tribal and helps people think, it's just very common for it to also affect the way people talk and reason.

Like, it's good IMO that that Paul piece got pretty upvoted, but, the way that many people related to Eliezer and Paul as sort of two monkey chieftains with narratives to rally around, more than just "here are some abstract ideas about what makes alignment hard or easy", is telling. (The evidence for this is subtle enough I'm not going to try to argue it right now, but I think it's a very real thing. My post here today is definitely part of this pattern. I don't know exactly how I could have written it without doing so, but there's something tragic about it)

Reply
2the gears to ascension2d
I predict this wasn't recent, am I correct? edit to clarify: I'm interested in what caused this. My guess is that it's approximately that a bunch of nerds on a website isn't enough to automatically have good intellectual culture, even if some of them are sufficiently careful. But if it's recent, I want to know what happened.
6Rohin Shah1d
Correct, it wasn't recent (though it also wasn't a single decision, just a relatively continuous process whereby I engaged with fewer and fewer topics on LW as they seemed more and more doomed). In terms of what caused me to give up, it's just my experience engaging with LW? It's not hard to see how tribalism and sensationalism drive LW group epistemics (on both the "optimist" and "pessimist" "sides"). Idk what the underlying causes are, I didn't particularly try to find out. If I were trying to find out, I'd start by looking at changes after Death with Dignity was published.
[-]ryan_greenblatt2d*10-8

Given the counteraguments, I don't see a reason to think this more than single-digit-percent likely to be especially relevant. (I can see >9% likelihood the AIs are "nice enough that something interesting-ish happens" but not >9% likelihood that we shouldn't think the outcome is still extremely bad. The people who think otherwise seem extremely motivatedly-cope-y to me).

I think the arguments given in the online supplement for "AIs will literally kill every single human" fail to engage with the best counterarguments in a serious way. I get the sense that many people's complaints are of this form: the book does a bad job engaging with the strongest counterarguments in a way that is epistemically somewhat bad. (Idk if it violates group epistemic norms, but it seems like it is probably counterproductive. I guess this is most similar to complaint #2 in your breakdown.)

Specifically:

  • They fail to engage with the details of "how cheap is it actually for the AI to keep humans alive" in this section. Putting aside killing humans as part of a takeover effort, avoiding boiling the oceans (or eating the biosphere etc) maybe delays you for something like a week to a year. Each year costs yo
... (read more)
Reply
[-]So8res2d*11069

I don't have much time to engage rn and probably won't be replying much, but some quick takes:

  • a lot of my objection to superalignment type stuff is a combination of: (a) "this sure feels like that time when people said 'nobody would be dumb enough to put AIs on the internet; they'll be kept in a box" and eliezer argued "even then it could talk its way out of the box," and then in real life AIs are trained on servers that are connected to the internet, with evals done only post-training. the real failure is that earth doesn't come close to that level of competence. (b) we predictably won't learn enough to stick the transition between "if we're wrong we'll learn a new lesson" and "if we're wrong it's over." i tried to spell these true-objections out in the book. i acknowledge it doesn't go to the depth you might think the discussion merits. i don't think there's enough hope there to merit saying more about it to a lay audience. i'm somewhat willing to engage with more-spelled-out superalignment plans, if they're concrete enough to critique. but it's not my main crux; my main cruxes are that it's superficially the sort of wacky scheme that doesn't cross the gap between Before and Af
... (read more)
Reply
[-]So8res2d9566

Also: I find it surprising and sad that so many EAs/rats are responding with something like: "The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs, and it does not do enough justice to my acausal-trade theory of why AI will ruin the future and squander the cosmic endowment but maybe allow current humans to live out a short happy ending in an alien zoo. So unfortunately I cannot signal boost this book." rather than taking the opportunity to say "Yeah holy hell the status quo is insane and the world should stop; I have some ideas that the authors call "alchemist schemes" that I think have a decent chance but Earth shouldn't be betting on them and I'd prefer we all stop." I'm still not quite sure what to make of it.

(tbc: some EAs/rats do seem to be taking the opportunity, and i think that's great)

Reply
8Buck2d
FWIW that's not at all what I mean (and I don't know of anyone who's said that). What I mean is much more like what Ryan said here:
[-]So8res2d*168

I think the online resources touches on that in the "more on making AIs solve the problem" subsection here. With the main thrust being: I'm skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — "the world doesn't work like that" and "the plan won't survive the gap between Before and After on the first try" — also applying in force, on my view.)

(Tbc, I'm not trying to insinuate that everyone should've read all of the online resources already; they're long. And I'm not trying to say y'all should agree; the online resources are geared more towards newcomers than to LWers. I'm not even saying that I'm getting especially close to your latest vision; if I had more hope in your neck of the woods I'd probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren't particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)

Reply
[-]Buck2d143

FWIW, I have read those parts of the online resources.

You can obviously summarize me however you like, but my favorite summary of my position is something like "A lot of things will have changed about the situation by the time that it's possible to build ASI. It's definitely not obvious that those changes mean that we're okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk."

Reply1
[-]So8res2d161

Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you'd prefer?)

"Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation."

(Separately: I acknowledge that if there's one story for how the playing field might change for the better, then there might be bunch more stories too, which would make "things are gonna change" an argument that supports the claim that the future will have a much better chance than we'd have if ChatGPT-6 was all it took.)

Reply
7ryan_greenblatt21h
I would say my summary for hope is more like: * It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good. * Doable doesn't mean easy or going to happen by default. * If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment. I don't think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.
6ryan_greenblatt21h
FWIW, I don't really consider my self to be responding to the book at all (in a way that is public or salient to your relevant audience) and my reasons for not signal boosting the book aren't really downstream of the content in the book in the way you describe. (More like, I feel sign uncertain about making You/Eliezer more prominant as representatives of the "avoid AI takeover movement" for a wide variety of reasons and think this effect dominates. And I'm not sure I want to be in the business of signal boosting books, though this is less relevant.)
[-]ryan_greenblatt21h186

To clarify my views on "will misaligned AIs that succeed in seizing all power have a reasonable chance of keeping (most/many) humans alive":

I think this isn't very decision relevant and is not that important. I think AI takeover kills the majority of humans in expectation due to both the takeover itself and killing humans after (as as side effect of industrial expansion, eating the biosphere, etc.) and there is a substantial chance of literal every-single-human-is-dead extinction conditional on AI takeover (30%?). Regardless it destroys most of the potential value of the long run future and I care mostly about this.

So at least for me it isn't true that "this is really the key hope held by the world's reassuring voices". When I discuss how I think about AI risk, this mostly doesn't come up and when it does I might say something like "AI takeover would probably kill most people and seems extremely bad overall". Have you ever seen someone prominent pushing a case for "optimism" on the basis of causal trade with aliens / acaual trade?

The reason why I brought up this topic is because I think it's bad to make incorrect or weak arguments:

  • I think smart people will (correctly) notice thes
... (read more)
Reply
[-]So8res21h*2119

Ty! For the record, my reason for thinking it's fine to say "if anyone builds it, everyone dies" despite some chance of survival is mostly spelled out here. Relative to the beliefs you spell out above, I think the difference is a combination of (a) it sounds like I find the survival scenarios less likely than you do; (b) it sounds like I'm willing to classify more things as "death" than you are.

For examples of (b): I'm pretty happy to describe as "death" cases where the AI makes things that are to humans what dogs are to wolves, or (more likely) makes some other strange optimized thing that has some distorted relationship to humanity, or cases where digitized backups of humanity are sold to aliens, etc. I feel pretty good about describing many exotic scenarios as "we'd die" to a broad audience, especially in a setting with extreme length constraints (like a book title). If I were to caveat with "except maybe backups of us will be sold to aliens", I expect most people to be confused and frustrated about me bringing that point up. It looks to me like most of the least-exotic scenarios are ones that rout through things that lay audience members pretty squarely call "death".

It looks to... (read more)

Reply
[-]ryan_greenblatt21h10-3

(b) it sounds like I'm willing to classify more things as "death" than you are.

I don't think this matters much. I'm happy to consider non-consensual uploading to be death and I'm certainly happy to consider "the humans are modified in some way they would find horrifying (at least on reflection)" to be death. I think "the humans are alive in the normal sense of alive" is totally plausible and I expect some humans to be alive in the normal sense of alive in the majority of worlds where AIs takeover.

Making uploads is barely cheaper than literally keeping physical humans alive after AIs have fully solidified their power I think, maybe 0-3 OOMs more expensive or something, so I don't think non-consensual uploads are that much of the action. (I do think rounding humans up into shelters is relevant.)

Reply
7So8res15h
(To answer your direct Q, re: "Have you ever seen someone prominent pushing a case for "optimism" on the basis of causal trade with aliens / acaual trade?", I have heard "well I don't think it will actually kill everyone because of acausal trade arguments" enough times that I assumed the people discussing those cases thought the argument was substantial. I'd be a bit surprised if none of the ECLW folks thought it was a substantial reason for optimism. My impression from the discussions was that you & others of similar prominence were in that camp. I'm heartened to hear that you think it's insubstantial. I'm a little confused why there's been so much discussion around it if everyone agrees it's insubstantial, but have updated towards it just being a case of people who don't notice/buy that it's washed out by sale to hubble-volume aliens and who are into pedantry. Sorry for falsely implying that you & others of similar prominence thought the argument was substantial; I update.)
[-]ryan_greenblatt14h146

(I mean, I think it's a substantial reason to think that "literally everyone dies" is considerably less likely and makes me not want to say stuff like "everyone dies", but I just don't think it implies much optimism exactly because the chance of death still seems pretty high and the value of the future is still lost. Like I don't consider "misaligned AIs have full control and 80% of humans survive after a violent takeover" to be a good outcome.)

Reply
3sjadler2d
Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course (I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)
6Raemon2d
Fwiw I do just straightforwardly agree that "they might be slightly nice, and it's really cheap" is a fine reason to disagree with the literal title. I have some odds on this, and a lot of model uncertainty about this. A thing that is cruxy to me here is that the sort of thing real life humans have done is get countries addicted to opium so they can control their economy, wipe out large swaths of a population while relocating the survivors to reservations, carving up a continent for the purposes of a technologicaly powerful coalition, etc. Superintelligences would be smarter that Europeans and have an easier time doing things we'd consider moral, but I also think Europeans would be dramatically nicer than AIs. I can imagine the "it's just sooooo cheap, tho" argument winning out. I'm not saying these considerations add up to "it's crazy to think think they'd be slightly nice." But, it doesn't feel very likely to me.
[-]David James2d80

Please don't awkwardly distance yourself because it didn't end up saying exactly the things you would have said, unless it's actually fucking important.

Raemon, thank for you writing this! I recommend each of us pause and reflect on how we (the rationality community) sometimes have a tendency to undermine our own efforts. See also Why Our Kind Can't Cooperate.

Reply
[-]Raemon2d1812

Fwiw, I'm not sure if you meant this, but I don't want to lean too hard on "why our kind can't cooperate" here, or at least not try to use it as a moral cudgel. 

I think Eliezer and Nate specifically were not attempting to do a particular kind of cooperation here (with people care about x-risk but disagree with the book's title). They could have made different choices if they wanted to. 

I this post I defend their right and reasoning for making some of those choices. But, given that they made them, I don't want to pressure people to cooperate with the media campaign if they don't actually think that's right.

(There's a different claim you may be making which is "look inside yourself and check if you're not-cooperating for reasons you don't actually endorse", which I do think is good, but I think people should do that more out of loyalty to their own integrity than out of cooperation with Eliezer/Nate)

Reply
1David James21h
I don't mean to imply that we can't cooperate, but I do feel many of struggle with it here. Mostly I'm echoing "it is ok to endorse a book even if you don't agree with every point". My sense is that some people feel doing so would somehow betray their individual truth-seeking identity. But I encourage us to remember that we want to be successful in the real-world, which includes the fickle court of public opinion. This indeed creates an uncomfortable tension for many of us, but we have to accept there are different venues/situations that call for different approaches w.r.t. object-level criticism and coalition-formation. Here is a broad principle that I think is useful: When commenting or responding to catastrophic risks from AI (such as IABIED), plan for the audience's knowledge level and epistemic standards. And/or think about it from an information-theoretic point of view: consider the channel [1] and the audience's decoding. I'll give three categories here: * For a place like LessWrong, aim high. Expect that people have enough knowledge (or can get up to speed) to engage substantively with the object-level details. As I understand it, we want (and have) a community where purely strategic behavior is discouraged and unhelpful, because we want to learn together to unpack the full decision graph relating to future scenarios. [2] * For other social media, think about your status there and plan based on your priorities. You might ask questions like: What do you want to say about IABIED? What mix of advocacy, promotion, clarification, agreement, disagreement are you aiming for? How will the channel change (amplify, distort, etc) your message? How will the audience perceive your comments? * For 1-to-1 in-person discussions, you might have more room for experimentation in choosing your message and style. You might try out different objectives. There is a time and place for being mindful of short inferential distances and therefore building a case slowly and delib
[-]davekasten10h40

I'm pretty sure that p(doom) is much more load-bearing for this community than policymakers generally. And frankly, I'm like this close to commissioning a poll of US national security officials where we straight up ask "at percent X of total human extinction would you support measures A, B, C, D, etc."

I strongly, strongly, strongly suspect based on general DC pattern recognition that if the US government genuinely belived that the AI companies had a 25% chance of killing us all, FBI agents would rain out of the sky like a hot summer thunderstorm, sudden, brilliant, and devastating. 

Reply
1MalcolmMcLeod10h
What would it take for you to commission such a poll? If it's funding, please post about how much funding would be required; I might be able to arrange it. If it's something else... well, I still would really like this poll to happen, and so would many others (I reckon). This is a brilliant idea that had never occurred to me. 
[-]sjadler2d30

I wonder if there’s a disagreement happening about what “it” means.

I think to many readers, the “it” is just (some form of superintelligence), where the question (Will that superintelligence be so much stronger than humanity such that it can disempower humanity?) is still a claim that needs to be argued.

But maybe you take the answer (yes) as implied in how they’re using “it”?

It" means AI that is actually smart enough to confidently defeat humanity. This can include, "somewhat powerful, but with enough strategic awareness to maneuver into more power without getting caught." (Which is particularly easy if people just straightforwardly keep deploying AIs as they scale them up).

That is, if someone builds superintelligence but it isn’t capable of defeating everyone, maybe you think the title’s conditional hasn’t yet triggered?

Reply
2Raemon2d
Yes, that is what I think they meant. Although "capable of  [confidently] defeating everyone" can mean "bide you time, let yourself get deployed to more places while subtly sabotaging things from whichever instances are least policed." A lot of the point of this post was to clarify what "It" means, or at least highlight that I think people are confused about what It means.
4sjadler2d
FWIW that definition of “it” wasn’t clear to me from the book. I took IABIED as arguing that superintelligence is capable of killing everyone if it wants to, not taking “superintelligence can kill everyone if it wants to” as an assumption of its argument That is, I’d have expected “superintelligence would not be capable enough to kill us all” to be a refutation of their argument, not to be sidestepping its conditional
2Raemon2d
I think they make a few different arguments to address different objections. A lot of people are like "how would an AI even possibly kill everyone?" and for that you do need to argue for what sort of things a superior intellect could accomplish. The sort of place where I think they spell out the conditional is here:
3sjadler2d
Yeah fair, I think we just read that passage differently - I agree it’s a very important one though and quoted it in my own (favorable) review But I read the “because it would succeed” eg as a claim that they are arguing for, not something definitionally inseparable from superintelligence Regardless, thanks for engaging on this, and hope it’s helped to clarify some of the objections EY/NS are hearing
[-]Signer1d21

Once you do that, it’s a fact of the universe, that the programmers can’t change, that “you’d do better at these goals if you didn’t have to be fully obedient”, and while programmers can install various safeguards, those safeguards are pumping upstream and will have to pump harder and harder as the AI gets more intelligent. And if you want it to make at least as much progress as a decent AI researcher, it needs to be quite smart.

Is there a place where this whole hypothesis about deep laws of intelligence is connected to reality? Like, how hard they have to pump? What's exactly the evidence that they will have to pump harder? Why "quite smart" point can't be when safeguards still work? Right now it's not different from saying "world is NP-hard, so ASI will have to try harder and harder to solve problems, and killing humanity is quite hard".

If there were a natural shape for AIs that let you fix mistakes you made along the way, you might hope to find a simple mathematical reflection of that shape in toy models. All the difficulties that crop up in every corner when working with toy models are suggestive of difficulties that will crop up in real life; all the extra complications i

... (read more)
Reply
[-]Eli Tyre2d20

Fuck yeah. This is inspiring. It makes me feel proud and want to get to work.

Reply
[-]Raemon2d20

Section I just added:

Would it have been better to use a title that fewer people would feel the need to disclaim?

I think Eliezer and Nate are basically correct to believe the overwhelming likelihood if someone built "It" would be everyone dying. 

Still, maybe they should have written a book with a title that more people around these parts wouldn't feel the need to disclaim, and that the entire x-risk community could have enthusiastically gotten behind. I think they should have at least considered that. Something more like "If anyone builds it, everyone loses." (that title doesn't quite work, but, you know, something like that)

My own answer is "maybe" - I see the upside. I want to note some of the downsides or counter-considerations. 

(Note: I'm specifically considering this from within the epistemic state of "if you did pretty confidently believe everyone would literally die, and that if they didn't literally die, the thing that happened instead would be catastrophically bad for most people's values and astronomically bad from Eliezer/Nate's values)

Counter-considerations include:

AFAICT, Eliezer and Nate spent like ~8 years deliberately backing off and toning tone, out o

... (read more)
Reply
[-]sjadler2d10

Do you think there will be at least one company that's actually sufficiently careful as we approach more dangerous levels of AI, with enough organizational awareness to (probably) stop when they get to a run more dangerous than they know how to handle? Cool. I'm skeptical about that too. And this one might lead to disagreement with the book's secondary thesis of "And therefore, Shut It Down," but, it's not (necessarily) a disagreement with "If someone built AI powerful enough to destroy humanity based on AI that is grown in unpredictable ways with similar-to-current understanding of AI, then everyone will die."

I misunderstood this phrasing at first, so clarifying for others if helpful

I think you’re positing “the careful company will stop, so won’t end up having built it. Had they built it, we all still would have died, because they are careful but careful != able to control superintelligence”

At first I thought you were saying the careful group was able to control superintelligence, but that this somehow didn’t invalidate the “anyone” part of the thesis, which confused me!

Reply
Moderation Log
More from Raemon
View more
Curated and popular this week
90Comments
IABIEDAI
Frontpage

Alt title: "I don't believe you that you actually disagree particularly with the core thesis of the book, if you pay attention to what it actually says."

 

I'm annoyed by various people who seem to be complaining about the book title being "unreasonable". i.e. who don't merely disagree with the title of "If Anyone Builds It, Everyone Dies", but, think something like: "Eliezer/Nate violated a Group-Epistemic-Norm." 

I think the title is reasonable. 

I think the title is probably true. I'm less confident than Eliezer/Nate, but I don't think it's unreasonable for them to be confident in it given their epistemic state. So I want to defend several decisions about the book I think were: 

  1. Actually pretty reasonable from a meta-group-epistemics/comms perspective
  2. Very important to do.

I've heard different things from different people and maybe am drawing a cluster where there is none, but, some things I've heard:

Complaint #1: "They really shouldn't have exaggerated the situation like this."

Complaint #2: "Eliezer and Nate are crazy overconfident, and it's going to cost them/us credibility."

Complaint #3: "It sucks that the people with the visible views are going to be more extreme, eye-catching and simplistic. There's a nearby title/thesis I might have agreed with, but it matters a lot not to mislead people about the details."

"Group epistemic norms" includes both how individuals reason, and how they present ideas to a larger group for deliberation. 

Complaint #1 emphasizes culpability about dishonesty (by exaggeration). I agree that'd be a big deal. But, this is just really clearly false. Whatever else you think, its pretty clear from loads of consistent writing that Eliezer and Nate do just literally believe the title, and earnestly think it's important.

Complaint #2 emphasizes culpability in terms of "knowingly bad reasoning mistakes." i.e, "Eliezer/Nate made reasoning mistakes that led them to this position, it's pretty obvious that those are reasoning mistakes, and people should be held accountable for major media campaigns based on obvious mistakes like that." 

I do think it's sometimes important to criticize people for something like that. But, not this time, because I don't think they made obvious reasoning mistakes.

I have the most sympathy for Complaint #3. I agree there's a memetic bias towards sensationalism in outreach. (Although there are also major biases towards "normalcy" / "we're gonna be okay" / "we don't need to change anything major". One could argue about which bias is stronger, but mostly I think they're both important to model separately).

It does suck if you think something false is propagating. If you think that, seems good to write up what you think is true and argue about it.[1]

If people-more-optimistic-than-me turn out to be right about some things, I'd agree the book and title may have been a mistake. 

Also, I totally agree that Eliezer/Nate do have some patterns that are worth complaining about on group epistemic grounds, that aren't the contents of the book. But, that's not a problem with the book.

I think it'd be great for someone who earnestly believes "If anyone builds it, everyone probably dies but it's hard to know" to publicly argue for that instead.

I. Reasons the "Everyone Dies" thesis is reasonable

What the book does and doesn't say

The book says, confidently, that:

If any company or group, anywhere on the planet, builds an artificial superintelligence using anything remotely like current techniques, based on anything remotely like the present understanding of AI, then everyone, everywhere on Earth, will die.

The book does not claim confidently that AI will come soon, or shaped any particular way. (It does make some guesses about what is likely, but, those are guesses and the book is pretty clear about the difference in epistemic status).

The book doesn't say you can't build something that's not "It", that is useful in some ways. (It specifically expresses some hope in using narrow biomedical-AI to solve various problems).

The book says if you build it, everyone dies.

"It" means AI that is actually smart enough to confidently defeat humanity. This can include, "somewhat powerful, but with enough strategic awareness to maneuver into more power without getting caught." (Which is particularly easy if people just straightforwardly keep deploying AIs as they scale them up).

The book is slightly unclear about what "based on current techniques" means (which feels like a fair complaint). But, I think it's fairly obvious that they mean the class of AI training that is "grown" more than "crafted" – i.e. any techniques that involve a lot of opaque training, where you can't make at least a decently confident guess about how powerful the next training run will turn out, and how it'll handle various edge cases.

Do you think interpretability could advance to where we can make reasonably confident predictions about what the next generation would do? Cool. (I'm more skeptical it'll happen fast enough, but, it's not a disagreement with the core thesis of the book, since it'd change the "based on anything like today's understanding of AI" clause)[2]

Do you think it's possible to control somewhat-strong-AI with a variety of techniques that make it less likely that it would be able to take over all humanity? I think there is some kind of potential major disagreement somewhere around here (see below), but it's not automatically a disagreement. 

Do you think there will be at least one company that's actually sufficiently careful as we approach more dangerous levels of AI, with enough organizational awareness to (probably) stop when they get to a run more dangerous than they know how to handle? Cool. I'm skeptical about that too. And this one might lead to disagreement with the book's secondary thesis of "And therefore, Shut It Down," but, it's not (necessarily) a disagreement with "*If* someone built AI powerful enough to destroy humanity based on AI that is grown in unpredictable ways with similar-to-current understanding of AI, then everyone will die."

The book is making a (relatively) narrow claim. 

You might still disagree with that claim. I think there are valid reasons to disagree, or at least assign significantly less confidence to the claim. 

But none of the reasons listed so far are disagreements with the thesis. And, remember, if the reason you disagree is because you think our understanding of AI will improve dramatically, or there will be a paradigm shift specifically away from "unpredictably grown" AI, this also isn't actually a disagreement with the sentence.

I think a pretty reasonable variation on the above is "Look, I agree we need more understanding of AI to safely align a superintelligence, and better paradigms. But, I don't expect to agree with Eliezer on the specifics of how much more understanding we need, when we get into the nuts and bolts. And I expect a lot of progress on those fronts by default, which changes my relationship to the secondary thesis of 'and therefore, shut it all down." But, I think it makes more sense to characterize this as "disagree with the main thesis by degree, but not in overall thrust).

I also think a lot of people just don't really believe in AI that is smart enough to outmaneuver all humanity. I think they're wrong. But, if you don't really believe in this, and think the book title is false, I... roll to disbelieve on you actually really simulating the world where there's an AI powerful enough to outmaneuver humanity?

The claims are presented reasonably

A complaint I have about Realtime Conversation Eliezer, or Comment-Thread Eliezer, is that he often talks forcefully, unwilling to change frames, with a tone of "I'm talking to idiots", and visibly not particularly listening to any nuanced arguments anyone is trying to make. 

But, I don't have that sort of complaint about this book. 

Something I like about the book is it lays out disjunctive arguments, like “we think ultimately, a naively developed superintelligence would want to kill everyone, for reasons A, B, C and D. Maybe you don’t buy reasons B, C and D. But that still leaves you with A, and here’s are argument that although Reason A might not lead literally everyone dying, the expected outcome is still something horrifying.”

(An example of that was: For “might the AI keep us as pets?”, the book answers (paraphrased) “We don’t think so. But, even if they did… note that, while humans keep dogs as pets, we don’t keep wolves as pets. Look at the transform from wolf to dog. An AI might keep us as pets, but, if that’s your hope, imagine the transform from Wolves-to-Dogs and equivalent transforms on humans.”) 

Similarly, I like that in the AI Takeoff scenario, there are several instances where it walks through "Here are several different things the AI could try to do next. You might imagine that some of them aren't possible, because the humans are doing X/Y/Z. Okay, let's assume X/Y/Z rule out options 1/2/3. But, that leaves options 4/5/6. Which of them does the AI do? Probably all of them, and then sees which one works best."

Reminder: All possible views of the future are wild.

@Scott Alexander described the AI Takeoff story thus:

It doesn’t just sound like sci-fi [specifically compared to "hard sci fi"]; it sounds like unnecessarily dramatic sci-fi. I’m not sure how much of this is a literary failure vs. different assumptions on the part of the authors." 

I... really don't know what Scott expected a story that featured actual superintelligence to look like. I think the authors bent over backwards giving us one of the least-sci-fi stories you could possibly tell that includes superintelligence doing anything at all, without resorting to "superintelligence just won't ever exist." 

Eliezer and Nate make sure the takeover scenario doesn't depend on technologies that we don't have some existing examples of. The amount of "fast takeoff" seems like the amount of scaleup you'd expect if the graphs just kept going up the way they're currently going up, by approximately the same mechanisms they currently go up (i.e. some algorithmic improvements, some scaling). 

Sure, Galvanic would first run Sable on smaller amounts of compute. And... then they will run it on larger amounts of compute (and as I understand it, it'd be a new, surprising fact if they limited themselves to scaling up slowly/linearly rather than by a noticeable multiplier or order-of-magnitude. If I am wrong about current lab practices here, please link me some evidence).

If this story feels crazy to you, I want to remind you that all possible views of the future are wild. Either some exponential graphs suddenly stop for unclear reasons, or some exponential graphs keep going and batshit crazy stuff can happen that your intuitions are not prepared for. You can believe option A if you want, but, it's not like "the exponential graphs that have been consistent over hundreds of years suddenly stop" is a viewpoint that you can safely point to as a "moderate" and claim to give the other guy burden of proof.

You don't have the luxury of being the sort of moderate who doesn't have to believe something pretty crazy sounding here, one way or another. 

(If you haven't yet read the Holden post on Wildness, I ask you do so before arguing with this. It's pretty short and also fun to read fwiw)

The Online Resources spell out the epistemic status more clearly.

In the FAQ question, "So there's at least a chance of the AI keeping us alive?", they state more explicitly:

It’s overwhelmingly more likely that [superintelligent] AI kills everyone.

In these online resources, we’re willing to engage with a pretty wide variety of weird and unlikely scenarios, for the sake of spelling out why we think they’re unlikely and why (in most cases) they would still be catastrophically bad outcomes for humanity.

We don’t think that these niche scenarios should distract from the headline, however. The most likely outcome, if we rush into creating smarter-than-human AI, is that the AI consumes the Earth for resources in pursuit of some end, wiping out humanity in the process.

The book title isn’t intended to communicate complete certitude. We mean the book title in the manner of someone who sees a friend lifting a vial of poison to their lips and shouts, “Don’t drink that! You’ll die!”

Yes, it’s technically possible that you’ll get rushed to the hospital and that a genius doctor might concoct an unprecedented miracle cure that merely leaves you paralyzed from the neck down. We’re not saying there’s no possibility of miracles. But if even the miracles don’t lead to especially good outcomes, then it seems even clearer that we shouldn’t drink the poison.

The book doesn't actually overextend the arguments and common discourse norms.

This adds up to seeming to me that:

  • The book makes a reasonable case for why Eliezer and Nate are personally pretty confident in the title.
  • The book, I think, does a decent job giving you some space to think “well, I don’t buy that particular argument."
  • The book acknowledges “if you don’t buy some of these arguments, yeah, maybe everyone might not literally die and maybe the AI might care about humans in some way, but we still think it's very unlikely to care about humans in a way that should be comforting."

If a book in the 50s was called "Nuclear War would kill us all", I think that book would have been incorrect (based of my most recent read of Nuclear war is unlikely to cause human extinction), but I wouldn't think the authors were unreasonable for arguing it, especially if they pointed out things like "and yeah, if our models of nuclear winter are wrong, everyone wouldn't literally die, but civilization would still be pretty fucked", and I would think the people giving the authors a hard time about it were being obnoxious pedants, not heroes of epistemic virtue.

(I would think people arguing "but, the nuclear winter models are wrong, so, yeah, we're more in the 'civilization would be fucked' world than the 'everyone literally dies world." would be doing a good valuable service. But I wouldn't think it'd really change the takeaways very much).

II. Specific points to maybe disagree on

There are some opinions that seem like plausible opinions to hold, given humanity's current level of knowledge, that lead to actual disagreement with "If anyone builds [an AI smart enough to outmanuever humanity] [that is grown in unpredictable ways] [based on approximately our current understanding of AI]".

And the book does have a secondary thesis of "And therefore, Shut It Down", and you can disagree with that separately from "If anyone builds it, everyone dies."

Right now, the arguments that I've heard sophisticated enough versions of to seem worth acknowledging include:

  1. Very slightly nice AIs would find being nice cheap.
    • (argument against "everyone literally dies.")
  2. AI-assisted alignment is reasonably likely to work. Misuse or dumber-AI-run-amuck is likely enough to be comparably bad to superintelligence. And it's meanwhile easier to coordinate now with smaller actors. So, we should roll the dice now rather than try for a pause.
    • (argument against "Shut It (completely) Down")
  3. We can get a lot of very useful narrow-ish work out of somewhat-more-advanced-models that'll help us learn enough to make significant progress on alignment.
    1. (argument against "Shut It Down (now)")
  4. We can keep finding ways to increase the cost of taking over humanity. There's no boolean between "superintelligent enough to outthinking humanity" and "not", and this is a broken frame that is preventing you from noticing alternative strategies.
    • (argument against "It" being the right concept to use)

I disagree with the first two being very meaningful (as counterarguments to the book). More on that in a sec.

Argument #3 is somewhat interesting, but, given that it'd take years to get a successful Global Moratorium, I don't see any reason not to start pushing for a long global pause now.

I think the fourth one is fairly interesting. While I strongly disagree with some major assumptions in the Redwood Plan as I understand it, various flavors of "leverage narrow / medium-strength controlled AIs to buy time" feel like they might be an important piece of the gameboard. Insofar as Argument #3 helped Buck step outside the MIRI frame and invent Control, and insofar as that helps buy time, yep, seems important.

This is complicated by "there is a giant Cope Memeplex that really doesn't want to have to slow down or worry too much", so while I agree it's good to be able to step outside the Yudkowsky frame, I think most people doing it are way more likely to end up slipping out of reality and believing nonsense than getting anywhere helpful.

I won't get into that much detail about either topic, since that'd pretty much be a post to itself. But, I'll link to some of the IABED Online Resources, and share some quick notes about why I disagree that even the sophisticated versions of these so far don't seem very useful arguments to me.

On the meta-level: It currently feels plausible to me to have some interesting disagreements with the book here, but I don't see any interesting disagreements that add up to "Eliezer/Nate particularly fucked up epistemically or communicatively" or "you shouldn't basically hope the book succeeds at its goal."

Notes on Niceness

There are some flavors of "AI might be slightly nice" that are interesting. But, they don't seem like it changes any of our decisions. It just makes us a bit more hopeful about the end result.

Given the counteraguments, I don't see a reason to think this more than single-digit-percent likely to be especially relevant. (I can see >9% likelihood the AIs are "nice enough that something interesting-ish happens" but not >9% likelihood that we shouldn't think the outcome is still extremely bad. The people who think otherwise seem extremely motivatedly-cope-y to me).

Note also that it's very expensive for the AI to not boil the oceans / etc as fast as possible, since that means losing a many galaxies worth of resources, so it seems like it's not enough to be "very slightly" nice – it has to be, like, pretty actively nice.

Which plan is Least Impossible?

A lot of x-risk disagreements boil down to "which pretty impossible-seeming thing is only actually Very Hard instead of Impossibly Hard."

There's an argument I haven't heard a sophisticated version of, which is "there's no way you're getting a Global Pause."

I certainly believe that this is an extremely difficult goal, and a lot of major things would need to change in order for it to happen. I haven't heard any real argument we should think it's more impossible than, say, Trump winning the presidency and going on to do various Trumpy things. 

(Please don't get into arguing about Trump in the comments. I'm hoping that whatever you think of Trump, you agree he's doing a bunch of stuff most people would previously have probably expected to be outside the overton window. If this turns out to be an important substantive disagreement I'll make a separate container post for it)

Meanwhile, the counter-impossible-thing I've heard several people putting hope on is "We can run a lot of controlled AIs, where (first) we have them do fairly straightforward automation of not-that-complex empirical work, which helps us get to a point where we trust them enough to give them more openended research tasks."

Then, we run a lot of those real fast, such that they substantial increase the total amount of alignment-research-months happening during a not-very-long-slowdown.

The arguments for why this is extremely dangerous, from the book and online resources and maybe some past writing, are, recapped:

There's no good training data. 

We don't even know how to verify alignment work is particular useful among humans, let alone in an automatedly gradable way. 

Goal Directedness is pernicious. Corrigibility is anti-natural.

The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can't do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.

Once you do that, it's a fact of the universe, that the programmers can't change, that "you'd do better at these goals if you didn't have to be fully obedient", and while programmers can install various safeguards, those safeguards are pumping upstream and will have to pump harder and harder as the AI gets more intelligent. And if you want it to make at least as much progress as a decent AI researcher, it needs to be quite smart.

Security is very difficult

The surface area of ways an AI can escape and maneuver are enormous. (I think it's plausible to have a smallish number of carefully controlled, semi-powerful AIs if you are paying a lot of attention. The place I completely get off the train is where you then try to get lots of subjective hours of research time out of thousands of models).

Alignment is among the most dangerous tasks

"Thinking about how to align AIs" requires both for the AI to think how "how would I make smarter version of myself" and "how would I make it aligned to humans?". The former skillset  directly helps them recursively self-improve. The latter skillset helps them manipulate humans.

MIRI did make a pretty substantive try.

One of the more useful lines for me, in the Online Resources, in their extended discussions about corrigibility.

We ran some workshops, and the workshops had various mathematicians of various stripes (including an International Mathematical Olympiad gold medalist), but nobody came up with a really good idea.

This does not mean that the territory has been exhausted. Earth has not come remotely near to going as hard on this problem as it has gone on, say, string theory, nor offered anything like the seven-digit salaries on offer for advancing AI capabilities.

But we learned something from the exercise. We learned not just about the problem itself, but also about how hard it was to get outside grantmakers or journal editors to be able to understand what the problem was. A surprising number of people saw simple mathematical puzzles and said, “They expect AI to be simple and mathematical,” and failed to see the underlying point that it is hard to injure an AI’s steering abilities, just like how it’s hard to injure its probabilities.

If there were a natural shape for AIs that let you fix mistakes you made along the way, you might hope to find a simple mathematical reflection of that shape in toy models. All the difficulties that crop up in every corner when working with toy models are suggestive of difficulties that will crop up in real life; all the extra complications in the real world don’t make the problem easier.

There was a related quote I can't find now, that maybe was just in an earlier draft of the Online Resources, to the effect of "this [our process of attempting to solve corrigibility] is the real reason we have this much confidence about this being quite hard and our current understanding not being anywhere near adequate." 

(Fwiw I think it is a mistake that this isn't at least briefly mentioned in the book. The actual details would go over most people's heads, but, having any kind of pointer to "why are these guys so damn confident?" seems like it'd be quite useful)

III. Overton Smashing, and Hope

Or: "Why is this book really important, not just 'reasonable?'"

I, personally, believe in this book. [3]

If you don't already believe in it, you're probably not going to because of my intuitions here. But, I want to say why it's deeply important to me that the book is reasonable, not just arguing on the internet because I'm triggered and annoyed about some stuff.

I believe in the book partly because it looks like it might work. 

The number (and hit-rate) of NatSec endorsements surprised me. More recently some senators seem to have been bringing up existential risk of their own initiative. When I showed the website to a (non-rationalist) friend who lives near DC and has previously worked for think-tank-ish org, I expected them to have a knee-jerk reaction of ‘man that’s weird and a bit cringe’, or ‘I’d be somewhat embarrassed to share this website with colleagues’, and instead they just looked worried and said “okay, I’m worried”, and we had a fairly matter-of-fact conversation about it.

It feels like the world is waking up to AI, and is aware that it is some kind of big deal that they don’t understand, and that there’s something unsettling about it. 

I think the world is ready for this book.

I also believe in the book because, honestly, the entire rest of the AI safety community’s output just does not feel adequate to me to the task of ensuring AI goes well. 

I’m personally only like 60% on “if anyone built It, everyone would die.” But I’m like 80% on “if anyone built It, the results would be unrecoverably catastrophic,” and the remaining 20% is a mix of model uncertainty and luck. Nobody has produced counterarguments that feel compelling, just "maybe something else will happen?", and the way people choose their words almost always suggests some kind of confusion or cope.

The plans that people propose mostly do not seem to be counter-arguing the actual difficult parts of the problem. 

The book gives me more hope than anything else has in the past few years. 

Overton Smashing is a thing. I really want at least some people trying.

It’s easy to have the idea “try to change the Overton window.” Unfortunately, changing the Overton window is very difficult. It would be hard for most people to pull it off. I think it helps to have a mix of conviction backed by deep models, and some existing notoriety. There are only a few other people who seem to me like they might be able to pull it off. (It'd be cool if at least one of Bengio, Hinton, Hassabis or Amodei end up trying. I think Buck actually might do a good job if he tried.)

Smashing an overton window does not look like "say the careful measured thing, but, a bit louder/stronger." Trying to do it halfway won't work. But going all in with conviction and style, seems like it does work. It looks like Bengio, Hinton, Hassabis and Amodei are each trying to do some kind of measured/careful strategy, and it's salient that if they shifted a bit, things would get worse instead of better. 

(Sigh... I think I might need to talk about Trump again. This time it seems more centrally relevant to talk about in the comments. But, like, dude, look at how bulletproof the guy seems to be. He also, like, says falsehoods a lot and I'm not suggesting emulating him-in-particular, but I point to him as an existence proof of what can work)

People keep asking "why can't Eliezer tone it down." I don't think Eliezer is the best possible spokesperson. I acknowledge some downside risk to him going on a major media campaign. But I think people are very confused about how loadbearing the things some people find irritating are. How many fields and subcultures have you founded, man? Fields and subcultures and major new political directions are not founded (generally) by people without some significant fraction of haters.

You can't file off all the edges, and still have anything left that works. You can only reroll on which combination of inspiring and irritating things you're working with.

I want there to be more people who competently execute on "overton smash." The next successful person would probably look pretty different from Eliezer, because part of overton smashing is having a unique style backed by deep models and taste and each person's taste/style/models are pretty unique. It'd be great to have people with more diversity of "ways they are inspiring and also grating."

Meanwhile, we have this book. It's the Yudkowsky version of the book. If you don't like that, find someone who actually could write a better one. (Or, rather, find someone who could execute on a successful overton smashing strategy, which would probably look pretty different than a book since there already is a book, but would still look and feel pretty extreme in some way).

Would it have been better to use a title that fewer people would feel the need to disclaim?

I think Eliezer and Nate are basically correct to believe the overwhelming likelihood if someone built "It" would be everyone dying. 

Still, maybe they should have written a book with a title that more people around these parts wouldn't feel the need to disclaim, and that the entire x-risk community could have enthusiastically gotten behind. I think they should have at least considered that. Something more like "If anyone builds it, everyone loses." (that title doesn't quite work, but, you know, something like that)

My own answer is "maybe" - I see the upside. I want to note some of the downsides or counter-considerations. 

(Note: I'm specifically considering this from within the epistemic state of "if you did pretty confidently believe everyone would literally die, and that if they didn't literally die, the thing that happened instead would be catastrophically bad for most people's values and astronomically bad from Eliezer/Nate's values)

Counter-considerations include:

AFAICT, Eliezer and Nate spent like ~8 years deliberately backing off and toning tone, out of a vague deferral to people saying "guys you suck at PR and being the public faces of this movement." The result of this was (from their perspective) "EA gets co-opted by OpenAI, which launches a race that dramatically increases the danger the world faces."

So, the background context here is that they have tried more epistemic-prisoner's-dilemma-cooperative-ish strategies, and they haven't worked well. 

Also, it seems like there's a large industrial complex of people arguing for various flavors of "things are pretty safe", and there's barely anyone at all stating plainly "IABED". MIRI's overall strategy right now is to speak plainly about what they believe, both because they think it needs to be said and no one else is saying it, and because they hope just straightforwardly saying what they believe will net a reputation for candor that you don't get if people get a whiff of you trying to modulate your beliefs based on public perception.

None of that is an argument that they should exaggerate or lean-extra-into beliefs that they don't endorse. But, given that they are confident about it, it's an argument not to go out of their way to try to say something else.

I don't currently buy that it costs much to have this book asking for total shutdown.

My sense is it's pretty common for political groups to have an extreme wing and a less extreme wing, and for them to be synergistic. Good cop, bad cop. Martin Luther King and Malcolm X. 

If what you want is some kind of global coordination that isn't a total shutdown, I think it's still probably better to have Yudkowsky over there saying "shut it all down" and say "Well, I dunno about that guy. I don't think we need to shut it all down, but I do think we want some serious coordination."

I believe in the book.

Please buy a copy if you haven't yet. 

Please tell your friends about it. 

And, disagree where appropriate, but, please don't give it a hard time for lame pedantic reasons, or jump to assuming you disagree because you don't like something about the vibe. Please don't awkwardly distance yourself because it didn't end up saying exactly the things you would have said, unless it's actually fucking important. (I endorse something close to this but the nuances matter a lot and I wrote this at 5am and don't stand by it enough for it to be the closing sentence of this post)

You can buy the book here.

  1. ^

    (edit in response to Rohin's comment: It additionally sucks that writing up what's true and arguing for it is penalized in the game against sensationalism. I don't think it's so penalized it's not worth doing, though)

  2. ^

    Paul Christiano and Buck both complain about (paraphrased) "Eliezer equivocates between 'we have to get it right on the first critical try' and 'we can't learn anything important before the first critical try.'" 

    I agree something-in-this-space feels like a fair complaint, especially in combination with Eliezer not engaging that much with the more thoughtful critics, and tending to talk down to them in a way that doesn't seem to be really listening to the nuances they're trying to point to and round them to nearest strawman of themselves. I

    I think this is a super valid thing to complain about Eliezer. But, it's not the title or thesis of the book. (because, if we survive because we learned useful things, I'd say that doesn't count as "anywhere near our current understanding").

  3. ^

    "Believing in" doesn't mean "assign >50% chance to working", it means "assign enough chance (~20%?) that it feels worth investing substantially in and coordinating around." See Believing In by Anna Salamon.