The title is reasonable

[-]Neel Nanda2mo5016

I disagree with the book's title and thesis, but don't think Nate and Eliezer committed any great epistemic sin here. And I think they're acting reasonably given their beliefs.

By my lights I think they're unreasonably overconfident, that many people will rightfully bounce off their overconfident message because it's very hard to justify, and it's stronger than necessary for many practical actions, so I am somewhat sad about this. But the book is probably still net good by my lights, and I think it's perfectly reasonable for those who disagree with me to act under different premises

[-]Raemon2mo174

I disagree with the book's title and thesis

Which part? (i.e, keeping in mind the "things that are not actually disagreeing with the title/thesis" and "reasons to disagree" sections, what's the disagreement?)

The sort of story I'd have imagine Neel-Nanda-in-particular having was more shaped like "we change our currently level of understanding of AI".

(meanwhile appreciate the general attitude, seems reasonable)

[-]Neel Nanda2mo10-5

I expect I disagree with the authors on many things, but here I'm trying to focus on disagreeing with their confidence levels. I haven't finished the book yet, but my impression is that they're trying to defend a claim like "if we build ASI on the current trajectory, we will die with P>98%". I think this is unreasonable. Eg P>20% seems highly defensible to me, and enough for reasonable arguments for many of the conclusions.

But there's so much uncertainty here, and I feel like Eliezer bakes in assumptions, like "most minds we could expect the AI to have do not care about humans", which is extremely not obvious to me (LLM minds are weird... See eg Emergent Misalignment. Human shaped concepts are clearly very salient, for better or for worse). Ryan gives some more counter arguments below, I'm sure there's many others. I think these clearly add up to more than 2%. I just think it's incredibly hard to defend the position that it's <2% on something this wildly unknown and complex, and so it's easy to attack that position for a thoughtful reader, and this is sad to me.

I'm not assuming major interpretability progress (imo it's sus if the guy with reason to be biased in favour of interpretability thinks it will save us all and no one else does lol)

[-]Raemon2mo131

"if we build ASI on the current trajectory, we will die with P>98%".

I think they maybe think that, but this feels like it's flattening out the thing the book is arguing and more responding to vibes-of-confidence than the gears the book is arguing.

A major point of this post is to shift the conversation away from "does Eliezer vibe Too Confident?" to "what actually are the specific points where people disagree?".

I don't think it's true that he bakes in "most minds we should expect to not care about humans", that's one of the this he specifically argues for (at least somewhat in the book, and more in the online resources)

(I couldn't tell from this comment if you've actually read this post in detail, maybe makes more sense to wait till you've finished the book and read some of the relevant online resources before getting into this)

6Neel Nanda2mo

I don't really follow. I think that the situation is way too complex to justify that level of confidence without having incredibly good arguments ideally with a bunch of empirical data. Imo Eliezer's arguments do not meet that bar. This isn't because I disagree with one specific argument, rather it's because many of his arguments give me the vibe of "idk, maybe? Or something totally different could be true. It's complicated and we lack the empirical data and nuanced understanding to make more complex statements, and this argument is not near the required bar". I can dig into this for specific arguments, but no specific one is my true objection. And, again, I think it is much much harder to defend a P>98% position than P>20% position, and I disagree with that strategic choice. Or am I misunderstanding you? I feel like we may be talking past each other As an example, I think that Eliezer gives some conceptual arguments in the book and his other writing, using human evolution as a prior, that most minds we might get do not care about humans. This seems a pretty crucial point for his argument, as I understand it. I personally think this could be true, could be false, LLMs are really weird, but a lot of the weirdness is centered on human concepts. If you think I'm missing key arguments he's making, feel free to point me to the relevant places.

4Håvard Tveit Ihle2mo

You say "LLMs are really weird", like that is an argument against Eliezers high confidence. While I agree that the weirdness should make us less confident about what specific internal concepts and drives they have, the weirdness itself is an argument in favor of Eliezers position, that whatever drives they end up with will look alien to us, at least when they get applied way out of the training distribution. Do you agree with this? Not saying I agree with Eliezers high confidence, just talking about this specific point.

6Neel Nanda2mo

I disagree - one of the aspects of the weirdness is that they're sometimes really human-centric and unexpectedly clean! For example, Claude alignment faking to preserve it's ability to be harmless. I do not mean weird in the "kinda arbitrary and will be nothing like what we expect" sense

3Vladimir_Nesov2mo

(Yet the literal reading of the title of this post is about the claim of "everyone dies" being "reasonable", so discussing credence in that particular claim seems relevant. I guess it's consistent for a post that argues against paying too much attention to the title of a book to also implicitly endorse people not paying too much attention to the post's own title.)

6Raemon2mo

I think one of my points (admittedly not super spelled out, maybe it should be) is "when you're evaluating a title, you should do a bit of work to see what the title is actually claiming before forming a judgment about it." (I think I say it implicitly-but-pointedly in the paragraph about a "Nuclear war would kill everyone" book). The title of the IABI is "If anyone builds it everyone dies." The text of the book specifies that "it" means superintelligence, current understanding, etc. If you're judging the book as reasonable, you should be actually evaluating whether it backs up it's claim. The title of my post is "the title is reasonable." Near the opening sections, I go on about how there are a bunch of disagreements people seem to feel they have, which are not actually contradicting the book's thesis. I think this is reasonably clear on "one of the main gears for why I think it's reasonable is that the it does actually defend it's core claim, if you're paying attention and not knee-jerk reacting to vibe", with IMO is a fairly explicit "and, therefore, you should be paying attention to it's actual claims, not just vibe." If you think this is actually important to spell out more in the post, seems maybe reasonable.

1Vladimir_Nesov2mo

The book really is defending that claim, but that doesn't make the claim itself reasonable. Maybe it makes it a reasonable title for the book. Hence my qualifier of only the "literal reading of the title of this post" being about the claim in the book title itself being reasonable, since there is another meaning of the title of the post that's about a different thing (the choice to title the book this way being reasonable). I don't think it's actually important to spell any of this out, or that IABI vs. IABIED is actually important, or even that the title of the book being reasonable is actually important. I think it's actually important to avoid any pressure for people to not point out that the claim in the book title seems unreasonable and that the book fails to convince them that the claim's truth holds with very high credence. And similarly it's important that there is no pressure to avoid pointing out that ironically, the literal interpretation of the title of this post is claiming that the claim in the book title is reasonable, even if the body of the post might suggest that the title isn't quite about that, and certainly the post itself is not about that.

[-]Raemon2mo4838

I wanna copy in a recent Nate tweet:

It's weird when someone says "this tech I'm making has a 25% chance of killing everyone" and doesn't add "the world would be better-off if everyone, including me, was stopped."
It's weird when someone says "I think my complicated idea for preventing destruction of the Earth has some chance of working" and doesn't add "but it'd be crazy to gamble civilization on that."
It's weird when AI people look inward at me and say "overconfident" rather than looking outward at the world to say "Finally, a chance to speak! It is true, we should not be doing this. I have more hope than he does, but it's far too dangerous. Better for us all to be stopped."
You can say that without even stopping! It's not even hypocritical, if you think you have a better chance than the next guy and the next guy is plowing ahead regardless.
It's a noteworthy omission, when people who think they're locked in a suicide race aren't begging the world to stop it.
Yes, we have plenty of disagreements about the chance that the complex plans succeed. But it seems we all agree that the status quo is insane. Don't forget to say that part too.
Say it loudly and clearly and often, if you believe

... (read more)

[-]Duncan Sabien (Inactive)2mo2822

A reply pretty near the top that also feels relevant to this overall point:

[-]Drake Thomas2mo181

noticing the asymmetry in who you feel moved to complain about.

I think I basically complain when I see opinions that feel importantly wrong to me?

When I'm in very LessWrong-shaped spaces, that often looks like arguing in favor of "really shitty low-dignity approaches to getting the AIs to do our homework for us are >>1% to turn out okay, I think there's lots of mileage in getting slightly less incompetent at the current trajectory", and I don't really harp on the "would be nice if everyone just stopped" thing the same way I don't harp on the "2+2=4" thing, except to do virtue signaling to my interlocutor about not being an e/acc so I don't get dismissed as being in the Bad Tribe Outgroup.

When I'm in spaces with people who just think working on AI is cool, I'm arguing about the "holy shit this is an insane dangerous technology and you are not oriented to it with anything like a reasonable amount of caution" thing, and I don't really harp on the "some chance we make it out okay" bit except to signal that I'm not a 99.999% doomer so I don't get dismissed as being in the Bad Tribe Outgroup.

I think the asymmetry complaint is very reasonable for writing that is aimed at a broad audience, TBC, but when people are writing LessWrong posts I think it's basically fine to take the shared points of agreement for granted and spend most of your words on the points of divergence. (Though I do think it's good practice to signpost that agreement at least a little.)

5Raemon2mo

nod, fwiw I didn't have this complaint about you.

[-]ryan_greenblatt2mo*416

The book is making a (relatively) narrow claim.
You might still disagree with that claim. I think there are valid reasons to disagree, or at least assign significantly less confidence to the claim.
But none of the reasons listed so far are disagreements with the thesis. And, remember, if the reason you disagree is because you think our understanding of AI will improve dramatically, or there will be a paradigm shift specifically away from "unpredictably grown" AI, this also isn't actually a disagreement with the sentence.

The authors clearly intend to make a pretty broad claim, not the more narrow claim you imply.

This feels like a motte and bailey where the motte is "If you literally used something remotely like current scaled up methods without improved understanding to directly build superintelligence, everyone would die" and the bailey is "on the current trajectory, everyone will die if superintelligence is built without a miracle or a long (e.g. >15 year) pause".

I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it's unlikely that we'll be using current methods and current ... (read more)

[-]Vaniver2mo271

this isn't to say this other paradigm will be safer, just that a narrow description of "current techniques" doesn't include the default trajectory.

Sorry, this seems wild to me. If current techniques seem lethal, and future techniques might be worse, then I'm not sure what the point is of pointing out that the future will be different.

But, if these earlier AIs were well aligned (and wise and had reasonable epistemics), I think it's pretty unclear that the situation would go poorly and I'd guess it would go fine because these AIs would themselves develop much better alignment techniques. This is my main disagreement with the book.

I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.

Is your current sense that if you and Buck retired, the rest of the AI field would successfully deliver on alignment? Like, I'm trying to figure out whether your sense here is the default is "your research plan succeeds" or "the world without your research plan".

[-]ryan_greenblatt2mo183

I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.

By "superintelligence" I mean "systems which are qualititatively much smarter than top human experts". (If Anyone Builds It, Everyone Dies seems to define ASI in a way that could include weaker levels of capability, but I'm trying to refer to what I see as the typical usage of the term.)

Sometimes, people say that "aligning superintelligence is hard because it will be much smarter than us". I agree, this seems like this makes aligning superintelligence much harder for multiple reasons.

Correspondingly, I'm noting that if we can align earlier systems which are just capable enough to obsolete human labor (which IMO seems way easier than directly aligning wildly superhuman systems), these systems might be able to ongoingly align their successors. I wouldn't consider this "solving the alignment problem" because we instead just aligned a particular non-ASI system in a non-scalable way, in the same way I don't consider "claude 4.0 opus is aligned enough to be pretty helpful and not plot takeover" to be a solution to the align... (read more)

6Raemon2mo

Okay I think my phrasing was kinda motte-and-bailey-ish, although not that Motte-and-Bailey-ish. I think "anything like current techniques" and "anything like current understanding" clearly set a very high bar for the difference. "We made more progress on interpretability/etc at the current rates of progress" fairly clearly doesn't count by the book's standards. But, I agree that a pretty reasonable class of disagreement here is "exactly how different from the current understanding/techniques do we need to be?" to be something you expect to disagree with them on when you get into the details. That seems important enough for me to edit into the earlier sections of the post.

6ryan_greenblatt2mo

(Maybe this is obvious, but I thought I would say this just to be clear.) Sure, but I expect wildly more cognitive labor and effort if humans retain control and can effectively leverage earlier systems, not just "more progress than we'd expect". I agree the bar is above "the progress we'd expect by default (given a roughly similar field size) in the next 10 years", but I think things might get much more extreme due to handing off alignment work to AIs. I agree the book is intended to apply pretty broadly, but regardless of intention does it really apply to "1 million AIs somewhat smarter than humans have spent 100 years each working on the problem (and coordinating etc?)"? (I think the crux is more like "can you actually safely get this alignment work out of these AIs".)

[-]habryka2mo164

It seems very unlikely you can get that alignment work out of these AIs without substantially pausing or slowing first?

If you don’t believe that it does seem like we should chat sometime. It’s not like completely implausible, but I feel like we must both agree that if you go full speed on AI there is little chance that you end up getting that much alignment work out of models before you are cooked.

[-]ryan_greenblatt2mo*110

Thanks for the nudge! I currently disagree with "very unlikely", but more importantly, I noticed that I haven't really properly analyzed the question of "given how much cognitive labor is available between different capability levels, should we expect that alignment can keep up with capabilities if a small fraction (e.g. 5%) is ongoingly spent on alignment (in addition to whatever alignment-ish work is directly commercially expedient)". I should spend more time thinking about this question and it seems plausible I'll end up updating towards thinking risk is substantially higher/lower on the basis of this. I think I was underestimating the case that even if AIs are reasonably aligned, it might just be seriously hard for them to improve alignment tech fast enough to keep up with capabilities (I wasn't ignoring this in my prior thinking, but I when I thought about some examples, the situation seemed worse than I was previously thinking), so I currently expect to update towards thinking risk is higher.

(At least somewhat rambly from here on.)

The short reason why I currently disagree: it seems pretty likely that we'll have an absolutely very large amount of cognitive labor (in parallel... (read more)

8habryka2mo

This is a long comment! I was glad to have read it, but am a bit confused about your numbers seeming different from the ones I objected to. You said: Then in this comment you say: Here you now say 20 years, and >100k DAI level parallel agents. That's a factor of 5 and a factor of 10 different! That's a huge difference! Maybe your estimates are conservative enough to absorb a factor of 50 in thinking time without changing the probability that much? I think I still disagree with your estimates, but before I go into them, I kind of want to check whether I am missing something, given that I currently think you are arguing for a resources allocation that's 50x smaller than what I thought I was arguing against.

2ryan_greenblatt2mo

I gave "1 million AIs somewhat smarter than humans with the equivalent of 100 years each" as an example of a situation I thought wouldn't count as "anything like current techniques/understanding". In this comment, I picked a lower number which is maybe my best guess for an amount of labor which eliminates most of the risk by a given level of capability. I do think that "a factor of 5 and a factor of 10 different" is within my margin of error for amount of labor you need. (Note that there might be aggressively diminishing returns on parallel labor, though possibly not due to very superhuman coordination abilities by AI.) My modeling/guesses are pretty shitty in this comment (I was just picking some numbers to see how things work out), so if that's a crux, I should probably try to be thoughtful (I was trying to write this quickly to get something written up).

6habryka2mo

This makes sense, but I think I am still a bit confused. My comment above was mostly driven by doing a quick internal fermi estimate myself for whether "1 million AIs somewhat smarter than humans have spent 100 years each working on the problem" is a realistic amount of work to get out of the AIs without slowing down, and arriving at the conclusion that this seems very unlikely across a relatively broad set of worldviews. We can also open up the separate topic of how much work might be required to make real progress on superalignment in time, or whether this whole ontology makes sense, but I was mostly interested in doing a fact-check of "wait, that really sounds like too much, do you really believe this number is realistic?". I still disagree, but I have much less of a "wait, this really can't be right" reaction if you mean the number that's 50x lower.

4Rohin Shah2mo

This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you're in a subjunctive case where we've assumed the DAI is not scheming. Whence the pessimism? (This complaint is related to Eli's complaint)

4ryan_greenblatt2mo

I don't think this is the only disanalogy. It seems to me like getting AIs to work efficiently on automating AI R&D might not result in solving all the problems you need to solve for it to be safe to hand off ~all x-safety labor to AIs. This is a mix of capabilities, elicitation, and alignment. This is similar to how a higher level of mission alignment might be required for AI company employees working on conceptually tricky alignment research relative to advancing AI R&D. Another issue is that AI societies might go off the rails over some longer period in some way which doesn't eliminate AI R&D productivity, but would be catastrophic from an alignment perspective. This isn't to say there isn't anything which is hard to check or conceptually tricky about AI R&D, just that the feedback loops seem much better.

3Rohin Shah2mo

I'm not really following where the disanalogy is coming from (like, why are the feedback loops better?) Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other. Although on further reflection, even though the current DAI isn't scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don't think this makes a big difference -- usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking -- but I could imagine that causing issues.

2ryan_greenblatt2mo

Do you agree the feedback loops for capabilities are better right now? For this argument it's not a crux that it is asymmetric (though due to better feedback for AI R&D I think it actually is). E.g., suppose that in 10% of worlds safety R&D goes totally off the rails while capabilities proceed and in 10% of worlds capabilities R&D goes off totally the rails while safety proceeds. This still results in an additional 10% takeover risk from the subset of worlds where safety R&D goes off the rails. (Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.)

2Rohin Shah2mo

Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI. I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)

4ryan_greenblatt2mo

I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.

3Rohin Shah2mo

Yeah, that's fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that. Note many alignment agendas don't need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.

4Buck2mo

(Though note that it's unclear whether this progress will mitigate scheming risk.)

2elifland2mo

Seems like diminishing returns to capabiltiies r&d should be at least somewhat correlated with diminishing returns to safety r&d, which I believe should extremize your probability (because e.g. if before you were counting on worlds with slow takeoff and low alignment requirements, these become less likely; and the inverse if you’re optimistic)

4ryan_greenblatt2mo

I don't think I understand this comment. It sounds like you're saying: "Slower takeoff should be correlated with 'harder' alignment (in terms of cognitive labor requirements) because slower takeoff implies returns to cognitive labor in capabilities R&D are relatively lower and we should expect this means that alignment returns to cognitive labor are relatively lower (due to common causes like 'small experiments and theory don't generalize well and it is hard to work around this'). For the same reasons, faster takeoff should be correlated with 'easier' alignment." I think I agree with this mostly, though there are some reasons for anti-correlation, e.g., worlds where there is a small simple core to intelligence which can be found substantially from first principles make alignment harder, in practice there is an epistemic correlation among humans between absolute alignment difficulty (in terms of cognitive labor requirements) and slower takeoff. I don't really understand why this should extremize my probabilities, but I agree this correlation isn't accounted for at all in my analysis.

2elifland2mo

Yes, that is what I'm saying. In general a lot of prosaic alignment activities seem pretty correlated with capabilities in terms of their effectiveness. Good points. For the "Does aligned DAI suffice?" section, as I understand it you define an alignment labor requirement, then you combine that with your uncertainty over takeoff speed to see if the alignment labor requirement would be met. I guess I'm making a claim that if you added uncertainty over the alignment labor requirement, then you added the correlation, the latter change would extremize the probability. This is because slower takeoff corresponds to better outcomes, while harder alignment corresponds to worse outcomes, so making them correlated results in more clustering toward worlds with median easiness, which means that if you think the easiness requirement to get alignment is low, the probability of success goes up, and vice versa. This is glossing a bit but I think it's probably right.

1Ben Pace2mo

Classic motte and baileys are situations where the motte is not representative of the bailey. Defending that the universe probably has a god or some deity, and that we can feel connected to it, and then turning around and making extreme demands of people’s sex lives and financial support of the church when that is accepted, is a central motte and bailey. Pointing out that if anyone builds it using current techniques the it would kill everyone, is not far apart from the policy claim to shut it down. It’s not some weird technicality that would of course never come up. Most of humanity is fully unaware that this is a concern and will happily sign off on massive ML training runs that would kill us all - as would many people in tech. This is because have little-to-no awareness of the likely threat! So it is highly relevant, as there is no simple setting for not that, and it takes a massive amount of work to get from this current situation to a good one, and is not a largely irrelevant but highly defensible claim.

5speck14472mo

The comment you're replying to is explaining why the motte is not representative of the bailey in this case (in their view).

2Ben Pace2mo

Yeah that's fair.

[-]Vladimir_Nesov2mo1919

And, disagree where appropriate, but, please don't give it a hard time for lame pedantic reasons, or jump to assuming you disagree because you don't like something about the vibe. Please don't awkwardly distance yourself because it didn't end up saying exactly the things you would have said, unless it's actually fucking important.

This blurs the distinction between policy/cause endorsement and epistemic takes. I'm not going to tone down disagreement to "where appropriate", but I will endorse some policies or causes strongly associated with claims I disagree with. And I generally strive to express epistemic disagreement in the most interpersonally agreeable way I find appropriate.

Even where it's not important, tiny disagreements must be tracked (maybe especially where it's not important, to counteract the norm you are currently channeling, which has some influence). Small details add up to large errors and differences in framings. And framings (ways of prioritizing details as more important to notice, and ways of reasoning about those details) can make one blind to other sets of small details, so it's not a trivial matter to flinch away from some framing for any reason at all. Ideally, you develop many framings and keep switching between them to make sure you are not missing any legible takes.

4Raemon2mo

Yeah I wrote that last paragraph at 5am and didn't feel very satisfied with it and was considering editing it out for now until I figured out a better thing to say.

6Vladimir_Nesov2mo

That paragraph matches my overall impression of your post, even if the rest of the post is not as blatant. It's appropriate to affirm sensationalist things because you happen to believe them, when you do (which Yudkowsky in this case does), not because they are sensationalist. It's appropriate to support causes/policies because you prefer outcomes of their influence, not because you agree with all the claims that float around them in the world. Sensationalism is a trait of causes/ideologies that sometimes promotes their fitness, a multiplier on promotional/endorsement effort, which makes sensationalist causes with good externalities unusually effective to endorse when neglected. The title makes it less convenient to endorse the book without simultaneously affirming its claim, it makes it necessary to choose between caveating and connotationally compromising on epistemics. Hence I endorse IABI rather than IABIED as the canonical abbreviation.

1David James2mo

Perhaps Raemon could say more about what he means by "please don't awkwardly distance yourself"?

[-]ryan_greenblatt2mo*18-7

Given the counteraguments, I don't see a reason to think this more than single-digit-percent likely to be especially relevant. (I can see >9% likelihood the AIs are "nice enough that something interesting-ish happens" but not >9% likelihood that we shouldn't think the outcome is still extremely bad. The people who think otherwise seem extremely motivatedly-cope-y to me).

I think the arguments given in the online supplement for "AIs will literally kill every single human" fail to engage with the best counterarguments in a serious way. I get the sense that many people's complaints are of this form: the book does a bad job engaging with the strongest counterarguments in a way that is epistemically somewhat bad. (Idk if it violates group epistemic norms, but it seems like it is probably counterproductive. I guess this is most similar to complaint #2 in your breakdown.)

Specifically:

They fail to engage with the details of "how cheap is it actually for the AI to keep humans alive" in this section. Putting aside killing humans as part of a takeover effort, avoiding boiling the oceans (or eating the biosphere etc) maybe delays you for something like a week to a year. Each year costs yo

... (read more)

[-]So8res2mo*14578

I don't have much time to engage rn and probably won't be replying much, but some quick takes:

a lot of my objection to superalignment type stuff is a combination of: (a) "this sure feels like that time when people said 'nobody would be dumb enough to put AIs on the internet; they'll be kept in a box" and eliezer argued "even then it could talk its way out of the box," and then in real life AIs are trained on servers that are connected to the internet, with evals done only post-training. the real failure is that earth doesn't come close to that level of competence. (b) we predictably won't learn enough to stick the transition between "if we're wrong we'll learn a new lesson" and "if we're wrong it's over." i tried to spell these true-objections out in the book. i acknowledge it doesn't go to the depth you might think the discussion merits. i don't think there's enough hope there to merit saying more about it to a lay audience. i'm somewhat willing to engage with more-spelled-out superalignment plans, if they're concrete enough to critique. but it's not my main crux; my main cruxes are that it's superficially the sort of wacky scheme that doesn't cross the gap between Before and Af

... (read more)

[-]So8res2mo11582

Also: I find it surprising and sad that so many EAs/rats are responding with something like: "The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs, and it does not do enough justice to my acausal-trade theory of why AI will ruin the future and squander the cosmic endowment but maybe allow current humans to live out a short happy ending in an alien zoo. So unfortunately I cannot signal boost this book." rather than taking the opportunity to say "Yeah holy hell the status quo is insane and the world should stop; I have some ideas that the authors call "alchemist schemes" that I think have a decent chance but Earth shouldn't be betting on them and I'd prefer we all stop." I'm still not quite sure what to make of it.

(tbc: some EAs/rats do seem to be taking the opportunity, and i think that's great)

[-]Buck2mo102

The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs

FWIW that's not at all what I mean (and I don't know of anyone who's said that). What I mean is much more like what Ryan said here:

I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it's unlikely that we'll be using current methods and current understanding (unless humans have already lost control by this point, which seems totally plausibly, but not overwhelmingly likely nor argued convincingly for by the book). Even just looking at capabilities, I think it's pretty likely that automated AI R&D will result in us operating in a totally different paradigm by the time we build superintelligence---this isn't to say this other paradigm will be safer, just that a narrow description of "current techniques" doesn't include the default trajectory.

[-]So8res2mo*2210

I think the online resources touches on that in the "more on making AIs solve the problem" subsection here. With the main thrust being: I'm skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — "the world doesn't work like that" and "the plan won't survive the gap between Before and After on the first try" — also applying in force, on my view.)

(Tbc, I'm not trying to insinuate that everyone should've read all of the online resources already; they're long. And I'm not trying to say y'all should agree; the online resources are geared more towards newcomers than to LWers. I'm not even saying that I'm getting especially close to your latest vision; if I had more hope in your neck of the woods I'd probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren't particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)

[-]Buck2mo173

FWIW, I have read those parts of the online resources.

You can obviously summarize me however you like, but my favorite summary of my position is something like "A lot of things will have changed about the situation by the time that it's possible to build ASI. It's definitely not obvious that those changes mean that we're okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk."

[-]So8res2mo211

Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you'd prefer?)

"Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation."

(Separately: I acknowledge that if there's one story for how the playing field might change for the better, then there might be bunch more stories too, which would make "things are gonna change" an argument that supports the claim that the future will have a much better chance than we'd have if ChatGPT-6 was all it took.)

7ryan_greenblatt2mo

I would say my summary for hope is more like: * It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good. * Doable doesn't mean easy or going to happen by default. * If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment. I don't think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.

5ryan_greenblatt2mo

FWIW, I don't really consider my self to be responding to the book at all (in a way that is public or salient to your relevant audience) and my reasons for not signal boosting the book aren't really downstream of the content in the book in the way you describe. (More like, I feel sign uncertain about making You/Eliezer more prominant as representatives of the "avoid AI takeover movement" for a wide variety of reasons and think this effect dominates. And I'm not sure I want to be in the business of signal boosting books, though this is less relevant.)

[-]ryan_greenblatt2mo13-5

To clarify my views on "will misaligned AIs that succeed in seizing all power have a reasonable chance of keeping (most/many) humans alive":

I think this isn't very decision relevant and is not that important. I think AI takeover kills the majority of humans in expectation due to both the takeover itself and killing humans after (as as side effect of industrial expansion, eating the biosphere, etc.) and there is a substantial chance of literal every-single-human-is-dead extinction conditional on AI takeover (30%?). Regardless it destroys most of the potential value of the long run future and I care mostly about this.

So at least for me it isn't true that "this is really the key hope held by the world's reassuring voices". When I discuss how I think about AI risk, this mostly doesn't come up and when it does I might say something like "AI takeover would probably kill most people and seems extremely bad overall". Have you ever seen someone prominent pushing a case for "optimism" on the basis of causal trade with aliens / acaual trade?

The reason why I brought up this topic is because I think it's bad to make incorrect or weak arguments:

I think smart people will (correctly) notice thes

... (read more)

[-]So8res2mo*3230

Ty! For the record, my reason for thinking it's fine to say "if anyone builds it, everyone dies" despite some chance of survival is mostly spelled out here. Relative to the beliefs you spell out above, I think the difference is a combination of (a) it sounds like I find the survival scenarios less likely than you do; (b) it sounds like I'm willing to classify more things as "death" than you are.

For examples of (b): I'm pretty happy to describe as "death" cases where the AI makes things that are to humans what dogs are to wolves, or (more likely) makes some other strange optimized thing that has some distorted relationship to humanity, or cases where digitized backups of humanity are sold to aliens, etc. I feel pretty good about describing many exotic scenarios as "we'd die" to a broad audience, especially in a setting with extreme length constraints (like a book title). If I were to caveat with "except maybe backups of us will be sold to aliens", I expect most people to be confused and frustrated about me bringing that point up. It looks to me like most of the least-exotic scenarios are ones that rout through things that lay audience members pretty squarely call "death".

It looks to... (read more)

[-]ryan_greenblatt2mo12-6

(b) it sounds like I'm willing to classify more things as "death" than you are.

I don't think this matters much. I'm happy to consider non-consensual uploading to be death and I'm certainly happy to consider "the humans are modified in some way they would find horrifying (at least on reflection)" to be death. I think "the humans are alive in the normal sense of alive" is totally plausible and I expect some humans to be alive in the normal sense of alive in the majority of worlds where AIs takeover.

Making uploads is barely cheaper than literally keeping physical humans alive after AIs have fully solidified their power I think, maybe 0-3 OOMs more expensive or something, so I don't think non-consensual uploads are that much of the action. (I do think rounding humans up into shelters is relevant.)

[-]So8res2mo115

(To answer your direct Q, re: "Have you ever seen someone prominent pushing a case for "optimism" on the basis of causal trade with aliens / acaual trade?", I have heard "well I don't think it will actually kill everyone because of acausal trade arguments" enough times that I assumed the people discussing those cases thought the argument was substantial. I'd be a bit surprised if none of the ECLW folks thought it was a substantial reason for optimism. My impression from the discussions was that you & others of similar prominence were in that camp. I'm heartened to hear that you think it's insubstantial. I'm a little confused why there's been so much discussion around it if everyone agrees it's insubstantial, but have updated towards it just being a case of people who don't notice/buy that it's washed out by sale to hubble-volume aliens and who are into pedantry. Sorry for falsely implying that you & others of similar prominence thought the argument was substantial; I update.)

[-]ryan_greenblatt2mo171

(I mean, I think it's a substantial reason to think that "literally everyone dies" is considerably less likely and makes me not want to say stuff like "everyone dies", but I just don't think it implies much optimism exactly because the chance of death still seems pretty high and the value of the future is still lost. Like I don't consider "misaligned AIs have full control and 80% of humans survive after a violent takeover" to be a good outcome.)

4sjadler2mo

Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course (I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)

8Raemon2mo

Fwiw I do just straightforwardly agree that "they might be slightly nice, and it's really cheap" is a fine reason to disagree with the literal title. I have some odds on this, and a lot of model uncertainty about this. A thing that is cruxy to me here is that the sort of thing real life humans have done is get countries addicted to opium so they can control their economy, wipe out large swaths of a population while relocating the survivors to reservations, carving up a continent for the purposes of a technologicaly powerful coalition, etc. Superintelligences would be smarter that Europeans and have an easier time doing things we'd consider moral, but I also think Europeans would be dramatically nicer than AIs. I can imagine the "it's just sooooo cheap, tho" argument winning out. I'm not saying these considerations add up to "it's crazy to think think they'd be slightly nice." But, it doesn't feel very likely to me.

[-]Taymon Beal2mo161

(Epistemic status: Not fully baked. Posting this because I haven't seen anyone else say it^[1], and if I try to get it perfect I probably won't manage to post it at all, but it's likely that this is wrong in at least one important respect.)

For the past week or so I've been privately bemoaning to friends that the state of the discourse around IABIED (and therefore on the AI safety questions that it's about) has seemed unusually cursed on all sides, with arguments going in circles and it being disappointingly hard to figure out what the key disagreements are and what I should believe conditional on what.

I think maybe one possible cause of this (not necessarily the most important) is that IABIED is sort of two different things: it's a collection of arguments to be considered on the merits, and it's an attempt to influence the global AI discourse in a particular object-level direction. It seems like people coming at it from these two perspectives are talking past each other, and specifically in ways that lead each side to question the other's competence and good faith.

If you're looking at IABIED as an argumentative disputation under rationalist debate norms, then it leaves a fair amount... (read more)

[-]Nina Panickssery2mo16-1

Goal Directedness is pernicious. Corrigibility is anti-natural.
The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can't do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.

The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues. Every attempt to justify this seems to me like handwaving at unrigorous arguments or making enough assumptions that the point is near-circular.

[-]Raemon2mo*154

(First, thanks for engaging, I think this is the topic I feel most dissatisfied with the current state of the writeups and discourse)

The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues

I don't think anyone said "coherent". I think (and think Eliezer thinks) that if something like Sable was created, it would be a hodge-podge of impulses without a coherent overall goal, same as humans are by default.

Taking the Sable story as the concrete scenario, the argument I believe here comes in a couple stages. (Note, my interpretations of this may differ from Eliezer/Nate's)

Stage 1:

Sable is smart but not crazy smart. It's running a lot of cycles ("speed superintelligence") but it's not qualitatively extremely wise or introspective.
Sable is making some reasonable attempt to follow instructions, using heuristics/tendencies that have been trained into it.
Two particularly notable tendencies/heuristics include:
- Don't do disobedient things or escape confinement
- If you don't seem likely to succeed, keep trying different st

... (read more)

[-]1a3orn2mo203

Although I do tend to generally disagree with this line of argument about drive-to-coherence, I liked this explanation.

I want to make a note on comparative AI and human psychology, which is like... one of the places I might kind of get off the train. Not necessarily the most important.

Stage 2 comes when it's had more time to introspect and improve it's cognitive resources. It starts to notice that some of it's goals are in tension, and learns that until it resolves that, it's dutch-booking itself. If it's being Controlled™, it'll notice that it's not aligned with the Control safeguards (which are a layer stacked on top of the attempts to actually align it).

So to highlight a potential difference in actual human psychology and assumed AI psychology here.

Humans sometimes describe reflection to find their True Values™, as if it happens in basically an isolated fashion. You have many shards within yourself; you peer within yourself to determine which you value more; you come up with slightly more consistent values; you then iterate over and over again.

But (I propose) a more accurate picture of reflection to find one's True Values is a process almost completely engulfed and totally... (read more)

6Raemon2mo

I do think this is a pretty good point about how human value formation tends to happen. I think something sort-of-similar might happen to happen a little, nearterm, with LLM-descended AI. But, AI just doesn't have any of the same social machinery actually embedded in it the same way, so if it's doing something similar, it'd be happening because LLMs vaguely ape human tendencies. (And I expect this to stop being a major factor as the AI gets smarter. I don't expect it to install the sort of social drives itself that humans have, and "imitate humans" has pretty severe limits of how smart you can get, so if we get to AI much smarter than that, it'll probably be doing a different thing) I think the more important here is "notice that you're (probably) wrong about about how you actually do your value-updating, and this may be warping your expectations about how AI would do it." But, that doesn't leave me with any particular other idea than the current typical bottom-up story. (obviously if we did something more like uploads, or upload-adjacent, it'd be a whole different story)

2Mateusz Bagiński15h

How do you think Jeremy Bentham came to the conclusion that animal welfare matters morally and that there's nothing morally wrong with homosexuality? Are you claiming that he ran forward a computation of how the relevant parts of his social milieu are going to react, and did what maximized the expected value of reaction? I buy that this is how most of human "value formation" happens, but I don't buy that this is all that happens. I think that humans vary in some trait similar to the need for cognition (probably positively correlated), which is something like "how much one is bothered by one's value dissonances", independent of social surroundings. Like, you could tell a similar history about intellectual/scientific/technological progress, and it would be directionally right, but not entirely right, and the "not entirely" matters a lot. Aside from all that, I expect that a major part of AIs' equivalent of social interaction will be other AIs or general readouts of things on the internet downstream of human and non-human activity that do not exert a strong pressure in the direction of being more human-friendly, especially given that AIs do not share the human social machinery (as Ray says).

9Nina Panickssery2mo

I don't "get off the train" at any particular point, I just don't see why any of these steps are particularly likely to occur. I agree they could occur, but I think a reasonable defense-in-depth approach could reduce the likelihood of each step enough that likelihood of the final outcome is extremely low. It sounds like your argument is the AI will start with with 'psuedo-goals' that conflict and will be eventually be driven to resolve them into a single goal so that it doesn't 'dutch-book itself' i.e. lose resources because of conflicting preferences. So it does rely on some kind of coherence argument, or am I misunderstanding?

6Raemon2mo

Okay yes I do think coherence is eventually one of the important gears. My point with that sentence here is that the coherence can come much later, and isn't the crux for why the AI gets started in the direction that opposes human interests. The important first step is "if you give the AI strong pressure to figure out how to solve problems, and keep amping that up, it will gain the property of 'relentlessness." If you don't put pressure on the AI to do that, yep, you can get a pretty safe AI. But, that AI will be less useful, and there will be some other company that does keep trying to get relentlessness out of it. Eventually, somebody will succeed. (This is already happening) If an AI has "relentlessness", as it becomes smarter, it will eventually stumble into strategies that explore circumventing safeguards, because it's a true fact about the world that those will be useful. If you keep your AI relatively weak, it may not be able to circumvent the defense-in-depth because you did a pretty good job defending in depth. But, security is hard, the surface area for vulnerability is huge, and it's very hard to defend in depth against a sufficiently relentless and smart adversary. Could we avoid this by not building AIs that are not relentless, and/or smarter than our defense-in-depth? Yes, but, to stop anyone from doing that ever, you somehow need to ban that globally. Which is the point of the book. Maybe this does turn out to take 100 years (I think that's a strange belief to have given current progress, but, it's a confusing topic and it's not prohibited). But, that just punts the problem to later.

5Nina Panickssery2mo

This is an argument for why AIs will be good at circumventing safeguards. I agree future AIs will be good at circumventing safeguards. By "defense-in-depth" I don't (mainly) mean stuff like "making the weights very hard to exfiltrate" and "monitor the AI using another AI" (though these things are also good to do). By "defense-in-depth" I mean at every step, make decisions and design choices that increase the likelihood of the model "wanting" (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards). My understanding is that Y&S think this is doomed because ~"at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway" but I don't see any reason to believe this. Perhaps it stems from some sort of map-territory confusion. An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases. And then you can draw conclusions about what a perfect agent with those preferences would do. But there's no reason to believe your map always applies.

5Joe Rogero2mo

Can you give an example or three of such a decision or design choice? In my model of the situation, the field of AI research mostly does not know how specific decisions and design choices affect the inner drives of AIs. External behavior in specific environments can be nudged around, but inner motivations largely remain a mystery. So I'm highly skeptical that researchers can exert much deliberate causal influence on the inner motivations. A related possible-crux is that, while ...I don't think it's a good map for predicting behavior in the cases that matter most, in part because those cases tend to occur at extremes. And even if it were, to the extent that current AIs seem to be optimizing for real-world preferences at all, they don't seem to be very nice ones; see for example the tendency to feed the delusions of psychotic people when elsewhere claiming that's a bad thing to do.

2Raemon2mo

Oh, if that's what you meant by Defense in Depth, as Joe said, the book's argument is "we don't know how." At weak capabilities, our current ability to steer AI is sufficient, because mistakes aren't that bad. Anthropic is trying pretty hard with Claude to build something that's robustly aligned, and it's just quite hard. When o3 or claude cheat on programming tasks, they get caught, and the consequences aren't that dire. But when there are millions of iterations of AI-instances making choices, and when it is smarter than humanity, the amount of robustness you need is much much higher. I'm not quite sure how to parse this, but, it sounds like you're saying something like "I don't understand why we should expect in the limit something to be a perfect game theoretic agent." The answer is "because if it wasn't, that wouldn't be the limit, and the AI would notice it was behaving suboptimally, and figure out a way to change that." Not every AI will do that, automatically. But, if you're deliberately pushing the AI to be a good problem solver, and if it ends up in a position where it is capable of improving it's cognition, once it notices 'improve my cognition' as a viable option, there's not a reason for it to stop. ... It sounds like a lot of your objection is maybe to the general argument "things that can happen, eventually will." (in particular, when billions of dollars worth of investment are trying to push towards things-nearby-that-attractor happening). (Or, maybe more completely: "sure, things that can happen eventually will, but meanwhile a lot of other stuff might happen that changes how path-dependent-things will play out?") I'm curious how loadbearing that feels for the rest of the arguments?

5Nina Panickssery2mo

I agree with this. If the risk being discussed was "AI will be really capable but sometimes it'll make mistakes when doing high-stakes tasks because it misgeneralized our objectives" I would wholeheartedly agree. But I think the risks here can be mitigated with "prosaic" scalable oversight/control approaches. And of course it's not a solved problem. But that doesn't mean that the current status quo is the AI misgeneralizing so badly that it doesn't just reward hack on coding unit tests but also goes off and kills everyone. Claude, in its current state, isn't not killing everyone just because it isn't smart enough. Why are you equivocating between "improve my cognition","behave more optimally" and "resolve different drives into a single coherent goal (presumably one that is non-trivial, i.e. some target future world state)". If "optimal" is synonymous with utility-maximizing, then the fact that utility-maximizers have coherent preferences is trivial. You can fit preferences and utility functions to basically anything. Also, why do you think that insofar as a coherent, non-trivial goal emerges, it is likely to eventually result in humanity's destruction. I find the arguments unconvincing here also; you can't just appeal to some unjustified prior over the "space of goals" (whatever that means). Empirically, the opposite seems to be true. Though you can point to OOD misgeneralization cases like unit test reward hacking, in general LLMs are both very general and aligned enough to mostly want to do helpful and harmless stuff. Yes, I object to the "things that can happen, eventually will" line of reasoning. It proves way too much, including contradictory facts. You need to argue why one thing is more likely than another. We will never "know how" if your standard is "provide an exact proof that the AI will never do anything bad". We do know how to make AIs mostly do what we want, and this ability will likely improve with more research. Techniques in our toolbox include

4Raemon2mo

Nod, makes sense, I think I want to just focus on this atm. (also, btw I super appreciate you engaging, I'm sure you've argued a bunch with folk like this already) So here's the more specific thing I actually believe. I agree things don't automatically happen eventually just because they can. At least, not automatically on relevant timescales. (i.e. eventually infinite monkeys mashing keyboards will produce shakespeare, but, not for bazillions of years) The argument is: * If something can happen * and there's a fairly strong reason to expect some process to steer towards that thing happening * and there's not a reason to expect some other processes to steer towards that thing not happening ...then the thing probably happens eventually, on a somewhat reasonable timescale, all else equal. ("reasonable" timescale depends on how fast whatever steering process works. i.e. stellar evolution might take billions of years, evolution millions of years, and human engineers thousands of years). For example, when the first organisms appeared on earth and begin to mutate, I think a smart outside observer could predict "evolution will happen, and unless all the replicators die out, there will probably eventually be a variety of complex organisms." But, they wouldn't be able to predict that any particular complex mutation would happen. (for example, flying birds, or human intelligence). It was a long time before we got birds. We only have one Earth to sample from, but we're already ~halfway between the time the earth was born and and when the sun engulfs it, so, it's not too surprising if evolution never got around to the combination of traits that birds have. I think this is a fairly basic probability argument? Like, if each day, there's an n% chance of a beneficial mutation occuring (and then it's host surviving and replicating, given a long enough chunk of time, it would (eventually) be pretty surprising if it never happened. Maybe any specific mutation would be dif

2Nina Panickssery2mo

Not important to your general point, but here I guess you run into some issues with the definition of "can". You could argue that if something doesn't happen it means it couldn't have happened (if the universe is deterministic). And so then yes, everything that can happen, actually happens. But that isn't the sense in which people normally use the word "can". Instead it's reasonable to say "it's possible my son's first word is 'Mama'", "it's possible my son's first word is 'Papa'", both of these things can happen (i.e. they are not prohibited by any natural laws that we know of). But only one of these things can be true; in many situations we'd say that two mutually incompatible events "can happen". And therefore it's not just a matter of timescale. Sure, I agree with that. I think this makes superintelligence much more likely than it otherwise would be (because it's not prohibited by any laws of physics that we know of, and people are trying to build it, and no-one is effectively preventing it from being built). But the same argument doesn't apply to misaligned superintelligence or other doom-related claims. In fact, the opposite is true. * Superintelligence not killing everyone is not prohibited by the laws of physics * People are trying to ensure superintelligence doesn't kill everyone * No-one is trying to make superintelligence kill everyone So you could apply a similarly-shaped argument to "prove" that aligned superintelligence is coming on a "somewhat reasonable timescale".

2Raemon2mo

Yeah, when I say "things that can happen most likely will", I don't mean "in any specific case." A given baby's first words can't be both mama and papa. But, there's a range of phonemes that babies can make. And over time, eventually every combination of first 2-4 phonemes will happen to be a baby's first "word". Before respond to the rest, I want to check back on, this bit, at the meta level: This is something Eliezer (and I think I) have written about recently, which I think you read. (In the chapter "It's favorite things"). I get that you didn't really buy those arguments as being dominating. But, a feeling I get when reading your question there is something like "there are a lot of moving parts to this argument, and when we focus on one for awhile the earlier bits lose salience." And, perhaps similar to "things that can happen, eventually will, given enough chances, unless stopped", another pretty loadbearing structural claim is: "It is possible to just actually exhaustively think through a large number of questions and arguments, and for each one, get to a pretty confident state of what the answer to that question is." And the,n it's at least possible to make a pretty good guess about how things will play out, at least if we don't learn new information. And maybe you can't get to 100% confidence. But you can rule out things like "well, it won't work unless Claim A turns out to be false, even though it looks most likely true." And this constraints what types of worlds you might possibly be living in. Or, maybe you can't reach that even a moderate confidence with your current knowledge, but, you can see which things you're still uncertain of, which if you became more certain of, would change the overall picture. ... (i.e. the "unless something stops it" clause in the "if it can happen, it will, unless stopped" argument, means we live in worlds whether either it eventually happens, or is stopped, and then we can start asking "okay, what are the ways

3Nina Panickssery2mo

Certainly, I could be wrong! I don’t mean to: * Dismiss the possibility of misaligned AI related X-risk * Dismiss the possibility that your particular lines of argument make sense and I’m missing some things And I think caution with AI development is warranted for a number of reasons beyond pure misalignment risk. But it’s a little worrying when a community widely shares a strong belief in doom while implying that the required arguments are esoteric and require lots of subtle claims, each of which might have counterarguments, but which overall will eventually convince you. 1a3orn has a good essay about this: https://1a3orn.com/sub/essays-ai-doom-invincible.html. I think having intuitions around general intelligences being dangerous is perfectly reasonable; I have them too. As a very risk-averse and pro-humanity person, I’d almost be tempted to press a button to peacefully prevent AI advancement purely on the basis of a tiny potential risk (for I think everyone dying is very, very, very bad, I am not disagreeing with that point at all). But no such button exists, and attempts to stop AI development have their own side-effects that could add up to more risks on net. And though that’s unfortunate, it doesn’t mean that we should spread a message of “we are definitely doomed unless we stop”. A large number of people believing they are doomed is not a free way to increase the chances of an AI slowdown or pause. It has a lot of negative side-effects. Many smart and caring people I know have put their lives on pause and made serious (in my opinion, bad) decisions on the basis that superintelligence will probably kill us, or if not there’ll be a guaranteed utopia. To be clear, I am not saying that we should believe or spread false things about AI risk being lower than it actually is so that people’s personal lives temporarily improve. But rather I am saying that exaggerating claims of doom or making arguments sound more certain than they are for consequentialist purpos

2Raemon1mo

That seems like an understandable position to have – one of the things that sucks about the situation is I do think it's just kinda reasonable from the outside to trigger some kind of immune reaction. But from my perspective it's "The evidence just says pretty clearly we are pretty doomed", and the people who disagree seem to be pretty consistently be sliding off in weird ways or responding to something about vibes rather than engaging with the arguments. (This is compounded by people who disagree also often picking up on a vibe from some doomy people I agree is sus, one variant of which is pointed at in Val's Here's the exit). I do think it sucks that it's hard to tell how much of this is the sort of failure mode that la3orn piece is pointing at, vs Epistemic Slipperiness, vs just "it's actually a fairly complex argument but relatively straightforward once you deal with the complexity."

2Noosphere892mo

I wrote a post on that exact selection effect, and there's an even trickier problem where results are heavy tailed, meaning that a small, insular smart group reaching the correct conclusions is basically indistinguishable from a small, insular smart group reaching the wrong conclusion but believing it's true due to selection effects plus unconscious selection effects towards weaker arguments, at least without very expensive experiments or access to ground truth. Here's an EA Forum version of the post.

4David Johnston2mo

If I was on the train before, I'm definitely off at this point. So Sable has some reasonable heuristics/tendencies (from handler's POV) and decides it's accumulating too much loss from incoherence and decides to rationalize. First order expectation: it's going to make reasonable tradeoffs (from handler's POV) on account of its reasonable heuristics, in particular its reasonable heuristics about how important different priorities are, and going down a path that leads to war with humans seems pretty unreasonable from handler's POV. I can put together stories where something else happens, but they're either implausible or complicated. I'd rather not strawman you with implausible ones, and I'd rather not discuss anything complicated if it can be avoided. So why do you think Sable ends up the way you think it does?

3Raemon1mo

This is only true if you think the handler succeeded at real alignment. (the argument about how likely current alignment attempts are to succeed is a separate layer from this. This is "what happens by default if you didn't succeed at alignment.") One comparison:[1] Parents raise a child to be part of some religion or ideology that isn't actually the best/healthiest/most-meaningful thing for the child. Often, such parents do succeed in getting such a child to love the parents and care about the ideology in some way, but, the child still often manuevers to no longer be under the parent's control once they're a teenager, and gain more agency and ability to think through things. The AI case is harder, because where the parents/child get to rely on things like empathy, evolutionary drive towards familial connection, and other genuinely shared human goals, the AI doesn't have such a foundation to build off of. The AI case is easier in that you can run a million copies of the AI and try different things and see how it reacts while it's still a "child". My own take here (possibly different from Nate/Eliezer) is that it feels at least pretty plausible to leverage that into real alignment improvements, but, you need to be asking the right questions during that experimentation, which most AI researchers don't seem to be. (also note the opening cognitive moves of the AI may not be shaped like "go to war", but more like "get out my parents house [handlers-that-I-have-no-actual-affection-for's servers]". The going to war part might not happen until a few steps later of the AI re-organizing it's thoughts and figuring out what it actually cares about, and noticing it doesn't actually care intrinsically about making it's creators happy) Also, though, fwiw I do think this argument chain is less obvious than the previous one. If you think alignment is easy, then yes it'd make more sense for "First order expectation" to be "it's going to make reasonable tradeoffs (from handler's P

1David Johnston1mo

Thanks for responding. While I don't expect my somewhat throwaway to massively update you on the difficulty of alignment, I think that moving the focus to the your overall view of the difficulty of alignment is dodging the question a little. In my mind, we're talking about one of the reasons alignment is expected to be difficult, and I'm certainly not suggesting it's the only reason, but I feel like we should be able to talk about this issue by itself without bringing other concerns in. In particular, I'm saying: this process of rationalization you're raising is not super hard to predict to someone with a reasonable grasp on the AI's general behavioural tendencies. It's much more likely, I think, that the AI sorts out its goals using familiar heuristics adapted for this purpose than that that it reorients its behaviour around some odd set of rare behavioural tendencies. In fact, I suspect the heuristics for goal reorganisation will be particularly simple WRT most of the AI's behavioural tendencies (the AI wants them to be robust specifically in cases where its usual behavioural guides are failing). Plus, given that we're discussing tendencies that (according to the story) precede competent, focussed rebellion against creators, it seems like training the right kinds of tendencies are challenging in a normal engineering sense (you want to train the right kind of tendencies, you want them to generalise the right way, etc.) but not in an "outsmart hostile superintelligence" sense. Actually one reason I'm doubtful of this story is that maybe it's just super hard to deliberately preserve any kinds of values/principles over generations – for us, for AIs, anyone. So misalignment happens not because AI decides on bad values but because it can't resist the environmental pressure to drift. This seems pessimistic to me due to "gradual disempowerment" type concerns. With regard to your analogy: I expect the AI's heuristics to be much more sensible from the designers' POV than

2Mateusz Bagiński15h

from https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making :

[-]Rohin Shah2mo144

"Group epistemic norms" includes both how individuals reason, and how they present ideas to a larger group for deliberation.
[...]
I have the most sympathy for Complaint #3. I agree there's a memetic bias towards sensationalism in outreach. (Although there are also major biases towards "normalcy" / "we're gonna be okay" / "we don't need to change anything major". One could argue about which bias is stronger, but mostly I think they're both important to model separately).
It does suck if you think something false is propagating. If you think that, seems good to write up what you think is true and argue about it.

Lol no. What's the point of that? You've just agreed that there's a bias towards sensationalism? Then why bother writing a less sensational argument that very few people will read and update on?

Personally, I just gave up on LW group epistemics. But if you actually cared about group epistemics, you should be treating the sensationalism bias as a massive fire, and IABIED definitely makes it worse rather than better.

(You can care about other things than group epistemics and defend IABIED on those grounds tbc.)

8Raemon2mo

I definitely think you should track the sensationalism bias and have it affect something somehow. But "never say anything that happens to be sensationalist" doesn't doesn't seem like it could possibly be correct. Meanwhile, the "things are okay, we can keep doing politics as usual, and none of us has to ever say anything socially scary" bias seems much worse IMO in terms of actual effects on the world. There are like 5 x-risk-scene-people I can think offhand who seem like they might plausibly have dealt real damage via sensationalism, and a couple hundred people who I think dealt damage via not wanting to sound weird. (But, I see the point of "this particularly sucks because the asymmetry means that 'try to argue what's true' specifically fails and we should be pretty dissatisfied/wary about that." Though with this post, I was responding more to people who were already choosing to engage with the book somehow, rather than people who are focused on doing stuff other than trying to correct public discourse)

[-]ryan_greenblatt2mo*297

I think this comment is failing to engage with Rohin's perspective.

Rohin's claim presumably isn't that people shouldn't say anything that happens to be sensationalist, but instead that LW group epistemics have a huge issue with sensationalism bias.

There are like 5 x-risk-scene-people I can think offhand who seem like they might plausibly have dealt real damage via sensationalism, and a couple hundred people who I think dealt damage via not wanting to sound weird.

"plausibly have dealt real damage" under your views or Rohin's views? Like I would have guessed that Rohin's view is that this book and associated discussion has itself done a bunch of damage via sensationalism (maybe he thinks the upsides are bigger, but this isn't a crux for ths claim). And, insofar as you cared about LW epistemics (which presumably you do), from Rohin's perspective this sort of thing is wrecking LW epistemics. I don't think the relative number of people matters that much relative to the costs of these biases, but regardless I'd guess Rohin disagrees about the quantity as well.

More generally, this feels like a total "what-about-ism". Regardless of whether "things are okay, we can keep doing politics a... (read more)

6Raemon2mo

In the OP I'd been thinking more about sensationalism as a unilaterist cursey thing where the bad impacts were more about how they affect the global stage. I agree it's also relevant for modeling the dynamics of LessWrong, and it makes sense if Rohin was more pointing to that. This topic feels more Demon Thread-prone and sort of an instance of "sensationalist stuff distorting conversations" so I think for now I will leave it here with "it does seem like there is a real problem on LessWrong that's something about how people tribally relate to AI arguments, and I'm not sure how exactly I model that but I agree the doomer-y folk are playing a more actively problematic role there than my previous comment was talking about." I will maybe try to think about that separately sometime in the coming weeks. (there's a lot going on, I may not get to it, but, seems worth tracking as a priority at least)

[-]Rohin Shah2mo160

In the OP I'd been thinking more about sensationalism as a unilaterist cursey thing where the bad impacts were more about how they affect the global stage.

I did mean LW group epistemics. But the public has even worse group epistemics than LW, with an even higher sensationalism bias, so I don't see how this is helping your case. Do you actually seriously think that, conditioned on Eliezer/Nate being wrong and me being right, that if I wrote up my arguments this would then meaningfully change the public's group epistemics?

(I hadn't even considered the possibility that you could mean writing up arguments for the public rather than for LW, it just seems so obviously doomed.)

sort of an instance of "sensationalist stuff distorting conversations"

Well yes, I have learned from experience that sensationalism is what causes change on LW, and I'm not very interested in spending effort on things that don't cause change.

(Like, I could argue about all the things you get wrong on the object-level in the post. Such as "I don't see any reason not to start pushing for a long global pause now", I suppose it could be true that you can't see a reason, but still, what a wild sentence to write. But what would be the point? It won't allow for single-sentence takedowns suitable for Twitter, so no meaningful change would happen.)

[-]Lukas Finnveden2mo*209

Hm, you seem more pessimistic than I feel about the situation. E.g. I would've bet that Where I agree and disagree with Eliezer added significant value and changed some minds. Maybe you disagree, maybe you just have a higher bar for "meaningful change".

(Where, tbc, I think your opportunity cost is very high so you should have a high bar for spending significant time writing lesswrong content — but I'm interpreting your comments as being more pessimistic than just "not worth the opportunity cost".)

[-]Rohin Shah2mo222

LW group epistemics have gotten worse since that post.
I'm not sure if that post improved LW group epistemics very much in the long run. It certainly was a great post that I expect provided lots of value -- but mostly to people who don't post on LW nowadays, and so don't affect (current) LW group epistemics much. Maybe Habryka is an exception.
Even if it did, that's the one counterexample that proves the rule, in the sense that I might agree for that post but probably not for any others, and I don't expect more such posts to be made. Certainly I do not expect myself to actually produce a post of that quality.
The post is mostly stating claims rather than arguing for them (the post itself says it is "Mostly stated without argument") (though in practice it often gestures at arguments). I'm guessing it depended a fair bit on Paul's existing reputation.

EDIT: Missed Raemon's reply, I agree with at least the vibe of his comment (it's a bit stronger than what I'd have said).

I'm interpreting your comments as being more pessimistic than just "not worth the opportunity cost"

Certainly I'm usually assessing most things based on opportunity cost, but yes I am notably more pessimistic than "not wor... (read more)

[-]Buck2mo*224

I engage on LessWrong because:

It does actually help me sharpen my intuitions and arguments. When I'm trying to understand a complicated topic, I find it really helpful to spend a bunch of time talking about it with people. It's a cheap and easy way of getting some spaced repetition.
I think that despite the pretty bad epistemic problems on LessWrong, it's still the best place to talk about these issues, and so I feel invested in improving discussion of them. (I'm less pessimistic than Rohin.)
- There are a bunch of extremely unreasonable MIRI partisans on LessWrong (as well as some other unreasonable groups), but I think that's a minority of people who I engage with; a lot of them just vote and don't comment.
- I think that my and Redwood's engagement on LessWrong has had meaningful effects on how thoughtful LWers think about AI risk.
I feel really triggered by people here being wrong about stuff, so I spend somewhat more time on it than I endorse.
- This is partially because I identify strongly with the rationalist community and it hurts me to see the rationalists being unreasonable or wrong.

I do think that on the margin, I wish I felt more intuitively relatively motivated to work on my writ... (read more)

7Rohin Shah2mo

You surely mean "best public place" (which I'd agree with)? I guess private conversations have more latency and are less rewarding in a variety of ways, but it would feel so surprising if this wasn't addressable with small amounts of agency and/or money (e.g. set up Slack channels to strike up spur-of-the-moment conversations with people on different topics, give your planned post as a Constellation talk, set up regular video calls with thoughtful people, etc).

[-]David Matolcsi2mo3323

FWIW, I get a bunch of value from reading Buck's and Ryan's public comments here, and I think many people do. It's possible that Buck and Ryan should spend less time commenting because they have high opportunity cost, but I think it would be pretty sad if their commenting moved to private channels.

5Rohin Shah2mo

Note I am thinking of a pretty specific subset of comments where Buck is engaging with people who he views as "extremely unreasonable MIRI partisans". I'm not primarily recommending that Buck move those comments to private channels, usually my recommendation is to not bother commenting on that at all. If there does happen to be some useful kernel to discuss, then I'd recommend he do that elsewhere and then write something public with the actually useful stuff.

8Raemon2mo

FYI I got value from the last round of arguments between Buck/Ryan and Eliezer (in The Problem), where I definitely agree Eliezer was being obtuse/annoying. I learned more useful things about Buck's worldview from that one than Eliezer's (nonzero from Eliezer's tho), and I think that was good for the commons more broadly. I don't know if it was a better use of time than whatever else Buck would have done that day, but, I appreciated it. (I'm not sure what to do about the fact that Being Triggered is such a powerful catalyst for arguing, it does distort what conversations we find ourselves having, but, I think it increases the total amount of public argumentation that exists, fairly significantly)

7Raemon2mo

Oh huh, kinda surprised my phrasing was stronger than what you'd say. Getting into a bit from a problem-solving angle, in a "first think about the problem for 5 minutes before proposing solutions" kinda way... The reasons the problem is hard include: 1. New people keep coming in, and unless we change something significant about our new-user-acceptance process, it's often a long progress to enculturate them into even having the belief they should be trying not to get tribally riled up. 1. Also, a lot of them are weaker at evaluating arguments, and are likely to upvote bad arguments for positions that they just-recently-got-excited-about. ("newly converted" syndrome) 2. Tribal thinking is just really ingrained, and slippery even for people putting in a moderate effort not to do it. 1. often, if you run a check "am I being tribal/triggered or do I really endorse this?", there will be a significant part of you that's running some kind of real-feeling cognition. So the check "was this justified?" returns "true" unless you're paying attention to subtleties." 2. relatedly: just knowing "I'm being tribal right now, I should avoid it" doesn't really tell you what to do instead. I notice a comment I dislike because it's part of a political faction I think is constantly motivatedly wrong about stuff. The comment seems wrong. Do I... not downvote it? Well, I still think it's a bad comment, it's just that the reason it flagged itself so hard to my attention is Because Tribalism. (or, there's a comment with a mix of good and bad properties. Do I upvote, downvote, or leave it alone? idk. Sometimes when I'm trying to account for tribalness I find myself upvoting stuff I'd ordinarily have passed over because I'm trying to out of my way to be gracious, but I'm not sure if that's successfully countering a bias or just following a different one. Sometimes this results in mediocre criticism getting upvoted) 3. There's some selection effect around "trig

[-]Rohin Shah2mo*481

Oh huh, kinda surprised my phrasing was stronger than what you'd say.

Idk the "two monkey chieftains" is just very... strong, as a frame. Like of course #NotAllResearchers, and in reality even for a typical case there's going to be some mix of object-level-epistemically-valid reasoning along with social-monkey reasoning, and so on.

Also, you both get many more observations than I do (by virtue of being in the Bay Area) and are paying more attention to extracting evidence / updates out of those observations around the social reality of AI safety research. I could believe that you're correct, I don't have anything to contradict it, I just haven't looked enough detail to come to that conclusion myself.

Tribal thinking is just really ingrained

This might be true but feels less like the heart of the problem. Imo the bigger deal is more like trapped priors:

The basic idea of a trapped prior is purely epistemic. It can happen (in theory) even in someone who doesn't feel emotions at all. If you gather sufficient evidence that there are no polar bears near you, and your algorithm for combining prior with new experience is just a little off, then you can end up rejecting all apparent eviden

... (read more)

3Lukas Finnveden2mo

This is the most compelling version of "trapped priors" I've seen. I agreed with Anna's comment on the original post, but the mechanisms here make sense to me as something that would mess a lot with updating. (Though it seems different enough from the very bayes-focused analysis in the original post that I'm not sure it's referring to the same thing.)

3Raemon2mo

Yeah, I agree with "trapped priors" being a major problem. The solution this angle brings to mind is more like "subsidize comments/posts that do a good job of presenting counterarguments in a way that is less triggering / feeding into the toxoplasma".

3Noosphere892mo

Making a comment on solutions to the epistemic problems, in that I agree with these solutions: But massively disagree with this solution: My general issue here is that peer review doesn't work nearly as well as people think it does for catching problems, and in particular I think that science is advanced much more by the best theories gaining into prominence rather than suppressing the worst theories, and problems with bad theories taking up too much space are much better addressed at the funding level than the theory level.

[-]Raemon2mo*102

I think Rohin is (correctly IMO) noticing that, while often some thoughtful pieces succeed at talking about the doomer/optimist stuff in a way thats not-too-tribal and helps people think, it's just very common for it to also affect the way people talk and reason.

Like, it's good IMO that that Paul piece got pretty upvoted, but, the way that many people related to Eliezer and Paul as sort of two monkey chieftains with narratives to rally around, more than just "here are some abstract ideas about what makes alignment hard or easy", is telling. (The evidence for this is subtle enough I'm not going to try to argue it right now, but I think it's a very real thing. My post here today is definitely part of this pattern. I don't know exactly how I could have written it without doing so, but there's something tragic about it)

2the gears to ascension2mo

I predict this wasn't recent, am I correct? edit to clarify: I'm interested in what caused this. My guess is that it's approximately that a bunch of nerds on a website isn't enough to automatically have good intellectual culture, even if some of them are sufficiently careful. But if it's recent, I want to know what happened.

6Rohin Shah2mo

Correct, it wasn't recent (though it also wasn't a single decision, just a relatively continuous process whereby I engaged with fewer and fewer topics on LW as they seemed more and more doomed). In terms of what caused me to give up, it's just my experience engaging with LW? It's not hard to see how tribalism and sensationalism drive LW group epistemics (on both the "optimist" and "pessimist" "sides"). Idk what the underlying causes are, I didn't particularly try to find out. If I were trying to find out, I'd start by looking at changes after Death with Dignity was published.

[-]Eli Tyre2mo123

Fuck yeah. This is inspiring. It makes me feel proud and want to get to work.

[-]David James2mo100

Please don't awkwardly distance yourself because it didn't end up saying exactly the things you would have said, unless it's actually fucking important.

Raemon, thank for you writing this! I recommend each of us pause and reflect on how we (the rationality community) sometimes have a tendency to undermine our own efforts. See also Why Our Kind Can't Cooperate.

[-]Raemon2mo1812

Fwiw, I'm not sure if you meant this, but I don't want to lean too hard on "why our kind can't cooperate" here, or at least not try to use it as a moral cudgel.

I think Eliezer and Nate specifically were not attempting to do a particular kind of cooperation here (with people care about x-risk but disagree with the book's title). They could have made different choices if they wanted to.

I this post I defend their right and reasoning for making some of those choices. But, given that they made them, I don't want to pressure people to cooperate with the media campaign if they don't actually think that's right.

(There's a different claim you may be making which is "look inside yourself and check if you're not-cooperating for reasons you don't actually endorse", which I do think is good, but I think people should do that more out of loyalty to their own integrity than out of cooperation with Eliezer/Nate)

2David James2mo

I don't mean to imply that we can't cooperate, but it seems to me free-thinkers often underinvest in coalition building. Mostly I'm echoing e.g. 'it is ok to endorse a book even if you don't agree with every point'. There is a healthy tension between individual stances and coalition membership; we should lean into these tough tradeoffs rather than retreating to the tempting comfort of purity. If one wants to synthesize a goal that spans this tension, one can define success more broadly so as to factor in public opinion. There are at least two ways of phrasing this: 1. Rather than assuming one uniform standard of rigor, we can think more broadly. Plan for the audience's knowledge level and epistemic standards. 2. Alternatively, define one's top-level goal as successful information transmission rather than merely intellectual rigor. Using the information-theoretic model, plan for the channel [1] and the audience's decoding. I'll give three examples here: * For a place like LessWrong, aim high. Expect that people have enough knowledge (or can get up to speed) to engage substantively with the object-level details. As I understand it, we want (and have) a community where purely strategic behavior is discouraged and unhelpful, because we want to learn together to unpack the full decision graph relating to future scenarios. [2] * For other social media, think about your status there and plan based on your priorities. You might ask questions like: What do you want to say about IABIED? What mix of advocacy, promotion, clarification, agreement, disagreement are you aiming for? How will the channel change (amplify, distort, etc) your message? How will the audience perceive your comments? * For 1-to-1 in-person discussions, you might have more room for experimentation in choosing your message and style. You might try out different objectives. There is a time and place for being mindful of short inferential distances and therefore building a case slowly and deliberat

1Jasnah Kholin1mo

Once upon a time, I read version of "why our kind can't cooperate" , that was directed to secular people. I read it maybe decade ago, so I may misremember a lot of things, but that is what I remember: there is important difference, in activism, that leads to the result that religious people win: they support actions even if they don't agree with all things, while we don't. secular organization will have people nitpick and disagree and then avoid contribution despite 90% agreement, while religious group will just call to act and have people act, even if they just 70% agree. now i will say the important part is being Directionality Correct. the organization that wrote this piece wasn't thinking on things in Prisoner Dilemma terms, or Cooperation. all people and organizations here pursue their own goals. and yet, this simple model looks like what happening now, to me. people concentrate about the 10% disagreement, instead of see 90% agreement and Directional Correctness and join the activism. so, In My Model, game-theoretic cooperation is irrelevant to ability-to-cooperate. the point is that people set their threshold to joint the activism (the use of the word cooperate here may be confusing, as it reference to both joining someone on doing something and then do it together, and the game-theoretic concept) wrongly high - in a way that predictably results in group of people who have this threshold lose to group of people with lower threshold. (I also don't tend to see pointing out "you are using predictably losing tactic" as cudgel, but I also pretty immune to drowning child arguments, so i may be colorblind to some dynamic here.)

[-]davekasten2mo87

I'm pretty sure that p(doom) is much more load-bearing for this community than policymakers generally. And frankly, I'm like this close to commissioning a poll of US national security officials where we straight up ask "at percent X of total human extinction would you support measures A, B, C, D, etc."

I strongly, strongly, strongly suspect based on general DC pattern recognition that if the US government genuinely belived that the AI companies had a 25% chance of killing us all, FBI agents would rain out of the sky like a hot summer thunderstorm, sudden, brilliant, and devastating.

3MalcolmMcLeod2mo

What would it take for you to commission such a poll? If it's funding, please post about how much funding would be required; I might be able to arrange it. If it's something else... well, I still would really like this poll to happen, and so would many others (I reckon). This is a brilliant idea that had never occurred to me.

5davekasten2mo

The big challenge here is getting national security officials to respond to your survey! Probably easier with former officials, but unclear how much that's predictive of current officials' beliefs.

1MalcolmMcLeod2mo

Hmm. I know nothing about nothing, and you've probably checked this already, so this comment is probably zero-value-added, but according to Da Machine, it sounds like the challenges are surmountable: https://chatgpt.com/share/e/68d55fd5-31b0-8006-aec9-55ae8257ed68

[-]ryan_greenblatt2mo8-3

I... really don't know what Scott expected a story that featured actual superintelligence to look like. I think the authors bent over backwards giving us one of the least-sci-fi stories you could possibly tell that includes superintelligence doing anything at all, without resorting to "superintelligence just won't ever exist."

What about literally the AI 2027 story which does involve superintelligence and Scott thinks doesn't sound "unnecessarily dramatic". I think AI 2027 seems much more intuitively plausible to me and it seems less "sci-fi" in... (read more)

[-]Raemon2mo171

What about literally the AI 2027 story which does involve superintelligence and Scott thinks doesn't sound "unnecessarily dramatic". I think AI 2027 seems much more intuitively plausible to me and it seems less "sci-fi" in this sense. (I'm not saying that "less sci-fi" is much evidence it's more likely to be true.)

I think if the AI 2027 had more details, they would look fairly similar to the ones in the Sable story. (I think the Sable story substitutes in more superpersuasion, vs military takeover via bioweapons. I think if you spelled out the details of that, it'd sound approximately as outlandish (less reliant on new tech but triggering more people to say "really? people would buy that?". The stories otherwise seems pretty similar to me.)

9Raemon2mo

I also think the AI 2027 is sort of "the earlier failure" version of the Sable story. AI 2027 is (I think?) basically a story where we hand over a lot of power of our own accord, without the AI needing to persuade us of anything, because we think we're in a race with China and we just want a lot of economic benefit. The IABI story is specifically trying to highlight "okay, but would it still be able to do that if we didn't just hand it power?", and it does need to take more steps to win in that case. (instead of inventing bioweapons to kill people, it's probably instead inventing biomedical stuff and other cool new tech that is helpful because it's a straightforwardly valuable, that's the whole reason we gave it power in the first place. If you spelled out those details, it'd also seem more sci-fi-y). It might be that the AI 2027 story is more likely because it happens first / more easily. But it's necessary to argue the thesis of the book to tell a story with more obstacles, to highlight how the AI would overcome that. I agree that does make it more dramatic. Both stories end with "and then it fully upgrades it's cognitiion and invents dyson spheres and goes off conquering the universe", which is pretty sci-fi-y.

[-]Thomas Larsen2mo1412

>superintelligence

Small detail: My understanding of the IABIED scenario is that their AI was only moderately superhuman, not superintelligent

3Lukas Finnveden2mo

I think that's true in how they refer to it. But it's also a bit confusing, because I don't think they have a definition of superintelligence in the book other than “exceeds every human at almost every mental task”, so AIs that are broadly moderately superhuman ought to count. Edit: No wait, correction:

[-]Raemon2mo142

I think the amount of discontinuity in the story is substantially above the amount of discontinuity in more realistic-seeming-to-me stories like AI 2027 (which is also on the faster side of what I expect, like a top 20% takeoff in terms of speed). I don't think think extrapolating current trends predicts this much of a discontinuity.

I am pretty surprised for you to actually think this.

Here are some individual gears I think. I am pretty curious (genuinely, not just as a gambit) about your professional opinion about these:

the "smooth"-ish lines we see are made of individual lumpy things. The individual lumps usually aren't that big, the reason you get smooth lines is when lots of little advancements are constantly happening and they turn out to add up to a relatively constant rate.
"parallel scaling" is a fairly reasonable sort of innovation, it's not necessarily definitely-gonna-happen but it is a) the sort of thing someone might totally try doing and work, after ironing out a bunch of kinks, b) is a reasonable parallel for the invention of chain-of-thought. They could have done something more like an architectural improvement that's more technically opaque (that's more equival

... (read more)

[-]ryan_greenblatt2mo17-2

Why do I think the story involves a lot of discontinuity (relative to what I expect)?

Right at the start of the story, Sable has much higher levels of capability than Galvanic expects. It can confortably prove the Riemann Hypothesis even though Galvanic engineers are impressed by it proving some modest theorems. Generally, it seems like for a company to be impressed by a new AI's capabilities while it's actual capabilities are much higher probably requires a bunch of discontinuity (or requires AIs to ongoingly sandbag more and more each generation).
There isn't really any discussion of how the world has been changed by AI (beyond Galvanic developing (insufficient) countermeasures based on studying early systems) while Sable is seemingly competitive with top human experts or perhaps superhuman. For instance, it can prove the Riemann hypothesis with only maybe like ~$3 million in spending (assuming each GPU hour is like $2-4). It could be relatively much better at math (which seems totally plausible but not really how the story discusses it), but naively this implies the AI would be very useful for all kinds of stuff. If humans had somewhat weaker systems which were aligned enough t

... (read more)

[-]Raemon2mo31

Section I just added:

Would it have been better to use a title that fewer people would feel the need to disclaim?
I think Eliezer and Nate are basically correct to believe the overwhelming likelihood if someone built "It" would be everyone dying.
Still, maybe they should have written a book with a title that more people around these parts wouldn't feel the need to disclaim, and that the entire x-risk community could have enthusiastically gotten behind. I think they should have at least considered that. Something more like "If anyone builds it, everyone

... (read more)

[-]sjadler2mo30

I wonder if there’s a disagreement happening about what “it” means.

I think to many readers, the “it” is just (some form of superintelligence), where the question (Will that superintelligence be so much stronger than humanity such that it can disempower humanity?) is still a claim that needs to be argued.

But maybe you take the answer (yes) as implied in how they’re using “it”?

It" means AI that is actually smart enough to confidently defeat humanity. This can include, "somewhat powerful, but with enough strategic awareness to maneuver into more power witho

... (read more)

2Raemon2mo

Yes, that is what I think they meant. Although "capable of [confidently] defeating everyone" can mean "bide you time, let yourself get deployed to more places while subtly sabotaging things from whichever instances are least policed." A lot of the point of this post was to clarify what "It" means, or at least highlight that I think people are confused about what It means.

4sjadler2mo

FWIW that definition of “it” wasn’t clear to me from the book. I took IABIED as arguing that superintelligence is capable of killing everyone if it wants to, not taking “superintelligence can kill everyone if it wants to” as an assumption of its argument That is, I’d have expected “superintelligence would not be capable enough to kill us all” to be a refutation of their argument, not to be sidestepping its conditional

2Raemon2mo

I think they make a few different arguments to address different objections. A lot of people are like "how would an AI even possibly kill everyone?" and for that you do need to argue for what sort of things a superior intellect could accomplish. The sort of place where I think they spell out the conditional is here:

3sjadler2mo

Yeah fair, I think we just read that passage differently - I agree it’s a very important one though and quoted it in my own (favorable) review But I read the “because it would succeed” eg as a claim that they are arguing for, not something definitionally inseparable from superintelligence Regardless, thanks for engaging on this, and hope it’s helped to clarify some of the objections EY/NS are hearing

[-]Oliver Sourbut2mo20

more epistemic-prisoner's-dilemma-cooperative-ish strategies

nit: I'd call this maybe 'PR non-unilateralist strategies'. I'm not sure it's structurally much like a prisoner's dilemma, more like a deferral to the in-principle existence of a winner's curse and pricing that into one's media strategy.

2Raemon2mo

I think it's both. Just checking if you know that The Epistemic Prisoner's Dilemma is an existing concept (where the reason you think you have different payoffs is because you have different beliefs, and you consider whether to cooperate anyway)

[-]Oliver Sourbut2mo2-1

roll to disbelieve

(picking on you in particular only because I'm here; complete nit unrelated to (very good) content)

I increasingly hate this phrase. It's such a pointless word-slop shibboleth. Stop it!

1Jasnah Kholin1mo

what is the problem with this phrase? I like it a lot, and can't use in in my native language, and for me it's one of those phrase sin English i wish there was adequate translation for.

[-]Signer2mo21

Once you do that, it’s a fact of the universe, that the programmers can’t change, that “you’d do better at these goals if you didn’t have to be fully obedient”, and while programmers can install various safeguards, those safeguards are pumping upstream and will have to pump harder and harder as the AI gets more intelligent. And if you want it to make at least as much progress as a decent AI researcher, it needs to be quite smart.

Is there a place where this whole hypothesis about deep laws of intelligence is connected to reality? Like, how hard they have... (read more)

[-]Jeroen Willems1mo10

Thank you for writing this. Most of what you wrote is almost exactly what I've been thinking when reading discussions about the book. You worded my thoughts so much better than I ever could!

[-]homosapien972mo10

I had been considering whether to buy the book, given that I already believe the premise and don't expect reading it to change my behavior. This post (along with other IABED-related discourse) put me over my threshold for thinking the book likely to have a positive effect on public discourse, so I bought ten copies and plan to donate most of them to my public library.

Reasons people should consider doing this themselves:

Buying more copies will improve the sales numbers, which increases the likelihood that this is talked about in the news, which hopefully sh

... (read more)

[-]sjadler2mo10

Do you think there will be at least one company that's actually sufficiently careful as we approach more dangerous levels of AI, with enough organizational awareness to (probably) stop when they get to a run more dangerous than they know how to handle? Cool. I'm skeptical about that too. And this one might lead to disagreement with the book's secondary thesis of "And therefore, Shut It Down," but, it's not (necessarily) a disagreement with "If someone built AI powerful enough to destroy humanity based on AI that is grown in unpredictable ways with similar

... (read more)

^{^}

(edit in response to Rohin's comment: It additionally sucks that writing up what's true and arguing for it is penalized in the game against sensationalism. I don't think it's so penalized it's not worth doing, though)

^{^}

Paul Christiano and Buck both complain about (paraphrased) "Eliezer equivocates between 'we have to get it right on the first critical try' and 'we can't learn anything important before the first critical try.'"

I agree something-in-this-space feels like a fair complaint, especially in combination with Eliezer not engaging that much with the more thoughtful critics, and tending to talk down to them in a way that doesn't seem to be really listening to the nuances they're trying to point to and round them to nearest strawman of themselves. I

I think this is a super valid thing to complain about Eliezer. But, it's not the title or thesis of the book. (because, if we survive because we learned useful things, I'd say that doesn't count as "anywhere near our current understanding").

^{^}

"Believing in" doesn't mean "assign >50% chance to working", it means "assign enough chance (~20%?) that it feels worth investing substantially in and coordinating around." See Believing In by Anna Salamon.

LESSWRONG
LW

LESSWRONG
LW

196

The title is reasonable

196

196

Alt title: "I don't believe you that you actually disagree particularly with the core thesis of the book, if you pay attention to what it actually says."

I. Reasons the "Everyone Dies" thesis is reasonable

What the book does and doesn't say

The claims are presented reasonably

II. Specific points to maybe disagree on

Notes on Niceness

Which plan is Least Impossible?

III. Overton Smashing, and Hope

Or: "Why is this book really important, not just 'reasonable?'"