Notes on fatalities from AI takeover

I think there's a fallacy in going from "slight caring" to "slight percentage of resources allocated."

Suppose that preserving earth now costs one galaxy that could be colonized later. Even if that one galaxy is merely one billionth of the total reachable number, it's still an entire galaxy ("our galaxy itself contains a hundred billion stars..."), and its usefulness in absolute terms is very large.

So there's a hidden step where you assume that the AIs that take over have diminishing returns (strong enough to overcome the hundred-billion-suns thing) on all their other desires for what they're going to do with the galaxies they'll reach, allowing a slight preference for saving Earth to seem worthwhile. Or maybe they have some strong intrinsic drive for variety, like how if my favorite fruit is peaches I still don't buy peaches every single time I go to the supermarket.

If I had no such special pressures for variety, and simply valued peaches at $5 and apples at $0.05, I would not buy apples 1% of the time, or dedicate 1% of my pantry to apples. I would just eat peaches. Or even if I had some modest pressure for variety but apples were only my tenth favorite fruit, I might be satisfied just eating my five favorite fruits and would never buy apples.

You describe this as "there are some reasons why small amounts of motivation don't suffice (given above) which are around 20% likely", but I think that's backwards. Small amounts of motivation by default don't suffice, but there's some extra machinery AIs could have that would make them matter.

Or to try a different analogy: Suppose a transformer model is playing a game where it gets to replace galaxies with things, and one of the things it can replace a galaxy with is "a preserved Earth." If actions are primitive, then there's some chance of preserving Earth, which we might call "the exponential of how much it cares about Earth, divided by the partition function" making a Boltzmann-rational modeling assumption. But if a model with similar caring it has to execute multi-step plans to replace each galaxy, then the probability of preserving Earth goes down dramatically, because it will have chances to change its mind and do the thing it cares for more (using Boltzmann-rationality assumption the other way). So in this toy example, a slight "caring" in the sense of what the model will say it would pick when quickly asked isn't represented when you look at the distribution of results of many-step plans.

If small motivations do matter, I think you can't discount "weird" preferences to do other things with Earth than preserve it. "Optimize Earth according to proxy X, which will kill all humans but really grow the economy / save the ecosystem / fill it with intelligent life / cure cancer / make it beautiful / maximize law-abiding / create lots of rewarding work for a personal assistant / really preserve Earth". Such motivations sound like they'd be small unless fairly directly optimized for, but the AI is supposed to be acting on small motivations, why not those bad ones, rather than the one we want?

[-]ryan_greenblatt2mo10-1

I think there's a fallacy in going from "slight caring" to "slight percentage of resources allocated."

Suppose that preserving earth now costs one galaxy that could be colonized later. Even if that one galaxy is merely one billionth of the total reachable number, it's still an entire galaxy ("our galaxy itself contains a hundred billion stars..."), and its usefulness in absolute terms is very large.

So there's a hidden step where you assume that the AIs that take over have diminishing returns

No, I'm not assuming diminishing returns or a preference for variety, when I talk about sufficient caring to result in a "extremely small amount of motivation" I just mean "caring enough to allocate ~1 / billionth to 1 / trillionth of our resources on it". This is a pretty natural notion of slight caring IMO. I agree 1 / billionth of resources will be absolutely very large to an AI and there is a type of caring which gets divided out by vast cosmic scale.

IMO, it's pretty natural for slight caring to manifest in this way and the thing you describe with peaches and apples is more like "slight preferences". I'm not saying "the AI will slightly prefer humans to survive over humans not surviving (but not more than what it could do with a marginal galaxy)", I mean that it will actually pay small costs at the margin of the actual decisions it has to make. I agree that AIs might prefer to preserve earth, but have this small preference swamped by the vastness of the galatic resources they would have to give up in exchange. I just wouldn't call this an amount of caring which suffices for paying 1 / billionth of resources.

I agree that the way this manifests will depend on the motivational structure of the AI, and motivational structures which purely care about the the ultimate arrangement of matter without regard to anything else will end up not caring at all about keeping humans alive (after all, why privilege humans, they just happened to be using the matter/energy to start with). (In the same way that these motivations would kill aliens to tile that matter with something that seems more optimal even though keeping these aliens alive might be insanely cheap as a fraction of resources.) Humans demonstrate motivations which don't just care about the ultimate arrangement of matter all the time, e.g. even relatively utilitarian people probably wouldn't kill everyone on earth to get an additional galaxy or kill aliens to get their stuff when it would be this cheap to leave them alone.

I think the relevant type of "kindness" is pretty natural, though obviously not guaranteed.

If small motivations do matter, I think you can't discount "weird" preferences to do other things with Earth than preserve it.

I don't discount this, see discussion in the section "How likely is it that AIs will actively have motivations to kill (most/many) humans". Note that it's pretty cheap to keep humans alive while also doing something that destroys earth.

[-]Charlie Steiner2mo106

This is a pretty natural notion of slight caring IMO

Agree to disagree about what seems natural, I guess. I think "slight caring" being relative more than absolute makes good sense as a way to talk about some common behaviors of humans and parliaments of subagents, but is a bad fit for generic RL agents.

[-]johnswentworth1mo72

It sounds like you are not not claiming that superintelligence will have human-like scope insensitivity baked into its preferences? Which seems like an absolutely bonkers thing to claim. "1 billionth of resources" does not at all seem like a natural way for "slight caring" to manifest in an actually-advanced mind; it seems like a thing which very arguably occurs in human minds but is particularly unlikely to generalize to superintelligence precisely because the generalized version would kneecap many general capabilities quite badly.

[-]ryan_greenblatt1mo21

It sounds like you are not not claiming that superintelligence will have human-like scope insensitivity baked into its preferences?

I think it's plausible ASI will have preferences which aren't totally linear returns-y and/or don't just care about the final arrangement of matter. These preferences might be very inhuman. Perhaps you think it's highly over determined that actually-advanced minds would only care about the utimate arrangement of matter at cosmic scales in a linear-ish way, but I don't think this is so obvious.

[-]ryan_greenblatt2mo70

Another way to put this: some types of caring are the types of caring where you're willing to allocate 1 / billionth of your resources on it even if you learn that you are extremely rich (so these resource could buy a lot of what you want in absolute terms). But you can also have preferences for what to spend your money on which depend strongly on what options are available; you'd be willing to trade some absolute amount of value for the thing and if it ends up being that your wealthier and can buy more in absolute terms, you'd be willing to spend a smaller fraction of money on the thing.

I'm predicting a type of caring/kindness which is the first rather than a type of preference which cares about questions like "exactly how big is the cosmic endowment".

[-]ryan_greenblatt2mo*3816

@So8res^[1] responded on twitter, copying his response here for completeness:

I appreciate your emphasis the bottom line that (IIUC) we agree upon: rogue superintelligence would permanently ruin the future and (at best) relegate humanity to something like zoo animals.
As I understand your arguments, they boil down to (a) maybe the AI will care a little and (b) maybe the AI will make backups of humans and sell them to aliens.
My basic response to (a) is: weak prefs about humans (if any) would probably get distorted, and existing ppl probably wouldn't be the optimum (and if they were, the result probably wouldn't be pretty). cf
My basic response to (b) is: I'm skeptical of your apparent high probabilities (>50%?) on this outcome, which looks somewhat specific and narrow to me. I also expect most implementations to route through a step most people would call "death".
In case (b), the traditional concept of "death" gets dodgy. (What if copies of you are sold multiple times? What if you only start running in an alien zoo billions of years later? What if the aliens distort your mind before restoring you?) I consider this topic a tangent.
Mostly I consider these cases to be exotic enough and death-flavored enough that I don't think "maybe AIs will sell backups of us to aliens" merits a caveat when communicating the basic danger. But I'm happy to acknowledge the caveat, and that some people think it's likely.

^{^}
If Nate cross posts himself, I'll delete this comment.

[-]Ben Pace1mo90

Link to Nate's comments.

[-]Max H2mo92

It's unclear what fraction of people die due to takeover because this is expedient for the AI, but it seems like it could be the majority of people and could also be almost no one. If AIs are less powerful, this is more likely (because AIs would have a harder time securing a very high chance of takeover without killing more humans).

Yeah, this (extinction to facilitate takeover) seems like the most plausible pathway to total or near-total extinction by far. An AI that is only a little bit smarter than humanity collectively has to worry about humans making a counter-move - launching missiles or building a competing AI or various kinds of sabotage. If you're a rogue AI, engineering a killer virus (something that smart humans can already do or almost do, if they wanted to) as soon as you or humanity has built out sufficient robotics infrastructure, makes all the subsequent parts of your takeover / expansion plan much less contingent and more straightforward to reason about. (And I think the analogy to historical relatively-bloodless coups here is a pretty weak counter / faint hope - for one, because human coup instigators generally still need humans to rule over, whereas AIs wouldn't.)

If there are a large number of different rogue AIs, it becomes more likely that one of them would benefit from massive fatalities (e.g. due to a pandemic) making this substantially more likely.

I don't see how the number of AIs makes a big difference here, rather than the absolute power level of the leading AI? An extinction or near-extinction event seems beneficial to just about any unaligned AI that is not all-powerful enough to not have to worry about humanity at all.

Put another way, the scenarios where extinction doesn't happen due to takeover only feel plausible in scenarios where a single AI fooms so fast and so hard that it can leave humanity alive without really sweating it. But if I understand the landscape of the discourse / disagreement here, these fast and discontinuous takeoff scenarios are exactly the ones that you and some others find the least plausible.

[-]ryan_greenblatt2mo8-2

human coup instigators generally still need humans to rule over, whereas AIs wouldn't

Sometimes? In countries where most wealth is mineral wealth, they actually need very few citizens. The more basic point is that human coups aren't particularly helped by indiscriminate killing while AI takeover might be. But I still think coups are surprisingly bloodless and often keeping coups bloodless is the optimal strategy for the person doing the coup. I think this transfers to AI.

I don't see how the number of AIs makes a big difference here, rather than the absolute power level of the leading AI?

My argument here is basically just that even if a given rogue AI wouldn't benefit from mass fatalities (which maybe you think is implausible, fair enough), if there are many rogue AIs in different situations, then mass fatalities become more likely.

If you think most humans dying as part of the takeover is over determined, then fair enough.

An extinction or near-extinction event seems beneficial to just about any unaligned AI that is not all-powerful enough to not have to worry about humanity at all.

I think this is pretty unclear because of the possiblity of responses from human actors and the potential for slowly building up control over time. E.g., in the AI 2027 scenario, killing (large fractions of) humans at the end didn't matter almost at all for the AI's takeover prospects and at each earlier point, killing tons of humans wouldn't be helpful as this would just cause variance and the potential for stronger responses while the AI can just build up power over time.

Edit: there is also a live question of "marginal returns to death"; e.g. does it increase your odds of takeover to kill 90% instead of 30%?

[-]anithite1mo10

Even if a coup is meant to capture mineral wealth and the population is irrelevant, coup leaders recognize that mass murder will lead to sanctions stopping them from selling that mineral wealth. Plenty of examples of regimes that kill even low thousands of people being sanctioned.

AI that plans to take over the world does not need to trade with humans or keep them from being horrified and lashing out. Kill approximately everyone is a viable strategy and preferrable in most cases since it removes us as an intelligent adversary.

[-]Gavin Runeblade1mo*1-2

"But I still think coups are surprisingly bloodless and often keeping coups bloodless is the optimal strategy for the person doing the coup. I think this transfers to AI."

I think you are conflating two separate goals, One of which applies to AI and the other of which doesn't. Control of government and control of the governed. Coups have constrained violence because the leaders of the coup explicitly want to rule the people and killing them reduces the rewards of winning.

This does not appear to apply to an AI coup as it would not benefit from having people to rule over.

[-]quetzal_rainbow2mo91

I don't feel like "slightly caring" imagery coherently describes any plausible feature of caring of misaligned AIs.

I think that if Earth-originated AI gets to choose between state of the world without humans and exactly the same world but with humans added, supermajority of AIs is going to choose the latter, because of central role humans are likely to take in training data. But "humans are more valuable than no humans ceteris paribus" doesn't say anything about whether configuration of matter that contains humans is literally the best possible configuration. Take "there are galaxies of computronium and solar system with humans" vs "there is one solar system of computronium more, no humans".

To put things on scale: on napkin estimate, plausible lower bound on achievable density of computations is ops/sec*kg. If we assume that for minimally comfortable live humans will need 100 times their weight and one human has weight 80kg, we will get that one second of existence of single human is worth ~ $10^{39}$ operations, which is approximately 3 billion years of computations on all currently existing computers. I think it sends idea of non-uploaded humans existing after misaligned takeover straight out of window.

[-]jessicata2mo81

Insofar as I buy this argument, I would apply it to my own CEV, hence concluding that on reflection I would most prefer a universe without biological humans. Then I notice I'm probably not special and probably a lot of other people's CEV generates this conclusion too. This seems kind of heretical but I'm not convinced it's wrong.

[-]ryan_greenblatt2mo60

Suppose humans retained control and came across some aliens who didn't want to be replaced with computronium and wanted to live on their planet. Do you think that humans would do something to these aliens as bad (from their perspective) as killing them?

I think this type of caring (where you respect the actual preferences of existing intelligent agents) seems reasonably plausible to me.

[-]quetzal_rainbow1mo40

I think humans don't have actual "respect for preferences of existing agents" in way that doesn't pose existential risks for agents weaker than them.

Imagine planet of conscious paperclippers. They are pre-Singularity paperclippers, so they are not exactly coherent single-minded agents, they have a lot of shards of desire and if you take their children and apply effort in their upbringing, they won't be single-minded paperclippers and they will have some sort of alien fun. But majority of establishment and conventional morality says that the best future outcome is to build superintelligent paperclip-maximizer and die, turning into paperclips. Yes, including children. Yes, they would strongly object if you try to divert them from this course. They won't take buying a lot of paperclips somewhere else, just like humanity won't take getting paperclipped in exchange of building Utopia somewhere else.

I actually don't know position of future humanity regarding this hypothetical, but I predict that siginificant faction would be really unhappy and demand violent intervention.

[-]StanislavKrym1mo10

Except that SOTA establishment and conventional morality are arguably either manageable or far from settled. There is an evolution-related case against the possibility that Paperclippers can arise naturally. What could be SOTA model organisms of alien cultures that we would call misaligned? The Aztec culture before being discovered? Or something more recent, like Kampuchea? And how does the chance that mankind deems a culture unworthy of continued existence depend on the culture's capabilities, including those of self-reflection?

[-]levsky71mo01

Humans will lose trust in media which they have now staked their daily existence on; news media and legal procedures will be altered by the influx of Sora-generated images and videos, raising the current level of national aggressions we see acted out now, and thanks to humans' underdeveloped emotional responses and highly developed weaponry, and the propensity of particularly the US culture toward violence, AI will aid in the spread of death and destruction.

[-]MinusGix1mo10

From my current stance, it is plausible, because we haven't settled how we think of aliens (especially those who are significantly outside of our behaviors) philosophically. I most likely don't respect arbitrary intelligent agents, as I'd be for getting rid of a vulnerable paperclipper if we found one on the far edges of the galaxy.

Then, I think you're not extrapolating mentally how much that computronium would give. From our current perspective the logic makes sense: where we upload the aliens regardless even if you respect their preferences beyond that, because it lets you simulate vastly more aliens or other humans at the same time.
I expect we care about their preferences. However those preferences will end up to some degree subordinate to our own preferences, the clear obvious being that we probably wouldn't allow them an ASI depending on how attack/defense works, but the other being that we may upload them regardless due to the sheer benefits.

Beyond that I disagree how common that motivation is. I think the kind of learning we know naturally results in that, limited social agents modeling each other in an iterated environment, is currently not on track to apply to AI.... and that another route is "just care strategically" especially if you're intelligent enough. I feel this is extrapolating a relatively modern human line of thought to arbitrary kinds of minds.

[-]jwfiredragon2mo10

I think it's plausible that an AI that has this kind of caring could exist, but actually getting this AI instead of one that doesn't care at all seems very unlikely.

IMO "respect the actual preferences of existing intelligent agents" is a very narrow target in mind-space. I.e. if we had any reason to believe the AI has a decent chance of being this kind of mind, the alignment problem would be 90% solved. The hard part is going from "AI that kills everyone" to "AI that doesn't kill everyone". Once you're there, getting to"AI that benefits humanity, or at least leaves for another star system" is comparatively trivial.

[-]TAG1mo*62

IMO “respect the actual preferences of existing intelligent agents” is a very narrow target in mind-space

The MindSpace of the Orthogonality Thesis is a set of possibilities. The random potshot version of the OT argument, has it that there is an even chance of hitting any mind, and therefore a high chance ideas of hitting an eldritch, alien mind. But equiprobability is only one way of turning possibilities into probabilities, and not particularly realistic. Random potshots aren't analogous to the probability density for action of building a certain type of AI, without knowing much about what it would be.

While, many of the minds in mindpsace are indeed weird and unfriendly to humans, that does not make it likely that the AIs we will construct will be. We are deliberately seeking to build duplicates of our minds for one thing, and have certain limitations, for another.

[-]jwfiredragon1mo1-2

we are deliberately seeking to build certainties of mind

I think "deliberately seeking to build" is the wrong way to frame the current paradigm - we're growing the AIs through a process we don't fully understand, while trying to steer the external behaviour in the hopes that this corresponds to desirable mind structures.

If we were actually building the AIs, I would be much more optimistic about them coming out friendly.

[-]TAG1mo*51

Not fully understanding things is the default ... even non AI software can't be fully understood if it is complex enough. We already know how to probe systems we don't understand apriori, through scientific experimentation. You don't have to get alignment right first time, at least not without the foom/RRSI or incorrigibility assumptions.

[-]jwfiredragon1mo10

The difference with normal software is that at least somebody understands every individual part, and if you collected all those somebodies and locked them in a room for a while they could write up a full explanation. Whereas with AI I think we're not even like 10% of the way to full understanding.

Also, if you're trying to align a superintelligence, you do have to get it right on the first try, otherwise it kills you with no counterplay.

[-]TAG1mo31

Also, if you’re trying to align a superintelligence, you do have to get it right on the first try, otherwise it kills you with no counterplay

That has not been demonstrated.

( "Gestures towards IABIED"

"Gestures towards critiques thereof")

[-]ryan_greenblatt2mo4-1

My claim is that you're reasonably likely to get a small preference for something like this by default even in the absence of serious effort to ensure alignment beyond what is commercially expedient. For instance, note that humans put some weight on "respect the actual preferences of existing intelligent agents" despite this being a "narrow target in mind-space".

[-]jwfiredragon1mo10

Analogy to humans feels like generalizing from one example to me. My prior is that minds evolved under different circumstances will have different desires, so we shouldn't expect an AI to robustly share any specific human value unless we can explain exactly how it develops that value.

But that aside, would you agree that if this were true, alignment should be fairly easy, because we just need to amplify the degree of caring?

[-]Raemon1mo70

So I think this topic isn't exactly cruxy for most decisions (except things routing through "is the IABIED thesis in particular wrong?". But, wanted to leave a quick note about where I ended up after thinking about it awhile longer:

tl;dr

"for trade" doesn't seem like it should be >50% likely to avoid something-like-death (and IMO, the only reason I'm not at <1% is model uncertainty)
- (for Pascal Wager Counterargument reasons)
"because they're actually nice" doesn't seem like it should be >25%-ish likely (and I'd personally go much lower than 25%)
- (for "uncertainty about how their values will structurally work" reasons)

(for time reasons I didn't doublecheck exactly what numbers you assigned to things and was going off memory, I may have misremembered and maybe you're not saying things that different from what I say here but it seemed like you were. If your numbers were similar to what I wrote above maybe this isn't a disagreement, just an elaboration)

Game Theory

Agreed there's some game theoretic reason to preserve humanity just-in-case, but, I don't see why it'd route through "forced uploading or stasis."

There might be entities that might pay a premium for humanity that was fully preserved, and got access to a full star system.

But there might also be entities that care about humanity, but, actively prefer a pristine stasis copy, rather than a "stunted" copy that evolved on it's own for millions of years in suboptimal conditions.^[1]" Or, lots of entities that prefer stasis copies because they are just intellectually interested in us rather than caring about us. Pascal's Wager has to account for multiple gods wanting multiple things.

So it seems wrong to put a >50% (or really even 10%) on "AI preserves humanity somehow for Trade Reasons" resulting in "we straightforwardly get to survive."

Just Being Nice because maybe it's very slightly aligned or niceness is very slightly schelling.

The general thrust of:

look, it really needs to be such a teeny amount nice for it to end up nice
it would be pretty weird if somehow it had nonzero propensity to be nice and the amount was 'less than Leave Earth Alone'
it would be pretty weird if it had zero niceness

seems at least plausibly pretty compelling. The main thing that felt sus about your other comments here was defining 'pretty nice' in terms of 'they will spend a fraction of their resources on us, no matter what, even if there are lots of other things they care about more', and then giving that a >50% likelihood of being how niceness would play out. (Tho I might have misread you there)

That's one way things could shake out. Two other ways are:

They care about us nonzero, but there are nigh-infinite things they also care about about-as-much-as-us-or-more, enough that the many galaxies of resources they lose by slowing down aren't clearly just more important
They care about us, but not in a 'fraction-of-resources' sense, but rather in a 'they'd save us all-else-equal, but, things are not equal, and even ignoring the many galaxies out there there's just something they'd rather do with one spare solar system more than us.

(I'd maybe include #3 "random other reasons we haven't thought about yet")

#1 and #2 are similar, but I intend to make fairly different points with them.

The structure of #1 is taking at face value them attempting to give us some small fraction of their resources, just, even when they do that, the teeny-percentage-cost of us is outweighed by there being vast numbers of things they also want to give a similar percentage to, and the number of galaxies they lose by slowing down a few weeks or months just dwarfs the amount they are able to devote.

The structure of #2 is "they just aren't even trying to give us a fraction of their wealthy, they're just weighing each block of resources and asking 'what's the best thing to do with this?' and we show up under consideration but below the threshold.

It seems very weird to be confident in your frame #0 of "they earnestly care about us nonzero, which they frame as a fraction of their wealth, and there are no competing things they care about" as noticeably more likely than the other two, without more argumenation than you've made thus far.

It seems like the structure of the "but it's so cheap" argument is like "the EV of them caring a little is pretty high, because it's so cheap" but that is different than "the chances of them structurally caring this way is >50%"

^{^}
(I guess at least a partial counterpoint is "of the aliens that care about our values/freedom, rather than us as intellectually curiosities, maybe they care a lot about us all getting preserved, and maybe it's about as hard to put us all in stasis as to just leave us alone." Seems seems like more a stretch and shouldn't be the mainline guess)

[-]ryan_greenblatt1mo20

I currently think "the AIs instrically care enough to spend >1 / 10 billion of resources keeping humans alive" is like (idk) 35% likely and that "acausal/causal trade would incentivize AIs to spend >1 / 10 billion of resources on keeping humans alive if the AIs care about acausal trade and there isn't some other entity paying even more for some other outcome" is like 75% likely.

(I think this is similar to what my view was when I wrote this post, but maybe a bit less optimistic after further reflection. I now think the chance that >50% of humans die (or have something happen to them that is similarly bad as death) due to rapid industrial expansion is more likely, maybe 35% and the chance of something which is effectively like extinction is maybe 25% (though the details of what counts as extinction might matter a bunch and uncertainty about this is driving a bunch of my change in views).)

Agreed there's some game theoretic reason to preserve humanity just-in-case [...] Pascal's Wager has to account for multiple gods wanting multiple things.

Isn't it kinda surprising if the highest bidder wants to do something to humans which is as bad or close to as bad as killing them? (As bad from the perspective of typical humans.)

Part of this is that I don't see this as "just in case", I'd say that it seems likely that someone is willing to compensate AIs for keeping humans alive and it's pretty plausible that the AIs have the smarts/compute to do actual acaual trade prior to them otherwise killing humans (due to rapid industrial expansion at least). This is messier if the best takeover strategies involve killing humans. E.g., in the AI 2027 race scenario, I think the AIs probably would have been able to do acausal trade reasonably prior to killing off the humans.

[-]Raemon1mo20

It occurs to me:

You are more optimistic than I that our current AIs will care enough to spend 1/billionth their resources on keeping us alive.

You are separately more optimistic than I, that one could expect the high bidders for trading "we saved the humans" to care about not merely out well being but our agency.

It seems like those maybe share a crux at how natural niceness is (which is... not exactly doublecounting, but, if you were to change your mind about that, probably both of those numbers drop. Is that right?)

[-]ryan_greenblatt1mo20

Yes, there is an underlying correlation. E.g., if I thought that humans on reflection wouldn't care at all about bailing out other humans and satisfying their preferences to remain physically alive this would be evidence on both trade and about AIs.

[-]Raemon1mo21

I currently think "the AIs instrically care enough to spend >1 / 10 billion of resources keeping humans alive" is like (idk) 35%

Seems high (factoring in the unknown unknowns of other things to care about), but, not crazy.

[-]Raemon1mo20

Isn't it kinda surprising if the highest bidder wants to do something to humans which is as bad or close to as bad as killing them? (As bad from the perspective of typical humans.)

I don't even think it's obvious most humans would/should prefer the non-upload route once they actually understood the situation (like it seems super reasonable to consider that "not death"), and is just a pretty reasonable thing for an AI to say "okay, I do think I just know better than you what you will want after you think about it for ~~5 minutes~~ like a month)

I also think plenty of high-bidders would have some motivation to help humans, but their goal isn't obviously "give them exactly what they want" as opposed to "give them a pretty nice zoo that is also optimized for some other stuff."

highest bidder

At the time the AI is making this call, it hasn't yet built Jupiter brains and will have uncertainty about the bid spread and who's out there and whether aliens or acausal trade are even real and which ones are easier to contact.

The upload version gives you a lot more option value – it's tradeable to the widest variety of beings, and at the very least you can always reconstruct the solar system, so the only people you're losing bargaining power with are the few aliens who strongly prefer "unmodified solar system continued" vs "reconstructing original unmodified solar system after the fact", which seems like a very weirdly specific thing to care that strongly about.

(Also, you might get higher bids if you're actually able to get multiple gods bidding for it. If you only did the non-stasis'd solar system version, you only get to trade with the Very Specific Altruists, and even if they are the highest bidders, you lose the ability to get the Weird Curious Zookeepers bidding the price up)

[-]ryan_greenblatt1mo20

Hmm, it seems that from your perspective "do non-consensual uploads (which humans would probably later be fine with) count as death" is actually a crux for fatality questions. I feel like this is a surprising place to end up because I think keeping humans physically alive isn't much more expensive and I expect a bunch of the effort to keep humans alive to be motivated by fulfilling their preferences (in a non-bastardized form) rather than by something else.

Intuitively, I feel tempted to call it not death if people would be fine with it on reflection but it seems like a mess and either way not that important.

the only people you're losing bargaining power with are the few aliens who strongly prefer "unmodified solar system continued" vs "reconstructing original unmodified solar system after the fact"

What about people who want you to not do things to the humans that they consider as bad as death (at least without further reflection).

[-]Raemon1mo60

Intuitively, I feel tempted to call it not death if people would be fine with it on reflection but it seems like a mess and either way not that important.

Nod, I think this is both fine, and, also, resolving the other which way would be fine.

"do non-consensual uploads (which humans would probably later be fine with) count as death" is actually a crux for fatality questions.

On my end the crux is more like "the space of things aliens could care about is so vast, it just seems so unlikely for it to line up exactly with the preferences of currently living humans." (I agree "respect boundaries" is a schelling value that probably has disproportionate weight, but there's still a lot of degree of freedom of how to implement that, and how to trade for it, and whether acausal economies have a lot of Very Oddly Specific Trades (i.e. saving a very specific group) going on that would cover it.

The question of whether "nonconsensual uploads that you maybe endorse later" is a question I end up focused on mostly because you're rejecting the previous paragraph,

What about people who want you to not do things to the humans that they consider as bad as death (at least without further reflection).

I agree that's a thing, just, there's lots of other things aliens could want.

(Not sure if cruxy, but, I think the aliens will care about respecting our agency more like the way we care about respecting trees agency, than the way we care about respecting dogs agency)

[-]Raemon1mo20

Or: "we will be more like trees than like dogs to them." Seems quite plausible they might be more wisely benevolent towards us than humans are towards trees currently.

But, it seems like an important intuition pump for how they'd be engaging with us and what sort of moral reflection they'd have to be doing.

i.e. on the "bacteria -> trees -> cats -> humans -> weakly superhuman LLM -> ... ??? ... -> Jupiter Brain that does acausal trades" spectrum of coherent agency and intelligence, it's not obvious we're more like Jupiter Brains or like trees.

(somewhere there's a nice Alex Flint post about how you would try to help a tree if you were vaguely aligned to it)

[-]mattmacdermott2mo72

We have a ~25% chance of extinction

Maybe add the implied 'conditional on AI takeover' to the conclusion so people skimming don't come away with the wrong bottom line? I had to go back through the post to check whether this was conditional or not.

[-]ryan_greenblatt2mo50

sure

[-]Raemon2mo62

Other entities that care trade with the AI (or with entities that the AI trades with) to keep humans alive. This includes acausal trade, ECL, simulations/anthropic capture (to be able to effectively acausally trade with decision theory naive AIs), and causal trade with aliens that the AI ends up encountering. It's unclear how simulations/anthropic capture works out. It seems plausible that this happens in a way which doesn't result in all relevant (acausal) trades happening. It seems plausible that other entities (including humans in other Everett branches) pretty universally don't want to bail humans (in this branch) out because they have better things to spend their resources on, but even a tiny fraction of entities spending some non-trivial fraction of resources could suffice. This depends on there being some beings with power who care about things like human survival despite alignment difficulties, but this feels very likely to me (even if alignment is very hard, there may well be more competent aliens or AIs who care a small amount about this sort of thing). Note that this requires the AI to care about at least some of these mechanisms which isn't obvious.

How much of your plausibility-of-caring-slightly routes through this, vs the "actually just slightly intrinisically nice?" thing.

In the previous discussion, you said:

I'm happy to consider non-consensual uploading to be death and I'm certainly happy to consider "the humans are modified in some way they would find horrifying (at least on reflection)" to be death. I think "the humans are alive in the normal sense of alive" is totally plausible and I expect some humans to be alive in the normal sense of alive in the majority of worlds where AIs takeover.
Making uploads is barely cheaper than literally keeping physical humans alive after AIs have fully solidified their power I think, maybe 0-3 OOMs more expensive or something, so I don't think non-consensual uploads are that much of the action. (I do think rounding humans up into shelters is relevant.)

I'm surprised at the "only 3 OOMs more expensive." I haven't done a calculation but that seems really implausible. Maybe I am not imagining how big an OOM is properly.

In particular, comparing "keep humans alive indefinitely, multigenerationally" vs "store them on ice without evening running them, turn them back on when/if you trade them."

The sort of entity that might pay extra for "keep them actually alive instead of on-ice"... probably also cares about them being alive and getting some kind of standard of living or something. (fyi it's not even obvious most people would prefer being alive in minimally-satisfying-shelters vs stored, or run in simulation at a higher standard of living)

Insofar as this option is loadbearing for "you put lowish odds on them killing everyone", I think a crux is not finding it very plausible that the AI would choose "keep fully alive in a way that is clearly better than death" vs "store digitally" or "upload." It's just a very narrow target for the game theory to work out such that this exact thing is what turns out to matter.

[-]Raemon2mo73

I hadn't seen this footnote yet when writing the above:

Consensual uploads, or uploads people are fine with on reflection, don't count as death. ↩︎

There's a range of stuff like "the AI manuevered us into a situation where we either die, or live in world with X standard of living, or get uploaded and get 1000x (or whatever) standard of living.

Most people on reflection choose uploading. I think it's reasonable to disagree with whether this counts as "death", but, this seems like pretty ambiguous consent at best to me, and while I think most fully informed people wouldn't count is as "death" per se, most people with their current set of beliefs/values would count it more like death than not-death, or, not be very impressed by arguments that it doesn't count as death.

(Curious if there is any kind of mechanical turk poll we could run that would change either of our minds about this? Not sure that it matters that much)

I think this sort of thing is also why I don't think the "AI may be very slightly nice" is likely to result in something that's clearly "not death". It seems really unlikely to me that very slightly nice things would go the "preserve parochial humans exactly as is", it's just such a narrow target to hit even within niceness.

[-]ryan_greenblatt2mo42

It's kinda messy how to think about non-consensual uploads that the person is totally fine with after some reflection (especially if the reason the AI did the upload is because it know the person would be fine with this on reflection). I also don't think considering uploads without informed consent to be death makes a big difference to my numbers.

I think this sort of thing is also why I don't think the "AI may be very slightly nice" is likely to result in something that's clearly "not death". It seems really unlikely to me that very slightly nice things would go the "preserve parochial humans exactly as is", it's just such a narrow target to hit even within niceness.

I don't really agree? It seems like "don't upload people without their consent when they consider this to be death unless there are no better options" is pretty natural and the increase in resources for keeping people physically alive is pretty small.

[-]Raemon2mo42

(FYI I think I am sold by your other reply that the main question is "how much is it slowed down initially, and how much does it value the resources that it loses by doing so?". I agree leaving us with one star isn't that big a deal, modulo making sure we can't somehow mess up the rest of it's plans, which doesn't seem that hard)

[-]ryan_greenblatt2mo2-2

How much of your plausibility-of-caring-slightly routes through this, vs the "actually just slightly intrinisically nice?" thing.

Maybe 60% trade, 40% intrinsic? Idk though.

I'm surprised at the "only 3 OOMs more expensive." I haven't done a calculation but that seems really implausible. Maybe I am not imagining how big an OOM is properly.

In particular, comparing "keep humans alive indefinitely, multigenerationally" vs "store them on ice without evening running them, turn them back on when/if you trade them."

The dominant cost of "keeping humans alive physically" from a linear-returns+patient industrial expansion perspective is a small delay from rounding humans up into shelters (that you have to build and supply etc) or avoiding boiling the oceans (and other catastrophic environmental damage). The dominant cost of uploads is the small delay from rounding up and scanning people before they die. They both seem small, but the delays seem comparable.

Giving humans an entire star long term (e.g. after a few decades) is negligible in cost (like, of all galactic resources) relative to this delay (edit: for patient AIs which don't prefer resources closer to earth), so I think keeping humans physically alive longer term is totally fine and just has higher upfront costs.

There's also the cost of not killing humans as part of a takeover attempt and not using them for some other purpose (reasons (1) and (3)), but these are equal between uploads and physically keeping humans alive. I wasn't intending to include this as I said "after AIs have fully solidified their power", but if I did, then this makes the gap much smaller depending on how important reason (1) is.

[-]Raemon2mo20

Mmm, nod. I maybe see it.

RE: "how long are you delaying?"

It seems like a major early choice is "is the AI basically using the Earth to bootstrap the intergalactic probe process" or "is the AI only using the amount of Earth resources that leaves it mostly functional from human's perspective, and then mostly using the rest of the planets."

You note "seems like it'd only slow down the AI a bit, to first round up humans". I haven't done a calculation about it, but, seemed potentially like a very big deal to decide between "full steam ahead on beginning the dyson sphere using Earth Parts" vs "start the process from the moon and then mercury", since there are harder-to-circumvent delays there.

[-]ryan_greenblatt2mo2-2

You note "seems like it'd only slow down the AI a bit, to first round up humans". I haven't done a calculation about it, but, seemed potentially like a very big deal to decide between "full steam ahead on beginning the dyson sphere using Earth Parts" vs "start the process from the moon and then mercury", since there are harder-to-circumvent delays there.

Can't the AI proceed full steam ahead until it has enough industrial capacity to build shelters, then build shelters and put humans in the shelters (while pausing as needed at this point to avoid fatalities from environmental damage while humans are being rounded up), and then finish the industrial expansion (possibly upgrading shelters along the way as needed as the AI gets more resources)? Seems like naively this only delays you for as long as is needed to round up humans and up them in shelter which seems probably <1 year and probably <1 month. (At least if takeoff is pretty fast.)

Separately, not destroying the earth (and instead doing more of the growth in space) seems like it should cost <3 years of delay and probably <1 year of delay which is still pretty small as an absolute fraction of resources (for patient AIs). Like we're talking 1 / billion or something.

[-]Raemon2mo*20

I agree it's small as a fraction of resources, but it still seems very expensive in terms of total resources since that's a lotta galaxies falling outside the lightcone.

[-]Ben Pace1mo40

Curated. This is a topic that is getting a ton of attention at the minute, and the post does a decent job at laying out relevant considerations and summarizing some prior debate.

(I disagree with the conclusion but it seems positive to me to curate something with a different conclusion than the recent book on the issue.)

[-]ryan_greenblatt1mo20

I find it somewhat sad/depressing/unfortunate that this post got curated rather than a better post on the topic given that it's mostly pretty low effort notes from my perspective, but hopefully the post + the discussion in the comments are valuable in practice and there isn't currently a better post to curate.

[-]Bronson Schoen2mo40

One difficulty I have with these posts is I never have any idea how seriously to take any of the percentages or even the overall conclusion. My overall update is to doubt there’s a reasonable threshold of saying “we don’t know enough to put a number on this”.

[-]AnthonyC2mo64

Yeah, my own instinct is to just see if the results are interesting in such a way that if I believed them, it would meaningfully change what I thought was the best strategy. In this case, don't think so. Even what I see as a very optimistic set of assumptions still results in what I see as an unacceptably high risk of very bad outcomes. I do find the exploration itself interesting, though.

[-]Viliam1mo30

If the AI keeps humans as uploads, it does not necessarily mean that they are running. They could be paused, ready to get unpaused when there is a reason to do so (e.g. the AI meets another space AI which values biological minds). That is effectively the same as being dead, because the probability of meeting an alien AI that cares about humans is tiny. Paused uploads are cheaper than running uploads.

[-]ryan_greenblatt1mo20

I agree that paused uploads which never get unpaused count as death. However, it's extremely, extremely relatively cheap on a cosmic scale to run some uploads, so if you make the uploads and there is literally any motive to run them, I expect they get run.

[-]Raphael Roche1mo30

Many people here are more or less transhumanists, but I would bet that for a significant portion of the population, all these alternatives to death that you discuss (uploading, etc.) are actually even more frightening prospects than traditional death itself. For instance, many religious people who hope their souls will continue on would prefer death (especially if not too painful) rather than the horror sci-fi scenarios of being uploaded and traded to aliens to be kept in a zoo.

[-]Dagon2mo30

I mean, everyone is going to die, absent singularity-like advances in medical/upload capabilities. There are two different doom scenarios:

Everyone dies PREMATURELY and SUDDENLY in the next N (N < 100) years.
VASTLY fewer humans are born and thrive than otherwise in the next M (M > 500) years.

1 implies 2, but not the reverse. I give pretty high chances of #2, but much lower of #1, assuming my "median case" of ASI, being fairly slow and not-magically-super.

Note, I already had a pretty high probability of #1 from non-AI sources. The current revolution in modeling has increased it, but not by much.

[-]NicholasKees1mo20

Preserving earth (as in, not causing catastrophic environmental damage due to industrial expansion) is more expensive than keeping physical humans alive which is more expensive than only keeping humans alive as uploads.

Not saying this is likely, but if the AI is not speciesist in favor of humans (it does care about humans, but not more than e.g. whales or chimpanzees), then plans which end up protecting a large majority of humans look a whole lot more expensive overall.

[-]ryan_greenblatt1mo20

It seems pretty natural for the AI to end up caring more about the preference of humans than other beings, but point taken.

[-]NicholasKees1mo20

Even if it does care more about human beings on an individual basis, I think the argument holds (unless the difference in caring is extremely large). Including them in the calculus at all would increase the cost a lot (just considering the sheer number of them, and the logistic challenges of untangling the complex interdependent web of natural ecosystems).

[-]Seth Herd2mo2-2

Interesting and good breakdown.

I place much higher odds on the "death due to takeover" for a pretty specific reason. We seem to have an excellent takeover mechanism in place which kills all or most of us: nukes. We have a gun pointed at our collective heads, and it's deadlier to humans than AGIs.

Igniting a nuclear exchange and having just enough working robots, power sources, and industrial resources (factories, etc) to rebuild seems like a pretty viable route to fast takeover.

Igniting that exchange could be done via software intrusion or perhaps more easily by spoofing human communications to launch. This is basically the only use of deepfakes that really concerns me, but it concerns me a lot.

This becomes increasingly concerning to me in a multipolar scenario, which in turn seems all too likely at this point. Then every misaligned AGI is incentivized to take over as quickly as possible. A risky plan becomes more appealing if you have to worry that another AGI with different goals may launch their coup at any point.

This logic also applies if we solve alignment and have intent-aligned AGIs in a multipolar scenario with different human masters. I guess it also favors everyone who can, getting themselves or a backup to a safe location.

This is also conditional on progress in robotics vs. timelines; even with short timelines it seems like robotics will probably be far enough along for clumsy robots to build better robots. But here my knowledge of robotics, EMP effects, etc, fails. It does seem like a nontrivial chance that triggering a nuclear exchange is the easiest/fastest route to takeover.

One very well-informed individual told me nobody knows if any humans would survive a nuclear winter; but that's probably less important than whether the "winning" AGI/human wanted them to survive.

The number of likely deaths given takeover seems higher than your estimate if that logic about nukes as a route to takeover mostly goes through.

But the question of extinction still hinges largely on whether AGI has any interest in humanity surviving. I think you're assuming it will have a distribution of interests like humans do; I don't think that's a safe assumption at all, even given LLM-based AGI and noting that LLMs do really seem to have a distribution of motivations, including kindness toward humans.

I think how they reason about their goals once they're capable of doing so is very hard to predict, so I'd give it more like a 50% chance they wind up being effectively maximizers for whatever their top motivation happens to be. There seems to be some instrumental pressure toward reflective stability, and that might favor one motivation winning outright vs. sharing power within that mind. I wrote a little about that in Section 5 and more in the remainder of LLM AGI may reason about its goals and discover misalignments by default, but it's pretty incomplete.

Like the rest of alignment, it's uncertain. LLMs have some kindness, but that doesn't mean that LLM-based AGI will retain it. If we could be confident they would, we'd be a lot better set to solve alignment.

[-]ryan_greenblatt2mo60

I place much higher odds on the "death due to takeover" for a pretty specific reason. We seem to have an excellent takeover mechanism in place which kills all or most of us: nukes. We have a gun pointed at our collective heads, and it's deadlier to humans than AGIs.

I ended up feeling moderately persuaded by an argument that human coups typically kill a much smaller fraction of people in that country. I think some of the relevant reference classes don't look that deadly and then there are some specific arguments for thinking AIs will kill lots of people as part of takeover, and I overall come to a not-that-high bottom line.

The number of likely deaths given takeover seems higher than your estimate if that logic about nukes as a route to takeover mostly goes through.

My understanding is that a large scale nuclear exchange would probably kill well less than half of people (I'd guess <25%). The initial exchange would probably directly kill <1 billion and then I don't see a strong argument for nuclear winter killing far more people. This winter would be occuring during takeoff which has an unclear effect on the number of deaths.

But the question of extinction still hinges largely on whether AGI has any interest in humanity surviving. I think you're assuming it will have a distribution of interests like humans do

I don't think I'm particularly assuming this. I guess I think there is a roughly 50% chance that AI will be slightly kind in the sense needed to want to keep humans alive. Then, the rest is driven by trade. (Edit: I think trade is more important than slightly kind, but both are >50% likely.)

[-]Seth Herd2mo20

GPT5T roughly agrees with your estimate of nuclear winter deaths, so I'm probably way off. Looks like my information was well out of date; that conversation was from 2017 or so. And yes, deaths from winter-induced famine might be substantially mitigated if someone with AGI-developed tech bothered.

Thanks for the clairifications on trade vs kindness. I am less optimistic about trade, causal or acausal, than you, but it's a factor.

[-]Steven1mo10

Wouldn’t there be even cheaper ways to satisfy preferences about living humans? A fake, cheap version which satisfies that preference would probably be possible in the same way that a preference for a pet can be satisfied by a plush toy. Wanting humans or uploads but not being able to satisfy that desire with something fake seems like it isn’t how many of our actual desires work

I originally wrote this post to articulate why I thought the chance of literal extinction and number of expected fatalities were much higher than someone else thought, but it's also pretty relevant to ongoing discussion of the book "If anyone builds it, everyone dies". ↩︎
As in, all power of earth-originating civilizations. ↩︎
Within "death", I'm including outcomes like "some sort of modification happens to a person without their informed consent that they would consider similarly bad (or worse than death) or which effectively changes them into being an extremely different person, e.g., they are modified to have wildly different preferences such that they do very different things than they would otherwise do". This is to include (unlikely) outcomes like "the AI rewires everyone's brain to be highly approving of it" or similarly strange things that might happen if the AI(s) have strong preferences over the state of existing humans. Death also includes non-consensual uploads (digitizing and emulating someone's brain) insofar as the person wouldn't be fine with this on reflection (including if they aren't fine with it because they strongly dislike what happens to them after being uploaded). Consensual uploads, or uploads people are fine with on reflection, don't count as death. ↩︎
Concretely, literally every human in this universe is dead (under the definition of dead included in the prior footnote, so consensual uploads don't count). And, this happens within 300 years of AI takeover and is caused by AI takeover. I'll put aside outcomes where the AI later ends up simulating causally separate humans or otherwise instantiating humans (or human-like beings) which aren't really downstream of currently living humans. I won't consider it extinction if humans decide (in an informed way) to cease while they could have persisted or decide to modify themselves into very inhuman beings (again with informed consent etc.). ↩︎
As in, what would happen if we were in base reality or if we were in a simulation, what would happen within this simulation if it were continued in a faithful way. ↩︎
Outcomes for currently living humans which aren't death but which are similarly bad or worse than death are also possible. ↩︎

LESSWRONG
LW

LESSWRONG
LW

56

Notes on fatalities from AI takeover

56

56

Industrial expansion and small motivations to avoid human fatalities

How likely is it that AIs will actively have motivations to kill (most/many) humans

Death due to takeover itself

Combining these numbers