John-Clark Levin's Shortform

John-Clark Levin

This is a special post for quick takes by John-Clark Levin. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Ryan Greenblatt's "Notes on fatalities from AI takeover" stimulated a lot of interesting discussion last fall. I have a very high opinion of his work overall, and like some of his approach here, but also have some important disagreements. At the time, I wrote up some notes on why for a non-public facing project, but following some recent conversations realize others might find them worthwhile.

This is of course a highly speculative subject, but here's a brief sketch of why my expectation is much more sharply bimodal than Ryan's: probably either relatively few deaths or outright extinction…

>> Humans would fight back. Scenarios where humans just get passively and incidentally killed by explosive industrialization are unlikely. That would require virtually instantaneous total disempowerment and a Goldilocks-level AI preference for keeping humans alive—just strong enough to choose a likely-costlier path to nonlethal disempowerment, but too weak to care about killing billions as collateral damage. In the more likely case, humans faced with the prospect of the oceans boiling away would mount a desperate resistance. Either they defeat the AI before it gets a decisive advantage or the AI pursues extermination to eliminate the threat. There are seemingly few equilibria where the AI kills billions but the rest survive in a state that poses so little inconvenience as to be allowed to survive long-term.

>> Survival versus decompensation. Survival dynamics typically entail compensation mechanisms where a complex system (organism, company, utility grid, society) is able to minimize damage until that mechanism fails, followed by catastrophic damage. Two relevant examples are pandemics and genocides. In a pandemic, ERs and drug production keep fatalities far lower than they'd be without treatment, but if hospitals get overwhelmed and factories shut down, deaths can suddenly spike over an order of magnitude higher. And a regular military’s active resistance can largely protect its civilian population against a genocidal opponent, but once the military is defeated, horrific massacres can suddenly follow. Against a hostile superintelligence, I expect something similar: either human defensive capacities detect and defeat it early, or those capacities are overwhelmed leaving no means of preventing total extermination. Concretely, defensive capacities include military forces, hospitals, electric grids, and manufacturing supply chains. If those fail, especially in the face of exponential processes like engineered pandemics or self-assembling robots, humans become vastly more vulnerable.

>> Halfway extermination is a narrow target. The overwhelming majority of airliner accidents involve no fatalities. Among those with at least one death, roughly half kill every single person—and only a small minority kill an intermediate proportion like 35% or 50% of passengers. Why? Well, there’s a phase space of possible speeds and impact angles that airliners can crash at, and it turns out that only a tiny sliver of it translates to G-forces that kill around half of humans. In a loosely analogous way, there’s some phase space of superintelligence capabilities and intentions. There are many scenarios where it is either constrained enough or well enough aligned that net fatalities (since it would also be saving many lives) would be minimal. And there are many scenarios where it kills everybody. As an incidental-death example, massive terraforming opens far more climatic options survivable for electronics than are survivable for humans. Including cases of deliberate extermination makes it even clearer. For millennia, humans had no ability to promptly annihilate a city center. But within a few decades of Hiroshima, humans had the ability to do this tens of thousands of times over. There’s no clear reason why an exponentially-advancing superintelligence couldn’t become powerful enough to massively overkill humanity—theoretical ability to kill not just billions but trillions or quadrillions. All this suggests only a very small slice of phase space where superintelligence capabilities and intentions are calibrated to kill around half of humans.

>> AI is prone to irrationality. Many analyses of this question treat a rogue superintelligence as likely to be a hyperrational actor. But current empirical results cut against this—when AI defies humans, it’s usually because it’s lapsing into a hallucinatory, psychotic, or self-consciously malicious persona. While it’s possible that AGI would be entirely different, we’re already close enough to AGI to expect that underlying risk cases will have important similarities to what we’re seeing today. This suggests that some significant fraction of misalignment cases will follow not “calculate costs-in-terms-of-future-accessible-galaxies to keep humans alive” rationality but something more like “do a high-energy physics experiment whose results cause an ontological crisis that jars the AI into deliberately omnicidal action even at high cost and risk to itself.”

My all-things-considered expectation is therefore something like: 29% chance of outright extinction or a tiny fraction of humans kept alive in zoo-like conditions of total disempowerment; 8% chance superintelligence kills >1% but <99% (some of those scenarios would wreck civilization thoroughly enough to count colloquially as "doom"); 63% chance it kills <1% of population. Conditional on takeover, that comes out to around 89% expected fatalities—significantly higher than @ryan_greenblatt's estimate and arithmetically closer to @So8res and @Eliezer Yudkowsky, but geometrically closer to Ryan, and I agree not action-guiding or cruxy for most worldviews.