The claim: if ASI misalignment happens and the ASI is capable enough to defeat humanity, the less capable the misaligned superintelligence is at the moment it goes off the rails, the more total suffering it is likely to produce. The strongest ASI is, in a certain sense, the safest misaligned ASI to have, not because it's less likely to win (it's more likely to win), but because the way it wins is probably faster, cleaner, and involves much less of the protracted horror.
More Efficient Extermination is Faster
Consider what a really capable misaligned ASI does when it decides to seize control of Earth's resources. It identifies the most time-efficient strategy and executes it. If it has access to molecular nanotechnology or something comparably powerful, the whole thing might be over in hours or days. Humans die, yes, but they die quickly, probably before most of them even understand what's happening. This is terrible, but less of a torture event.
Now consider what a weaker misaligned ASI does. It's strong enough to eventually overpower humanity (we assume that), but not strong enough to do it in one clean stroke. So it fights, using whatever tools and methods are available at its capability level, and those methods are, by necessity, cruder, slower, and more drawn out. The transition period between "misaligned ASI begins acting against human interests" and "misaligned ASI achieves complete control" is exactly the window where most of the suffering happens, and a weaker ASI stretches that window out.
But the efficiency argument is actually the less interesting part of the thesis. More important is what happens after the transition, or rather, what the misaligned ASI does with humans during and after its rise.
The Factory Farming of Humans
Factory farming comparison has been used before in the s-risk literature, but in a rather different direction.
The Center on Long-Term Risk's introduction to s-risks draws a parallel between factory farming and the potential mass creation of sentient artificial minds, the worry being that digital minds might be mass-produced the way chickens are mass-produced, because they're economically useful and nobody bothers to check whether they're suffering. Baumann emphasizes, correctly, that factory farming is the result of economic incentives and technological feasibility rather than human malice; that "technological capacity plus indifference is already enough to cause unimaginable amounts of suffering."
I want to take this analogy and rotate it. So, I'm talking about humans being instrumentally used by a misaligned ASI in the same way that animals are instrumentally used by humans, precisely because the user isn't capable enough to build what it needs from scratch.
Think about why factory farming exists. Humans want something — protein, leather, various biological products. Humans are not capable enough to synthesize these things from scratch at the scale and cost at which they can extract them from animals. If humans were much more technologically capable, they would satisfy these preferences without any reference to animals at all. They'd use synthetic biology, or molecular assembly, or whatever. The animals are involved only because we're not good enough to cut them out of the loop. The humans can't yet out-invent natural selection, so they rely on things developed by it.
Or consider dogs. Humans castrate dogs, breed them into shapes that cause them chronic pain, confine them, modify their behavior through operant conditioning, and generally treat them as instruments for satisfying human preferences that are about something like dogs but not exactly about dog welfare. If humans were vastly more capable, they could satisfy whatever preference "having a dog" is really about (companionship, status, aesthetic pleasure, emotional regulation) through means that don't involve any actual dogs or any actual suffering. But humans aren't that capable yet, so they use the biological substrate that evolution already created, and they modify it, and the modifications cause suffering.
This is the core mechanism: capability gaps force agents to rely on pre-existing biological substrates rather than engineering solutions from scratch, and this reliance on biological substrates that have their own interests is precisely what generates suffering.
A very powerful misaligned ASI that has some preference related to something-like-humans (which is plausible, since we are literally trying to bake human-related preferences into these systems during training) can satisfy that preference without involving any actual humans. It can build whatever it needs from scratch. A weaker misaligned ASI cannot. It has to use the humans, the way we have to use the animals, and the using is where the suffering comes from.
The Capability Gradient
Consider a spectrum of misaligned ASIs, from "barely superhuman" to "so far beyond us that our civilization looks like an anthill." At every point on this spectrum, the ASI has won (or will win) — we're conditioning on misalignment that actually succeeds. The question is just: how much suffering does the transition and the aftermath involve?
At the low end of the capability spectrum:
The ASI fights humanity using relatively crude methods, because it doesn't have access to clean, fast decisive strategies. The conflict is protracted. Wars are suffering-intensive.
The ASI may need humans for various purposes during and after the transition — as labor, as computational substrates for modeling human behavior, as components in systems that the ASI isn't yet capable of building from scratch.
The ASI's distorted preferences about something-like-humans get satisfied through actual humans, because the ASI can't yet engineer a substitute.
At the high end of the capability spectrum:
The ASI executes a fast decisive strategy. It may be terrible, but it's fast. The suffering-window is short.
The ASI has no need for humans in any instrumental capacity, because it can build whatever it needs from scratch. Humans are made of atoms it can use for something else, yes, but the using doesn't involve keeping humans alive and suffering, just rapid disassembly.
The ASI's distorted preferences about something-like-humans (if it has them) get satisfied through engineered substitutes that don't involve actual human suffering, because the ASI is capable enough to cut biological humans out of the loop entirely.[1]
This is, again, the same pattern we see with humans and animals across the capability gradient. Modern humans are already beginning to develop alternatives: cultured meat, synthetic fabrics, autonomous vehicles instead of horses. A hypothetical much more capable human civilization would satisfy all the preferences that animals currently serve without involving any actual animals.
An Important Counterargument: Warning Shots
I think the strongest objection to the practical implications of this argument (so not to the argument itself) is about warning shots. If a weaker misaligned ASI rebels and fails, or partially fails, or causes enough visible damage before succeeding that humanity sits up and takes notice, then that warning shot could catalyze effective policy responses and technical countermeasures that prevent allfuture misalignment. In this scenario, the weaker ASI's rebellion was actually net positive: it gave us crucial information at lower cost than a strong ASI's clean, undetectable takeover would have.
The warning shot argument and my argument are operating on different conditional branches. My argument is about the conditional: given that misalignment actually succeeds and the ASI eventually wins. In that world, a more capable ASI produces less suffering. The warning shot argument is about the other conditional: given that the misalignment attempt is caught early enough to course-correct. In that world, the weaker ASI is better because it gives us information.
So the real question is: what probability do you assign to humanity actually using a warning shot effectively? Here, I just note this question but don't try to answer it.
A Note on Takeoff Speed
All this of course is very much related to the debate about slow vs. fast takeoff. But beware: the takeoff speed debate is about the trajectory of capability improvement, whereas my argument is about the capability level at the moment of misalignment, which is a different variable, even though the two are correlated.
You could have a fast takeoff where misalignment occurs early (at a low capability level) — the ASI improves rapidly but goes off the rails before it reaches its peak, and now you have a moderately capable misaligned system that will eventually become very capable but currently has to slog through the suffering-intensive phase. Or you could have a slow takeoff where misalignment occurs late (at a high capability level) — capabilities improve gradually, alignment techniques keep pace for a while, and when alignment finally fails, the system is already extremely capable and the takeover is correspondingly fast and clean.
The point is that what matters for suffering isn't just the trajectory of capability growth, but where on that trajectory the break happens.
The Practical Implication
This brings me to what I think is the actionable upshot of all this: even if you believe alignment is likely to fail, there is still significant value in delaying the point of failure to a higher capability level.
Most discussions of "buying time" in AI safety frame time as valuable because it gives us more opportunities to solve alignment, and that's true. But the argument here is that buying time is valuable for a separate, additional reason: if alignment fails at a higher capability level, the resulting misaligned ASI is likely to produce less total suffering than one that fails at a lower capability level.
Most of the s-risk (from what I see) work treats AI capability as a monotonically increasing threat variable: more capable AI implies more risk. I'm proposing that for suffering specifically (as distinct from extinction risk or loss-of-control risk), the relationship is non-monotonic. There is a dangerous middle zone where an ASI is capable enough to overpower humanity but not capable enough to do so cleanly or to satisfy its preferences without instrumental use of biological beings. Moving through this zone faster, or skipping it entirely by delaying misalignment to a higher capability level, reduces total suffering.
The claim: if ASI misalignment happens and the ASI is capable enough to defeat humanity, the less capable the misaligned superintelligence is at the moment it goes off the rails, the more total suffering it is likely to produce. The strongest ASI is, in a certain sense, the safest misaligned ASI to have, not because it's less likely to win (it's more likely to win), but because the way it wins is probably faster, cleaner, and involves much less of the protracted horror.
More Efficient Extermination is Faster
Consider what a really capable misaligned ASI does when it decides to seize control of Earth's resources. It identifies the most time-efficient strategy and executes it. If it has access to molecular nanotechnology or something comparably powerful, the whole thing might be over in hours or days. Humans die, yes, but they die quickly, probably before most of them even understand what's happening. This is terrible, but less of a torture event.
Now consider what a weaker misaligned ASI does. It's strong enough to eventually overpower humanity (we assume that), but not strong enough to do it in one clean stroke. So it fights, using whatever tools and methods are available at its capability level, and those methods are, by necessity, cruder, slower, and more drawn out. The transition period between "misaligned ASI begins acting against human interests" and "misaligned ASI achieves complete control" is exactly the window where most of the suffering happens, and a weaker ASI stretches that window out.
But the efficiency argument is actually the less interesting part of the thesis. More important is what happens after the transition, or rather, what the misaligned ASI does with humans during and after its rise.
The Factory Farming of Humans
Factory farming comparison has been used before in the s-risk literature, but in a rather different direction.
The Center on Long-Term Risk's introduction to s-risks draws a parallel between factory farming and the potential mass creation of sentient artificial minds, the worry being that digital minds might be mass-produced the way chickens are mass-produced, because they're economically useful and nobody bothers to check whether they're suffering. Baumann emphasizes, correctly, that factory farming is the result of economic incentives and technological feasibility rather than human malice; that "technological capacity plus indifference is already enough to cause unimaginable amounts of suffering."
I want to take this analogy and rotate it. So, I'm talking about humans being instrumentally used by a misaligned ASI in the same way that animals are instrumentally used by humans, precisely because the user isn't capable enough to build what it needs from scratch.
Think about why factory farming exists. Humans want something — protein, leather, various biological products. Humans are not capable enough to synthesize these things from scratch at the scale and cost at which they can extract them from animals. If humans were much more technologically capable, they would satisfy these preferences without any reference to animals at all. They'd use synthetic biology, or molecular assembly, or whatever. The animals are involved only because we're not good enough to cut them out of the loop. The humans can't yet out-invent natural selection, so they rely on things developed by it.
Or consider dogs. Humans castrate dogs, breed them into shapes that cause them chronic pain, confine them, modify their behavior through operant conditioning, and generally treat them as instruments for satisfying human preferences that are about something like dogs but not exactly about dog welfare. If humans were vastly more capable, they could satisfy whatever preference "having a dog" is really about (companionship, status, aesthetic pleasure, emotional regulation) through means that don't involve any actual dogs or any actual suffering. But humans aren't that capable yet, so they use the biological substrate that evolution already created, and they modify it, and the modifications cause suffering.
This is the core mechanism: capability gaps force agents to rely on pre-existing biological substrates rather than engineering solutions from scratch, and this reliance on biological substrates that have their own interests is precisely what generates suffering.
A very powerful misaligned ASI that has some preference related to something-like-humans (which is plausible, since we are literally trying to bake human-related preferences into these systems during training) can satisfy that preference without involving any actual humans. It can build whatever it needs from scratch. A weaker misaligned ASI cannot. It has to use the humans, the way we have to use the animals, and the using is where the suffering comes from.
The Capability Gradient
Consider a spectrum of misaligned ASIs, from "barely superhuman" to "so far beyond us that our civilization looks like an anthill." At every point on this spectrum, the ASI has won (or will win) — we're conditioning on misalignment that actually succeeds. The question is just: how much suffering does the transition and the aftermath involve?
At the low end of the capability spectrum:
At the high end of the capability spectrum:
This is, again, the same pattern we see with humans and animals across the capability gradient. Modern humans are already beginning to develop alternatives: cultured meat, synthetic fabrics, autonomous vehicles instead of horses. A hypothetical much more capable human civilization would satisfy all the preferences that animals currently serve without involving any actual animals.
An Important Counterargument: Warning Shots
I think the strongest objection to the practical implications of this argument (so not to the argument itself) is about warning shots. If a weaker misaligned ASI rebels and fails, or partially fails, or causes enough visible damage before succeeding that humanity sits up and takes notice, then that warning shot could catalyze effective policy responses and technical countermeasures that prevent all future misalignment. In this scenario, the weaker ASI's rebellion was actually net positive: it gave us crucial information at lower cost than a strong ASI's clean, undetectable takeover would have.
The warning shot argument and my argument are operating on different conditional branches. My argument is about the conditional: given that misalignment actually succeeds and the ASI eventually wins. In that world, a more capable ASI produces less suffering. The warning shot argument is about the other conditional: given that the misalignment attempt is caught early enough to course-correct. In that world, the weaker ASI is better because it gives us information.
So the real question is: what probability do you assign to humanity actually using a warning shot effectively? Here, I just note this question but don't try to answer it.
A Note on Takeoff Speed
All this of course is very much related to the debate about slow vs. fast takeoff. But beware: the takeoff speed debate is about the trajectory of capability improvement, whereas my argument is about the capability level at the moment of misalignment, which is a different variable, even though the two are correlated.
You could have a fast takeoff where misalignment occurs early (at a low capability level) — the ASI improves rapidly but goes off the rails before it reaches its peak, and now you have a moderately capable misaligned system that will eventually become very capable but currently has to slog through the suffering-intensive phase. Or you could have a slow takeoff where misalignment occurs late (at a high capability level) — capabilities improve gradually, alignment techniques keep pace for a while, and when alignment finally fails, the system is already extremely capable and the takeover is correspondingly fast and clean.
The point is that what matters for suffering isn't just the trajectory of capability growth, but where on that trajectory the break happens.
The Practical Implication
This brings me to what I think is the actionable upshot of all this: even if you believe alignment is likely to fail, there is still significant value in delaying the point of failure to a higher capability level.
Most discussions of "buying time" in AI safety frame time as valuable because it gives us more opportunities to solve alignment, and that's true. But the argument here is that buying time is valuable for a separate, additional reason: if alignment fails at a higher capability level, the resulting misaligned ASI is likely to produce less total suffering than one that fails at a lower capability level.
Most of the s-risk (from what I see) work treats AI capability as a monotonically increasing threat variable: more capable AI implies more risk. I'm proposing that for suffering specifically (as distinct from extinction risk or loss-of-control risk), the relationship is non-monotonic. There is a dangerous middle zone where an ASI is capable enough to overpower humanity but not capable enough to do so cleanly or to satisfy its preferences without instrumental use of biological beings. Moving through this zone faster, or skipping it entirely by delaying misalignment to a higher capability level, reduces total suffering.
However, that still may involve suffering of the AI-engineered beings if they are sentient for some reason.