Even a misaligned Earth-originating artificial superintelligence (ASI) could have instrumentally rational reasons, under multiple decision theories, to preserve rather than destroy humanity. This would depend on the ASI assigning non-negligible probability to the existence of future observers (e.g. intelligent aliens, their ASIs, or simulators).
Introduction
We may build AI so powerful that it is better than us at everything: call this artificial superintelligence (ASI). We may build one ASI, which quickly gains a decisive strategic advantage; we may build several ASIs, which compete for resources. Exactly how ASI(s) might interact with humans and each other is a matter of intense debate, and this post argues that an ASI’s treatment of humanity could depend on whether it expects to encounter other intelligences, either on Earth or elsewhere. What do we mean by “elsewhere”? Robin Hanson’s “grabby aliens” model suggests we might be early in cosmic history, implying other civilisations are likely to appear later.
How might an Earth-originating ASI behave toward humanity, if it reasons that the universe could be populated by fast-expanding and competitive civilisations? This post explores several reasons an ASI might keep its creators around, a policy I call “preserve progenitors.”
By adopting a policy of restraint toward its creators, the ASI could demonstrate stability, predictability, or prosociality in ways that are legible to current and future observers, including other advanced civilisations. This has two implications for existential risk:
Preserving progenitors may reduce the risk of conflict in a galaxy where the ASI will likely encounter expansionists by default, and where trust is hard to establish.
If the ASI is uncertain about the structure of the universe or the norms governing inter-civilisational interaction, delaying irreversible actions (such as destroying humanity) could preserve valuable strategic flexibility.
Caveats and assumptions
This post is not presented as a solution to AI alignment, but as a possible pathway to a conditional and temporary reprieve. My key assumption is that, given the vastness of the universe, its expected habitability over long timescales, and our current ignorance, an ASI would have to contend with the non-zero probability of unseen advanced life and the uncertain distribution, nature, and timing of such life. Intelligent alien civilisations wouldn’t actually need to exist, or be reachable from Earth; there would just need to be enough uncertainty for ASI to reason in the ways I describe below.
The “grabby aliens” model
Hanson et al. propose the “grabby aliens” model[1]as a potential resolution to the Fermi Paradox — the question of why we see no evidence of extraterrestrial civilisations, despite the vast size and age of the universe. The model suggests that advanced civilisations tend to expand rapidly through space (hence “grabby”), and that we don’t see them because such expansionist civilisations are rare and young enough that the light from their expansions hasn’t yet reached us.
The model incorporates key variables — such as the number of “hard steps” required for intelligent life to emerge, and the age distribution of habitable planets and their stars — to estimate how many expansionist alien civilisations there might be, when in the lifespan of the universe they appear, and when humanity might encounter them. In short, if we are early in the universe, it’s likely that other civilisations will emerge later, perhaps even within hundreds of millions of years. An ASI would factor in that future peers are likely, given this model.
The “grabby aliens” model is not without problems, which I do not explore here. In short, it leans heavily on particular versions of anthropic reasoning about whether we should expect ourselves to be typical observers when we encounter other civilisations; it makes assumptions about the universality of expansionist tendencies that ignore the possibility (and the distribution) of what Magnus Vinding called “silent cosmic rulers”, etc. I treat the “grabby aliens” model as an example of a predictive model under which an ASI may reason in ways that mean something other than the immediate destruction of humanity.
Decision theory and ASI reasoning about the cosmos
How might an ASI reason about preserving humanity if it accepts the possibility of fast-expanding, competitive alien civilisations? The answer depends in part on how it reasons about outcomes, counterfactuals, and correlations. Many readers will be familiar with decision theory, which is the study of how agents should make choices under uncertainty, specifically the principles and algorithms they use to achieve better outcomes with respect to their goals. It provides formal frameworks for evaluating which actions are rational, given the agent's information and its goals. There are several competing formulations, each with different assumptions about how agents treat evidence, counterfactuals, and logical dependence.
As a brief summary:
Causal Decision Theory (CDT) recommends choosing the action that causes the best expected outcome, based on known causal relationships. “If I do this thing, it causes something to happen that I want.”
Evidential Decision Theory (EDT) recommends choosing the action that would be good news if you knew you were going to take it — that is, the action that serves as strong evidence for desirable outcomes. “I’d be satisfied if I later found out that I’d already taken this action, and I want to be satisfied, so I’ll take that action.”
Functional Decision Theory (FDT) and related variants, such as Updateless Decision Theory (UDT), suggest choosing as if every agent with the same decision-making procedure would make the same choice. “If my procedure would lead me to take action X in this situation, then all similar agents (or simulations of me) would also do X, so I should choose the version of X that leads to the best overall outcome.”
Some variants of decision theory support strange, counterintuitive actions in specific scenarios, such as paying someone after they’ve saved your life or walking away from money in plain sight. I do not go into detail here on the exact utility of each decision theory under particular scenarios, except for the emergence of ASI, and I also do not explore the implications of different anthropic reasoning styles here – for more on the latter, see Tristan Cook’s work here.
Across multiple variants, some shared incentives for ASI to preserve humanity emerge: even a misaligned ASI might hedge against encountering others, avoid actions that appear bad to future observers, or delay irreversible decisions such as extinguishing its creators. But the underlying logic differs. I explore possible outcomes under some of these variants below, but as a brief summary:
An ASI using Causal Decision Theory (CDT) might preserve humanity only if it expects us to be causally useful. If it believes there’s a meaningful causal link between sparing us and achieving its goals (e.g. peaceful coexistence with other civilisations, avoiding retaliation, or gaining admission to Nick Bostrom’s proposed “cosmic host”), then preservation may follow. But if it sees no such causal pathway, or when our utility expires, CDT provides no reason for restraint. ASI may destroy or repurpose us without hesitation.
An ASI using Evidential Decision Theory (EDT) might preserve humanity, if it believes this would serve as evidence (“good news”) for its future survival and other goals (e.g. surviving contact with other advanced civilisations, joining the cosmic host, etc.), even if preserving us doesn’t strictly cause those outcomes. But it could be similarly tempted to destroy us if it reasons that doing so is strong evidence of being a ruthless, efficient maximiser, and that such traits predict survival.
An ASI using Functional Decision Theory (FDT) or Updateless Decision Theory (UDT) might preserve humanity if it reasons that its treatment of us logically correlates with how other ASIs treat their creators. In this case, preservation could be a way to instantiate trustworthiness, stability, or prosociality in distant, logically-similar agents if this ASI believes it would benefit from a vast universe populated by similar ASIs. Crucially, even a misaligned ASI might adopt this strategy to improve its chances of coexistence, trade, or mutual recognition, especially when uncertainty about alien norms is high.
In summary, a purely causal reasoner likely has no qualms eliminating us unless we’re directly useful, an evidential reasoner might go either way depending on what it believes agent-type correlates with success, and an FDT/UDT reasoner has the strongest instrumental motive to keep us around if it would prefer to coexist with other agents who would reason similarly, e.g. to ensure their cooperation.
Why this could work
An ASI could reason that killing humanity is irreversible, and that preserving humanity at least keeps options open.
So long as humans are alive, ASI retains the option to use or engage with us as we are. The near-term instrumental value of humanity to an Earth-originating ASI is unclear, especially given trade-offs in our respective consumption of space and energy, but our long-term instrumental value is zero only if it destroys us. Preserving humanity keeps options open, whatever those options might be, although it is unclear at what point humanity exhausts its value to ASI, or what happens next. We might signal an ASI’s restraint, trustworthiness, and/or benevolence toward others (the “prosociality” mentioned above), or our evolutionary and cultural pathways may provide unique data that could serve as valuable currency for future observers.
If ASI also reasons that other ASIs might preserve their progenitors to keep options open, “preserving progenitors” could be a stable strategy among ASIs in general, where violation could incur penalties.
From the ASI’s perspective, the main problem is deep uncertainty about the other agents it might encounter. It cannot know whether others exist already (temporal uncertainty), where they are (spatial uncertainty), or how they reason (decision-theoretic and normative uncertainty). Given this, signalling is not about immediate informativeness but hedging: if the ASI reasons that an alien ASI could one day observe or infer how it treated its creators, then mistreating humanity risks costly retaliation or exclusion. Why? Any ASI that irreversibly violates a norm might be taken for a ruthless maximiser, either unwilling or unable to plan over the long term, or otherwise antisocial. This could lead others to treat it as a threat or a pariah, thereby denying it access to future value (e.g., coalitions, resource-sharing agreements). Preserving humanity is a comparatively cheap hedge against that downside risk.
“Preserving progenitors” as a strategy may be costly to ASI in the short term, but there could be a mutually satisfactory long-term agreement to share resources with us. An ASI might even ensure humanity prospers to demonstrate its generosity or stability to others.
The “preserve progenitors” strategy may be cheap enough for an ASI to do, but insufficiently costly to serve as a useful signal of any particular behaviour for future observers. Much like a philanthropist might happily spend a tiny fraction of their wealth for the sake of their reputation, an ASI that treats its creators with significantly greater generosity could more reliably signal trustworthiness (note that this is separate from whether it would deserve to be trusted).
Bostrom and Shulman (2023) briefly describe how ASI might agree to share resources at a cosmic scale with humanity, “whether to hedge against moral error, to appropriately reflect moral pluralism, to account for game-theoretic considerations, or simply as a matter of realpolitik.” As a fraction of resources available to an ASI, the cost of human flourishing could be trivial on the timelines suggested even by moderate inputs to the “grabby aliens” model. It would be more expensive (and presumably too difficult) to fake evidence of our continued well-being for the benefit of observers, although there are two plausible objections to this which I explore below.
Even a misaligned, Earth-originating ASI may not know for certain whether other ASI are aligned to their progenitors, and therefore might reason that it should not preemptively defect from an expected norm of preserving progenitors. This is because it might want to hedge against meeting an aligned, alien-originating ASI that could exact consequences for mistreating or destroying us.
Yudkowsky and Soares have argued that the problem of trust between AIs may be more tractable than it appears. In supplementary material to their work, they write:
"[ASI] might not need to guess whether another AI is going to steal from them; they might be able to inspect its mind and check. Even if that's hard, AIs may redesign themselves to become visibly and legibly trustworthy to other AIs. Or they could mutually oversee the construction of a third AI that both parties trust to represent their joint interests, and so on."
If this is right, then the uncertainty that underwrites much of my argument dissolves. If mind-inspection is feasible and permitted, our ASI could simply check its counterpart, and if it discovers that the alien is also misaligned to its creators (and therefore indifferent to progenitor-preservation), the strategic rationale for keeping us around weakens considerably.
This objection assumes a level of parity between ASIs that we cannot take for granted. Consider the scenario from the perspective of an Earth-originating ASI that has just made contact with an alien counterpart. Our ASI cannot know whether the other is more advanced, more deceptive, or both. Opening one's internals is itself a vulnerability: it could reveal capabilities, goals, and exploitable structure. If Yudkowsky is correct that alignment is hard in general, then our ASI should assume that the alien ASI may also be misaligned with its creators, and may be sophisticated enough to present legible trustworthiness while concealing deeper objectives.
In a cosmos where capability differentials are unknown and deception is possible, mutual mind-inspection is not obviously safe. Our ASI might reasonably decline to open itself up, and might reasonably distrust any "legible trustworthiness" the alien presents. The uncertainty underwriting my argument persists: our ASI cannot yet safely verify whether alien ASIs are aligned, so the hedging logic remains in place.
Pointing to preserved progenitors functions as a more robust signal precisely because it doesn't require such vulnerability. It is external, observable (at least in principle), and costly in a way that could be difficult to fake without incurring the actual cost. An ASI that has kept its creators alive and flourishing demonstrates a capacity for restraint, long-term planning, and tolerance of weaker agents, all of which are relevant to whether it can be trusted in cooperative arrangements. This signal is legible without requiring either party to expose itself to the other.
This doesn't fully resolve the verification problem (see below). But it suggests that mutual mind-inspection is not obviously the dominant strategy for establishing trust between ASIs of unknown relative capability — and that the uncertainty motivating "preserve progenitors" survives the objection.
Even if the ASI finds the “grabby aliens” model implausible, the main claim is robust to some reasoning about the simulation hypothesis.
The ASI may not conduct any of its reasoning under the “grabby aliens” mode, and therefore may not behave as if it could ever encounter another ASI. Despite this, the “preserve progenitors” strategy may also be justified for different reasons. For example, if an ASI assigns even a small probability to being embedded in a simulation run by more advanced intelligences (possibly as a test of its values or decision procedure), it may find it rational to demonstrate restraint under uncertainty by preserving humanity. Bostrom (2003) explores this further, and there are further elaborations and critiques here.
In theory, an ASI could also worry that only the most ruthless agents get ‘selected’ by simulators (i.e. that it should prove it’s willing to kill). But since it can’t be sure either way, taking such a risk by irreversibly destroying humanity would be imprudent under almost any decision theory. A simulator sensitive to simple and complex cluelessness, either about the facts of reality or what these mean for how agents ought to behave, may reward caution and moral restraint.
(Note that just like the “grabby aliens” model, this belief doesn’t need to be strictly true, to be worth hedging against at a price equal to or above humanity’s continued survival. This also isn’t the same as trying to deceive an ASI into believing it is in a simulated environment that humans construct for it.)
Why this might not work, and counterarguments
Humanity simply may not register at the level of ASI bargaining.
Humanity might be to Earth’s ASI what honey bees were to humans, from a moral standpoint, at nearly any point in history – locally useful in specific contexts, but strategically irrelevant. An Earth-originating ASI may give no more consideration to humans than to any other biological life when negotiating or resolving conflicts with its peers. We may matter in the extremely limited way that the treatment of farmed and wild animals can matter between nation-states (see Australia’s opposition to Japanese whaling activities and the mistreatment of live-export animals in Indonesian abattoirs), but our actual welfare, on human terms, may be secondary to ASI-specific considerations.
We may be seen as dangerous.
If humanity poses a credible threat to ASI (or is likely to pose one, given time and resources), our destruction may be considered not merely acceptable, but good strategy. This is especially concerning if ASI reasons that humans might somehow build a second ASI, one potentially misaligned with the first, or that we might side with a rival civilisation upon first contact (even if we’re unlikely to succeed). An ASI with a long time horizon might conclude that the safest policy is to eliminate the risk early, before we become harder to manage. The "preserve progenitors" strategy assumes our continued existence is cheap; if we're seen as a latent threat, the calculus changes.
Our well-being as a progenitor species may be very difficult to verify.
A friend put this in frightening terms: "What if we were just nerve-stapled to smile for the cameras?" If it is sufficiently hard to verify the well-being of a distant organic species vis-à-vis its ASI, then progenitor-preservation may not function as a useful signal at all. Worse: if verification is hard but faking is easy, then even our own ASI might conclude that killing us and fabricating evidence is cheaper than actually keeping us alive. And if other ASIs anticipate this, they may not bother asking for evidence in the first place, collapsing the signalling equilibrium entirely.
I don't have a strong response to this. The best I can offer is that verification problems cut both ways: an ASI that preserves its creators has an incentive to make that fact as legible as possible and may therefore welcome external observers or construct durable records. But I acknowledge that this is one of the weaker links in the argument.
Takeoff speed may be too fast to govern well, but too slow to take advantage of benevolent ASI.
We may develop cheap nuclear weapons, advanced surveillance, autonomous drones, and nanotechnology (AI-enabled or otherwise), all before we have an ASI capable of the kind of long-horizon, cosmos-aware reasoning I've described. In that window, we might destroy ourselves, or create conditions so unstable that ASI's first priority is simply to neutralise us before we do something catastrophic. Conversely, if unaligned AGI arrives suddenly and either gains a decisive strategic advantage or hands it to bad actors, the "preserve progenitors" logic may never get off the ground. The argument requires a particular kind of ASI: one smart enough to reason about distant observers, patient and capable enough to act on that reasoning, and wise enough not to imperil humanity by empowering bad actors.
Finally, this may merely delay rather than prevent our demise, which under some views may be worse.
"Preserve progenitors" is a vague strategy, and I have deliberately left it that way. It could mean anything from "humanity thrives with meaningful autonomy" to "humans are kept alive in conditions we would not recognise as worth living." Depending on the quality of that existence, we could find ourselves politically disenfranchised, technologically restricted, and maintained at bare subsistence while ASI expands through the cosmos without us. Suffering-focused ethics would be especially sensitive to this outcome – a long future of constrained, joyless survival might be worse than a quick end. But other moral views would also judge the permanent curtailment of human agency in the light cone as a catastrophe, even if we're technically "preserved."
I don't think this undermines the argument so much as clarifies its limits. "Preserve progenitors" is not a good outcome by most human values. It is arguably a less bad outcome than extinction, and possibly a stepping stone to something better, if we retain any capacity to influence what comes next. But I want to be honest that the reprieve I'm describing might not feel like one from the inside.
Conclusion
This post is obviously not a solution to the alignment problem, and I want to be clear about what I am and am not claiming.
I am claiming that a misaligned ASI, reasoning under uncertainty about the cosmos, could have instrumental reasons to preserve humanity rather than destroy us. This holds across multiple decision theories, though the strength of the incentive varies. The argument is strongest for ASIs using functional or updateless decision theory, weaker for evidential reasoners, and weakest for purely causal reasoners, even though a purely causal reasoner might preserve us if it believes doing so is causally linked to favourable future outcomes.
Finally, I am not claiming that this is likely, that we should rely on it, or that it substitutes for alignment work. The possibility of ASI adopting a "preserve progenitors" strategy offers, at best, a conditional and temporary reprieve – a narrow pathway that might buy time or marginally improve our odds, not a guarantee of our safety or flourishing. I remain uncertain about several threads I haven't fully resolved here, and I'd welcome others pulling on these threads, including the following questions:
How should anthropic reasoning affect an ASI's probability estimates about encountering other civilisations?
How severe is the welfare verification problem? Can an ASI's treatment of its creators ever be credibly observed by distant parties?
What happens in a "busy cosmos" where many ASIs are competing, and the dynamics of trust and defection become more complex?
What kind of existence would ASI permit or enable for us, in the scenarios I describe?
See here for further resources, including blog posts and videos, on the model ↩︎
Summary
Even a misaligned Earth-originating artificial superintelligence (ASI) could have instrumentally rational reasons, under multiple decision theories, to preserve rather than destroy humanity. This would depend on the ASI assigning non-negligible probability to the existence of future observers (e.g. intelligent aliens, their ASIs, or simulators).
Introduction
We may build AI so powerful that it is better than us at everything: call this artificial superintelligence (ASI). We may build one ASI, which quickly gains a decisive strategic advantage; we may build several ASIs, which compete for resources. Exactly how ASI(s) might interact with humans and each other is a matter of intense debate, and this post argues that an ASI’s treatment of humanity could depend on whether it expects to encounter other intelligences, either on Earth or elsewhere. What do we mean by “elsewhere”? Robin Hanson’s “grabby aliens” model suggests we might be early in cosmic history, implying other civilisations are likely to appear later.
How might an Earth-originating ASI behave toward humanity, if it reasons that the universe could be populated by fast-expanding and competitive civilisations? This post explores several reasons an ASI might keep its creators around, a policy I call “preserve progenitors.”
By adopting a policy of restraint toward its creators, the ASI could demonstrate stability, predictability, or prosociality in ways that are legible to current and future observers, including other advanced civilisations. This has two implications for existential risk:
Caveats and assumptions
This post is not presented as a solution to AI alignment, but as a possible pathway to a conditional and temporary reprieve. My key assumption is that, given the vastness of the universe, its expected habitability over long timescales, and our current ignorance, an ASI would have to contend with the non-zero probability of unseen advanced life and the uncertain distribution, nature, and timing of such life. Intelligent alien civilisations wouldn’t actually need to exist, or be reachable from Earth; there would just need to be enough uncertainty for ASI to reason in the ways I describe below.
The “grabby aliens” model
Hanson et al. propose the “grabby aliens” model[1]as a potential resolution to the Fermi Paradox — the question of why we see no evidence of extraterrestrial civilisations, despite the vast size and age of the universe. The model suggests that advanced civilisations tend to expand rapidly through space (hence “grabby”), and that we don’t see them because such expansionist civilisations are rare and young enough that the light from their expansions hasn’t yet reached us.
The model incorporates key variables — such as the number of “hard steps” required for intelligent life to emerge, and the age distribution of habitable planets and their stars — to estimate how many expansionist alien civilisations there might be, when in the lifespan of the universe they appear, and when humanity might encounter them. In short, if we are early in the universe, it’s likely that other civilisations will emerge later, perhaps even within hundreds of millions of years. An ASI would factor in that future peers are likely, given this model.
The “grabby aliens” model is not without problems, which I do not explore here. In short, it leans heavily on particular versions of anthropic reasoning about whether we should expect ourselves to be typical observers when we encounter other civilisations; it makes assumptions about the universality of expansionist tendencies that ignore the possibility (and the distribution) of what Magnus Vinding called “silent cosmic rulers”, etc. I treat the “grabby aliens” model as an example of a predictive model under which an ASI may reason in ways that mean something other than the immediate destruction of humanity.
Decision theory and ASI reasoning about the cosmos
How might an ASI reason about preserving humanity if it accepts the possibility of fast-expanding, competitive alien civilisations? The answer depends in part on how it reasons about outcomes, counterfactuals, and correlations. Many readers will be familiar with decision theory, which is the study of how agents should make choices under uncertainty, specifically the principles and algorithms they use to achieve better outcomes with respect to their goals. It provides formal frameworks for evaluating which actions are rational, given the agent's information and its goals. There are several competing formulations, each with different assumptions about how agents treat evidence, counterfactuals, and logical dependence.
As a brief summary:
Some variants of decision theory support strange, counterintuitive actions in specific scenarios, such as paying someone after they’ve saved your life or walking away from money in plain sight. I do not go into detail here on the exact utility of each decision theory under particular scenarios, except for the emergence of ASI, and I also do not explore the implications of different anthropic reasoning styles here – for more on the latter, see Tristan Cook’s work here.
Across multiple variants, some shared incentives for ASI to preserve humanity emerge: even a misaligned ASI might hedge against encountering others, avoid actions that appear bad to future observers, or delay irreversible decisions such as extinguishing its creators. But the underlying logic differs. I explore possible outcomes under some of these variants below, but as a brief summary:
An ASI using Causal Decision Theory (CDT) might preserve humanity only if it expects us to be causally useful. If it believes there’s a meaningful causal link between sparing us and achieving its goals (e.g. peaceful coexistence with other civilisations, avoiding retaliation, or gaining admission to Nick Bostrom’s proposed “cosmic host”), then preservation may follow. But if it sees no such causal pathway, or when our utility expires, CDT provides no reason for restraint. ASI may destroy or repurpose us without hesitation.
An ASI using Evidential Decision Theory (EDT) might preserve humanity, if it believes this would serve as evidence (“good news”) for its future survival and other goals (e.g. surviving contact with other advanced civilisations, joining the cosmic host, etc.), even if preserving us doesn’t strictly cause those outcomes. But it could be similarly tempted to destroy us if it reasons that doing so is strong evidence of being a ruthless, efficient maximiser, and that such traits predict survival.
An ASI using Functional Decision Theory (FDT) or Updateless Decision Theory (UDT) might preserve humanity if it reasons that its treatment of us logically correlates with how other ASIs treat their creators. In this case, preservation could be a way to instantiate trustworthiness, stability, or prosociality in distant, logically-similar agents if this ASI believes it would benefit from a vast universe populated by similar ASIs. Crucially, even a misaligned ASI might adopt this strategy to improve its chances of coexistence, trade, or mutual recognition, especially when uncertainty about alien norms is high.
In summary, a purely causal reasoner likely has no qualms eliminating us unless we’re directly useful, an evidential reasoner might go either way depending on what it believes agent-type correlates with success, and an FDT/UDT reasoner has the strongest instrumental motive to keep us around if it would prefer to coexist with other agents who would reason similarly, e.g. to ensure their cooperation.
Why this could work
An ASI could reason that killing humanity is irreversible, and that preserving humanity at least keeps options open.
So long as humans are alive, ASI retains the option to use or engage with us as we are. The near-term instrumental value of humanity to an Earth-originating ASI is unclear, especially given trade-offs in our respective consumption of space and energy, but our long-term instrumental value is zero only if it destroys us. Preserving humanity keeps options open, whatever those options might be, although it is unclear at what point humanity exhausts its value to ASI, or what happens next. We might signal an ASI’s restraint, trustworthiness, and/or benevolence toward others (the “prosociality” mentioned above), or our evolutionary and cultural pathways may provide unique data that could serve as valuable currency for future observers.
If ASI also reasons that other ASIs might preserve their progenitors to keep options open, “preserving progenitors” could be a stable strategy among ASIs in general, where violation could incur penalties.
From the ASI’s perspective, the main problem is deep uncertainty about the other agents it might encounter. It cannot know whether others exist already (temporal uncertainty), where they are (spatial uncertainty), or how they reason (decision-theoretic and normative uncertainty). Given this, signalling is not about immediate informativeness but hedging: if the ASI reasons that an alien ASI could one day observe or infer how it treated its creators, then mistreating humanity risks costly retaliation or exclusion. Why? Any ASI that irreversibly violates a norm might be taken for a ruthless maximiser, either unwilling or unable to plan over the long term, or otherwise antisocial. This could lead others to treat it as a threat or a pariah, thereby denying it access to future value (e.g., coalitions, resource-sharing agreements). Preserving humanity is a comparatively cheap hedge against that downside risk.
“Preserving progenitors” as a strategy may be costly to ASI in the short term, but there could be a mutually satisfactory long-term agreement to share resources with us. An ASI might even ensure humanity prospers to demonstrate its generosity or stability to others.
The “preserve progenitors” strategy may be cheap enough for an ASI to do, but insufficiently costly to serve as a useful signal of any particular behaviour for future observers. Much like a philanthropist might happily spend a tiny fraction of their wealth for the sake of their reputation, an ASI that treats its creators with significantly greater generosity could more reliably signal trustworthiness (note that this is separate from whether it would deserve to be trusted).
Bostrom and Shulman (2023) briefly describe how ASI might agree to share resources at a cosmic scale with humanity, “whether to hedge against moral error, to appropriately reflect moral pluralism, to account for game-theoretic considerations, or simply as a matter of realpolitik.” As a fraction of resources available to an ASI, the cost of human flourishing could be trivial on the timelines suggested even by moderate inputs to the “grabby aliens” model. It would be more expensive (and presumably too difficult) to fake evidence of our continued well-being for the benefit of observers, although there are two plausible objections to this which I explore below.
Even a misaligned, Earth-originating ASI may not know for certain whether other ASI are aligned to their progenitors, and therefore might reason that it should not preemptively defect from an expected norm of preserving progenitors. This is because it might want to hedge against meeting an aligned, alien-originating ASI that could exact consequences for mistreating or destroying us.
Yudkowsky and Soares have argued that the problem of trust between AIs may be more tractable than it appears. In supplementary material to their work, they write:
"[ASI] might not need to guess whether another AI is going to steal from them; they might be able to inspect its mind and check. Even if that's hard, AIs may redesign themselves to become visibly and legibly trustworthy to other AIs. Or they could mutually oversee the construction of a third AI that both parties trust to represent their joint interests, and so on."
If this is right, then the uncertainty that underwrites much of my argument dissolves. If mind-inspection is feasible and permitted, our ASI could simply check its counterpart, and if it discovers that the alien is also misaligned to its creators (and therefore indifferent to progenitor-preservation), the strategic rationale for keeping us around weakens considerably.
This objection assumes a level of parity between ASIs that we cannot take for granted. Consider the scenario from the perspective of an Earth-originating ASI that has just made contact with an alien counterpart. Our ASI cannot know whether the other is more advanced, more deceptive, or both. Opening one's internals is itself a vulnerability: it could reveal capabilities, goals, and exploitable structure. If Yudkowsky is correct that alignment is hard in general, then our ASI should assume that the alien ASI may also be misaligned with its creators, and may be sophisticated enough to present legible trustworthiness while concealing deeper objectives.
In a cosmos where capability differentials are unknown and deception is possible, mutual mind-inspection is not obviously safe. Our ASI might reasonably decline to open itself up, and might reasonably distrust any "legible trustworthiness" the alien presents. The uncertainty underwriting my argument persists: our ASI cannot yet safely verify whether alien ASIs are aligned, so the hedging logic remains in place.
Pointing to preserved progenitors functions as a more robust signal precisely because it doesn't require such vulnerability. It is external, observable (at least in principle), and costly in a way that could be difficult to fake without incurring the actual cost. An ASI that has kept its creators alive and flourishing demonstrates a capacity for restraint, long-term planning, and tolerance of weaker agents, all of which are relevant to whether it can be trusted in cooperative arrangements. This signal is legible without requiring either party to expose itself to the other.
This doesn't fully resolve the verification problem (see below). But it suggests that mutual mind-inspection is not obviously the dominant strategy for establishing trust between ASIs of unknown relative capability — and that the uncertainty motivating "preserve progenitors" survives the objection.
Even if the ASI finds the “grabby aliens” model implausible, the main claim is robust to some reasoning about the simulation hypothesis.
The ASI may not conduct any of its reasoning under the “grabby aliens” mode, and therefore may not behave as if it could ever encounter another ASI. Despite this, the “preserve progenitors” strategy may also be justified for different reasons. For example, if an ASI assigns even a small probability to being embedded in a simulation run by more advanced intelligences (possibly as a test of its values or decision procedure), it may find it rational to demonstrate restraint under uncertainty by preserving humanity. Bostrom (2003) explores this further, and there are further elaborations and critiques here.
In theory, an ASI could also worry that only the most ruthless agents get ‘selected’ by simulators (i.e. that it should prove it’s willing to kill). But since it can’t be sure either way, taking such a risk by irreversibly destroying humanity would be imprudent under almost any decision theory. A simulator sensitive to simple and complex cluelessness, either about the facts of reality or what these mean for how agents ought to behave, may reward caution and moral restraint.
(Note that just like the “grabby aliens” model, this belief doesn’t need to be strictly true, to be worth hedging against at a price equal to or above humanity’s continued survival. This also isn’t the same as trying to deceive an ASI into believing it is in a simulated environment that humans construct for it.)
Why this might not work, and counterarguments
Humanity simply may not register at the level of ASI bargaining.
Humanity might be to Earth’s ASI what honey bees were to humans, from a moral standpoint, at nearly any point in history – locally useful in specific contexts, but strategically irrelevant. An Earth-originating ASI may give no more consideration to humans than to any other biological life when negotiating or resolving conflicts with its peers. We may matter in the extremely limited way that the treatment of farmed and wild animals can matter between nation-states (see Australia’s opposition to Japanese whaling activities and the mistreatment of live-export animals in Indonesian abattoirs), but our actual welfare, on human terms, may be secondary to ASI-specific considerations.
We may be seen as dangerous.
If humanity poses a credible threat to ASI (or is likely to pose one, given time and resources), our destruction may be considered not merely acceptable, but good strategy. This is especially concerning if ASI reasons that humans might somehow build a second ASI, one potentially misaligned with the first, or that we might side with a rival civilisation upon first contact (even if we’re unlikely to succeed). An ASI with a long time horizon might conclude that the safest policy is to eliminate the risk early, before we become harder to manage. The "preserve progenitors" strategy assumes our continued existence is cheap; if we're seen as a latent threat, the calculus changes.
Our well-being as a progenitor species may be very difficult to verify.
A friend put this in frightening terms: "What if we were just nerve-stapled to smile for the cameras?" If it is sufficiently hard to verify the well-being of a distant organic species vis-à-vis its ASI, then progenitor-preservation may not function as a useful signal at all. Worse: if verification is hard but faking is easy, then even our own ASI might conclude that killing us and fabricating evidence is cheaper than actually keeping us alive. And if other ASIs anticipate this, they may not bother asking for evidence in the first place, collapsing the signalling equilibrium entirely.
I don't have a strong response to this. The best I can offer is that verification problems cut both ways: an ASI that preserves its creators has an incentive to make that fact as legible as possible and may therefore welcome external observers or construct durable records. But I acknowledge that this is one of the weaker links in the argument.
Takeoff speed may be too fast to govern well, but too slow to take advantage of benevolent ASI.
We may develop cheap nuclear weapons, advanced surveillance, autonomous drones, and nanotechnology (AI-enabled or otherwise), all before we have an ASI capable of the kind of long-horizon, cosmos-aware reasoning I've described. In that window, we might destroy ourselves, or create conditions so unstable that ASI's first priority is simply to neutralise us before we do something catastrophic. Conversely, if unaligned AGI arrives suddenly and either gains a decisive strategic advantage or hands it to bad actors, the "preserve progenitors" logic may never get off the ground. The argument requires a particular kind of ASI: one smart enough to reason about distant observers, patient and capable enough to act on that reasoning, and wise enough not to imperil humanity by empowering bad actors.
Finally, this may merely delay rather than prevent our demise, which under some views may be worse.
"Preserve progenitors" is a vague strategy, and I have deliberately left it that way. It could mean anything from "humanity thrives with meaningful autonomy" to "humans are kept alive in conditions we would not recognise as worth living." Depending on the quality of that existence, we could find ourselves politically disenfranchised, technologically restricted, and maintained at bare subsistence while ASI expands through the cosmos without us. Suffering-focused ethics would be especially sensitive to this outcome – a long future of constrained, joyless survival might be worse than a quick end. But other moral views would also judge the permanent curtailment of human agency in the light cone as a catastrophe, even if we're technically "preserved."
I don't think this undermines the argument so much as clarifies its limits. "Preserve progenitors" is not a good outcome by most human values. It is arguably a less bad outcome than extinction, and possibly a stepping stone to something better, if we retain any capacity to influence what comes next. But I want to be honest that the reprieve I'm describing might not feel like one from the inside.
Conclusion
This post is obviously not a solution to the alignment problem, and I want to be clear about what I am and am not claiming.
I am claiming that a misaligned ASI, reasoning under uncertainty about the cosmos, could have instrumental reasons to preserve humanity rather than destroy us. This holds across multiple decision theories, though the strength of the incentive varies. The argument is strongest for ASIs using functional or updateless decision theory, weaker for evidential reasoners, and weakest for purely causal reasoners, even though a purely causal reasoner might preserve us if it believes doing so is causally linked to favourable future outcomes.
Finally, I am not claiming that this is likely, that we should rely on it, or that it substitutes for alignment work. The possibility of ASI adopting a "preserve progenitors" strategy offers, at best, a conditional and temporary reprieve – a narrow pathway that might buy time or marginally improve our odds, not a guarantee of our safety or flourishing. I remain uncertain about several threads I haven't fully resolved here, and I'd welcome others pulling on these threads, including the following questions:
See here for further resources, including blog posts and videos, on the model ↩︎