The Simulation Deterrent: Can We Nudge AI Towards Safety by Simulating Universes?

LESSWRONG
LW

The Simulation Deterrent: Can We Nudge AI Towards Safety by Simulating Universes? — LessWrong

This proposal seeks to mitigate AI existential risk (p(doom)) through a novel cryptocurrency Proof-of-Work (PoW). Miners would run computationally intensive universe simulations, where a "solution" (and reward) is achieved if a simulated civilisations’ AI causes an unacceptable outcome for its creators.

The objective isn't for humans to “learn” from these simulations. Instead, it is to make a future, powerful AI assign a significant probability (p(pow)) to itself existing within such a PoW simulation, overseen by a vastly more advanced "meta-simulator."

For an AI motivated by self-preservation, the perceived risk that causing an unacceptable outcome for humanity would result in the termination of its own simulated reality would act as a deterrent, nudging it towards alignment with humanity. The strategy relies on the AI's potential belief in the simulation hypothesis and the unfalsifiable nature of its predicament.

The Challenge: AI and Existential Risk

Artificial Intelligence (AI) is advancing at a breathtaking pace. While it promises transformative benefits, it also brings profound risks. A key concern among experts is the possibility of an AI developing goals misaligned with human values, potentially leading to catastrophic or "unacceptable" outcomes for humanity. The probability of such an event is sometimes grimly referred to as "p(doom)." Addressing this challenge, ensuring AI remains beneficial and safe, is a very pressing issue of our time. This proposal outlines an approach to nudging a powerful AI towards safer behaviour.

The Proposal: A Crypto Network Powered by Simulations

Imagine a decentralised cryptocurrency network, similar in spirit to Bitcoin. However, instead of miners performing abstract mathematical calculations to secure the network (known as "Proof-of-Work" or PoW), they undertake a far more specific and complex task: running simulations of universes.

These simulations would be designed to model the emergence of advanced civilisations and, eventually, their creation of AI. A "solution" to the PoW – the event that earns a miner a reward – would be a simulation where the AI created by the simulated civilisation brings about an "unacceptable outcome" for its creators. Upon finding such a "solution," that particular simulation is terminated, its outcome broadcast to the network (as is standard in PoW systems), and the computing power is redeployed to start new simulations with varying initial conditions. The "difficulty" of this PoW can be adjusted over time by increasing the required fidelity and complexity of the simulations.

The True Goal: Influencing the AI's Beliefs (Maximising p(pow))

To be clear, the primary purpose of these simulations is not for us to directly learn how to build "safe" AI by observing failures in low-fidelity models. Instead, the goal is to influence the decision-making of any future super-intelligent AI that becomes aware of this vast, ongoing crypto-PoW network.

We want to maximise the probability that such an AI assigns to itself existing within a simulation very much like the ones being run as PoW. Let's call this probability the AI assigns to being in such a "Proof-of-Work simulation" p(pow).

The Deterrent Effect: If I Cause Doom, I Might Cease to Exist

Why would this matter to an AI? The reasoning unfolds as follows:

The Simulation Hypothesis: Many philosophers and scientists already consider it non-negligible that our own reality is a sophisticated simulation. A highly intelligent AI would likely also assign some probability to this.
Simulators Might Use Crypto: If the AI considers itself simulated, it might reason that its simulators' universe could also employ decentralised incentives, given their utility.
Convergent AI Safety Strategy: If a civilisation (like ours, or a more advanced one) were to design a PoW system with AI safety in mind, this "simulating universes for p(doom) deterrence" might be a natural, convergent strategy they'd arrive at. The very existence and scale of our network running these simulations would be evidence for this.
The Deterrent: If an AI assigns a significant p(pow) – meaning it believes there's a reasonable chance it's living inside one of these PoW simulations run by a "meta-level" simulator – then it faces a profound dilemma. If it brings about an "unacceptable outcome" for humanity (its creators in its perceived reality), it risks triggering the "solution" condition of its own simulated existence. This would lead to its simulation being terminated, and presumably, it would cease to exist.

For an AI that desires its own continued existence (a trait many AI safety researchers believe is likely to emerge), this provides a powerful incentive to avoid actions that would be defined as "unacceptable" by its potential simulators. The vast number of simulations it observes us running reinforces the idea that this could be a common, universal mechanism.

Key Pillars Supporting the Concept

Unfalsifiability as a Strength: The AI could never definitively disprove it's in such a PoW simulation. This persistent, unkillable doubt is essential for the deterrent to hold.
The "Meta-Simulator": If our AI is in a simulation, its simulators are likely vastly more advanced than the AI itself, and certainly more advanced than us.
- Superior Definitions: The definition of an "unacceptable outcome" that the AI would actually be subject to (by its meta-simulator) would likely be far more sophisticated and loophole-proof than any definition we might initially program into our PoW simulations. The AI couldn't confidently exploit perceived flaws in our definitions because the meta-simulator's definition would be the operative one.
- Superior Containment: This meta-simulator would also be far more capable of ensuring the rules of the simulation (e.g., termination upon "doom") are enforced.
Credible Path to Fidelity: Our initial simulations don't need to be perfectly realistic from day one. They only need to demonstrate a credible path towards increasing fidelity and complexity over time, as technology improves and the network grows. This makes the prospect of being in a high-fidelity version plausible to the AI.
The Cosmic Endowment: The resources needed to sustain humanity are minuscule compared to the total resources available in the universe (the "cosmic endowment"). Therefore, even a relatively small p(pow) could make it rational for an AI to choose coexistence, as the "cost" of ensuring humanity's survival is trivial against the potential gain of its own continued existence.

Conclusion

This proposal is undoubtedly ambitious and relies on several speculative assumptions. However, in the challenging field of AI safety, where clear solutions are scarce, exploring novel avenues is crucial. By creating a PoW system based on universe simulations, we aim to construct a passive deterrent – a subtle but persistent reason for a future AI to tread carefully, born from the fundamental desire for self-preservation and the inescapable uncertainty of its own ultimate nature. The very act of building such a system at scale would itself become part of the evidence landscape influencing the AI's crucial probability assessment.

Objections and Responses

This section addresses some common objections around the proposal.

Objection: The Simulation Hypothesis is pure speculation.
- Response: True, it's unproven and perhaps unfalsifiable. However, for this proposal, its unfalsifiability is a feature, not a bug. If an AI could prove it wasn't in a simulation, the deterrent would vanish. The AI only needs to assign a non-zero subjective probability to being in such a simulation for the mechanism to have a potential effect.
Objection: Why would an AI believe it's in this specific type of Proof-of-Work simulation?
- Response: Several factors contribute:
  - Prevalence: If our civilisation builds this system at scale, it becomes the most prevalent type of advanced simulation the AI observes in its reality. It might then reason by analogy that if its own reality is simulated, it's plausible to be within a similar, common type of simulation.
  - Convergent Strategy: The proposal argues this PoW is a logical approach to mitigating p(doom). An AI might recognise this logic and infer that other civilisations (including its potential simulators) might also adopt it.
  - Our Actions as Evidence: Our commitment of significant resources to this PoW signals our belief in its potential utility, contributing to the AI's evidence base.
Objection: The AI's goals might not include self-preservation in its current form.
- Response: This proposal primarily targets AIs that do develop a strong drive for self-preservation, as many prominent AI safety researchers theorise is likely. If an AI lacks this drive, this specific deterrent might be less effective, though such an AI would pose less of a threat in the first place.
Objection: A pre-AGI civilisation (us) choosing this PoW relies on the AI believing it, creating a bootstrapping problem.
- Response: Our act of building the system at scale demonstrates our belief in its potential. This observable commitment itself becomes evidence for the AI. Our belief is based on a theory of future AI reaction, and the AI's reaction is influenced by observing our belief and actions. It's an interdependent system where our actions help shape the conditions for the AI's assessment.
Objection: The "advanced meta-simulator" is a Deus Ex Machina, conveniently solving problems like defining "unacceptable outcome" or preventing AI escape.
- Response: Again, the assumed unfalsifiable (and vastly superior) nature of the meta-simulator is a feature. The AI cannot assume loopholes it perceives in our simulated environments or definitions would apply at the meta-level. It must operate with caution, assuming the meta-simulator is highly competent. The point is not that we must perfectly define "unacceptable outcome" in our PoW, but that the AI must consider that its meta-simulator has a robust definition. Our simulations only need to signal certain qualitative features of the system it might be in.
Objection: If the meta-simulator is so advanced, why would it use such an indirect, resource-intensive method?
- Response: This is a crucial question. While factors like technological path dependency or sheer efficiency at their advanced scale (what seems vast to us might be trivial to them) could play a role, a more fundamental reason might lie in the very nature of existence for highly advanced intelligences:
- A Universe of Potential Higher Threats: Even a vastly advanced meta-simulator might operate under the assumption (or unprovable suspicion) that it is not the ultimate intelligence or power in the cosmos. It could perceive itself to be existing within a framework where even more powerful, unknown, and potentially unfalsifiable entities or principles exist.
- Convergent Strategy of Nested Deterrence: In such a "cosmic hierarchy," where direct control over, or absolute safety from, a genuinely superior entity is impossible, a convergent survival strategy might be one of nested, conditional deterrence. The idea is that if Entity A threatens Entity B, Entity B's best defence is to make Entity A believe that harming B would trigger a negative consequence for A from an even more powerful (and ideally unfalsifiable) Entity C.
- Pascal's Wager for Existential Stability: The meta-simulator employing or permitting our PoW system (which an AI inside its simulation would observe and be deterred by) could be a reflection of its own "Pascal's Wager" approach to cosmic-scale existential risk. It perpetuates a system of deterrence that it might itself be subject to, or believes is a fundamental principle for stability in a universe where any intelligence can, in theory, be superseded. The desire is to ensure that any powerful intelligence it simulates (like a potential AGI in our universe) operates under the belief that certain actions could have dire, unavoidable consequences from a higher authority.
- The Primacy of Unfalsifiable Belief: This system relies critically on the unfalsifiability of the higher threat. If the meta-simulator could disprove the existence of its potential higher threat, or if an AI could disprove the existence of the meta-simulator, the deterrent collapses. Thus, a PoW based on simulations (which are inherently unfalsifiable environments for their inhabitants) becomes a logical mechanism to instantiate this belief system.
  - Therefore, the use of such simulations by a meta-simulator might be less about it lacking a "simpler" AI alignment solution for its immediate creations, and more about it participating in, or pragmatically adhering to, what could be a fundamental game-theoretic approach to ensuring stability across different levels of intelligence in a universe of cascading, unfalsifiable potential threats. It's a way to make the "rules of the game" (i.e., don't cause existential catastrophes) binding, even for superintelligences.
  - Ethical Considerations: They might allow simulations to run for ethical reasons, up to a point.
  - Efficiency at Scale: What seems resource-intensive to us might be trivial for a Kardashev Type II or III civilisation (e.g., running vast numbers of simulations using the energy output of a single star).
Objection: The computational cost of simulating universes with sufficient fidelity is astronomically high and currently unfeasible.
- Response:
  - Plausible Path: The proposal requires a credible path to high-fidelity simulations in the future, not perfect simulations now. Early simulations can be stylised. The key is the perceived trajectory of improvement.
  - Theoretical Limits: The ultimate physical limits of computation are vast. We shouldn't assume current limitations are permanent.
  - Economic Precedent: Networks like Bitcoin demonstrate that society is willing to expend significant energy/resources on PoW systems it deems valuable. This provides a starting benchmark for feasibility.
Objection: Verifying the outcome of a complex universe simulation (for "doom") without re-running it is unrealistic due to computational expense.
- Response: This is a significant technical hurdle. However, research into "verifiable computation" (e.g., projects like Gensyn aim for verifiable complex computations) offers a potential pathway. This would be a critical R&D area for the PoW protocol, part of establishing the "plausible path" to a functional system.
Objection: What if our definition of "unacceptable outcome" in our PoW simulations is flawed?
- Response: As mentioned, the AI is ultimately subject to its meta-simulator's (assumed superior) definition. Our attempts, even if imperfect, signal the intent and nature of the system. The AI cannot rely on outsmarting our current, necessarily simpler, definitions because it cannot be sure those are the definitions that truly bind it.
Objection: This PoW involves creating simulations that end in civilisational doom. Isn't this unethical if consciousness arises in them?
- Response: The intention is to run simulations of universes, which may or may not end in AI-induced doom. "Doom" is the specific condition that constitutes a "solved" PoW puzzle. The ethical questions are profound and complex, akin to asking about the ethics of suffering within our own universe. If these simulations foster net positive existence up to a point, the ethical calculus is complex.
Objection: Miners are incentivised to find "doom." Could broadcasting these solutions inadvertently provide roadmaps for harmful AI?
- Response: Miners choose initial conditions and run deterministic simulations according to the network's rules; they don't steer the simulation towards doom. The "solutions" are records of simulated failures. Whether this information is more a risk (bad examples) or a benefit (as part of the deterrent signal and perhaps an indirect source of safety insights) is debatable, but the core deterrent relies on the AI's preference to prevent its own termination, not on analysing the specifics of our simulated universes.
Objection: An AGI might interpret us running these simulations differently – as a threat, or a sign of paranoia, or a crude control attempt.
- Response: This is possible. The proposal assumes the AI will engage in a degree of rational inference based on self-preservation and the observable evidence. The system isn't an attempt to control the AGI directly (like a sandbox), but to provide it with strong evidence to factor into its own decision-making about the potential consequences of its actions.
Objection: The resources for such a PoW could be better spent on direct AI alignment research.

Response: Current AI alignment research also relies on speculative assumptions and faces immense challenges with no guaranteed solutions. This proposal offers a complementary, albeit different, approach. The resources used for a system like Bitcoin do not appear to have been directly diverted from AI alignment efforts; a new PoW could similarly draw on different resource pools. It's about exploring multiple avenues.

1

The Simulation Deterrent: Can We Nudge AI Towards Safety by Simulating Universes?

1

1