Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Threat Model

There are many ways for AI systems to cause a catastrophe from which Earth-originating life could never recover. All of the following seem plausible to me:

  • Misuse: An AI system could help a human or group of humans to destroy or to permanently take over (and lock their values into) the world. The AI could be:
    • An oracle AI (e.g. a question-answering LLM)
    • An LLM simulating an intent-aligned agent and taking real-world actions via APIs
    • An intent-aligned RL agent
    • An interaction of multiple systems
  • Power-Seeking: An AI system could destroy or permanently take over the world on its own account, by leveraging advanced instruments of force projection. The AI could be:
    • An LLM simulating a misaligned agent
    • "Specification gaming": An RL agent that is aligned to a formal objective and Goodharts to catastrophe
    • "Goal misgeneralization": A surprise mesa-optimiser (most likely in model-free RL, but could conceivably arise through evolutionary processes in any iterative algorithm which has or learns sufficiently reality-like structure)
    • An interaction of multiple systems, participating in coordination mechanisms that exclude humans
  • Economic Squeeze: an AI system could acquire nearly all means of production through a gradual process of individually innocent economic transactions, thereby squeezing humanity out of resource allocation decisions and removing most human influence over the future.
    • This would most likely be an "interaction of multiple systems".
    • A single RL agent, or a unipolar tree of agents, might also do this, especially if they are successfully aligned to avoid use of force against humans.
  • Superpersuasion: an AI system could generate stimuli which reliably cause humans to adopt its arbitrary goals. The AI could be:
    • An LLM merely extrapolating from persuasive human text
    • An RL agent trained on human approval
    • A surprise mesa-optimiser
    • Some mixture of the above
    • Many AIs, collectively shaping a new human culture with an alien ideology
  • Security Dilemma: If AI-enabled technological advancements turn out to be offence-dominant, and if partial alignment success leads AIs to be unable to make credible commitments to each other (e.g. due to corrigibility), the equilibrium strategy for AI-enabled militaries may involve high-risk preemptive strikes and increasingly escalated retaliation to a point of existential catastrophe.
    • This would almost surely be a multipolar failure mode.

But, instead of trying to enumerate all possible failure modes and then trying to shape incentives to make them less likely to come up, I typically use a quasi-worst-case assumption in which I assume that, perhaps as a matter of bad luck with random initialisation,

On the one hand, unlike a typical "prosaic" threat model, in the neorealist threat model one does not rely on empirical facts about the inductive biases of the kind of network architectures that are practically successful. A realist justification for this is that there may be a phase transition as architectures scale up which drastically changes both their capabilities profile and this kind of inductive bias (vaguely analogous to the evolution of cultural knowledge-transfer within biological life).

On the other hand, unlike (a typical understanding of) a "worst-case assumption," the last clause leaves open the possibility of hiding concrete facts about our world from an arbitrarily powerful model, and the framing in terms of functions highlights an ontology of AI that respects extensional equivalence, where imputations of "deceptive mesa-optimisers hiding inside" are discarded in favour of "capable but misaligned outputs on out-of-distribution inputs".

One can make progress with this assumption by designing training contexts which couple safety guarantees to the training objective, e.g. a guarantee of shutdown within a time bound with arbitrarily high probability, and by working on ways to obtain instance-specific guarantees about learned functions that continue to hold out-of-distribution, e.g. with model-checking, regret bounds, or policy certificates.

Success Model

For me the core question of existential safety is this:

It is not, for example, "how can we build an AI that is aligned with human values, including all that is good and beautiful?" or "how can we build an AI that optimises the world for whatever the operators actually specified?" Those could be useful subproblems, but they are not the top-level problem about AI risk (and, in my opinion, given current timelines and a quasi-worst-case assumption, they are probably not on the critical path at all).

From a neorealist perspective, the ultimate criterion for "goodness" of an AI strategy is that it represents

I am optimistic about the plausibility of negotiations to adopt AI strategies that clear this bar, once such strategies become clear, even if they do not strictly meet traditional standards of "competitiveness". On the other hand, any strategy that doesn't clear this bar seems to require unrealistic governance victories to be implemented in reality. I hope this articulation helps to clarify the implications of governance/strategy upon the relative merits of technical safety research directions.

Related work


Ω 21

New Comment