865

LESSWRONG
LW

864
Distillation & PedagogyAI
Frontpage

12

Key paths, plans and strategies to AI safety success

by Adam Jones
19th Jun 2025
Linkpost from bluedot.org
7 min read
0

12

12

New Comment
Moderation Log
More from Adam Jones
View more
Curated and popular this week
0Comments
Distillation & PedagogyAI
Frontpage

In January, I spent over 100 hours:

  • reading 50+ AI safety 'plans',
  • interviewing with 10+ AI safety researchers; and
  • thinking about AI safety strategy, drawing on my 2.5 years in the field

This list of routes to AI safety success (updated June 2025) is a key output of that research. It describes the main paths to success in AI safety. I’d guess that understanding the strategies in this document would place you in the top 10th percentile of people at AI safety strategy in the AI safety community.

 

Part 1: International approaches

International safe actors race ahead

Get “good” international coalitions to be the first people to develop advanced AI, and do so safely. They then would use this AI model to shift the world into a good state: for example by accelerating one of the other paths below.

CERN for AI is a common model for this. These proposals usually envision shared infrastructure and expertise for frontier AI research, combining the efforts of many nation states together.

This can be approached from two angles:

  • make AI superpowers (US, China) more cooperative and democratic
  • make cooperative democracies (EU, UK, Australia, etc.) more AI-capable

Resources:

  • Building CERN for AI (Center for Future Generations) - European-led international research consortium with tiered membership (EU core plus UK, Switzerland, Canada), specific funding mechanisms and governance structures for trustworthy AI development.
  • What Success Looks Like (EA Forum) - Scenario 6 "Apollo Project" where well-resourced international project develops safe AI first.
  • Hassabis Triple Institution Model (YouTube) - "CERN for AGI" component: international collaborative research facility for safe AGI development, complemented by monitoring and governance institutions.

Cooperative international deterrence

Get everyone to agree not to build the dangerous thing, and enforce those agreements.

IAEA for AI is a common model for this. This body would oversee advanced AI development globally, conduct inspections, and enforce safety standards. Suggested implementations usually rely on compute governance - using control over compute resources (AI chips and other hardware) to monitor and regulate AI development.

These agreements could vary both in scope (who is affected) and severity (what are they prevented from doing). For example, different options could be:

  • preventing anyone from building superintelligence
  • preventing non-democratic states from building AGI
  • preventing states with high levels of terrorism/poor cybersecurity from building bioweapons-capable models

Resources:

  • A Narrow Path (Miotti et al.) - pause superintelligent AI development for 20-years globally, enforced by various proposed bodies: an International AI Safety Commission (IASC), a Global Unit for AI R&D (GUARD), and an International AI Tribunal.
  • AI Governance Needs a Theory of Victory (Convergence Analysis) - "AI Moratorium" as one of three theories of victory, with analysis of historical precedents from nuclear non-proliferation and climate agreements.
  • Analysis of Global AI Governance Strategies (LessWrong) - "Global Moratorium" path where governments slow or halt the development of advanced AI.
  • Survey on Intermediate Goals (Rethink Priorities) - Documents "International monitoring and enforcement" as one theory of victory, with institutional capacity building as key intermediate goal.
  • What Success Looks Like (EA Forum) - Scenario 5 "US-China cooperation" enables global moratorium and IAEA-style institution for AI.
  • Policymaking in the Pause (FLI) - Specific policy agenda for implementation during an AI development pause, including compute governance and international coordination mechanisms.
  • Coalition for a Baruch Plan for AI (CBPAI) - NGO coalition advocating for bold international treaty-making process to create global democratic governance of AI, modeled on the 1946 Baruch Plan for nuclear technologies.
  • Hassabis Triple Institution Model (YouTube) - "IAEA for AI" component: safety monitoring and verification institution to oversee compliance with international AI safety standards.
  • Case studies of international security agreements (Wasil et al.) - Explores various international security agreements including the IAEA, START treaties, the Organisation for the Prohibition of Chemical Weapons, the Wassenaar Arrangement, the Biological Weapons Convention, and the lessons to be learned from AI governance.
  • Secure, Governable Chips (CNAS) - Technical proposal for "on-chip governance" using hardware-embedded security features to monitor and control AI chip usage, enabling IAEA-like verification of compute.

Hostile international deterrence

State actors use sabotage, threats, or AI leverage to limit AI development.

This cluster acknowledges that competitive dynamics may dominate and designs around them rather than trying to eliminate them with cooperation.

Resources:

  • Mutual Assured AI Malfunction (MAIM) (Superintelligence Strategy) - Deterrence through credible sabotage threats against destabilizing AI projects, with specific escalation ladder from intelligence gathering to kinetic strikes.
  • AI Governance Needs a Theory of Victory (Convergence Analysis) - "AI Leviathan" approach where first-mover uses superintelligence to enforce coordination on other actors.

 

Part 2: Domestic approaches

Domestic safe actors race ahead

"Good" domestic actors (companies, government projects) achieve technological advantage, then use that advantage responsibly. Often this is either by suggesting it gets spun into an international safe actor, or by using an aligned superintelligent system to lock in a good world.

This approach focuses on ensuring the right actors win the race with an aligned AI rather than preventing races. A common sub-strategy is developing aligned human-level AI that can then solve aligning superintelligence.

Resources:

  • The Checklist (Sam Bowman from Anthropic) - Phase-based approach: get aligned human-level AI (ASL-4), then use it to align superintelligence and handle societal coordination ("AI does our homework").
  • Planning for Extreme AI Risks (Joshua Clymer) - Decision trees for "careful AI companies" to maintain competitive position while advancing safety, with specific resource allocation heuristics.
  • A Rough Alignment Plan (Adam Jones) - Three-pillar defense-in-depth approach for aligning human-level AI (prevent harmful capabilities, reduce harmful motivations, control harmful actions). Then use human-level AI to align superintelligence.
  • What Success Looks Like (EA Forum) - Scenario 1 "Alignment easier than expected" and Scenario 4 "Alignment by chance" where technical solutions enable coordination.
  • Our Approach to Alignment Research (OpenAI) - Three pillars: human feedback, AI-assisted human feedback, then automated alignment research.
  • Analysis of Global AI Governance Strategies (LessWrong) - "Strategic Advantage" path where Western governments race to TAI to prevent authoritarian control.

Societal resilience

Accept that powerful AI is coming and focus on not dying when it arrives. Build better defenses, stronger institutions, more robust infrastructure.

This sidesteps coordination problems by making the consequences of coordination failure manageable rather than trying to prevent coordination failure itself.

Resources:

  • d/acc: one year later (Vitalik Buterin) - decentralized, democratic, and differential defensive technologies outpace offensive AI capabilities.
  • Analysis of Global AI Governance Strategies (LessWrong) - "Cooperative Development" involves building protective technologies through private initiative, public-private partnerships, or limited state involvement.
  • Societal Adaptation to Advanced AI (Jamie Bernardi et al.) - Reducing the expected negative impacts from a given level of diffusion of a given AI capability.
  • The Intelligence Curse (Luke Drago and Rudolf Laine) - AI could undermine economic incentives to invest in humans. Proposals to mitigate these problems include diffusing AI capabilities, building societal defences, and developing personalised models owned by individuals rather than companies.

 

Part 3: Other approaches

Muddle through doing generally good things

Implement many risk-reduction interventions simultaneously without committing to a single theory of victory. This usually relies on hoping that the combination provides a sufficient safety margin, or a clearer strategy will emerge in future.

This cluster acknowledges uncertainty about which approaches will work and hedges across multiple strategies, often emphasizing near-term concrete actions over long-term strategic coherence.

Resources:

  • A Playbook for AI Risk Reduction (Holden Karnofsky) - Four intervention categories (alignment research, standards, careful AI development, information security) implemented in parallel.
  • Catastrophic Risks from Unsafe AI (YouTube) - Three-pillar framework (prevent, mitigate, coordinate) with concrete interventions across all areas simultaneously.
  • What Success Looks Like (EA Forum) - Scenario 2 "Combination of many technical and governance strategies works" through portfolio approach.
  • The AI Regulator's Toolbox (Adam Jones) - Comprehensive catalog of concrete governance interventions available to regulators, from compute governance to liability frameworks.
  • Survey on Intermediate Goals (Rethink Priorities) - "Humanity muddles through" theory of victory using range of risk-reduction methods without single coordinating mechanism.
  • UK AI Safety Institute Research Agenda (AISI) - Evaluating risks and mitigations to these risks, to achieve policymaker awareness, and identify and promote best practice in government, model developers and deployers.
  • Action Plan to Increase the Safety and Security of Advanced AI (Gladstone) - Progressive government capacity building across multiple areas (regulation, research, international cooperation) without single strategic commitment.

Some combination of the above

Okay, so this isn’t really its own strategy. But weirdly when I speak to people they often forget that this is an option (and possibly quite a good one!).

Diversifying across strategies provides insurance against inevitable forecasting errors. We don't know which coordination problems will prove tractable, which technical approaches will work, or which geopolitical scenarios will unfold.

Additionally, many strategies actually work better together:

  • International safe actors + deterrence: Deterrence means that safe actors gain breathing room to spend more time on safety. Additionally, the benefits produced by the safe actors could be distributed in exchange for compliance with deterrence measures, encouraging international cooperation.
  • Hostile + cooperative deterrence: The credible threat of hostile measures makes cooperative agreements more attractive.
  • Societal resilience + deterrence: Deterrence could stop most threats, and then society can withstand the few bad actors remaining. This effectively offers a defence-in-depth approach.
  • Domestic safe actors + international safe actors: Having domestic AI capabilities provides diplomatic leverage, and more likely to be able to lead the world into having an international safe actor. Safety research can also be exported from domestic safe actors to international projects.

Finally, different strategies require different talent. This might look like:

  • technical researchers for the safe actor and domestic strategies
  • political experts on the deterrence and international strategies; and
  • entrepreneurs for the societal resilience strategy.

This diversity allows us to leverage more talent across many valuable bets.

The most promising path forward probably involves different groups working on different strategies simultaneously - playing to their comparative advantages while staying loosely coordinated.