AI Safety Interventions

This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.

Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I'd like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.

This post doesn't try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.

AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.

Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.

Prior Overviews

This consolidated report drew on the following prior efforts.

Comprehensive Surveys

AI Alignment: A Comprehensive Survey (Ji et al., 2023)
Foundational Challenges in Assuring Alignment and Safety of Large Language Models (Ganguli et al., 2024) - Anthropic's framework identifying 18 foundational challenges
The Circuits Research Landscape (Lindsey et al., 2025) - a comprehensive survey of interpretability methods

Control and Operational Approaches

Governance and Policy

Open Problems in Technical AI Governance (Reuel et al., 2024)
AI Governance to Avoid Extinction (Barnett & Scher, 2024)
2025 AI Safety Index (FLI, 2025) - an assessment of leading AI companies' safety practices

Project Ideas and Research Directions

AI Alignment Research Project Ideas (BlueDot Impact, 2023)
AI Safety Map (2024) - an overview of AI safety ecosystem as a map!
What Everyone in Technical Alignment is Doing and Why

Foundational Theories

Embedded Agency

Moving beyond the Cartesian boundary model to agents that exist within and interact with their environment.

Decision Theory and Rational Choice

Foundations for rational choice under uncertainty, including causal vs. evidential decision theory and updateless decision theory.

Optimization and Mesa-Optimization

Understanding when and how learned systems themselves become optimizers, with implications for deception and alignment faking.

Logical Induction

MIRI's framework for reasoning under logical uncertainty with computable algorithms.

Garrabrant et al. (2016)

Cartesian Frames and Finite Factored Sets

Infra-Bayesianism and Logical Uncertainty

Handling uncertainty in logical domains and imperfect models.

Hard Methods: Formal Guarantees

Neural Network Verification

Mathematical verification methods to prove properties about neural networks.

Zhang et al. (2022): α-β-CROWN

Conformal Prediction

Adding confidence guarantees to existing models.

Abbasi-Yadkori et al. (2024): Mitigating LLM Hallucinations via Conformal Abstention

Proof-Carrying Models

Adapting proof-carrying code to ML where outputs must be accompanied by proofs of compliance/validity.

Necula (1997): Proof-Carrying Code
Chen et al. (2022): Proof-Carrying Models

Safe Reinforcement Learning (SafeRL)

Algorithms that maintain safety constraints during learning while maximizing returns.

Shielded RL

Integrating temporal logic monitors with learning systems to filter unsafe actions.

Alshiekh et al. (2018): Safe Reinforcement Learning via Shielding

Runtime Assurance Architectures (Simplex)

Combining high-performance unverified controllers with formally verified safety controllers.

Sha et al. (1996): Simplex Architecture
Phan et al. (2020): Resilient Simplex Architecture

Safely Interruptible Agents

Theoretical framework for shutdown indifference.

Provably Corrigible Agents

Using utility heads to ensure formal guarantees of corrigibility.

Guaranteed Safe AI (GSAI)

Comprehensive framework for AI systems with quantitative, provable safety guarantees.

Proofs of Autonomy

Extending formal verification to autonomous agents using cryptographic frameworks.

Grigor et al. (2025): Proofs of Autonomy: Scalable and Practical Verification of AI Autonomy

Mechanistic and Mathematical Interpretability

Circuit Analysis and Feature Discovery

Reverse-engineering neural representations into interpretable circuits.

Sparse Autoencoders (SAEs)

Extracting interpretable features by learning sparse representations of activations.

Feature Visualization

Understanding neural network representations through direct visualization.

Linear Probes

Scalable analysis of model behavior and persuasion dynamics.

Jaipersaud et al. (2024)

Attribution Graphs

Interactive visualizations of feature-feature interactions.

Lindsey et al. (2025)

Causal Scrubbing

Rigorous method for testing interpretability hypotheses in neural networks.

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

Integrated Gradients

Attribution method using path integrals to attribute predictions to inputs.

Sundararajan et al. (2017): Axiomatic Attribution for Deep Networks

Chain-of-Thought Analysis

Detection of complex cognitive behaviors including alignment faking.

METR (2025): CoT May Be Highly Informative Despite “Unfaithfulness”

Model Editing (ROME)

Precise modification of factual associations within language models.

Meng et al. (2022)

Knowledge Neurons

Identifying specific components responsible for factual knowledge.

Dai et al. (2021)

Physics-Informed Model Control

Using approaches from physics to establish bounds on model behavior.

Tomaz & Jones (2025): Momentum-Point-Perplexity Mechanics

Representation Engineering

Activation-level interventions to suppress harmful trajectories.

Gradient Routing

Localizing computation in neural networks through gradient masking.

Cloud et al. (2024): Gradient Routing

Developmental Interpretability

Understanding how AI models acquire capabilities during training.

Kendiukhov (2024)

Singular Learning Theory (SLT)

Mathematical foundations for understanding learning dynamics and phase transitions.

Hoogland et al. (2024)

Scalable Oversight and Alignment Training

Reinforcement Learning from Human Feedback (RLHF)

Using human-labeled preferences for alignment training.

Reinforcement Learning from AI Feedback (RLAIF)

Bootstrapping alignment from smaller aligned models.

Bai et al. (2022)

Constitutional AI

Leveraging rule-based critiques to reduce reliance on human raters.

Pretraining Data Filtering

Removing dual-use content during training for tamper-resistant safeguards.

O'Brien et al. (2024)

Reinforcement Learning from Reflective Feedback (RLRF)

Models generate and utilize their own self-reflective feedback for alignment.

Yang et al. (2024)

CALMA

Soni et al. (2025): CALMA: A Process for Deriving Context-aligned Axes for Language Model Alignment

Value Learning / Cooperative Inverse Reinforcement Learning (CIRL)

Building AI systems that infer human values from behavior and feedback.

Imitation Learning

Learning safe behaviors from expert demonstrations.

Hussein et al. (2017)

Iterated Distillation and Amplification (IDA)

Recursively training models to decompose and amplify human supervision.

AI Safety via Debate

Two models in adversarial dialogue judged by humans.

Recursive Reward Modeling

Training reward models for sub-tasks and combining them for harder tasks.

Leike et al. (2018): Scalable Agent Alignment via Reward Modeling

Eliciting Latent Knowledge (ELK)

Extracting truthful internal representations even when deceptive behavior could arise.

Shard Theory

Framework for understanding how values and goals emerge through training.

Robustness and Adversarial Evaluation

Adversarial Training

Augmenting training with adversarial examples including jailbreak defenses.

Goodfellow et al. (2015): Explaining and Harnessing Adversarial Examples
Madry et al. (2017): Towards Deep Learning Models Resistant to Adversarial Attacks
Redwood Research (2022): Adversarial Training for Language Models
Redwood Research (2023): Training to Avoid Harmful Content

Prompt Injection Defenses

Defense systems against prompt injection attacks.

Red-Teaming and Capability Evaluations

Testing for misuse, capability hazards, and safety failures.

OS-HARM Benchmark

Evaluating agent vulnerabilities in realistic desktop environments.

Kuntz et al. (2024)

Goal Drift Evaluation

Assessing whether agents maintain intended objectives over extended interactions.

Arike et al. (2025)

Attempt to Persuade Eval (APE)

Measuring models' willingness to attempt persuasion on harmful topics.

Kowal et al. (2025)

INTIMA Benchmark

Evaluating AI companionship behaviors that can lead to emotional dependency.

Kaffee et al. (2025)

Signal-to-Noise Analysis for Evaluations

Ensuring safety assessments accurately distinguish model capabilities.

Heineman et al. (2024)

Data Scaling Laws for Domain Robustness

Systematic data curation to enhance model robustness.

Skywork-SWE (2025)

Behavioral and Psychological Approaches

LLM Psychology

Treating LLMs as psychological subjects to probe reasoning and behavior.

Persona Vectors

Automated monitoring and control of personality traits.

Chen et al. (2025)

Self-Other Overlap Fine-Tuning (SOO-FT)

Fine-tuning with paired prompts to reduce deceptive behavior.

Carauleanu et al. (2024)

Alignment Faking Detection

Identifying when models strategically fake alignment.

Brain-Like AGI Safety

Reverse-engineering human pro-social instincts and building AGI using architectures with similar effects.

Robopsychology and Simulator Theory

Understanding LLMs as universal simulators rather than goal-pursuing agents.

Operational Control and Infrastructure

AI Control Framework

Designing protocols for deploying powerful but untrusted AI systems.

Permission Management and Sandboxing

Fine-grained permission systems and OS-level sandboxing for AI agents.

CalypsoAI (2024)

Model Cascades

Using confidence calibration to defer uncertain tasks to more capable models.

Rabanser et al. (2025): GATEKEEPER

Guillotine Hypervisor

Advanced containment architecture for isolating potentially malicious AI systems.

Rosenthal et al. (2024)

AI Hardware Security

Physical high-performance computing hardware assurance for compliance.

TamperSec (2024)

Artifact and Experiment Lineage Tracking

Tracking systems linking AI outputs to precise production trajectories.

Shutdown Mechanisms and Cluster Kill Switches

Safely Interruptible Agents framework (MIRI 2016)
Core Safety Values for Provably Corrigible Agents

Watermarking and Output Detection

Digital watermarking techniques for AI-generated content.

Steganography and Context Leak Countermeasures

Preventing covert channels and hidden information in AI systems.

Runtime AI Firewalls and Content Filtering

Real-time interception and filtering during AI inference.

AI System Observability and Drift Detection

Continuous monitoring for performance degradation and anomalous behavior.

Governance and Institutions

Pre-Deployment External Safety Testing

Third-party evaluations before AI system release.

Attestable Audits

Using Trusted Execution Environments for verifiable safety benchmarks.

Schnabl et al. (2025)

Probabilistic Risk Assessment (PRA) for AI

Structured risk evaluation adapted from high-reliability industries.

Wisakanto et al. (2025)

Regulation: EU AI Act and US EO 14110

Risk-based regulatory obligations and safety testing mandates.

System Cards and Preparedness Frameworks

Labs release safety evidence and define deployment gates.

AI Governance Platforms

End-to-end governance workflows with compliance linkage.

Ecosystem Development and Meta-Interventions

Research infrastructure, community building, and coordination.

Underexplored Interventions

This is your chance to work on something nobody has worked on before. Feedback Wanted: Shortlist of AI Safety Ideas, Ten AI Safety Projects I'd Like People to Work On, AI alignment project ideas

Compositional Formal Specifications for Prompts/Agents

Treating prompts and agent orchestration as formal programs with verifiable properties.

mentioned in Ji et al. (2023): AI Alignment Survey
mentioned in Greenblatt (2023): An overview of control measures

Control-Theoretic Certificates for Tool-Using Agents

Extending barrier certificates to multi-step, multi-API agent action graphs.

mentioned in Greenblatt (2023): An overview of control measures

AI-BSL: Capability-Tiered Physical Containment Standards

Biosafety-level-like standards for labs training frontier models.

Oversight Mechanism Design

Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.

Designing Agent Incentives to Avoid Reward Tampering
mentioned in FLI AI Safety Index (2025)
mentioned in BlueDot Impact (2023)

Liability and Insurance Instruments

Risk transfer mechanisms including catastrophe bonds and mandatory coverage.

mentioned in FLI AI Safety Index (2025)
mentioned in BlueDot Impact (2023)

Dataset Hazard Engineering

Systematic hazard analysis for data pipelines using safety engineering methods.

Automated Alignment Research

Using AI systems to accelerate safety research.

Carlsmith (2025)
mentioned in AE Studio (2024)

Note: I'm currently collecting a longer list of papers and projects in this category. A lot of people are working on this!

Deliberative and Cultural Interventions

Integration of broader human values through citizen assemblies and stakeholder panels.

Deceptive Behavior Detection and Mitigation

Systematic approaches for detecting and preventing deceptive behaviors.

mentioned in Ganguli et al. (2024): Foundational Challenges
Carauleanu et al. (2024): SOO-FT

Generalization Control and Capability Containment

Frameworks for controlling how AI systems generalize to new tasks.

mentioned in Ganguli et al. (2024): Foundational Challenges

Multi-Agent Safety and Coordination Protocols

Safety frameworks for environments with multiple interacting AI systems.

mentioned in Ganguli et al. (2024): Foundational Challenges

Technical Governance Implementation Tools

Technical tools for implementing, monitoring, and enforcing AI governance policies.

Reuel et al. (2024): Open Problems in Technical AI Governance

International AI Coordination Mechanisms

Infrastructure and protocols for coordinating AI governance across international boundaries.

Barnett & Scher (2024): AI Governance to Avoid Extinction

Systemic Disempowerment Measurement

Quantitative frameworks for measuring human disempowerment as AI capabilities advance.

Kulveit et al. (2025): Gradual Disempowerment Framework

10

10

10

Table of Contents