This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.
Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I'd like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.
This post doesn't try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.
AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.
Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.
Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.
This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.
Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I'd like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.
This post doesn't try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.
AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.
Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.
Table of Contents
Prior Overviews
This consolidated report drew on the following prior efforts.
Comprehensive Surveys
Control and Operational Approaches
Governance and Policy
Project Ideas and Research Directions
Foundational Theories
See also AI Safety Arguments Guide
Embedded Agency
Moving beyond the Cartesian boundary model to agents that exist within and interact with their environment.
Decision Theory and Rational Choice
Foundations for rational choice under uncertainty, including causal vs. evidential decision theory and updateless decision theory.
Optimization and Mesa-Optimization
Understanding when and how learned systems themselves become optimizers, with implications for deception and alignment faking.
Logical Induction
MIRI's framework for reasoning under logical uncertainty with computable algorithms.
Cartesian Frames and Finite Factored Sets
Infra-Bayesianism and Logical Uncertainty
Handling uncertainty in logical domains and imperfect models.
Hard Methods: Formal Guarantees
See also LessWrong Tag Formal Verification and LessWrong Tag Corrigibility
Neural Network Verification
Mathematical verification methods to prove properties about neural networks.
Conformal Prediction
Adding confidence guarantees to existing models.
Proof-Carrying Models
Adapting proof-carrying code to ML where outputs must be accompanied by proofs of compliance/validity.
Safe Reinforcement Learning (SafeRL)
Algorithms that maintain safety constraints during learning while maximizing returns.
Shielded RL
Integrating temporal logic monitors with learning systems to filter unsafe actions.
Runtime Assurance Architectures (Simplex)
Combining high-performance unverified controllers with formally verified safety controllers.
Safely Interruptible Agents
Theoretical framework for shutdown indifference.
Provably Corrigible Agents
Using utility heads to ensure formal guarantees of corrigibility.
Guaranteed Safe AI (GSAI)
Comprehensive framework for AI systems with quantitative, provable safety guarantees.
Proofs of Autonomy
Extending formal verification to autonomous agents using cryptographic frameworks.
Mechanistic and Mathematical Interpretability
See also LessWrong Tag Interpretability and A Transparency and Interpretability Tech Tree
Circuit Analysis and Feature Discovery
Reverse-engineering neural representations into interpretable circuits.
Sparse Autoencoders (SAEs)
Extracting interpretable features by learning sparse representations of activations.
Feature Visualization
Understanding neural network representations through direct visualization.
Linear Probes
Scalable analysis of model behavior and persuasion dynamics.
Attribution Graphs
Interactive visualizations of feature-feature interactions.
Causal Scrubbing
Rigorous method for testing interpretability hypotheses in neural networks.
Integrated Gradients
Attribution method using path integrals to attribute predictions to inputs.
Chain-of-Thought Analysis
Detection of complex cognitive behaviors including alignment faking.
Model Editing (ROME)
Precise modification of factual associations within language models.
Knowledge Neurons
Identifying specific components responsible for factual knowledge.
Physics-Informed Model Control
Using approaches from physics to establish bounds on model behavior.
Representation Engineering
Activation-level interventions to suppress harmful trajectories.
Gradient Routing
Localizing computation in neural networks through gradient masking.
Developmental Interpretability
Understanding how AI models acquire capabilities during training.
Singular Learning Theory (SLT)
Mathematical foundations for understanding learning dynamics and phase transitions.
Scalable Oversight and Alignment Training
Reinforcement Learning from Human Feedback (RLHF)
Using human-labeled preferences for alignment training.
Reinforcement Learning from AI Feedback (RLAIF)
Bootstrapping alignment from smaller aligned models.
Constitutional AI
Leveraging rule-based critiques to reduce reliance on human raters.
Pretraining Data Filtering
Removing dual-use content during training for tamper-resistant safeguards.
Reinforcement Learning from Reflective Feedback (RLRF)
Models generate and utilize their own self-reflective feedback for alignment.
CALMA
Value Learning / Cooperative Inverse Reinforcement Learning (CIRL)
Building AI systems that infer human values from behavior and feedback.
Imitation Learning
Learning safe behaviors from expert demonstrations.
Iterated Distillation and Amplification (IDA)
Recursively training models to decompose and amplify human supervision.
AI Safety via Debate
Two models in adversarial dialogue judged by humans.
Recursive Reward Modeling
Training reward models for sub-tasks and combining them for harder tasks.
Eliciting Latent Knowledge (ELK)
Extracting truthful internal representations even when deceptive behavior could arise.
How to tell if your eyes deceive you
Shard Theory
Framework for understanding how values and goals emerge through training.
Robustness and Adversarial Evaluation
See also LessWrong Tag Adversarial Examples, AI Safety 101: Unrestricted Adversarial Training, An Overview of 11 Proposals for Building Safe Advanced AI
Adversarial Training
Augmenting training with adversarial examples including jailbreak defenses.
Prompt Injection Defenses
Defense systems against prompt injection attacks.
Red-Teaming and Capability Evaluations
Testing for misuse, capability hazards, and safety failures.
OS-HARM Benchmark
Evaluating agent vulnerabilities in realistic desktop environments.
Goal Drift Evaluation
Assessing whether agents maintain intended objectives over extended interactions.
Attempt to Persuade Eval (APE)
Measuring models' willingness to attempt persuasion on harmful topics.
INTIMA Benchmark
Evaluating AI companionship behaviors that can lead to emotional dependency.
Signal-to-Noise Analysis for Evaluations
Ensuring safety assessments accurately distinguish model capabilities.
Data Scaling Laws for Domain Robustness
Systematic data curation to enhance model robustness.
Behavioral and Psychological Approaches
See also LessWrong Tag Human-AI Interaction
LLM Psychology
Treating LLMs as psychological subjects to probe reasoning and behavior.
Persona Vectors
Automated monitoring and control of personality traits.
Self-Other Overlap Fine-Tuning (SOO-FT)
Fine-tuning with paired prompts to reduce deceptive behavior.
Alignment Faking Detection
Identifying when models strategically fake alignment.
Brain-Like AGI Safety
Reverse-engineering human pro-social instincts and building AGI using architectures with similar effects.
Robopsychology and Simulator Theory
Understanding LLMs as universal simulators rather than goal-pursuing agents.
Operational Control and Infrastructure
See also LessWrong Tag AI Control and Notes on Control Evaluations for Safety Cases
AI Control Framework
Designing protocols for deploying powerful but untrusted AI systems.
Permission Management and Sandboxing
Fine-grained permission systems and OS-level sandboxing for AI agents.
Model Cascades
Using confidence calibration to defer uncertain tasks to more capable models.
Guillotine Hypervisor
Advanced containment architecture for isolating potentially malicious AI systems.
AI Hardware Security
Physical high-performance computing hardware assurance for compliance.
Artifact and Experiment Lineage Tracking
Tracking systems linking AI outputs to precise production trajectories.
Shutdown Mechanisms and Cluster Kill Switches
Watermarking and Output Detection
Digital watermarking techniques for AI-generated content.
Steganography and Context Leak Countermeasures
Preventing covert channels and hidden information in AI systems.
Runtime AI Firewalls and Content Filtering
Real-time interception and filtering during AI inference.
AI System Observability and Drift Detection
Continuous monitoring for performance degradation and anomalous behavior.
Governance and Institutions
See also LessWrong Tag AI Governance and Advice for Entering AI Safety Research
Pre-Deployment External Safety Testing
Third-party evaluations before AI system release.
Attestable Audits
Using Trusted Execution Environments for verifiable safety benchmarks.
Probabilistic Risk Assessment (PRA) for AI
Structured risk evaluation adapted from high-reliability industries.
Regulation: EU AI Act and US EO 14110
Risk-based regulatory obligations and safety testing mandates.
System Cards and Preparedness Frameworks
Labs release safety evidence and define deployment gates.
AI Governance Platforms
End-to-end governance workflows with compliance linkage.
Ecosystem Development and Meta-Interventions
Research infrastructure, community building, and coordination.
Underexplored Interventions
This is your chance to work on something nobody has worked on before. Feedback Wanted: Shortlist of AI Safety Ideas, Ten AI Safety Projects I'd Like People to Work On, AI alignment project ideas
See also LessWrong Tag AI Safety Research
Compositional Formal Specifications for Prompts/Agents
Treating prompts and agent orchestration as formal programs with verifiable properties.
Control-Theoretic Certificates for Tool-Using Agents
Extending barrier certificates to multi-step, multi-API agent action graphs.
AI-BSL: Capability-Tiered Physical Containment Standards
Biosafety-level-like standards for labs training frontier models.
Oversight Mechanism Design
Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.
Liability and Insurance Instruments
Risk transfer mechanisms including catastrophe bonds and mandatory coverage.
Dataset Hazard Engineering
Systematic hazard analysis for data pipelines using safety engineering methods.
Automated Alignment Research
Using AI systems to accelerate safety research.
Note: I'm currently collecting a longer list of papers and projects in this category. A lot of people are working on this!
Deliberative and Cultural Interventions
Integration of broader human values through citizen assemblies and stakeholder panels.
Deceptive Behavior Detection and Mitigation
Systematic approaches for detecting and preventing deceptive behaviors.
Generalization Control and Capability Containment
Frameworks for controlling how AI systems generalize to new tasks.
Multi-Agent Safety and Coordination Protocols
Safety frameworks for environments with multiple interacting AI systems.
Technical Governance Implementation Tools
Technical tools for implementing, monitoring, and enforcing AI governance policies.
International AI Coordination Mechanisms
Infrastructure and protocols for coordinating AI governance across international boundaries.
Systemic Disempowerment Measurement
Quantitative frameworks for measuring human disempowerment as AI capabilities advance.