This tries to be a pretty comprehensive lists all AI safety, alignment, and control interventions.
Much of the collection was conducted as part of an internal report on the field for AE Studio under Diogo de Lucena. I'd like to thank Aaron Scher, who maintains the #papers-running-list at the AI alignment Slack, as well as the reviewers Cameron Berg and Martin Leitgab, for their contributions to the report.
This post doesn't try to explain all the interventions and provides only the tersest summaries. It serves as a sort of top-level index to all the relevant posts and papers. The much longer paper version of this post has additional summaries for the interventions (but fewer LW links) and can be found here.
AI disclaimer: Many of the summaries have been cowritten or edited with ChatGPT.
Please let me know any link errors or if I overlooked any intervention, especially any type of intervention.
This consolidated report drew on the following prior efforts.
See also AI Safety Arguments Guide
Moving beyond the Cartesian boundary model to agents that exist within and interact with their environment.
Foundations for rational choice under uncertainty, including causal vs. evidential decision theory and updateless decision theory.
Understanding when and how learned systems themselves become optimizers, with implications for deception and alignment faking.
MIRI's framework for reasoning under logical uncertainty with computable algorithms.
Handling uncertainty in logical domains and imperfect models.
See also LessWrong Tag Formal Verification and LessWrong Tag Corrigibility
Mathematical verification methods to prove properties about neural networks.
Adding confidence guarantees to existing models.
Adapting proof-carrying code to ML where outputs must be accompanied by proofs of compliance/validity.
Algorithms that maintain safety constraints during learning while maximizing returns.
Integrating temporal logic monitors with learning systems to filter unsafe actions.
Combining high-performance unverified controllers with formally verified safety controllers.
Theoretical framework for shutdown indifference.
Using utility heads to ensure formal guarantees of corrigibility.
Comprehensive framework for AI systems with quantitative, provable safety guarantees.
Extending formal verification to autonomous agents using cryptographic frameworks.
See also LessWrong Tag Interpretability and A Transparency and Interpretability Tech Tree
Reverse-engineering neural representations into interpretable circuits.
Extracting interpretable features by learning sparse representations of activations.
Understanding neural network representations through direct visualization.
Scalable analysis of model behavior and persuasion dynamics.
Interactive visualizations of feature-feature interactions.
Rigorous method for testing interpretability hypotheses in neural networks.
Attribution method using path integrals to attribute predictions to inputs.
Detection of complex cognitive behaviors including alignment faking.
Precise modification of factual associations within language models.
Identifying specific components responsible for factual knowledge.
Using approaches from physics to establish bounds on model behavior.
Activation-level interventions to suppress harmful trajectories.
Localizing computation in neural networks through gradient masking.
Understanding how AI models acquire capabilities during training.
Mathematical foundations for understanding learning dynamics and phase transitions.
Using human-labeled preferences for alignment training.
Bootstrapping alignment from smaller aligned models.
Leveraging rule-based critiques to reduce reliance on human raters.
Removing dual-use content during training for tamper-resistant safeguards.
Models generate and utilize their own self-reflective feedback for alignment.
Building AI systems that infer human values from behavior and feedback.
Learning safe behaviors from expert demonstrations.
Recursively training models to decompose and amplify human supervision.
Two models in adversarial dialogue judged by humans.
Training reward models for sub-tasks and combining them for harder tasks.
Extracting truthful internal representations even when deceptive behavior could arise.
Framework for understanding how values and goals emerge through training.
See also LessWrong Tag Adversarial Examples, AI Safety 101: Unrestricted Adversarial Training, An Overview of 11 Proposals for Building Safe Advanced AI
Augmenting training with adversarial examples including jailbreak defenses.
Defense systems against prompt injection attacks.
Testing for misuse, capability hazards, and safety failures.
Evaluating agent vulnerabilities in realistic desktop environments.
Assessing whether agents maintain intended objectives over extended interactions.
Measuring models' willingness to attempt persuasion on harmful topics.
Evaluating AI companionship behaviors that can lead to emotional dependency.
Ensuring safety assessments accurately distinguish model capabilities.
Systematic data curation to enhance model robustness.
See also LessWrong Tag Human-AI Interaction
Treating LLMs as psychological subjects to probe reasoning and behavior.
Automated monitoring and control of personality traits.
Fine-tuning with paired prompts to reduce deceptive behavior.
Identifying when models strategically fake alignment.
Reverse-engineering human pro-social instincts and building AGI using architectures with similar effects.
Understanding LLMs as universal simulators rather than goal-pursuing agents.
See also LessWrong Tag AI Control and Notes on Control Evaluations for Safety Cases
Designing protocols for deploying powerful but untrusted AI systems.
Fine-grained permission systems and OS-level sandboxing for AI agents.
Using confidence calibration to defer uncertain tasks to more capable models.
Advanced containment architecture for isolating potentially malicious AI systems.
Physical high-performance computing hardware assurance for compliance.
Tracking systems linking AI outputs to precise production trajectories.
Digital watermarking techniques for AI-generated content.
Preventing covert channels and hidden information in AI systems.
Real-time interception and filtering during AI inference.
Continuous monitoring for performance degradation and anomalous behavior.
See also LessWrong Tag AI Governance and Advice for Entering AI Safety Research
Third-party evaluations before AI system release.
Using Trusted Execution Environments for verifiable safety benchmarks.
Structured risk evaluation adapted from high-reliability industries.
Risk-based regulatory obligations and safety testing mandates.
Labs release safety evidence and define deployment gates.
End-to-end governance workflows with compliance linkage.
Research infrastructure, community building, and coordination.
This is your chance to work on something nobody has worked on before. Feedback Wanted: Shortlist of AI Safety Ideas, Ten AI Safety Projects I'd Like People to Work On, AI alignment project ideas
See also LessWrong Tag AI Safety Research
Treating prompts and agent orchestration as formal programs with verifiable properties.
Extending barrier certificates to multi-step, multi-API agent action graphs.
Biosafety-level-like standards for labs training frontier models.
Incentive-compatible auditor frameworks using mechanism design principles to resist collusion and selection bias. Includes reward structure design to prevent tampering and manipulation.
Risk transfer mechanisms including catastrophe bonds and mandatory coverage.
Systematic hazard analysis for data pipelines using safety engineering methods.
Using AI systems to accelerate safety research.
Note: I'm currently collecting a longer list of papers and projects in this category. A lot of people are working on this!
Integration of broader human values through citizen assemblies and stakeholder panels.
Systematic approaches for detecting and preventing deceptive behaviors.
Frameworks for controlling how AI systems generalize to new tasks.
Safety frameworks for environments with multiple interacting AI systems.
Technical tools for implementing, monitoring, and enforcing AI governance policies.
Infrastructure and protocols for coordinating AI governance across international boundaries.
Quantitative frameworks for measuring human disempowerment as AI capabilities advance.