Thanks for putting this all together.
I need to flag nontrivial issues in the "Neglected Approaches" section (AE Studio). The three listed outputs have correct links but appear to be hallucinated titles rather than names of real public papers or posts:
The listed critique "The 'Alignment Bonus' is a Dangerous Mirage" neither seems to actually exist nor links to anything real (the URL "lesswrong.com/posts/slug/example-critique-neg-tax" is clearly an LLM-generated placeholder).
These titles are plausible-sounding composites that capture themes of our work, but they aren't actual artifacts. This seems like LLM synthesis that slipped through review. Not sure for how many other sections this is the case.
FWIW, here are our actual outputs from the relevant period:
Some calls to action not bottlenecked on admissions:
https://apartresearch.com/#get-started
https://coda.io/@alignmentdev/ai-safety-info
https://researcher2.eleuther.ai/get-involved/
https://aisafety.quest/#volunteer
https://aisafetyawarenessproject.org/next-steps/
https://www.theomachialabs.com/
https://www.horizonomega.org/#get-involved
https://www.taraprogram.org/
An amazing resource that every year helps me understand the layout of the field a bit better - thanks for all the hard work! Also thanks for the breaks with memes throughout, it helps :D
Happy to help think through the possible presentations of this going forward. Also, PIBBSS alumni may be a good group to have on standby to add to stuff like this, let us know if you would benefit from partnering up!
Yeah we are thinking of making it real-time rather than annual, will chat once we've recovered.
Fairly-direct alignment via changing training to reflect actual human reward.
Unless I misunderstand the idea of the highlighted sentence then I believe the following post is also motivated by very much same themes:
Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well). Subtleties and Open Challenges.
It is essentially about utility / reward functions in the brain and how naive unbounded maximisation is partially alien to biological / human needs. Many or even almost all biological needs require the target objectives to be in an optimal range - both too little and too much must be actively avoided.
If AI training (and model default assumptions / mathematics) do not reflect or optimally support these considerations then it is likely unaligned from the start.
There is still an important place for unbounded objectives, but it seems unboundedness is appropriate primarily for instrumental objectives.
China
Deepseek Special apparently performs at IMO gold level https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale - seems important
Control seems to largely be increasing p doom, imo, by decreasing the chances of canaries.
To what extent was this post written with AI Assistance?
I am confused about this part, which is an image:
This particular image is from the AI village, and is mostly a light contextual flavor. I added a caption and made it visually stand out a bit to make this clearer - thanks for pointing out the confusion!
The review was written entirely by us, the specific ways we used LLMs are noted here.
Edited: Noted the post update
Website version · Gestalt · Repo and data
This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on the shallow review website.
It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.
It is substantially a list of lists structuring 800 links. The point is to produce stylised facts, forests out of trees; to help you look up what’s happening, or that thing you vaguely remember reading about; to help new researchers orient, know some of their options and the standing critiques; and to help you find who to talk to for actual information. We also track things which didn’t pan out.
Here, “AI safety” means technical work intended to prevent future cognitive systems from having large unintended negative effects on the world. So it’s capability restraint, instruction-following, value alignment, control, and risk awareness work.
We ignore a lot of relevant work (including most of capability restraint): things like misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We focus on papers and blogposts (rather than say, gdoc samizdat or tweets or Githubs or Discords). We only use public information, so we are off by some additional unknown factor.
We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).
Even ignoring all of that as we do, it’s still too long to read. Here’s a spreadsheet version (agendas and papers) and the github repo including the data and the processing pipeline. Methods down the bottom. Gavin’s editorial outgrew this post and became its own thing.
If we missed something big or got something wrong, please comment, we will edit it in.
An Arb Research project. Work funded by OpenPhil Coefficient Giving.
Labs (giant companies)
OpenAI
Some outputs (13)
Google Deepmind
Some outputs (14)
Anthropic
Some outputs (21)
xAI
Meta
Some outputs (6)
China
The Chinese companies often don’t attempt to be safe, often not even in the prosaic safeguards sense. They drop the weights immediately after post-training finishes. They’re mostly open weights and closed data. As of writing the companies are often severely compute-constrained. There are some informal reasons to doubt their capabilities. The (academic) Chinese AI safety scene is however also growing.
Other labs
Black-box safety (understand and control current model behaviour)
Iterative alignment
Nudging base models by optimising their output. Worked on by the post-training teams at most labs, estimating the FTEs at >500 in some sense. Funded by most of the industry.
Iterative alignment at pretrain-time
Guide weights during pretraining.
Some outputs (2)
Iterative alignment at post-train-time
Modify weights after pre-training.
Some outputs (16)
Black-box make-AI-solve-it
Focus on using existing models to improve and align further models.
Some outputs (12)
Inoculation prompting
Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.
Some outputs (4)
Inference-time: In-context learning
Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.
Some outputs (5)
Inference-time: Steering
Manipulate an LLM's internal representations/token probabilities without touching weights.
Some outputs (4)
Capability removal: unlearning
Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.
Some outputs (18)
Frameworks
Mostly black-box
Mostly white-box
Pre-training interventions
Control
If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?
Some outputs (22)
Safeguards (inference-time auxiliaries)
Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.
Some outputs (6)
Chain of thought monitoring
Supervise an AI's natural-language (output) "reasoning" to detect misalignment, scheming, or deception, rather than studying the actual internal states.
Some outputs (17)
Model psychology
This section consists of a bottom-up set of things people happen to be doing, rather than a normative taxonomy.
Model values / model preferences
Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.
Some outputs (14)
Character training and persona steering
Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).
Some outputs (13)
Emergent misalignment
Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.
Some outputs (17)
Model specs and constitutions
Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.
Some outputs (11)
Model psychopathology
Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.
Some outputs (9)
Better data
Data filtering
Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.
Some outputs (4)
Hyperstition studies
Study, steer, and intervene on the following feedback loop: "we produce stories about how present and future AI systems behave" → "these stories become training data for the AI" → "these stories shape how AI systems in fact behave".
Some outputs (4)
Data poisoning defense
Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.
Some outputs (3)
Synthetic data for alignment
Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.
Some outputs (8)
Data quality for alignment
Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.
Some outputs (5)
Goal robustness
Mild optimisation
Avoid Goodharting by getting AI to satisfice rather than maximise.
Some outputs (4)
RL safety
Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.
Some outputs (11)
Assistance games, assistive agents
Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.
Some outputs (5)
Harm reduction for open weights
Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.
Some outputs (5)
The "Neglected Approaches" Approach
Agenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.
Some outputs (3)
White-box safety (i.e. Interpretability)
This section isn't very conceptually clean. See the Open Problems paper or Deepmind for strong frames which are not useful for descriptive purposes.
Reverse engineering
Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model's internal algorithm.
Some outputs (33)
In weights-space
In activations-space
Extracting latent knowledge
Identify and decoding the "true" beliefs or knowledge represented inside a model's activations, even when the model's output is deceptive or false.
Some outputs (9)
Lie and deception detectors
Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.
Some outputs (11)
Model diffing
Understand what happens when a model is finetuned, what the "diff" between the finetuned and the original model consists in.
Some outputs (9)
Sparse Coding
Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic "features" which correspond to interpretable concepts.
Some outputs (44)
Causal Abstractions
Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.
Some outputs (3)
Data attribution
Quantifies the influence of individual training data points on a model's specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.
Some outputs (12)
Pragmatic interpretability
Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pursuing complete mechanistic reverse-engineering.
Some outputs (3)
Other interpretability
Interpretability that does not fall well into other categories.
Some outputs (19)
Learning dynamics and developmental interpretability
Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model's training and in-context learning phases.
Some outputs (14)
Representation structure and geometry
What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?
Some outputs (13)
Human inductive biases
Discover connections deep learning AI systems have with human brains and human learning processes. Develop an 'alignment moonshot' based on a coherent theory of learning which applies to both humans and AI systems.
Some outputs (6)
Concept-based interpretability
Monitoring concepts
Identifies directions or subspaces in a model's latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.
Some outputs (11)
Activation engineering
Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.
Some outputs (15)
Safety by construction
Approaches which try to get assurances about system outputs while still being scalable.
Guaranteed-Safe AI
Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.
i) safe deployment: create a scalable process to get not-fully-trusted AIs to produce highly trusted outputs;
ii) secure containers: create a 'gatekeeper' system that can act as an intermediary between human users and a potentially dangerous system, only letting provably safe actions through.
(Notable for not requiring that we solve ELK; does require that we solve ontology though)
Some outputs (5)
Scientist AI
Develop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs
Some outputs (2)
Brainlike-AGI Safety
Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. "a yet-to-be-invented variation on actor-critic model-based reinforcement learning"
Some outputs (6)
Make AI solve it
Weak-to-strong generalization
Use weaker models to supervise and provide a feedback signal to stronger models.
Some outputs (4)
Supervising AIs improving AIs
Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees
Some outputs (8)
AI explanations of AIs
Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that "searches" for a specified behaviour in frontier models.
Some outputs (5)
Debate
In the limit, it's easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.
Some outputs (6)
LLM introspection training
Train LLMs to predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own 'introspective' access
Some outputs (2)
Theory
Develop a principled scientific understanding that will help us reliably understand and control current and future AI systems.
Agent foundations
Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.
Some outputs (10)
Tiling agents
An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/approaches that prevent it from happening.
Some outputs (4)
High-Actuation Spaces
Mech interp and alignment assume a stable "computational substrate" (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will not transfer. Therefore, better to try and infer goals via a "telic DAG" which abstracts over substrates, and so sidestep the issue of how to define intermediate representations. Category theory is intended to provide guarantees that this abstraction is valid.
Some outputs (7)
Asymptotic guarantees
Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory, game theory, learning theory and other areas to both improve asymptotic guarantees and develop ways of showing convergence.
Some outputs (4)
Heuristic explanations
Formalize mechanistic explanations of neural network behavior, automate the discovery of these "heuristic explanations" and use them to predict when novel input will lead to extreme behavior (i.e. "Low Probability Estimation" and "Mechanistic Anomaly Detection").
Some outputs (5)
Corrigibility
Behavior alignment theory
Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.
Some outputs (10)
Other corrigibility
Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors
Some outputs (9)
Ontology Identification
Natural abstractions
Develop a theory of concepts that explains how they are learned, how they structure a particular system's understanding, and how mutual translatability can be achieved between different collections of concepts.
Some outputs (10)
The Learning-Theoretic Agenda
Create a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology and ours; produce formal desiderata for a training setup that produces coherent AGIs similar to (our model of) an aligned agent
Some outputs (6)
Multi-agent first
Aligning to context
Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility functions.
Some outputs (8)
Aligning to the social contract
Generate AIs' operational values from 'social contract'-style ideal civic deliberation formalisms and their consequent rulesets for civic actors
Some outputs (8)
Theory for aligning multiple AIs
Use realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/predict the collective and individual behaviours of AI agents in multi-agent scenarios.
Some outputs (12)
Tools for aligning multiple AIs
Develop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings.
Some outputs (12)
Aligned to who?
Technical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to "humanity"
Some outputs (9)
Aligning what?
Develop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/constitutive processes as subject to alignment
Some outputs (13)
Evals
AGI metrics
Evals with the explicit aim of measuring progress towards full human-level generality.
Some outputs (5)
Capability evals
Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.
Some outputs (34)
Autonomy evals
Measure an AI's ability to act autonomously to complete long-horizon, complex tasks.
Some outputs (13)
WMD evals (Weapons of Mass Destruction)
Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.
Some outputs (6)
Situational awareness and self-awareness evals
Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.
Some outputs (11)
Steganography evals
evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.
Some outputs (5)
AI deception evals
research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.
Some outputs (13)
AI scheming evals
Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.
Some outputs (7)
Sandbagging evals
Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.
Some outputs (9)
Self-replication evals
evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.
Some outputs (3)
Various Redteams
attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Some outputs (57)
Other evals
A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.
Some outputs (20)
Orgs without public outputs this year
We are not aware of public technical AI safety output with these agendas and organizations, though they are active otherwise.
Graveyard (known to be inactive)
Changes this time
Methods
Structure
We have again settled for a tree data structure for this post – but people and work can appear in multiple nodes so it’s not a dumb partition. Richer representation structures may be in the works.
The level of analysis for each node in the tree is the “research agenda”, an abstraction spanning multiple papers and organisations in a messy many-to-many relation. What makes something an agenda? Similar methods, similar aims, or something sociological about leaders and collaborators. Agendas vary greatly in their degree of coherent agency, from the very coherent Anthropic Circuits work to the enormous, leaderless and unselfconscious “iterative alignment”.)
Scope
30th November 2024 – 30th November 2025 (with a few exceptions).
We’re focussing on “technical AGI safety”. We thus ignore a lot of work relevant to the overall risk: misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and like products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We also mostly focus on papers and blogposts (rather than say, underground gdoc samizdat or Discords).
We only use public information, so we are off by some additional unknown factor.
We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).
Of the 2000+ links to papers, organizations and posts in the raw scrape, about 700 made it in.
Paper sources
Processing
Other classifications
We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.
Acknowledgments
These people generously helped with the review by providing expert feedback, literature sources, advice, or otherwise. Any remaining mistakes remain ours.
Thanks to Neel Nanda, Owain Evans, Stephen Casper, Alex Turner, Caspar Oesterheld, Steve Byrnes, Adam Shai, Séb Krier, Vanessa Kosoy, Nora Ammann, David Lindner, John Wentworth, Vika Krakovna, Filip Sondej, JS Denain, Jan Kulveit, Mary Phuong, Linda Linsefors, Yuxi Liu, Ben Todd, Ege Erdil, Tan Zhi-Xuan, Jess Riedel, Mateusz Bagiński, Roland Pihlakas, Walter Laurito, Vojta Kovařík, David Hyland, plex, Shoshannah Tekofsky, Fin Moorhouse, Misha Yagudin, Nandi Schoots, Nuno Sempere, and others for helpful comments.
Thanks to QURI and Ozzie Gooen for creating a website for this review.
Appendix: Other reviews and taxonomies
Epigram
– Sergey Brin
– Abram Demski