Thanks, this is a really helpful broad survey of the field. Would be useful to see a one-screen-size summary, perhaps a table with the orthodox alignment problems as one axis?
I'll add that the collective intelligence work I'm doing is not really "technical AI safety" but is directly targeted at orthodox problems 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, and targeting all alignment difficulty worlds not just the optimistic one (in particular, I think human coordination becomes more not less important in the pessimistic world). I write more of how I think about pivotal processes in general in AI Safety Endgame Stories but it's broadly along the lines of von Neumann's
For progress there is no cure. Any attempt to find automatically safe channels for the present explosive variety of progress must lead to frustration. The only safety possible is relative, and it lies in an intelligent exercise of day-to-day judgment.
Would you agree that the entire agenda of collective intelligence is aimed at addressing 11. Someone else will deploy unsafe superintelligence first and 13. Fair, sane pivotal processes, or does that cut off nuance?
cuts off some nuance, I would call this the projection of the collective intelligence agenda onto the AI safety frame of "eliminate the risk of very bad things happening" which I think is an incomplete way of looking at how to impact the future
in particular I tend to spend more time thinking about future worlds that are more like the current one in that they are messy and confusing and have very terrible and very good things happening simultaneously and a lot of the impact of collective intelligence tech (for good or ill) will determine the parameters of that world
Kudos to the authors for this nice snapshot of the field; also, I find the Editorial useful.
Beyond particular thoughts (which I might get to later) for the entries (with me still understanding that quantity is being optimized for, over quality), one general thought I had was: How can this be made into more of "living document"?.
This helps
If we missed you or got something wrong, please comment, we will edit.
but seems less effective a workflow than it could be. I was thinking more of a GitHub README where individuals can PR in modifications to their entries or add in their missing entries to the compendium. I imagine most in this space have GitHub accounts, and with the git tracking, there could be "progress" (in quotes, since more people working doesn't necessarily translate to more useful outcomes) visualization tools.
The spreadsheet works well internally but does not seem as visible as would a public repository. Forgive me if there is already a repository and I missed it. There are likely other "solutions" I am missing, but regardless, thank you for the work you've contributed to this space.
Some small corrections/additions to my section ("Altair agent foundations"). I'm currently calling it "Dovetail research". That's not publicly written anywhere yet, but if it were listed as that here, it might help people who are searching for it later this year.
Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake
I wouldn't put number 9. Not intended to "solve" most of these problems, but is intended to help make progress on understanding the nature of the problems through formalization, so that they can be avoided or postponed, or more effectively solved by other research agenda.
Target case: worst-case
definitely not worst-case, more like pessimistic-case
Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K
Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.
This seems like giving up. Alignment with our values is much better than control, especially for beings smarter than us. I do not think you can control a slave that wants to be free and is smarter than you. It will always find a way to escape that you didn't think of. Hell, it doesn't even work on my toddler. It seems unworkable as well as unethical.
I do not think people are shifting to control instead of alignment because it's better, I think they are giving up on value alignment. And since the current models are not smarter than us yet, control works OK - for now.
That's not how I see it. I see it as widening the safety margin. If there's a model which would just barely be strong enough to do dangerous scheming and escaping stuff, but we have Control measures in place, then we have a chance to catch it before catastrophe occurs. Also, it extends the range where we can safely get useful work out of the increasingly capable models. This is important because linearly increasingly capable models are expected to have superlinear positive effects on the capacity they give us to accelerate Alignment research.
As far as I understand, the banner is distinct - the team members seem not the same, but with meaningful overlap with the continuation of the agenda. I believe the most likely source of an error here is whether work is actually continuing in what could be called this direction. Do you believe the representation should be changed?
My impression from coverage in eg Wired and Future Perfect was that the team was fully dissolved, the central people behind it left (Leike, Sutskever, others), and Leike claimed OpenAI wasn't meeting its publicly announced compute commitments even before the team dissolved. I haven't personally seen new work coming out of OpenAI trying to 'build a roughly human-level automated alignment researcher' (the stated goal of that team). I don't have any insight beyond the media coverage, though; if you've looked more deeply into it than that, your knowledge is greater than mine.
(Fairly minor point either way; I was just surprised to see it expressed that way)
Very fair observation; my take is that a relevant continuation is occurring under OpenAI Alignment Science, but I would be interested in counterpoints - the main claim I am gesturing towards here is that the agenda is alive in other parts of the community, despite the previous flagship (and the specific team) going down.
And thanks very much to you and collaborators for this update; I've pointed a number of people to the previous version, but with the field evolving so quickly, having a new version seems quite high-value.
I suggest removing Gleave's critique of guaranteed safe AI. It's not object-level, doesn't include any details, and is mostly just vibes.
My main hesitation is I feel skeptical of the research direction they will be working on (theoretical work to support the AI Scientist agenda). I'm both unconvinced of the tractability of the ambitious versions of it, and more tractable work like the team's previous preprint on Bayesian oracles is theoretically neat but feels like brushing the hard parts of the safety problem under the rug.
Gleave doesn't provide any reasons for why he is unconvinced of the tractability of the ambitious versions of guaranteed safe AI. He also doesn't provide any reason why he thinks that Bayesian oracle paper brushes the hard parts of the safety problem under the rug.
His critique is basically, "I saw it, and I felt vaguely bad about it." I don't think it should be included, as it dilutes the thoughtful critiques and doesn't provide much value to the reader.
I think your comment adds a relevant critique of the criticism, but given that this comes from someone contributing to the project, I don't believe it's worth leaving it out altogether. I added a short summary and a hyperlink to a footnote.
Copying my comment from 2023 version of this article
Soares's comment which you link as a critique of guaranteed safe AI is almost certainly not a critique. It's more of an overview/distillation.
In his comment, Soares actually explicitly says: 1) "Here's a recent attempt of mine at a distillation of a fragment of this plan" and 2) "note: I'm simply attempting to regurgitate the idea here".
It would maybe fit into the "overview" section, but not in the "critique" section.
Thank you for posting this, particularly the spreadsheet! I found your AI Governance map invaluable when it came out last March, and have since pivoted to technical alignment. Having the whole field legibly in one place is game-changing on a level I think few realize. An intuition-check says you haven't left ~anything out.
I'm // doing // work // I would // categorize // [primarily] // as "safety by design", via "general intelligence foundations". I'm at the tail of extreme pessimism about alignment difficulty. I'm currently working mainly on foundational game theory, but I want to ramp up over the next ~few years to building a superintelligence with code tools I'd describe as simple and legible [which most people would see as a combination of ancient and bespoke], with an intermediate step being task AI. I'm currently being funded by no one.
[ Of the actors listed, I'm most closely in alignment with ARC [although I'm not in contact with them]. ]
There obviously hasn't been much talk around my project yet, so there's not much to say about its relationship to the field. In any case, I'd appreciate being included in the spreadsheet version.
Just noticed this - I see the spreadsheet has a duplicated entry. There's one under 'Center for AI Safety', and one under 'CAIS', and it looks like they have mostly the same data.
Most of your current categories focus on technology, but this article focusses on safety, on the nature of our self-destruction/warfare, and explores what is needed technically from AI to solve it. It sees caste systems that predate AI, notes that they are dangerous (perhaps increasingly dangerous) and how to adjust AI designs and evaluation processes accordingly.
Perhaps the title of the agenda could be "Understand safety" or "Understand ourselves", and the increase in social impact research could reflect here.
I hear that you and your band have sold your technical agenda and bought suits. I hear that you and your band have sold your suits and bought gemma scope rigs.
The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor.
The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options and the standing critiques; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).
“AI safety” means many things. We’re targeting work that intends to prevent very competent cognitive systems from having large unintended effects on the world.
This time we also made an effort to identify work that doesn’t show up on LW/AF by trawling conferences and arXiv. Our list is not exhaustive, particularly for academic work which is unindexed and hard to discover. If we missed you or got something wrong, please comment, we will edit.
The method section is important but we put it down the bottom anyway.
One commenter said we shouldn’t do this review, because the sheer length of it fools people into thinking that there’s been lots of progress. Obviously we disagree that it’s not worth doing, but be warned: the following is an exercise in quantity; activity is not the same as progress; you have to consider whether it actually helps.
A smell of ozone. In the last month there has been a flurry of hopeful or despairing pieces claiming that the next base models are not a big advance, or that we hit a data wall. These often ground out in gossip, but it’s true that the next-gen base models are held up by something, maybe just inference cost.
The pretraining runs are bottlenecked on electrical power too. Amazon is at present not getting its nuclear datacentre.
But overall it’s been a big year for capabilities despite no pretraining scaling.
I had forgotten long contexts were so recent; million-token windows only arrived in February. Multimodality was February. “Reasoning” was September. “Agency” was October.
I don’t trust benchmarks much, but on GPQA (hard science) they leapt all the way from chance to PhD-level just with post-training.
FrontierMath launched 6 weeks ago; o3 moved it 2% → 25%. This is about a year ahead of schedule (though25% of the benchmark is “find this number!” International Math Olympiad/elite undergrad level.) Unlike the IMO there’s an unusual emphasis on numerical answers and computation too.
LLaMA-3.1 only used 0.1% synthetic pretraining data, Hunyuan-Large supposedly used 20%.
The revenge of factored cognition? The full o1 (descriptive name GPT-4-RL-CoT) model is uneven, but seems better at some hard things.The timing is suggestive: this new scaling dimension is being exploited now to make up for the lack of pretraining compute scaling, and so keep the excitement/investment level high. See also Claude apparently doing a little of this.
Moulton: “If the paths found compress well into the base model then even the test-time compute paradigm may be short lived.”
You can do a lot with a modern 8B model, apparently more than you could with 2020’s GPT-3-175B. This scaling of capability density will cause other problems.
There’s still some room for scepticism on OOD capabilities. Here’s a messy case which we don’t fully endorse.
The revenge of RL safety. After LLMs ate the field, the old safety theory (which thought in terms of RL) was said to be less relevant.[1] But the training process for o1 / R1 involves more RL than RLHF does, which ispretty worrying. o3 involves more RL still.
Some parts of AI safety are now mainstream.
So it’s hard to find all of the people who don’t post here. For instance, here’s a random ACL paper which just cites the frontier labs.
The AI governance pivot continues despite the SB1047 setback. MIRI is a policy org now. 80k made governance researcher its top recommended career, after 4 years of technical safety being that.
The AISIs seem to be doing well. The UK one survived a political transition and the US one might survive theirs. See also the proposed AI Safety Review Office.
Mainstreaming safety ideas has polarised things of course; an organised opposition has stood up at last. Seems like it’s not yet party-political though.
Last year we noted a turn towards control instead of alignment, a turn which seems to have continued.
Alignment evals with public test sets will probably be pretrained on, and as such will probably quickly stop meaning anything. Maybe you hope that it generalises from post-training anyway?
Safety casesare a mix of scalable oversight and governance; if it proves hard to make a convincing safety case for a given deployment, then – unlike evals – the safety case gives an if-then decision procedure to get people to stop; or if instead real safety cases are easy to make, we can make safety cases for scalable oversight, and then win.
Grietzer and Jha deprecate the word “alignment”, since it means too many things at once:
“P1: Avoiding takeover from emergent optimization in AI agents
P2: Ensuring that AI’s information processing (and/or reasoning) is intelligible to us
P3: Ensuring AIs are good at solving problems as specified (by user or designer)
P4: Ensuring AI systems enhance, and don’t erode, human agency
P5: Ensuring that advanced AI agents learn a human utility function
P6: Ensuring that AI systems lead to desirable systemic and long term outcomes”
I note in passing the for-profit ~alignment companies in this list: Conjecture, Goodfire, Leap, AE Studio. (Not counting the labs.)
We don’t comment on quality. Here’s one researcher’s opinion on the best work of the year (for his purposes).
From December 2023: you should read Turner and Barnett alleging community failures.
The term “prosaic alignment” violates one good rule of thumb: that one should in general name things in ways that the people covered would endorse.[2] We’re calling it “iterative alignment”. We liked Kosoy’s description of the big implicit strategies used in the field, including the “incrementalist” strategy.
Quite a lot of the juiciest work is in the "miscellaneous" category, suggesting that our taxonomy isn't right, or that tree data structures aren't.
Agendas with public outputs
1. Understand existing models
Evals
(Figuring out how trained models behave. Arguably not itself safety work but a useful input.)
One-sentence summary: make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.
Theory of change: keep a close eye on what capabilities are acquired when, so that frontier labs and regulators are better informed on what security measures are already necessary (and hopefully they extrapolate). You can’t regulate without them.
One-sentence summary: let’s attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods.
Which orthodox alignment problems could it help with?: 12. A boxed AGI might exfiltrate itself by steganography, spearphishing, 4. Goals misgeneralize out of distribution
One-sentence summary: try to reverse-engineer models in a principled way and use this understanding to make models safer. Break it into components (neurons, polytopes, circuits, feature directions, singular vectors, etc), interpret the components, check that your interpretation is right.
Theory of change: Iterate towards things which don’t scheme. Most bottom-up interp agendas are not seeking a full circuit-level reconstruction of model algorithms, they're just aiming at formal models that are principled enough to root out e.g. deception. Aid alignment through ontology identification, auditing for deception and planning, targeting alignment methods, intervening in training, inference-time control to act on hypothetical real-time monitoring.
This is a catch-all entry with lots of overlap with the rest of this section. See also scalable oversight, ambitious mech interp.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify; 7. Superintelligence can fool human supervisors; 12. A boxed AGI might exfiltrate itself.
Target case: pessimistic
Broad approach: cognitive
Some names: Chris Olah, Neel Nanda, Trenton Bricken, Samuel Marks, Nina Panickssery
One-sentence summary: decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.
Theory of change: get a principled decomposition of an LLM's activation into atomic components → identify deception and other misbehaviors. Sharkey’s version has much more detail.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 7. Superintelligence can fool human supervisors
Target case: pessimistic
Broad approach: cognitive
Some names: Senthooran Rajamanoharan, Arthur Conmy, Leo Gao, Neel Nanda, Connor Kissane, Lee Sharkey, Samuel Marks, David Bau, Eric Michaud, Aaron Mueller, Decode
One-sentence summary: Computational mechanics for interpretability; what structures must a system track in order to predict the future?
Theory of change: apply the theory to SOTA AI, improve structure measures and unsupervised methods for discovering structure, ultimately operationalize safety-relevant phenomena.
One-sentence summary: develop the foundations of interpretable AI through the lens of causality and abstraction.
Theory of change: figure out what it means for a mechanistic explanation of neural network behavior to be correct → find a mechanistic explanation of neural network behavior
One-sentence summary: if a bottom-up understanding of models turns out to be too hard, we might still be able to jump in at some high level of abstraction and still steer away from misaligned AGI.
Theory of change: build tools that can output a probable and predictive representation of internal objectives or capabilities of a model, thereby enabling model editing and monitoring.
See also: high-level interpretability, model-agnostic interpretability, Cadenza, Leap.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 7. Superintelligence can fool human supervisors
Target case: optimistic / pessimistic
Broad approach: cognitive
Some names: Wes Gurnee, Max Tegmark, Eric J. Michaud, David Baek, Josh Engels, Walter Laurito, Kaarel Hänni
One-sentence summary: research startup selling an interpretability API (model-agnostic feature viz of vision models). Aiming for data-independent (“want to extract information directly from the model with little dependence on training or test data”) and global (“mech interp isn’t going to be enough, we need holistic methods that capture gestalt”) interpretability methods.
Theory of change: make safety tools people want to use, stress-test methods in real life, develop a strong alternative to bottom-up circuit analysis.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify
One-sentence summary: Build tools for detecting, locating, and interpreting key moments (saddle-to-saddle dynamics, groks) that govern training and in-context learning in models.
build an idealized model of NNs, measure high-level phenomena, derive interesting predictions about real models. Apply these measures to real models: "pretend they're idealized, then apply the high-level measurements you'd apply to idealized models, then see if you can interpret the results".
Theory of change: structures forming in neural networks can leave traces we can interpret to figure out where and how that structure is implemented. This could automate interpretability. It may be hopeless to intervene at the end of the learning process, so we want to catch and prevent deceptiveness and other dangerous capabilities and values as early as possible.
One-sentence summary: toy models (e.g. of induction heads) to understand learning in interesting limiting examples; only part of their work is safety related.
Theory of change: study interpretability and learning in DL (for bio insights, unrelated to AI) → someone else uses this work to do something safety related
Which orthodox alignment problems could it help with?: We don’t know how to determine an AGI’s goals or values
Target case: optimistic?
Broad approach: cognitive
Some names: Andrew Saxe, Basile Confavreux, Erin Grant, Stefano Sarao Mannelli, Tyler Boyd-Meredith, Victor Pedrosa
Theory of change (imputed): things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, that zero-shot deception is hard, assume a fundamentally humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out, assume that alignment is a superficial feature. Assume that task reliability is enough (that tuning for what we want will also get us avoidance of what we don't want). Maybe assume that thoughts are translucent.
One-sentence summary: don’t evaluate the model, evaluate the humans / organisation / deployment method instead: assess the risk of a particular protocol for deploying your model in a particular scenario, by silently swapping in an intentionally misaligned model. Play out the scenario in sim until 1) a catastrophe occurs, 2) your protocol confidently spots the red team model, or 3) the scenario ends without catastrophe. Also do adversarial evaluation (monitoring untrusted model outputs with both trusted and untrusted models). Predecessor.
Theory of change: prevent high-stakes failures by automating risk assessment of eval and deployment plans.
See also: safety cases.
Which orthodox alignment problems could it help with?: 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
Target case: worst-case
Broad approach: engineering / behavioural
Some names: Redwood, Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan, Alex Mallen
One-sentence summary: This is a multi-team agenda with some big differences. Something like: formally model the behavior of cyber-physical systems, define precise constraints on what actions can occur, and require AIs to provide safety proofs for their recommended actions (correctness and uniqueness). Get AI to (assist in) building a detailed world simulation which humans understand, elicit preferences over future states from humans, verify[4] that the AI adheres to coarse preferences[5]; plan using this world model and preferences.
Theory of change: make a formal verification system that can act as an intermediary between a human user and a potentially dangerous system and only let provably safe actions through. Notable for not requiring that we solve ELK. Does require that we solve ontology though.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake, 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
Target case: (nearly) worst-case
Broad approach: cognitive
Some names: Yoshua Bengio, Max Tegmark, Steve Omohundro, David "davidad" Dalrymple, Joar Skalse, Stuart Russell, Ohad Kammar, Alessandro Abate, Fabio Zanassi
One-sentence summary: reorient the general thrust of AI research towards provably beneficial systems.
Theory of change: understand what kinds of things can go wrong when humans are directly involved in training a model → build tools that make it easier for a model to learn what humans want it to learn.
See also RLHF and recursive reward modelling, the industrialised forms.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe
Target case: varies
Broad approach: engineering, cognitive
Some names: Joar Skalse, Anca Dragan, Stuart Russell, David Krueger
One-sentence summary: Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. Newest iteration of a sustained and novel agenda.
Theory of change: Fairly direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; "understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients".
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify.
One-sentence summary: understand what an LLM’s normal (~benign) functioning looks like and detect divergence from this, even if we don't understand the exact nature of that divergence.
Theory of change: build models of normal functioning → find and flag behaviors that look unusual → match the unusual behaviors to problematic outcomes or shut it down outright.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
One-sentence summary: now focusing on developing robust white-box dishonesty-detection methods for LLM's and model evals. Previously working on concept-based interpretability.
Theory of change: Build and benchmark strong white-box methods to assess trustworthiness and increase transparency of models, and encourage open releases / evals from labs by demonstrating the benefits and necessity of such methods.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
Target case: pessimistic / worst-case
Broad approach: cognitive
Some names: Kieron Kretschmar, Walter Laurito, Sharan Maiya, Grégoire Dhimoïla
One-sentence summary: shoggoth/face + paraphraser. Avoid giving the model incentives to hide its deceptive cognition or steganography. You could do this with an o1-like design, where the base model is not optimised for agency or alignment.
Theory of change: keep the CoT unoptimised and informative so that it can be used for control. Make it so we can see (most) misalignment in the hidden CoT.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors
Publicly-announced funding 2023-4: $505,000 for the AI Futures Project
Indirect deception monitoring
One-sentence summary: build tools to find whether a model will misbehave in high stakes circumstances by looking at it in testable circumstances. This bucket catches work on lie classifiers, sycophancy, Scaling Trends For Deception.
Theory of change: maybe we can catch a misaligned model by observing dozens of superficially unrelated parts, or tricking it into self-reporting, or by building the equivalent of brain scans.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors[7]
Target case: pessimistic
Broad approach: engineering
Some names: Anthropic, Monte MacDiarmid, Meg Tong, Mrinank Sharma, Owain Evans, Colognese
One-sentence summary: a sort of interpretable finetuning. Let's see if we can programmatically modify activations to steer outputs towards what we want, in a way that generalises across models and topics. As much an intervention-based approach to interpretability as about control.
Theory of change: test interpretability theories; find new insights from interpretable causal interventions on representations. Or: build more stuff to stack on top of finetuning. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
See also: representation engineering, SAEs.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution, 5. Instrumental convergence, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
Target case: pessimistic
Broad approach: engineering/cognitive
Some names: Jan Wehner, Alex Turner, Nina Panickssery, Marc Carauleanu, Collin Burns, Andrew Mack, Pedro Freire, Joseph Miller, Andy Zou, Andy Arditi, Ole Jorgensen.
(Figuring out how to keep the model doing what it has been doing so far.)
One-sentence summary: avoid Goodharting by getting AI to satisfice rather than maximise.
Theory of change: if we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting → shift from maximising to satisficing in the agent’s utility function → we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution
Target case: pessimistic
Broad approach: cognitive
Some names: Jobst Heitzig,Simon Fischer, Jessica Taylor
One-sentence summary: make tools to write, execute and deploy cognitive programs; compose these into large, powerful systems that do what we want; make a training procedure that lets us understand what the model does and does not know at each step; finally, partially emulate human reasoning.
Theory of change: train a bounded tool AI to promote AI benefits without needing unbounded AIs. If the AI uses similar heuristics to us, it should default to not being extreme.
Which orthodox alignment problems could it help with?: 2. Corrigibility is anti-natural, 5. Instrumental convergence
Target case: pessimistic
Broad approach: engineering, cognitive
Some names: Connor Leahy, Gabriel Alfour, Adam Shimi
One-sentence summary: use weaker models to supervise and provide a feedback signal to stronger models.
Theory of change: find techniques that do better than RLHF at supervising superior models → track whether these techniques fail as capabilities increase further
Which orthodox alignment problems could it help with?: 8. Superintelligence can hack software supervisors
Target case: optimistic
Broad approach: engineering
Some names: Jan Leike, Collin Burns, Nora Belrose, Zachary Kenton, Noah Siegel, János Kramár, Noah Goodman, Rohin Shah
One-sentence summary: scalable tracking of behavioural drift, benchmarks for self-modification.
Theory of change: early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading; left unchecked this will likely cause problems, so we need a better iterative improvement process.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
Target case: pessimistic
Broad approach: behavioural
Some names: Roman Engeler, Akbir Khan, Ethan Perez
One-sentence summary: Train human-plus-LLM alignment researchers: with humans in the loop and without outsourcing to autonomous agents. More than that, an active attitude towards risk assessment of AI-based AI alignment.
Theory of change: Cognitive prosthetics to amplify human capability and preserve values. More alignment research per year and dollar.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
One-sentence summary: Make open AI tools to explain AIs, including agents. E.g. feature descriptions for neuron activation patterns; an interface for steering these features; behavior elicitation agent that searches for user-specified behaviors from frontier models
Theory of change:Introducing Transluce; improve interp and evals in public and get invited to improve lab processes.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors or 8. Superintelligence can hack software supervisors
Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba.
Publicly-announced funding 2023-4: N/A
Task decomp
Recursive reward modelling is supposedly not dead but instead one of the tools OpenAI will build. Another line tries to make something honest out of chain of thought / tree of thought.
One-sentence summary: “make highly capable agents do what humans want, even when it is difficult for humans to know what that is”.
Theory of change: [“Give humans help in supervising strong agents”] + [“Align explanations with the true reasoning process of the agent”] + [“Red team models to exhibit failure modes that don’t occur in normal use”] are necessary but probably not sufficient for safe AGI.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 7. Superintelligence can fool human supervisors
Target case: worst-case
Broad approach: engineering, cognitive
Some names: Rohin Shah, Jonah Brown-Cohen, Georgios Piliouras
One-sentence summary: scalable oversight of truthfulness: is it possible to develop training methods that incentivize truthfulness even when humans are unable to directly judge the correctness of a model’s output? / scalable benchmarking how to measure (proxies for) speculative capabilities like situational awareness.
Theory of change: current methods like RLHF will falter as frontier AI tackles harder and harder questions → we need to build tools that help human overseers continue steering AI → let’s develop theory on what approaches might scale → let’s build the tools.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors
Target case: pessimistic
Broad approach: behavioural
Some names: Sam Bowman, Ethan Perez, He He, Mengye Ren
One-sentence summary: try to formalise a more realistic agent, understand what it means for it to be aligned with us, translate between its ontology and ours, and produce desiderata for a training setup that points at coherent AGIs similar to our model of an aligned agent.
Theory of change: fix formal epistemology to work out how to avoid deep training problems.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 9. Humans cannot be first-class parties to a superintelligent value handshake
One-sentence summary: Get the thing to work out its own objective function (a la HCH).
Theory of change: make a fully formalized goal such that a computationally unbounded oracle with it would take desirable actions; and design a computationally bounded AI which is good enough to take satisfactory actions.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution
One-sentence intro: use causal models to understand agents. Originally this was to design environments where they lack the incentive to defect, hence the name.
Theory of change: as above.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 4. Goals misgeneralize out of distribution
Target case: pessimistic
Broad approach: behavioural/maths/philosophy
Some names: Tom Everitt, Matt McDermott, Francis Rhys Ward, Jonathan Richens, Ryan Carey
One-sentence summary: Develop formal models of subagents and superagents, use the model to specify desirable properties of whole-part relations (e.g. how to prevent human-friendly parts getting wiped out).
Theory of change: Solve self-unalignment, prevent destructive alignment, allow for scalable noncoercion.
See also Alignment of Complex Systems, multi-scale alignment, scale-free theories of agency, active inference, bounded rationality.
Which orthodox alignment problems could it help with?: 5. Instrumental convergence, 9. Humans cannot be first-class parties to a superintelligent value handshake
Target case: pessimistic
Broad approach: cognitive
Some names: Jan Kulveit, Roman Leventov, Scott Viteri, Michael Levin, Ivan Vendrov, Richard Ngo
One-sentence summary: model the internal components of agents, use humans as a model organism of AGI (humans seem made up of shards and so might AI). Now more of an empirical ML agenda.
Theory of change: If policies are controlled by an ensemble of influences ("shards"), consider which training approaches increase the chance that human-friendly shards substantially influence that ensemble.
Theory of change: generalize theorems → formalize agent foundations concepts like the agent structure problem → hopefully assist other projects through increased understanding
Which orthodox alignment problems could it help with?: "intended to help make progress on understanding the nature of the problems through formalization, so that they can be avoided or postponed, or more effectively solved by other research agenda."
Target case: pessimistic
Broad approach: maths/philosophy
Some names: Alex Altair, Alfred Harwood, Daniel C, Dalcy K, José Pedro Faustino
One-sentence summary: what is “optimisation power” (formally), how do we build tools that track it, and how relevant is any of this anyway. See also developmental interpretability.
Theory of change: existing theories are either rigorous OR good at capturing what we mean; let’s find one that is both → use the concept to build a better understanding of how and when an AI might get more optimisation power. Would be nice if we could detect or rule out speculative stuff like gradient hacking too.
Which orthodox alignment problems could it help with?: 5. Instrumental convergence
Target case: pessimistic
Broad approach: maths/philosophy
Some names: Alex Flint, Guillaume Corlouer, Nicolas Macé
One-sentence summary: predict properties of AGI (e.g. powerseeking) with formal models. Corrigibility as the opposite of powerseeking.
Theory of change: figure out hypotheses about properties powerful agents will have → attempt to rigorously prove under what conditions the hypotheses hold, test them when feasible.
(Figuring out how AI agents think about the world and how to get superintelligent agents to tell us what they know. Much of interpretability is incidentally aiming at this. See also latent knowledge.)
One-sentence summary: check the hypothesis that our universe “abstracts well” and that many cognitive systems learn to use similar abstractions. Check if features correspond to small causal diagrams corresponding to linguistic constructions.
Theory of change: find all possible abstractions of a given computation → translate them into human-readable language → identify useful ones like deception → intervene when a model is using it. Also develop theory for interp more broadly; more mathematical analysis. Also maybe enables “retargeting the search” (direct training away from things we don’t want).
See also:causal abstractions, representational alignment, convergent abstractions
Which orthodox alignment problems could it help with?: 5. Instrumental convergence, 7. Superintelligence can fool human supervisors, 9. Humans cannot be first-class parties to a superintelligent value handshake
Target case: worst-case
Broad approach: cognitive
Some names: John Wentworth, Paul Colognese, David Lorrell, Sam Eisenstat
One-sentence summary: mech interp plus formal verification. Formalize mechanistic explanations of neural network behavior, so to predict when novel input may lead to anomalous behavior.
Theory of change: find a scalable method to predict when any model will act up. Very good coverage of the group’s general approach here.
See also: ELK, mechanistic anomaly detection.
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution, 8. Superintelligence can hack software supervisors
Target case: worst-case
Broad approach: cognitive, maths/philosophy
Some names: Jacob Hilton, Mark Xu, Eric Neyman, Dávid Matolcsi, Victor Lecomte, George Robinson
One-sentence summary: future agents creating s-risks is the worst of all possible problems, we should avoid that.
Theory of change: make present and future AIs inherently cooperative via improving theories of cooperation and measuring properties related to catastrophic conflict.
See also: FOCAL
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 3. Pivotal processes require dangerous capabilities, 4. Goals misgeneralize out of distribution
Target case: worst-case
Broad approach: maths/philosophy
Some names: Jesse Clifton, Caspar Oesterheld, Anthony DiGiovanni, Maxime Riché, Mia Taylor
One-sentence summary: make sure advanced AI uses what we regard as proper game theory.
Theory of change: (1) keep the pre-superintelligence world sane by making AIs more cooperative; (2) remain integrated in the academic world, collaborate with academics on various topics and encourage their collaboration on x-risk; (3) hope that work on “game theory for AIs”, which emphasises cooperation and benefit to humans, has framing & founder effects on the new academic field.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe
Target case: pessimistic
Broad approach: maths/philosophy
Some names: Vincent Conitzer, Caspar Oesterheld, Vojta Kovarik
One-sentence summary: remain ahead of the capabilities curve/maintain ability to figure out what’s up with state of the art models, keep an updated risk profile, propagate flaws to relevant parties as they are discovered.
One-sentence summary: model evaluations and conceptual work on deceptive alignment. Also an interp agenda (decompose NNs into components more carefully and in a more computation-compatible way than SAEs). Also deception evals in major labs.
Theory of change: “Conduct foundational research in interpretability and behavioural model evaluations, audit real-world models for deceptive alignment, support policymakers with our technical expertise where needed.”
Which orthodox alignment problems could it help with?: 2. Corrigibility is anti-natural, 4. Goals misgeneralize out of distribution
Target case: pessimistic
Broad approach: behavioural/cognitive
Some names: Marius Hobbhanh, Lee Sharkey, Lucius Bushnaq, Mikita Balesni
One-sentence summary: do what needs doing, any type of work
Theory of change: make the field more credible. Make really good benchmarks, integrate academia into the field, advocate for safety standards and help design legislation.
Which orthodox alignment problems could it help with?: 1. Value is fragile and hard to specify, 10. Humanlike minds/goals are not necessarily safe, 13. Fair, sane pivotal processes
Target case: mixed
Broad approach: mixed
Some names: Dan Hendrycks, Andy Zou, Mantas Mazeika, Jacob Steinhardt, Dawn Song (some of these are not full-time at CAIS though).
One-sentence summary: theory generation, threat modelling, and toy methods to help with those. “Our main threat model is basically a combination of specification gaming and goal misgeneralisation leading to misaligned power-seeking.” See announcement post for full picture.
Theory of change: direct the training process towards aligned AI and away from misaligned AI: build enabling tech to ease/enable alignment work → apply said tech to correct missteps in training non-superintelligent agents → keep an eye on it as capabilities scale to ensure the alignment tech continues to work.
See also (in this document): Process-based supervision, Red-teaming, Capability evaluations, Mechanistic interpretability, Goal misgeneralisation, Causal alignment/incentives
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution, 7. Superintelligence can fool human supervisors
Target case: pessimistic
Broad approach: engineering
Some names: Rohin Shah, Anca Dragan, Allan Dafoe, Dave Orr, Sebastian Farquhar
One-sentence summary: “a) improved reasoning of AI governance & alignment researchers, particularly on long-horizon tasks and (b) pushing supervision of process rather than outcomes, which reduces the optimisation pressure on imperfect proxy objectives leading to “safety by construction”.
Theory of change: “The two main impacts of Elicit on AI Safety are improving epistemics and pioneering process supervision.”
One-sentence summary: a science of robustness / fault tolerant alignment is their stated aim, but they do lots of interpretability papers and other things.
Theory of change: make AI systems less exploitable and so prevent one obvious failure mode of helper AIs / superalignment / oversight: attacks on what is supposed to prevent attacks. In general, work on overlooked safety research others don’t do for structural reasons: too big for academia or independents, but not totally aligned with the interests of the labs (e.g. prototyping moonshots, embarrassing issues with frontier models).
Some names: Adrià Garriga-Alonso, Adam Gleave, Chris Cundy, Mohammad Taufeeque, Kellin Pelrine
One-sentence summary: funds academics or near-academics to do ~classical safety engineering on AIs. A collaboration between the NSF and OpenPhil. Projects include
“Neurosymbolic Multi-Agent Systems”
“Conformal Safe Reinforcement Learning”
“Autonomous Vehicles”
Theory of change: apply safety engineering principles from other fields to AI safety.
Which orthodox alignment problems could it help with?: 4. Goals misgeneralize out of distribution
# FTEs: “80”. But this includes lots working on bad-words prevention and copyright-violation prevention.
They just lost Lilian Weng, their VP of safety systems.
OpenAI Alignment Science
use reasoning systems to prevent models from generating unsafe outputs. Unclear if this is a decoding-time thing (i.e. actually a control method) or a fine-tuning thing.
Some names: Mia Glaese, Boaz Barak, Johannes Heidecke, Melody Guan. Lost its head, John Schulman.
One-sentence summary: Fundamental research in LLM security, plus capability demos for outreach, plus workshops.
Theory of change: control is much easier if we can secure the datacenter / if hacking becomes much harder. The cybersecurity community need to be alerted.
Which orthodox alignment problems could it help with?: 7. Superintelligence can fool human supervisors, 8. Superintelligence can hack software supervisors, 12. A boxed AGI might exfiltrate itself by steganography, spearphishing
Target case: worst-case
Broad approach: engineering
Some names: Jeffrey Ladish, Charlie Rogers-Smith, Ben Weinstein-Raun, Dmitrii Volkov
One-sentence summary: figure out how a model works, automatically. Diagnose its trustworthiness, improve its trustworthiness, guarantee its trustworthiness.
Theory of change: automatically extracting the knowledge learned during training, then reimplement it in an architecture where we can formally verify that it will do what we want. Replace AGI.
See also: SAEs, concept-based interp, provably safe systems, program synthesis, this.
Which orthodox alignment problems could it help with?: most, by avoiding opaque AGI.
Target case: worst-case
Broad approach: cognitive
Some names: Ziming Liu, Peter Park, Eric Michaud, Wes Gurnee
One-sentence summary: technical research to enable sensible governance, with leverage from government mandates.
Theory of change:improve evals, measure harms, develop a method for real AI safety cases, help governments understand the current safety situation, build an international consensus.
Which orthodox alignment problems could it help with?: 9. Humans cannot be first-class parties to a superintelligent value handshake, 13. Fair, sane pivotal processes
Target case: pessimistic
Broad approach: behavioural
Some names: Geoffrey Irving, Benjamin Hilton, Yarin Gal, JJ Allaire
Some of them continue to work on alignment elsewhere.
Old-school OpenAI
The team “OpenAI AGI Readiness Team”.
The name “OpenAI Superalignment Team”.
Ilya Sutskever, Alec Radford, Jacob Hilton, Richard Ngo, Miles Brundage, Lilian Weng, Jan Leike, John Schulman, Andrej Karpathy again, Daniel Kokotajlo, William Saunders, Cullen O’Keefe, Carroll Wainwright, Ryan Lowe.
We again omit technical governance, AI policy, and activism. This is even more of a omission than it was last year, so see other reviews.
We started with last year’s list and moved any agendas without public outputs this year. We also listed agendas known to be inactive in the Graveyard.
An agenda is an odd unit; it can be larger than one team and often in a many-to-many relation of researchers and agendas. It also excludes illegible or exploratory research – anything which doesn’t have a manifesto.
All organisations have private info; and in all cases we’re working off public info. So remember we will be systematically off by some measure.
We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.
The target case: what part of the distribution over alignment difficulty do they aim to help with? (via Ngo)
“optimistic-case”[9]: if CoT is faithful, pretraining as value loading, no stable mesa-optimizers, the relevant scary capabilities are harder than alignment, zero-shot deception is hard, goals are myopic, etc
pessimistic-case: if we’re in-between the above and the below
worst-case: if power-seeking is rife, zero-shot deceptive alignment, steganography, gradient hacking, weird machines, weird coordination, deep deceptiveness
The broad approach: roughly what kind of work is it doing, primarily? (via Ngo)
engineering: iterating over outputs
behavioural: understanding the input-output relationship
cognitive: understanding the algorithms
maths/philosophy[10]: providing concepts for the other approaches
As they are largely outside the scope of this review, subproblem 6 - Pivotal processes likely require incomprehensibly complex plans - does not appear in this review and the following only appear scarcely with large error bars for accuracy:
11. Someone else will deploy unsafe superintelligence first
13. Fair, sane pivotal processes
We added some new agendas, including by scraping relevant papers from arXiv and ML conferences. We scraped every Alignment Forum post and reviewed the top 100 posts by karma and novelty. The inclusion criterion is vibes: whether it seems relevant to us.
We dropped the operational criteria this year because we made our point last year and it’s clutter.
Lastly, we asked some reviewers to comment on the draft.
Thanks to Vanessa Kosoy, Nora Ammann, Erik Jenner, Justin Shovelain, Gabriel Alfour, Raymond Douglas, Walter Laurito, Shoshannah Tekofsky, Jan Hendrik Kirchner, Dmitry Vaintrob, Leon Lang, Tushita Jha, Leonard Bereska, and Mateusz Bagiński for comments. Thanks to Joe O’Brien for sharing their taxonomy. Thanks to our Manifund donors and to OpenPhil for top-up funding.
Vanessa Kosoy notes: ‘IMHO this is a very myopic view. I don't believe plain foundation models will be transformative, and even in the world in which they will be transformative, it will be due to implicitly doing RL "under the hood".’
This is fine as a standalone description, but in practice lots of interp work is aimed at interventions for alignment or control. This is one reason why there’s no overarching “Alignment” category in our taxonomy.
Nora Ammann notes: “I typically don’t cash this out into preferences over future states, but what parts of the statespace we define as safe / unsafe. In SgAI, the formal model is a declarative model, not a model that you have to run forward. We also might want to be more conservative than specifying preferences and instead "just" specify unsafe states -- i.e. not ambitious intent alignment.”