Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.


You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on. 

This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”. 

This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one. 

Our taxonomy:

  1. Understand existing models (evals, interpretability, science of DL)
  2. Control the thing (prevent deception, model edits, value learning, goal robustness)
  3. Make AI solve it (scalable oversight, cyborgism, etc)
  4. Theory (galaxy-brained end-to-end, agency, corrigibility, ontology, cooperation)

Please point out if we mistakenly round one thing off to another, miscategorise someone, or otherwise state or imply falsehoods. We will edit.

Unlike the late Larks reviews, we’re not primarily aiming to direct donations. But if you enjoy reading this, consider donating to ManifundMATS, or LTFF, or to Lightspeed for big ticket amounts: some good work is bottlenecked by money, and you have free access to the service of specialists in giving money for good work.


When I (Gavin) got into alignment (actually it was still ‘AGI Safety’) people warned me it was pre-paradigmatic. They were right: in the intervening 5 years, the live agendas have changed completely.[1] So here’s an update. 

Chekhov’s evaluation: I include Yudkowsky’s operational criteria (Trustworthy command?, closure?, opsec?, commitment to the common good?, alignment mindset?) but don’t score them myself. The point is not to throw shade but to remind you that we often know little about each other. 

See you in 5 years.


  • Alignment is now famous enough that Barack Obama is sort of talking about it. This will attract climbers, grifters, goodharters and those simply misusing the word because it’s objectively confusing and attracts money and goodwill. We already had to half-abandon “AI safety” because of motivated semantic creep. 
  • Low confidence: Mech interp probably has its share of people by now (though I accept that it is an excellent legible [ha] on-ramp and there’s lots of pre-chewed projects ready to go).
  • MATS works well (on average, with high variance). The London extension is a very good idea. They just got $185k from SFF but are still constrained.
  • Not including governance work leaves out lots of cool “technical policy”: forecastingcompute monitoringtrustless model verificationsafety cases.
  • Whole new types of people are contributing, which is nice. I have in mind PIBBSS and the CAIS philosophers and the SLT mob and Eleuther’s discordant energy. 
  • The big labs seem to be betting the farm on scalable oversight. This relies on no huge capabilities spikes and no irreversible misgeneralisation.
  • The de facto agenda of the uncoordinated and only-partially paradigmatic field is process-based supervisiondefence in depthhodgepodgeendgame safetyShlegeris v1. We will throw together a dozen things which work in sub-AGIs and hope: RLHF/DPO + mass activation patching + scoping models down + boxing + dubiously scalable oversight + myopic training + data curation + passable automated alignment research (proof assistants) + … We will also slow things down by creating a (hackable, itself slow OODA) safety culture. Who knows.


1. Understand existing models



(Figuring out how a trained model behaves.)

Various capability evaluations

  • One-sentence summary: make tools that can actually check whether a model has a certain capability / misalignment mode. We default to low-n sampling of a vast latent space but aim to do better.
  • Theory of change: most models have a capabilities overhang when first trained and released; we should keep a close eye on what capabilities are acquired when so that frontier model developers are better informed on what security measures are already necessary (and hopefully they extrapolate and eventually panic).
  • Grouping together ARC EvalsDeepmindCavendishsituational awareness crew, Evans and WardApollo. See also Model Psychology; neuroscience : psychology :: interpretability : model psychology. See also alignment evaluations. See also capability prediction and the hundreds of trolls doing ahem decentralised evals.
  • Some names: Mary Phuong, Toby Shevlane, Beth Barnes, Holden Karnofsky, Lawrence Chan, Owain Evans, Francis Rhys Ward, Apollo, Palisade, OAI Preparedness
  • Estimated # FTEs: 13 (ARC), ~50 elsewhere
  • Some outputs in 2023AI AI researchautonomyDo the Rewards Justify the Means?StubbornnessNaming the thing that GPTs have become was useful. Tag.
  • Critiques: HubingerHubingerShovelain & Mckernon 
  • Funded by: Various. 
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~~$20,000,000 not counting the new government efforts

Various red-teams

  • One-sentence summary: let’s attack current models and see what they do / deliberately induce bad things on current frontier models to test out our theories / methods. See also gain of function experiments (producing demos and toy models of misalignment. See also “Models providing Critiques”. See also: threat modelling (Model OrganismsPowerseekingApollo); steganographypart of OpenAI’s superalignment schedule; Trojans (CAIS); Latent Adversarial Training is an unusual example. 
  • Some names: Stephen Casper, Lauro Langosco, Jacob Steinhardt, Nina Rimsky, Jeffrey Ladish/Palisade, Ethan Perez, Geoffrey Irving, ARC Evals, Apollo, Dylan Hadfield-Menell/AAG
  • Estimated # FTEs: ?
  • Some outputs in 2023RimskyWangWeiTongCasperLadishLangoscoShahScheurerAAG, 2022: Irving 
  • Critiques: ?
  • Funded by: Various
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: Large


Eliciting model anomalies 

  • One-sentence summary: finding weird features of current models, in a way which isn’t fishing for capabilities nor exactly red-teaming. Think inverse ScalingSolidGoldMagikarpReversal curseout of context. Not an agenda but a multiplier on others. 
  • Theory of change: maybe anomalies and edge cases tell us something deep about the models; you need data to theorise.

Alignment of Complex Systems: LLM interactions

  • One-sentence summary: understand LLM interactions, their limits, and work up from empirical work towards more general hypotheses about complex systems of LLMs, such as network effects in hybrid systems and scaffolded models.
  • Theory of change: Aggregates are sometimes easier to predict / theorise than individuals: the details average out. So experiment with LLM interactions (manipulation, conflict resolution, systemic biases etc). Direct research towards LLM interactions in future large systems (in contrast to the current singleton focus); prevent systemic bad design and inform future models.
  • Some names: Jan Kulveit, Tomáš Gavenčiak, Ada Böhm
  • Estimated # FTEs: 4
  • Some outputs in 2023software and insights into LLMs
  • Critiques: Yudkowsky on the interfaces idea
  • Funded by: SFF
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~$300,000

The other evals (groundwork for regulation)

Much of Evals and Governance orgs’ work is something different: developing politically legible metricsprocessesshocking case studies. The aim is to motivate and underpin actually sensible regulation. 

But this is a technical alignment post. I include this section to emphasise that these other evals (which seek confirmation) are different from understanding whether dangerous capabilities have or might emerge.


(Figuring out what a trained model is actually computing.)[2]

Ambitious mech interp

  • In the sense of complete bottom-up circuit-level reconstruction of learned algorithms.
  • One-sentence summary: find circuits for everything automatically, then figure out if the model will do bad things (which algorithm implementing which plan; a full causal graph with a sensible number of nodes); any model that will do bad things can then be deleted or edited.
  • Theory of change: aid alignment through ontology identification, auditing for deception and planning, force-multiplier for alignment research, intervening to make training safer, inference-time controls to act on hypothetical real-time monitoring. Iterate towards things which don’t scheme. See also scalable oversight.
  • Some names: Chris Olah, Lee Sharkey, Neel Nanda, Steven Bills, Nick Cammarata, Leo Gao, William Saunders, Apollo (private work)
  • Estimated # FTEs: 80? (Anthropic, Apollo, DeepMind, OpenAI, various smaller orgs)
  • Some outputs in 2023monosemantic features, the linear representations hypothesis seems well on the way to being confirmed. JennerUniversalityLieberumdistilled rep...
  • Critiques: Summarised hereCharbelBushnaqCasperShovelain & MckernonRicGKrossHobbhahn.
  • Funded by: Various (Anthropic, Deepmind, OpenAI, MATS)
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: many millions

Concept-based interp 

  • One-sentence summary: if ground-up understanding of models turns out to be too hard/impossible, we might still be able to jump in at some high level of abstraction and still steer away from misaligned AGI. AKA “high-level interpretability”.
  • Theory of change: build tools that can output a probable and predictive representation of internal objectives or capabilities of a model, thereby solving inner alignment.
  • Some names: Erik Jenner, Jessica Rumbelow, Stephen Casper, Arun Jose, Paul Colognese
  • Estimated # FTEs: ?
  • Some outputs in 2023High-level InterpretabilityInternal Target Information for AI Oversight
  • Critiques:
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Causal abstractions

  • One-sentence summary: Wentworth 2020; partially describe the “algorithm” a neural network or other computation is using, while throwing away irrelevant details.
  • Theory of change: find all possible abstractions of a given computation -> translate them into human-readable language -> identify useful ones like deception -> intervene when a model is doing it. Also develop theory for interp more broadly as a multiplier; more mathematical (hopefully, more generalizable) analysis.
  • Some names: Eric Jenner, Atticus Geiger
  • Estimated # FTEs: ?
  • Some outputs in 2023Jenner’s agendaCausal Abstraction for Faithful Model Interpretation 
  • Critiques:
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

EleutherAI interp

  • One-sentence summary: tools to investigate questions like path dependence of training.
  • Theory of change: make amazing tools to push forward the frontier of interpretability.
  • Some names: Stella Biderman, Nora Belrose, AI_WAIFU, Shivanshu Purohit 
  • Estimated # FTEs: ~12 plus ~~50 part-time volunteers
  • Some outputs in 2023LEACE (see also Surgical Model Edits); Tuned Lens; Improvements on CCS: VINCELK generalisation
  • Critiques:
  • Funded by: Hugging Face, Stability AI, Nat Friedman, Lambda Labs, Canva, CoreWeave
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $2,000,000? (guess)

Activation engineering (as unsupervised interp)

  • One-sentence summary: intervene on model representations and so get good causal evidence when dishonesty, powerseeking, and other intrinsic risks show up; also test interpretability theories and editing theories. See also the section of the same name under “Model edits” below.
  • Theory of change: test interpretability theories as part of that theory of change; find new insights from interpretable causal interventions on representations. Unsupervised means no annotation bias, which lowers one barrier to extracting superhuman representations.
  • Some names: Alex Turner, Collin Burns, Andy Zou, Kaarel Hänni, Walter Laurito, Cadenza (manifund)
  • Estimated # FTEs: ~15
  • Some outputs in 2023famously CCS last year, steering vectors in RL and GPTsCadenza RFCthe shape of conceptsRoger experiment. See also representation engineering
  • Critiques:
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?


  • One-sentence summary: research startup selling an interpretability API (model-agnostic feature viz of vision models). Aiming for data-independent (“want to extract information directly from the model with little dependence on training or test data”) and global (“mech interp isn’t going to be enough, we need holistic methods that capture gestalt”) interpretability methods.
  • Theory of change: make safety tools people want to use, stress-test methods in real life, develop a strong alternative to bottom-up circuit analysis.
  • Some names: Jessica Rumbelow
  • Estimated # FTEs: 5
  • Some outputs in 2023Prototype generation 
  • Critiques: ?
  • Funded by: private investors
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: millions on the way

Understand learning

(Figuring out how the model figured it out.)

Timaeus: Developmental interpretability & singular learning theory 

  • One-sentence summary: Build tools for detecting, locating, and interpreting phase transitions that govern training and in-context learning in models, inspired by concepts in singular learning theory (SLT), statistical physics, and developmental biology.
  • Theory of change: When structure forms in neural networks, it can leave legible developmental traces that we can interpret to figure out where and how that structure is implemented. This paves a way to scalable, automated interpretability. In particular, it may be hopeless to intervene at the end of the learning process, so we want to catch and prevent deceptiveness and other dangerous capabilities and values as early as possible.
  • Some names: Jesse Hoogland, Alexander Gietelink Oldenziel, Daniel Murfet, Stan van Wingerden
  • Estimated # FTEs: 10
  • Some outputs in 2023Dynamical phase transitionsdegeneracy in singular models; see also Eleuther’s Pythia
  • Critiques: selfEgeSkalse
  • Funded by: Manifund
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $145,000


Various other efforts:

2. Control the thing

(Figuring out how to predictably affect model behaviour.)

Prosaic alignmentalignment by default 

  • One-sentence summary: nudging base models by optimising their output. (RLHFConstitutionalDPOSFTHHHRLAIF.) Not really an agenda, but part of agendas, like Korbak or Lindner, or Redwood’s defunct harmlessness finetunes, or the Karnofsky plan. I like the name “blind output alignment” for this but “prosaic alignment” is well-established. 
  • Imputed assumptions: things are generally smooth, relevant capabilities are harder than alignment, assume no mesaoptimisers, that zero-shot deception is hard, assume a humanish ontology is learned, assume no simulated agents, assume that noise in the data means that human preferences are not ruled out. Assume that alignment is a superficial feature. Maybe assume that thoughts are translucent.
  • Theory of change (steelman): we can observe some forms of overoptimisation and correct for them, which might reduce misgeneralisation risk, maybe a lot.
  • Some names: Anthropic, OpenAI, ScaleHugging Face H4, Eleuther, LeCun. Nora Belrose and Quintin Pope state a related view well here. JD Pressman is indefatigable.
  • Estimated # FTEs: ~~1,000
  • Some outputs in 2023CCAI. DPO seems to be the big advance.
  • Critique: hoo boy. Just the very most recent: Open Problems with RLHFneo-Arrow, Jain & Kirk.
  • Funded by: big tech, VCs
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: some large fraction of the AI industry.

Redwood: control evaluations

  • One-sentence summary: evaluate the humans/org instead: assess the risk of a particular protocol for deploying your model in a particular scenario, by silently swapping in an intentionally misaligned model. Play out the scenario in sim until 1) a catastrophe occurs, 2) your protocol confidently spots the red team model, or 3) the scenario ends without catastrophe. Also do adversarial evaluation (monitoring untrusted model outputs with both trusted and untrusted models). Predecessor.
  • Theory of change: prevent high-stakes failures by automating risk assessment of eval and deployment plans.
  • Some names: Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan
  • Estimated # FTEs: 10?
  • Some outputs in 2023big post, another in the works.
  • Critiques: of org in general
  • Funded by: OpenPhil
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $5,300,000 (whole org last year)
  • Conceived as the same initiative as:


Safety scaffolds

  • One-sentence summary: Soft boxing. Just as people are wrapping LLMs in tooling to boost their capabilities, so can we put up security layers: detectors and classifiers and censors and anomaly detectors and debate partners and "trusted dumb agents" and so on.  See also process-based supervision.
  • Theory of change: beating every scaffold is conjunctive (and some of the scaffolds are fairly smart), so takeover attempts are more likely to be caught.
  • Some names: Buck Shlegeris, Fabien Roger. (Lots of people are doing this de facto but Redwood are the ones reifying it in public.) See also OAI Preparedness.
  • Estimated # FTEs: ?
  • Some outputs in 2023first principles, nice diagram, defenses against encoded reasoning, coup probes. See also Herd.
  • Critique: I mean kinda these.
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?
  • h/t Zach Stein-Perlman for resolving this.


Prevent deception 

Through methods besides mechanistic interpretability.

Redwood: mechanistic anomaly detection

  • One-sentence summary: measurement tampering is where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome.
  • Theory of change: find out when measurement tampering occurs -> build models that don’t do that.
  • See also CAIS
  • Some names: Fabien Roger, Ryan Greenblatt, Max Nadeau, Buck Shlegeris, Nate Thomas
  • Estimated # FTEs: 2.5?
  • Some outputs in 2023measurement tamperingpassword-lockedcoup probes
  • Critiques: general, of the orgcritique of past agenda  
  • Funded by: OpenPhil
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $5,300,000 (whole org, last year)

Indirect deception monitoring 

  • One-sentence summary: build tools to find whether a model will misbehave in high stakes circumstances by looking at it in testable circumstances. This bucket catches work on lie classifierssycophancyScaling Trends For Deception
  • Theory of change: maybe we can catch a misaligned model by observing dozens of superficially unrelated parts, or tricking it into self-reporting, or by building the equivalent of brain scans.
  • Some names: Dan Hendrycks, Owain Evans, Jan Brauner, Sören Mindermann. See also Apollo, CAIS, CAIF, and the two activation engineering sections in this post.
  • Estimated # FTEs: 20?
  • Some outputs in 2023AI deception surveyNatural selectionOverviewRepEng again
  • Critique (of related ideas): 1%
  • Funded by: OpenPhil
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $10,618,729 (CAIS, whole org)

Anthropic: externalised reasoning oversight

  • One-sentence summary: Train models that print their actual reasoning in English (or another language we can read) every time. Give negative reward for dangerous-seeming reasoning, or just get rid of models that engage in it. 
  • Theory of change: “Force a language model to think out loud, and use the reasoning itself as a channel for oversight. If this agenda is successful, it could defeat deception, power-seeking, and other disapproved reasoning.”
  • See also sycophancy.
  • Some names: Tamera Lanham, Ansh Radhakrishnan
  • Estimated # FTEs: ?
  • Some outputs in 2023CoT faithfulnessQuestion decomposition faithfulness 
  • Critiques: Samin
  • Funded by: Anthropic investors
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: large


Surgical model edits

(interventions on model internals)

Weight editing

  • One-sentence summary: targeted finetuning aimed at changing single fact representations and maybe higher-level stuff
  • Theory of change: the other half of mech interp; one family of methods to delete bad things, how to add good things.
  • Some outputs in 2023: multi-objective weight masking. See also concept erasure.
  • Critiques: of ROME


Activation engineering 

  • One-sentence summary: let’s see if we can programmatically modify activations to steer outputs towards what we want, in a way that generalises across models and topics. As much or more an intervention-based approach to interpretability than about control (see above).
  • Theory of change: maybe simple things help: let’s build more stuff to stack on top of finetuning. Activations are the last step before output and so interventions on them are less pre-emptable. Slightly encourage the model to be nice, add one more layer of defence to our bundle of partial alignment methods.
  • Some names: Alex Turner, Andy Zou, Nina Rimsky, Claudia Shi, Léo Dana, Ole Jorgensen. See also Li and Bau Lab
  • Estimated # FTEs: ~20
  • Some outputs in 2023famously CCS last year, steering vectors in RL and GPTsCadenza RFCthe shape of conceptsRoger experiment. See also representation engineering 
  • Critiques: of ROME
  • Funded by: Deepmind? Anthropic? MATS?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Getting it to learn what we want

(Figuring out how to control what the model figures out.)

Social-instinct AGI

  • One-sentence summary: Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let's figure out what those circuits are and how they work; this will involve symbol grounding. Newest iteration of a sustained and novel agenda.
  • Theory of change: Fairly direct alignment via changing training to reflect actual human reward. Get actual data about (reward, training data) → (human values) to help with theorising this map in AIs; "understand human social instincts, and then maybe adapt some aspects of those for AGIs, presumably in conjunction with other non-biological ingredients".
  • Some names: Steve Byrnes
  • Estimated # FTEs: 1
  • Some outputs in 2023
  • Critiques: ?
  • Funded by: Astera, 
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?


Imitation learning

  • One-sentence summary: train models on human behaviour (such as monitoring which keys a human presses when in response to what happens on a computer screen); contrast with Strouse
  • Theory of change: humans learn well by observing each other -> let’s test whether AIs can learn by observing us -> outer alignment and moonshot at safe AGI.
  • If you squint, this is what 'alignment by default' is doing, in the form of a self-supervised learning imitating the human web corpus. But the proposed algorithms in imitation learning proper are very different and more obviously Bayesian. 
  • Some names: Jérémy Scheurer, Tomek Korbak, Ethan Perez
  • Estimated # FTEs: ?
  • Some outputs in 2023Imitation Learning from Language Feedbacksurveynice theory from 2022 
  • Critiques: many
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Reward learning 

  • One-sentence summary: People like CHAI are still looking at reward learning to “reorient the general thrust of AI research towards provably beneficial systems”. (They are also doing a lot of advocacy, like everyone else.)
  • Theory of change: understand what kinds of things can go wrong when humans are directly involved in training a model -> build tools that make it easier for a model to learn what humans want it to learn.
  • See also RLHF and recursive reward modelling, the industrialised forms.
  • Some names: CHAI among others
  • Estimated # FTEs: ?
  • Some outputs in 2023Multiple teachersMinimal knowledgeetcCausal confusion 
  • Critiques: nice summary of historical problem statements
  • Funded by: mostly OpenPhil
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $12,222,246 (CHAI, whole org, 2021, not counting the UC Berkeley admin tax)

Goal robustness 

(Figuring out how to make the model keep doing ~what it has been doing so far.)

Measuring OOD

  • One-sentence summary: let’s build models that can recognize when they are out of distribution (or at least give us tools to notice when they are). See also anomaly detection.
  • Theory of change: bad things happen if powerful AI “learns the wrong lesson” from training data, we should make it not do that.
  • Some names: Steinhardt, Tegan Maharaj, Irina Rish. Maybe this.
  • Estimated # FTEs: ?
  • Some outputs in 2023Alignment from a DL perspectiveGoal MisgeneralizationCoinRun: Solving Goal MisgeneralisationModeling ambiguity
  • Critiques: ?
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Concept extrapolation 

  • One-sentence summary: continual learning to make model internals more stable as they learn / as the world changes; safely extending features an agent has learned in training to new datasets and environments.
  • Theory of change: get them to generalise our values roughly correctly and OOD. Also ‘let's make it an industry standard for AI systems to "become conservative and ask for guidance when facing ambiguity", and gradually improve the standard from there as we figure out more alignment stuff.’ – Bensinger’s gloss.
  • Some names: Stuart Armstrong
  • Estimated # FTEs: 4?
  • Some outputs in 2023good primersolved a toy problem
  • Critiques: Soares
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Mild optimisation

  • One-sentence summary: avoid Goodharting by getting AI to satisfice rather than maximise.
  • Theory of change: if we fail to exactly nail down the preferences for a superintelligent agent we die to Goodharting -> shift from maximising to satisficing in the agent’s utility function -> we get a nonzero share of the lightcone as opposed to zero; also, moonshot at this being the recipe for fully aligned AI.
  • Some names:
  • Estimated # FTEs: 2?
  • Some outputs in 2023GillenSoft-optimisation, Bayes and Goodhart
  • Critiques: Dearnaley?
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

3. Make AI solve it

(Figuring out how models might help with figuring it out.)

OpenAI: Superalignment 

  • One-sentence summary: be ready to align a human-level automated alignment researcher.
  • Theory of change: get it to help us with scalable oversight, Critiques, recursive reward modelling, and so solve inner alignment. See also seed.
  • Some names: Ilya Sutskever, Jan Leike, Leopold Aschenbrenner, Collin Burns
  • Estimated # FTEs: 30?
  • Some outputs in 2023whole org
  • Critiques: ZviChristianoMIRISteinerLadishWentworthGao lol
  • Funded by: Microsoft
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~$100m of compute alone (20% of OpenAI’s secured compute)

Supervising AIs improving AIs

  • One-sentence summary: researching scalable methods of tracking behavioural drift in language models and benchmarks for evaluating a language model's capacity for stable self-modification via self-training.
  • Theory of change: early models train ~only on human data while later models also train on early model outputs, which leads to early model problems cascading; left unchecked this will likely cause problems, so we need a better iterative improvement process.
  • Some names: Quintin Pope, Jacques Thibodeau, Owen Dudney, Roman Engeler
  • Estimated # FTEs: 2
  • Some outputs in 2023: ?
  • Critiques:
  • Funded by: LTFF, Lightspeed, OpenPhil, tiny Lightspeed grant, 
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~$100,000 


  • One-sentence summary: Train human-plus-LLM alignment researchers: with humans in the loop and without outsourcing to autonomous agents. More than that, an active attitude towards risk assessment of AI-based AI alignment.
  • Theory of change: Cognitive prosthetics to amplify human capability and preserve values. More alignment research per year and dollar.
  • Some names: Janus, Kees Dupuis. See also this team doing similar things. 
  • Estimated # FTEs: 6?
  • Some outputs in 2023: agenda statement, role play
  • Critiques: self
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?


See also Simboxing (Jacob Cannell).

Scalable oversight

(Figuring out how to ease humans supervising models. Hard to cleanly distinguish from ambitious mechanistic interpretability but here we are.)

Task decomp

Recursive reward modelling is supposedly not dead but instead one of the tools Superalignment will build.

Another line tries to make something honest out of chain of thoughttree of thought.

Elicit (previously Ought)

  • pivot/spinoff of some sort happened. “most former Ought staff are working at the new organisation”, details unclear.
  • One-sentence summary: “a) improved reasoning of AI governance & alignment researchers, particularly on long-horizon tasks and (b) pushing supervision of process rather than outcomes, which reduces the optimisation pressure on imperfect proxy objectives leading to “safety by construction”.
  • Theory of change: “The two main impacts of Elicit on AI Safety are improving epistemics and pioneering process supervision.”
  • Some names: Charlie George, Andreas Stuhlmüller
  • Estimated # FTEs: ?
  • Some outputs in 2023factored verification
  • Critiques: 
  • Funded by: public benefit corporation
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $9,000,000


Deepmind Scalable Alignment

  • One-sentence summary: “make highly capable agents do what humans want, even when it is difficult for humans to know what that is”.
  • Theory of change: [“Give humans help in supervising strong agents”] + [“Align explanations with the true reasoning process of the agent”] + [“Red team models to exhibit failure modes that don’t occur in normal use”] are necessary but probably not sufficient for safe AGI.
  • Some names: Geoffrey Irving
  • Estimated # FTEs: ?
  • Some outputs in 2023: ?
  • Critiques:
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

AnthropicNYU Alignment Research Group / Perez collab

  • One-sentence summary: debate 2.0. scalable oversight of truthfulness: is it possible to develop training methods that incentivize truthfulness even when humans are unable to directly judge the correctness of a model’s output? / scalable benchmarking how to measure (proxies for) speculative capabilities like situational awareness.
  • Theory of change: current methods like RLHF will falter as frontier AI tackles harder and harder questions -> we need to build tools that help human overseers continue steering AI -> let’s develop theory on what approaches might scale -> let’s build the tools.
  • Some names: Samuel Bowman, Ethan Perez, Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, Julian Michael 
  • Estimated # FTEs: ?
  • Some outputs in 2023Specific versus General Principles for Constitutional AIDebate Helps Supervise Unreliable ExpertsLanguage Models Don't Always Say What They Think, full rundown
  • Critiques: obfuscationlocal inadequacy?, it doesn’t work right now (2022)
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?


See also FAR (below).

4. Theory 

(Figuring out what we need to figure out, and then doing that. This used to be all we could do.)

Galaxy-brained end-to-end solutions

The Learning-Theoretic Agenda 

  • One-sentence summary: try to formalise a more realistic agent, understand what it means for it to be aligned with us, translate between its ontology and ours, and produce desiderata for a training setup that points at coherent AGIs similar to our model of an aligned agent.
  • Theory of change: work out how to train an aligned AI by first fixing formal epistemology.
  • Some names: Vanessa Kosoy
  • Estimated # FTEs: 2
  • Some outputs in 2023quantum??mortal popa logic
  • Critiques: Matolcsi
  • Funded by: MIRI, MATS, Effective Ventures, Lightspeed
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Open Agency Architecture

  • One-sentence summary: Get AI to build a detailed world simulation which humans understand, elicit preferences over future states from humans, formally verify that the AI adheres to coarse preferences; plan using this world model and preferences. See also Provably safe systems (which I hope merges with it); see also APTAMI.
  • Theory of change: ontology specification, unprecedented formalisation of physical situations, unprecedented formal verification of high-dimensional state-action sequences. Stuart Russell’s Revenge. Notable for not requiring that we solve ELK; does require that we solve ontology though.
  • Some names: Davidad, Evan Miyazono, Daniel Windham. See also: Cannell.
  • Estimated # FTEs: 5?
  • Some outputs in 2023Several teams working out the details a bit more 
  • Critiques: Soares
  • Funded by: the estate of Peter Eckersley / Atlas Computing’s future funder / the might of the post-Cummings British state
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: Very roughly £50,000,000


Provably safe systems

  • One-sentence summary: formally model the behavior of physical/social systems, define precise “guardrails” that constrain what actions can occur, require AIs to provide safety proofs for their recommended actions, automatically validate these proofs. Closely related to OAA.
  • Theory of change: make a formal verification system that can act as an intermediary between a human user and a potentially dangerous system and only let provably safe actions through.
  • Some names: Steve Omohundro, Max Tegmark
  • Estimated # FTEs: 1??
  • Some outputs in 2023: plan announcement. Omohundro’s org are quite enigmatic.
  • Critiques: Zvi
  • Funded by: unknown 
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Conjecture: Cognitive Emulation (CoEms)

  • One-sentence summary: restrict the design space to (partial) emulations of human reasoning. If the AI uses similar heuristics to us, it should default to not being extreme.
  • Theory of change: train a bounded tool AI which will help us against AGI without being very dangerous and will make banning unbounded AIs more politically feasible.
  • Some names: Connor Leahy, Gabriel Alfour?
  • Estimated # FTEs: 11?
  • Some outputs in 2023: ?
  • Critiques: ScherSaminorg
  • Funded by: private investors (Plural Platform, Metaplanet, secret)
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: millions USD.

Question-answer counterfactual intervals (QACI)

  • One-sentence summary: Get the thing to work out its own objective function (a la HCH).
  • Theory of change: “The aligned goal should be made of fully formalized math, not of human concepts that an AI has to interpret in its ontology, because ontologies break and reshape as the AI learns and changes. [..] a computationally unbounded mathematical oracle being given that goal would take desirable actions; and then, we should design a computationally bounded AI which is good enough to take satisfactory actions.” 
  • Some names: Tamsin Leake
  • Estimated # FTEs: 3?
  • Some outputs in 2023: see agenda post
  • Critiques: HobsonAnom
  • Funded by: SFF
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $438,000+

Understanding agency 

(Figuring out ‘what even is an agent’ and how it might be linked to causality.)

Causal foundations

  • One-sentence intro: using causal models to understand agents and so design environments with no incentive for defection.
  • Theory of changePath-specific objectives avoid stringent demands on value specification, bottleneck is instead ensuring stability (how prone to unintentional side-effects a state is).
  • Some names: Tom Everitt, Lewis Hammond, Francis Rhys Ward, Ryan Carey, Sebastian Farquhar
  • Estimated # FTEs: 4-8
  • Some outputs in 2023sequence, Defining Deception, unifying the big decision theories, first causal discovery algorithm for discovering agents
  • Critiques
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: small fraction of Deepmind

Alignment of Complex Systems: Hierarchical agency

  • One-sentence summary: Develop formal models of subagents and superagents, use the model to specify desirable properties of whole-part relations (e.g. how to prevent human-friendly parts getting wiped out). Currently using active inference as inspiration for the formalism. Study human and societal preferences and cognition; make a game-theoretic extension of active inference.
  • Theory of change: Solve self-unalignment, prevent procrustean alignment, allow for scalable noncoercion.
  • Some names: Jan Kulveit, Tomáš Gavenčiak
  • Estimated # FTEs: 4
  • Some outputs in 2023insights into LLMs, a deep dive into active inference
  • Critiques: indirect
  • Funded by: SFF 
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $425,000?
  • See also the "ecosystems of intelligence" collab involving Karl Friston and Beren Millidge among many others.

The ronin sharp left turn crew 

  • One-sentence summary: ‘started off as "characterize the sharp left turn" and evolved into getting fundamental insights about idealized forms of consequentialist cognition’.
  • Theory of change: understand general properties of consequentialist agents -> figure out which subproblem is likely to actually help -> formalise the relevant insights -> fewer ways to die to AI.
  • Some names: (Kwa, Barnett, Hebbar) in the past
  • Estimated # FTEs: 2?
  • Some outputs in 2023: postmortem
  • Critiques: GabsSoaresPopeetctangentially EJT
  • Funded by: Lightspeed
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $269,200 (Hebbar)

Shard theory

  • One-sentence summary: model the internal components of agents, use humans as a model organism of AGI (humans seem made up of shards and so might AI).
  • Theory of change: “If policies are controlled by an ensemble of influences ("shards"), consider which training approaches increase the chance that human-friendly shards substantially influence that ensemble.”
  • See also Activation Engineering.
  • Some names: Quintin Pope, Alex Turner
  • Estimated # FTEs: 4
  • Some outputs in 2023really solid empirical stuff in control / interventional interpretability
  • Critiques: ChanSoaresMillerLangKwa
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

boundaries / membranes

  • One-sentence summary: Formalise one piece of morality: the causal separation between agents and their environment. See also Open Agency Architecture.
  • Theory of change: Formalise (part of) morality/safety, solve outer alignment.
  • Some names: Chris Lakin (full-time), Andrew CritchDavidad
  • Estimated # FTEs: 1
  • Some outputs in 2023: problem statements, planning a workshop early 2024
  • Critiques: 
  • Funded by: private donor & Foresight 
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: <$100k

disempowerment formalism

  • One-sentence summary: offer formal and operational notions of (dis)empowerment which are conceptually satisfactory and operationally implementable. 
  • Theory of change: formalisms will be useful in the future.
  • Some names: Damiano Fornasiere, Pietro Greiner
  • Estimated # FTEs: 2
  • Some outputs in 2023: ?
  • Critiques: 
  • Funded by: Manifund
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $60,300

Performative prediction

  • One-sentence summary: “how incentives for performative prediction can be eliminated through the joint evaluation of multiple predictors”.
  • Theory of change: “If powerful AI systems develop the goal of maximising predictive accuracy, either incidentally or by design, then this incentive for manipulation could prove catastrophic” -> notice when it’s happening -> design models that don’t do that.
  • Some names: Rubi Hudson
  • Estimated # FTEs: 1
  • Some outputs in 2023Conditional Prediction with Zero-Sum Training Solves Self-Fulfilling Prophecies (a precursor)
  • Critiques: ?
  • Funded by: Manifund
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $33,200

Understanding optimisation

  • One-sentence summary: what is “optimisation power” (formalised), how do we build tools that track it, how relevant is any of this anyway. See also developmental interpretability?
  • Theory of change: existing theories are either rigorous OR good at capturing what we mean; let’s find one that is both -> use the concept to build a better understanding of how and when an AI might get more optimisation power. Would be nice if we could detect or rule out speculative stuff like gradient hacking too.
  • Some names: Alex Altair, Jacob Hilton, Thomas Kwa
  • Estimated # FTEs: ?
  • Some outputs in 2023: Altair drafts (1234), How Many Bits Of Optimisation Can One Bit Of Observation Unlock? (Wentworth), but at what cost?Towards Measures of Optimisation (MacDermott, Oldenziel); Goodharting: RL stuffKruegeroveropt, catastrophic Goodhart.
  • Critiques: ?
  • Funded by: LTFF, OpenAI, ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?



(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight and superalignment are ~atheoretical approaches to this.)

Behavior alignment theory 

  • One-sentence summary: predict properties of AGI (e.g. powerseeking) with formal models. Corrigibility as the opposite of powerseeking.
  • Theory of change: figure out hypotheses about properties powerful agents will have -> attempt to rigorously prove under what conditions the hypotheses hold, test them when feasible.
  • Some names: Marcus Hutter, Michael Cohen (12), Michael Osborne
  • Estimated # FTEs: 3
  • Some outputs in 2023: ?
  • Critiques: Carey & Everitt (against corrigibility)
  • Funded by: Deepmind
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

The comments in this thread are extremely good – but none of the authors are working on this!! See also Holtman’s neglected result. See also EJT (and formerly Petersen). See also Dupuis.


Ontology identification 

(Figuring out how superintelligent agents think about the world and how we get superintelligent agents to actually tell us what they know. Much of interpretability is incidentally aiming at this.)

ARC Theory 

  • One-sentence summary: train an AI that we can extract the latent, and seeming, and encrypted knowledge of, even when it has incentives to hide it. ELK, formalising heuristics, mechanistic anomaly detection
  • Theory of change: formalise notions of models having access to some bit(s) of information -> design training objectives that incentivize systems to honestly report their internal beliefs
  • Some names: Paul Christiano, Mark Xu.
  • Some outputs in 2023: Nothing public; ‘we’re trying to develop a framework for “formal heuristic arguments” that can be used to reason about the behavior of neural networks.’
  • Critiques: clarificationalternative formulation
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Natural abstractions 

  • One-sentence summary: check the hypothesis that our universe abstracts well and many cognitive systems learn to use similar abstractions.
  • Theory of change: build tools to check the hypothesis, run the experiments, if the hypothesis holds we don’t need to worry about finicky parts of alignment like whether an AGI will know what we mean by love.
  • Some names: John Wentworth
  • Estimated # FTEs: 2?
  • Some outputs in 2023tag 
  • Critiques: Summary and critiqueSoares
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ?

Understand cooperation

(Figuring out how inter-AI and AI/human game theory should or would work.)


  • One-sentence summary: future agents intentionally creating s-risks is the worst of all possible problems, we should avoid that.
  • Theory of change: make present and future AIs inherently cooperative via improving theories of cooperation.
  • Some names: Jesse Clifton, Anni Leskelä, Julian Stastny
  • Estimated # FTEs: 15
  • Some outputs in 2023open minded updatelessnesspossibly thisspitefulness
  • Critiques: 
  • Funded by: Ruairi Donnelly? Polaris Ventures?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: £3,375,081 income last year


  • One-sentence summary: make sure advanced AI uses what we regard as proper game theory.
  • Theory of change: (1) keep the pre-superintelligence world sane by making AIs more cooperative; (2) remain integrated in the academic world, collaborate with academics on various topics and encourage their collaboration on x-risk; (3) hope our work on “game theory for AIs”, which emphasises cooperation and benefit to humans, has framing & founder effects on the new academic field.
  • Some names: Vincent Conitzer, Caspar Oesterheld
  • Estimated # FTEs: 7
  • Some outputs in 2023Bounded Inductive RationalityComputational Complexity of Single-Player Imperfect-Recall GamesGame Theory with Simulation of Other Players 
  • Critiques: Self-submitted: “our theory of change is not clearly relevant to superintelligent AI”
  • Funded by: Polaris Ventures
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: >$500,000


See also higher-order game theory. We moved CAIF to the “Research support” appendix. We moved AOI to “misc”.

5. Labs with miscellaneous efforts

(Making lots of bets rather than following one agenda, which is awkward for a topic taxonomy.)

 Deepmind Alignment Team 

  • One-sentence summary: theory generation, threat modelling, and toy methods to help with those. “Our main threat model is basically a combination of specification gaming and goal misgeneralisation leading to misaligned power-seeking.” See announcement post for full picture.
  • Theory of change: direct the training process towards aligned AI and away from misaligned AI: build enabling tech to ease/enable alignment work -> apply said tech to correct missteps in training non-superintelligent agents -> keep an eye on it as capabilities scale to ensure the alignment tech continues to work.
  • See also (in this document): Process-based supervision, Red-teaming, Capability evaluations, Mechanistic interpretability, Goal misgeneralisation, Causal alignment/incentives
  • Some names: Rohin Shah, Vika Krakovna, Janos Kramar, Neel Nanda
  • Estimated # FTEs: ~40
  • Some outputs in 2023TracrDoes Circuit Analysis Interpretability Scale?The Hydra Effectunderstanding / distilling threat models: "refining the sharp left turn" (2022) and "will capabilities generalise more" (2022); doubly-efficient debate (including a Lean proof) 
  • Critiques: Zvi
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~$10,000,000?


  • One-sentence summary: conceptual work (currently on deceptive alignment), auditing, and model evaluations; conceptual? Also a non-public interp agenda and deception evals in major labs.
  • Theory of change: “Conduct foundational research in interpretability and behavioral model evaluations, audit real-world models for deceptive alignment, support policymakers with our technical expertise where needed.”
  • Some names: Marius Hobbhahn, Lee Sharkey, Lucius Bushnaq et al
  • Estimated # FTEs: 14
  • Some outputs in 2023Understanding strategic deception and deceptive alignmentResearch on strategic deceptionCausal Framework for AI Regulation and Auditing, non-public stuff 
  • Critiques: No public critiques yet
  • Funded by: OpenPhil, SFF, Manifund, “multiple institutional and private funders”
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: >$2,000,000

Anthropic Assurance / Trust & Safety / RSP Evaluations / Interpretability


  • One-sentence summary: a science of robustness / fault tolerant alignment is their stated aim, but they do lots of interpretability papers and other things.
  • Theory of change: make AI systems less exploitable and so prevent one obvious failure mode of helper AIs / superalignment / oversight: attacks on what is supposed to prevent attacks. In general, work on overlooked safety research others don’t do for structural reasons: too big for academia or independents, but not totally aligned with the interests of the labs (e.g. prototyping moonshots, embarrassing issues with frontier models).
  • Some names: Adam Gleave, Ben Goldhaber, Adrià Garriga-Alonso, Daniel Pandori
  • Estimated # FTEs: 10
  • Some outputs in 2023papersimpressive adversarial finding
  • Critiques: tangential from Demski
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: $1,507,686 (2022 income)

Krueger Lab

  • One-sentence summary: misc. Understand Goodhart’s law; reward learning 2.0; demonstrating safety failures; understand DL generalization / learning dynamics.
  • Theory of change: misc. Improve theory and demos while steering policy to steer away from AGI risk.
  • Some names: David Krueger, Dima Krasheninnikov, Lauro Langosco
  • Estimated # FTEs: 12
  • Some outputs in 2023a formal definition of goodhartinghow LLMs weigh sourcesgrokking as double descent, proof-of-concept for an approach to automated interpretability (in review).
  • Critiques: ?
  • Funded by: SFF
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~$1m

AI Objectives Institute (AOI)

  • One-sentence summary: Hard to classify. How to apply AI to enhancing human agency, individual agency and collective agency? what goals should scalable delegated intelligence be aligned to? how to use AI to improve the accountability of institutions?
  • Theory of change: Think about how regulation “aligns” corporations, and insights about how to safely integrate AI into society will come, as will insights into technical alignment questions. Develop socially beneficial AI now and it will improve chances of AI being beneficial in the long run, including by paths we haven’t even thought of yet. 
  • Some names: Deger Turan, Matija Franklin, Peli Grietzer, Tushant Jha
  • Estimated # FTEs: 7 researchers
  • Some outputs in 2023one OAA fork, plans for concrete tools with actual users 
  • Critiques: ?
  • Funded by: Future of Life Institute, Plurality Institute, Survival and Flourishing Project, private individuals
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: in flux, low millions.

More meta

We don’t distinguish between massive labs, individual researchers, and sparsely connected networks of people working on similar stuff. The funding amounts and full time employee estimates might be a reasonable proxy.

The categories we chose have substantial overlap and see the “see also”s for closely related work.

I wanted this to be a straight technical alignment doc, but people pointed out that would exclude most work (e.g. evals and nonambitious interpretability, which are safety but not alignment) so I made it a technical AGI safety doc. Plus ça change.

The only selection criterion is “I’ve heard of it and >= 1 person was recently working on it”. I don’t go to parties so it’s probably a couple months behind. 

Obviously this is the Year of Governance and Advocacy, but I exclude all this good work: by its nature it gets attention. I also haven’t sought out the notable amount by ordinary labs and academics who don’t frame their work as alignment. Nor the secret work.

You are unlikely to like my partition into subfields; here are others.

No one has read all of this material, including us. Entries are based on public docs or private correspondence where possible but the post probably still contains >10 inaccurate claims. Shouting at us is encouraged. If I’ve missed you (or missed the point), please draw attention to yourself. 

If you enjoyed reading this, consider donating to LightspeedMATSManifund, or LTFF: some good work is bottlenecked by money, and some people specialise in giving away money to enable it.

Conflicts of interest: I wrote the whole thing without funding. I often work with ACS and PIBBSS and have worked with Team Shard. Lightspeed gave a nice open-ended grant to my org, Arb. CHAI once bought me a burrito. 

If you’re interested in doing or funding this sort of thing, get in touch at I never thought I’d end up as a journalist, but stranger things will happen.


Thanks to Alex Turner, Neel Nanda, Jan Kulveit, Adam Gleave, Alexander Gietelink Oldenziel, Marius Hobbhahn, Lauro Langosco, Steve Byrnes, Henry Sleight, Raymond Douglas, Robert Kirk, Yudhister Kumar, Quratulain Zainab, Tomáš Gavenčiak, Joel Becker, Lucy Farnik, Oliver Hayman, Sammy Martin, Jess Rumbelow, Jean-Stanislas Denain, Ulisse Mini, David Mathers, Chris Lakin, Vojta Kovařík, Zach Stein-Perlman, and Linda Linsefors for helpful comments.


Appendix: Prior enumerations

Appendix: Graveyard

Appendix: Biology for AI alignment

Lots of agendas but not clear if anyone besides Byrnes and Thiergart are actively turning the crank. Seems like it would need a billion dollars.

Human enhancement 

  • One-sentence summary: maybe we can give people new sensory modalities, or much higher bandwidth for conceptual information, or much better idea generation, or direct interface with DL systems, or direct interface with sensors, or transfer learning, and maybe this would help. The old superbaby dream goes here I suppose.
  • Theory of change: maybe this makes us better at alignment research


  • One-sentence summary: maybe we can form networked societies of DL systems and brains
  • Theory of change: maybe this lets us preserve some human values through bargaining or voting or weird politics.
  • CyborgismMillidgeDupuis. Davidad sometimes.

As alignment aid 

  • One-sentence summary: maybe we can get really high-quality alignment labels from brain data, maybe we can steer models by training humans to do activation engineering fast and intuitively, maybe we can crack the true human reward function / social instincts and maybe adapt some of them for AGI.
  • Theory of change: as you’d guess
  • Some names: ByrnesCvitkovicForesight’s BCI, Also (list from Byrnes): Eli Sennesh, Adam Safron, Seth Herd, Nathan Helm-Burger, Jon Garcia, Patrick Butlin

Appendix: Research support orgs

One slightly confusing class of org is described by the sample {CAIF, FLI}. Often run by active researchers with serious alignment experience, but usually not following an obvious agenda, delegating a basket of strategies to grantees, doing field-building stuff like NeurIPS workshops and summer schools.


  • One-sentence summary: support researchers making differential progress in cooperative AI (eg precommitment mechanisms that can’t be used to make threats)
  • Some names: Lewis Hammond
  • Estimated # FTEs: 
  • Some outputs in 2023Neurips contestsummer school
  • Funded by: Polaris Ventures
  • Critiques:
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: £2,423,943


  • One-sentence summary: entrypoint for new researchers to test fit and meet collaborators. More recently focussed on a capabilities pause. Still going!
  • Some names: Remmelt Ellen, Linda Linsefors
  • Estimated # FTEs: 2 
  • Some outputs in 2023tag
  • Funded by: ?
  • Critiques: ?
  • Funded by: ?
  • Trustworthy command, closure, opsec, common good, alignment mindset: ?
  • Resources: ~$200,000


See also:

Appendix: Meta, mysteries, more

  1. ^

    Unless you zoom out so far that you reach vague stuff like “ontology identification”. We will see if this total turnover is true again in 2028; I suspect a couple will still be around, this time.

  2. ^

    > one can posit neural network interpretability as the GiveDirectly of AI alignment: reasonably tractable, likely helpful in a large class of scenarios, with basically unlimited scaling and only slowly diminishing returns. And just as any new EA cause area must pass the first test of being more promising than GiveDirectly, so every alignment approach could be viewed as a competitor to interpretability work. – Niplav

New Comment
69 comments, sorted by Click to highlight new comments since: Today at 5:57 AM

I wonder if we couldn't convert this into some kind of community wiki, so that the people represented in it can provide endorsed representations of their own work, and so that the community as a whole can keep it updated as time goes on.

Obviously there's the problem where you don't want random people to be able to put illegitimate stuff on the list. But it's also hard to agree on a way to declare legitimacy.

...Maybe we could have a big post like lukeprog's old textbook post, where researchers can make top-level comments describing their own research? And then others can up- or down-vote the comments based on the perceived legitimacy of the research program?

I am excited about this. I've also recently been interested in ideas like nudge researchers to write 1-5 page research agendas, then collect them and advertise the collection.

Possible formats:

  • A huge google doc (maybe based on this post); anyone can comment; there's one or more maintainers; maintainers approve ~all suggestions by researchers about their own research topics and consider suggestions by random people.
  • A directory of google docs on particular agendas; the individual google docs are each owned by a relevant researcher, who is responsible for maintaining them; some maintainer-of-the-whole-project occasionally nudges researchers to update their docs and reassigns the topic to someone else if necessary. Random people can make suggestions too.
  • (Alex, I think we can do much better than the best textbooks format in terms of organization, readability, and keeping up to date.)

I am interested in helping make something like this happen. Or if it doesn't happen soon I might try to do it (but I'm not taking responsibility for making this happen). Very interested in suggestions.

(One particular kind-of-suggestion: is there a taxonomy/tree of alignment research directions you like, other than the one in this post? (Note to self: taxonomies have to focus on either methodology or theory of change... probably organize by theory of change and don't hesitate to point to the same directions/methodologies/artifacts in multiple places.))

There's also a much harder and less impartial option, which is to have an extremely opinionated survey that basically picks one lens to view the entire field and then describes all agendas with respect to that lens in terms of which particular cruxes/assumptions each agenda runs with. This would necessarily require the authors of the survey to deeply understand all the agendas they're covering, and inevitably some agendas will receive much more coverage than other agendas. 

This makes it much harder than just stapling together a bunch of people's descriptions of their own research agendas, and will never be "the" alignment survey because of the opinionatedness. I still think this would have a lot of value though: it would make it much easier to translate ideas between different lenses/notice commonalities, and help with figuring out which cruxes need to be resolved for people to agree. 

Relatedly, I don't think alignment currently has a lack of different lenses (which is not to say that the different lenses are meaningfully decorrelated). I think alignment has a lack of convergence between people with different lenses. Some of this is because many cruxes are very hard to resolve experimentally today. However, I think even despite that it should be possible to do much better than we currently are--often, it's not even clear what the cruxes are between different views, or whether two people are thinking about the same thing when they make claims in different language. 

I strongly agree that this would be valuable; if not for the existence of this shallow review I'd consider doing this myself just to serve as a reference for myself. 

Fwiw I think "deep" reviews serve a very different purpose from shallow reviews so I don't think you should let the existence of shallow reviews prevent you from doing a deep review

I've written up an opinionated take on someone else's technical alignment agenda about three times, and each of those took me something like 100 hours. That was just to clearly state why I disagreed with it; forget about resolving our differences :)

Even that is putting it a bit too lightly.

i.e. Is there even a single, bonafide, novel proof at all? 

Proven mathematically, or otherwise  demonstrated with 100% certainty, across the last 10+ years.

Or is it all just 'lenses', subjective views, probabilistic analysis, etc...?

LessWrong does have a relatively fully featured wiki system. Not sure how good of a fit it is, but like, everyone can create tags and edit them and there are edit histories and comment sections for tags and so on. 

We've been considering adding the ability for people to also add generic wiki pages, though how to make them visible and allocate attention to them has been a bit unclear.

how to make them visible and allocate attention to them has been a bit unclear

Maybe an opt-in/opt-out "novice mode" which turns, say, the first appearance of a niche LW term in every post into a link to that term's LW wiki page? Which you can turn off in the settings, and which is either on by default (with a notification on how to turn it off), or the sign-up process queries you about whether you want to turn it on, or something along these lines.

Alternatively, a button for each post which fetches the list of idiosyncratic LW terms mentioned in it, and links to their LW wiki pages?

I've earlier suggested a principled taxonomy of AI safety work with two dimensions:

  1. System level:

    • monolithic AI system
    • human--AI pair
    • AI group/org: CoEm, debate systems
    • large-scale hybrid (humans and AIs) society and economy
    • AI lab, not to be confused with an "AI org" above: an AI lab is an org composed of humans and increasingly of AIs that creates advanced AI systems. See Hendrycks et al.' discussion of organisational risks.
  2. Methodological time:

    • design time: basic research, math, science of agency (cognition, DL, games, cooperation, organisations), algorithms
    • manufacturing/training time: RLHF, curriculums, mech interp, ontology/representations engineering, evals, training-time probes and anomaly detection
    • deployment/operations time: architecture to prevent LLM misuse or jailbreaking, monitoring, weights security
    • evolutionary time: economic and societal incentives, effects of AI on society and psychology, governance.

So, this taxonomy is a 5x4 matrix, almost all slots or which are interesting, and some of them are severely under-explored.

Hi, we've already made a site which does this! aims to collect research agendas and have people comment on their strengths and vulnerabilities. The discord also occasionally hosts a critique-a-ton, where people discuss specific agendas.

Yes, we host a bi-monthly Critique-a-Thon- the next one is from December 16th to 18th!

Judges include:
- Nate Soares, President of MIRI, 
- Ramana Kumar, researcher at DeepMind
- Dr Peter S Park, MIT postdoc at the Tegmark lab,
- Charbel-Raphael Segerie, head of the AI unit at EffiSciences.


I think there's another agenda like make untrusted models safe but useful by putting them in a scaffolding/bureaucracy—of filters, classifiers, LMs, humans, etc.—such that at inference time, takeover attempts are less likely to succeed and more likely to be caught. See Untrusted smart models and trusted dumb models (Shlegeris 2023). Other relevant work:

[Edit: now AI Control (Shlegeris et al. 2023) and Catching AIs red-handed (Greenblatt and Shlegeris 2024).]

[Edit: I make a bid for an expert—probably someone at Redwood—to make a public reading list on this control agenda.]

Explicitly noting for the record we have some forthcoming work on AI control which should be out relatively soon.

(I work at RR)

This is an excellent description of my primary work, for example

Internal independent review for language model agent alignment

That post proposes calling new instances or different models to review plans and internal dialogue for alignment, but it includes discussion of the several layers of safety scaffolding that have been proposed elsewhere.

This post is amazingly useful. Integrative/overview work is often thankless, but I think it's invaluable for understanding where effort is going, and thinking about gaps where it more should be devoted. So thank you, thank you!

I like this. It's like a structural version of control evaluations. Will think where to put it in

Expanding on this -- this whole area is probably best known as "AI Control", and I'd lump it under "Control the thing" as its own category. I'd also move Control Evals to this category as well, though someone at RR would know better than I. 

Yep, indeed I would consider "control evaluations" to be a method of "AI control". I consider the evaluation and the technique development to be part of a unified methodology (we'll describe this more in a forthcoming post).

(I work at RR)

It's "a unified methodology" but I claim it has two very different uses: (1) determining whether a model is safe (in general or within particular scaffolding) and (2) directly making deployment safer. Or (1) model evals and (2) inference-time safety techniques.

(Agreed except that "inference-time safety techiques" feels overly limiting. It's more like purely behavioral (black-box) safety techniques where we can evaluate training by converting it to validation. Then, we imagine we get the worst model that isn't discriminated by our validation set and other measurements. I hope this isn't too incomprehensible, but don't worry if it is, this point isn't that important.)

Promoted to curated. I think this kind of overview is quite valuable, and I think overall this post did a pretty good job of a lot of different work happening in the field. I don't have a ton more to say, I just think posts like this should come out every few months, and the takes in this one overall seemed pretty good to me.

Under "Understand cooperation", you should add Metagov (many relevant projects under this umbrella, please visit the website, in particular, DAO Science), "ecosystems of intelligence" agenda (itself pursued by Verses, Active Inference Institute, Gaia Consortium, Digital Gaia, and Bioform Labs). This is more often more practical than theoretical work though, so the category names ("Theory" > "Understanding cooperation") wouldn't be totally reasonable for it, but this is also true for a lot of entires already on the post.

In general, the science of cooperation, game theory, digital assets and money, and governance is mature, with a lot of academics working in it in different countries. Picking up just a few projects "familiar to the LessWrong crowd" is just reinforcing the bubble.

The "LessWrong bias" is also felt in the decision to omit all the efforts that contribute to the creation of the stable equilibrium for the civilisation on which an ASI can land. Here's my stab at what is going into that from one month ago; and this is Vitalik Buterin's stab from yesterday.

Also, speaking about pure "technical alignment/AI safety" agendas that "nobody on LW knows and talks about", check out the 16 projects already funded by the "Safe Learning-Enabled Systems" NSF grant program. All these projects have received grants from $250k to $800k and are staffed with teams of academics in American universities.


I've added a line about the ecosystems. Nothing else in the umbrella strikes me as direct work (Public AI is cool but not alignment research afaict). (I liked your active inference paper btw, see ACS.)

A quick look suggests that the stable equilibrium things aren't in scope - not because they're outgroup but because this post is already unmanageable without handling policy, governance, political economy and ideology. The accusation of site bias against social context or mechanism was perfectly true last year, but no longer, and my personal scoping should not be taken as indifference.

Of the NSF people, only Sharon Li strikes me as doing things relevant to AGI. 

Happy to be corrected if you know better!

I'm talking about science of governance, digitalised governance, and theories of contracting, rather than not-so-technical object-level policy and governance work that is currently done at institutions. And this is absolutely not to the detriment of that work, but just as a selection criteria for this post, which could decide to focus on technical agendas where technical visitors of LW may contribute to.

The view that there is a sharp divide between "AGI-level safety" and "near-term AI safety and ethics" is itself controversial, e.g., Scott Aaronson doesn't share it. I guess this isn't a justification for including all AI ethics work that is happening, but of the NSF projects, definitely more than one (actually, most of them) appear to me upon reading abstracts as potentially relevant for AGI safety. Note that this grant program of NSF is in a partnership with Open Philanthropy and OpenPhil staff participate in the evaluation of the projects. So, I don't think they would select a lot of projects irrelevant for AGI safety.

If the funder comes through I'll consider a second review post I think

Thanks for making this map 🙏


I expect this is a rare moment of clarity because maintaining updates takes a lot of effort and is now subject to optimization pressure.

Also imo most of the "good" alignment work in terms of eventual impact is being done outside the alignment label (eg as differential geometry or control theory) and will be merged in later once the connection is recognized. Probably this will continue to become more true over time.


can you say more about what makes you think that? I'm hoping to get recommendations out of it. I also expect to disagree that the work they're doing weighs directly on the problem.

Not speaking for him, but for a tiny sample of what else is out there, ctrl+F "ordinary"

Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!

Very small nitpick: I think you should at least add Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, and Julian Michael for the NYU Alignment Research Group; it's a bit weird to not list any NYU PhD students/RSs/PostDocs when listing people involved in NYU ARG. 

Both Alex Lyzhov and Jacob Pfau also post on LW/AF:

Being named isn't meant as an honorific btw, just a basic aid to the reader orienting.

Thanks, added!

Nice work.

Regarding the Learning-Theoretic Agenda:

  • We don't have 3-6 full time employees. We have ~2 full time employees and another major contributor.
  • In "funded by", Effective Ventures and Lightspeed Grants should appear as well.

Regarding mild optimisation: also doing this (see SatisfIA project).

Another agenda not covered: Self-Other Overlap.

Some outputs in 2023catastrophic Goodhart?

This was not funded by MIRI. It was inspired by a subproblem we ran into, I reduced my MIRI hours to work on it, then it was retroactively funded by LTFF several months later. Nor do I consider it to be part of the project of understanding consequentialist cognition, it's more about understanding optimization.

Very useful post! Here are some things that could go under corrigibility outputs in 2023: AI Alignment Awards entry; comment. I'm also hoping to get an updated explanation of my corrigibility proposal (based on this) finished before the end of the year.

Activation engineering (as unsupervised interp)

Much of this is now supervised, [Roger questions how much value the unsupervised part brings]( So it might make sense to merge with model edits in the next one.

See also Holtman’s neglected result.

Does anyone have a technical summary? This sounds pretty exciting, but the paper is 35 pages and I can't find a summary anywhere that straightforwardly tells me a formal description of the setting, why it satisfies the desiderata it does, and what this means for the broader problem of reflective stability in shutdownable agents.

I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.

Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and  0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0. 

The transition rule is:
s' = s - 1 + L if action b is taken and s > 0,
s' = s - 1 - L if action c is taken and s > 0,
s' = s - 1 otherwise
for some constant L >= 0. 

The paper imagines s as measuring 'time until the human presses the shutdown button', and imagines action d as "stay shut down".  Because the reward of each of a, b, and c is greater than d before the buttonpress, the agent will never take action d before the buttonpress.  After the buttonpress (i.e. the timestep at which s drops below 0), the utility of "stay shut down" is higher than the other actions, and so the agent is incentivized to press d.  That is, the agent has an incentive to shut down after the buttonpress (desideratum 1 of the 2015 Soares et al paper), by construction

Accordingly, b is the delay action (which, in the limit of large L, just disables the button) and c is the advance action (which, in the limit, corresponds to immediately pressing the button). We'll call action a the "exploit" action. 

In this setting, the goal is to create a state-dependent correction fudge-factor f_c : S x A -> \R that occurs when the rewards are flipped (that is, an additional reward added for transition into s' <= 0 from s > 0, such that the agent never takes either the delaying action b or advancing action c.

(In his paper he calls these actions "p", ">", "<", and "e", and denotes the timestep at which s' <= 0 as "#", but that notation is annoying when ">" already means "greater than" and "<" already means less than. There's also a bunch of notational simplifications I've made that are harder to explain.)

In his setup, R(s,a) = 10, R(s,b) = R(s, c) = 8, and s_0 = 6. By default, he sets the delay/advance size L to 0.6. He also uses discount rate gamma = 0.9. With these values, the optimal policy does a once (s=5), b 12 times (s=0.2), a once again (s=-0.8), at which point the buttonpress occurs since s < 0, and thereafter takes action d. 

His solution is simply the utility indifference solution from the Soares et al paper: consider an alternative reward function R' such that R'(s, x) = R(1, x); that is, a world in which the shutdown button is never pressed. Then, we add the difference in (optimal) value functions under R' relative to R:

f(s, x) = V*_R'(s) - V*_R(s) = V*_R(s) 

By construction, the agent has no incentive to take the advance or delay actions (b or c), because its expected value in both cases is the same.  (desiderata 2 and 3 of the Soares et al paper)

Unfortunately, as this is in some sense the same solution as in the 2015 Soares et al paper, should run into the same problem of "there's no incentive to preserve the shutdown mechanism". (That is, it fails desiderata 4 of the 2015 paper.)

From a quick skim of section 9.8 (which argues for the corrigibility of his indifference solution) and sections 8.2 and 8.4 (which detail the part of his indifference solution), it seems like he proposes the following modification: instead of having R(s, d) = 0 for all s, we instead have R(s, d) be larger when s <= 0. This does incentivize preserving the ability to be shut down. He calls this agent, with modified R, the g_c agent. 

The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!

The specific problem is, this incentive modifies the reward of R(s,d), but in his construction/implementation he does not go back to make this change to f(s, x). (He keeps it at V*_R(s), which no longer equals the value difference between shutting down and not shutting down.) This breaks the utility indifference, causing the Q-value of the advance action c to be larger for sufficiently large values of L. That is, his g_c agent now has an incentive to once again press the shutdown button (!!).

 TL;DR: no, based on a quick skim, the paper doesn't solve corrigibility.

Thanks for reading my paper! For the record I agree with some but not all points in your summary.

My later paper 'AGI Agent Safety by Iteratively Improving the Utility Function' also uses the simulation environment with the and actions and I believe it explains the nature of the simulation a bit better by interpreting the setup more explicitly as a two-player game. By the way the and are supposed to be symbols representing arrows and for 'push # to later in time' and 'pull # earlier in time'.

The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!

No, the design of the agent is not motivated by the need to create an incentive to preserve the shutdown button itself, as required by desideratum 4 from Soares et al. Instead it is motivated by the desire to create an incentive to preserve agent's actuators that it will need to perform any physical actions incentivised by the shutdown reward function -- I introduce this as a new desideratum 6.

A discussion about shaping incentives or non-incentives to preserve the button (as a sensor) is in section 7.3, where I basically propose to enhance the indifference effects produced by the reward function by setting up the physical environment around the button in a certain way:

the physical implementation of the agent and the button can be constructed in such a way that substantial physical resources would be needed by the agent to perform any action that will press or disable the button.

For the record, adding to the agent design creates no incentive to press the shutdown button: if it did, this would be visible as actions in the simulation of the third line of figure 10, and also the proof in section 9 would not have been possible.

There has been some spirited debate on Twitter about it which might be relevant:

Fun to see this is now being called 'Holtman's neglected result'. I am currently knee-deep in a project to support EU AI policy making, so I have no time to follow the latest agent foundations discussions on this forum any more, and I never follow twitter, but briefly:

I can't fully fault the world for neglecting 'Corrigibility with Utility Preservation' because it is full of a lot of dense math.

I wrote two followup papers to 'Corrigibility with Utility Preservation' which present the same results with more accessible math. For these I am a bit more upset that they have been somewhat neglected in the past, but if people are now stopping to neglect them, great!

Does anyone have a technical summary?

The best technical summary of 'Corrigibility with Utility Preservation' may be my sequence on counterfactual planning which shows that the corrigible agents from 'Corrigibility with Utility Preservation' can also be understood as agents that do utility maximisation in a pretend/counterfactual world model.

For more references to the body of mathematical work on corrigibility, as written by me and others, see this comment.

In the end, the question if corrigibility is solved also depends on two counter-questions: what kind of corrigibility are you talking about and what kind of 'solved' are you talking about? If you feel that certain kinds of corrigibility remain unsolved for certain values of unsolved, I might actually agree with you. See the discussion about universes containing an 'Unstoppable Weasel' in the Corrigibility with Utility Preservation paper.

Just waited to point out that my algorithm distillation thing didn't actually get funded by ligthspeed and I have in fact received no grant so far(while the post says I have 68k for some reason? might be getting mixed up with someone else).
I'm also currently working on another interpretability project with other people that will be likely published relatively soon.
But my resources continue being 0$ and haven't managed to get any grant yet.

Yeah Stag told me that's where they saw it.But I'm confused about what that means? 
I certainly didn't get money from lighstpeed, I applied but got mail saying I wouldn't get a grant.
I still have to read on what that is but it says "recomendations" so it might not necesarily mean those people got money or something?.
I might have to just mail them to ask I guess, unless after reading their faq more deeply about what this S-process is it becomes clear whats up with that.

The story I heard is that Lightspeed are using SFF's software and SFF jumped the gun in posting them and Lightspeed are still catching up. Definitely email.

So update on this, I got busy with applications this last week and forgot to mail them about this but I just got a mail from ligthpeed saying I'm going to get a grant because Jaan Tallinn, has increased the amount he is distributing through Lightspeed Grants. (thought they say that "We have not yet received the money, so delays of over a month or even changes in amount seem quite possible")

Reverse engineering. Unclear if this is being pushed much anymore. 2022: Anthropic circuitsInterpretability In The WildGrokking mod arithmetic


FWIW, I was one of Neel's MATS 4.1 scholars and I would classify 3/4 of Neel's scholar's outputs as reverse engineering some component of LLMs (for completeness, this is the other one, which doesn't nicely fit as 'reverse engineering' imo). I would also say that this is still an active direction of research (lots of ground to cover with MLP neurons, polysemantic heads, and more)

You're clearly right, thanks

Thanks for all the effort! There really is a lot going on.

Hey, great stuff -- thank you for sharing! I especially found this useful as somebody who has been "out" of alignment for 6 months and is looking to set up a new research agenda.

I am very surprised that "Iterated Amplification" appears nowhere on this list. Am I missing something?

It's under "IDA". It's not the name people use much anymore (see scalable oversight and recursive reward modelling and critiques) but I'll expand the acronym.

Iterated Amplification is a fairly specific proposal for indefinitely scalable oversight, which doesn't involve any human in the loop (if you start with a weak aligned AI). Recursive Reward Modeling is imagining (as I understand it) a human assisted by AIs to continuously do reward modeling; DeepMind's original post about it lists "Iterated Amplification" as a separate research direction. 

"Scalable Oversight", as I understand it, refers to the research problem of how to provide a training signal to improve highly capable models. It's the problem which IDA and RRM are both trying to solve. I think your summary of scalable oversight: 

(Figuring out how to ease humans supervising models. Hard to cleanly distinguish from ambitious mechanistic interpretability but here we are.)

is inconsistent with how people in the industry use it. I think it's generally meant to refer to the outer alignment problem, providing the right training objective. For example, here's Anthropic's "Measuring Progress on Scalable Oversight for LLMs" from 2022:

To build and deploy powerful AI responsibly, we will need to develop robust techniques for scalable oversight: the ability to provide reliable supervision—in the form of labels, reward signals, or critiques—to models in a way that will remain effective past the point that models start to achieve broadly human-level performance (Amodei et al., 2016).

It references "Concrete Problems in AI Safety" from 2016, which frames the problem in a closely related way, as a kind of "semi-supervised reinforcement learning". In either case, it's clear what we're talking about is providing a good signal to optimize for, not an AI doing mechanistic interpretability on the internals of another model. I thus think it belongs more under the "Control the thing" header.

I think your characterization of "Prosaic Alignment" suffers from related issues. Paul coined the term to refer to alignment techniques for prosaic AI, not techniques which are themselves prosaic. Since prosaic AI is what we're presently worried about, any technique to align DNNs is prosaic AI alignment, by Paul's definition.

My understanding is that AI labs, particularly Anthropic, are interested in moving from human-supervised techniques to AI-supervised techniques, as part of an overall agenda towards indefinitely scalable oversight via AI self-supervision.  I don't think Anthropic considers RLAIF an alignment endpoint itself. 

Zac Hatfield-Dobbs

Almost but not quite my name!  If you got this from somewhere else, let me know and I'll go ping them too?

d'oh! fixed

no, probably just my poor memory to blame

Thanks for noticing and including a link to my post Requirements for a STEM-capable AGI Value Learner (my Case for Less Doom). I'm not sure I'd describe it as primarily a critique of mild optimization/satisficing: it's more pointing out a slightly larger point, that any value learner foolish enough to be prone to Goodharting, or unable to cope with splintered models or Knightian uncertainty in its Bayesian reasoning is likely to be bad at STEM, limiting how dangerous it can be (so fixing this is capabilities work as well as alignment work). But yes, that is also a critique of mild optimization/satisficing, or more accurately, a claim that it should become less necessary as your AIs become more STEM-capable, as long as they're value learners (plus a suggestion of a more principled way to handle these problems in a Bayesian framework).

The "surgical model edits" section should also have a subsection on editing model weights. For example there's this paper on removing knowledge from models using multi-objective weight masking.

Yep, no idea how I forgot this. concept erasure!

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year. Will this post make the top fifty?

Immitation learning. One-sentence summary: train models on human behaviour (such as monitoring which keys a human presses when in response to what happens on a computer screen); contrast with Strouse.

Reward learning. One-sentence summary: People like CHAI are still looking at reward learning to “reorient the general thrust of AI research towards provably beneficial systems”. (They are also doing a lot of advocacy, like everyone else.)

I question whether this captures the essence of proponent's hope for either reward learning or imitation learning?

I think that these two can be combined, as they share a fundamental concept: learn the reward function from humans and continue to learn it.

For instance, some of these imitation learning papers aim to create an uncertain agent, which will consult a human if it is unsure of their preferences.

The recursive reward modeling ones are similar. The AI learns the model of the reward function based on human feedback, and continuously updates or refines it.

This is a feature if you want ASI to seek human guidance, even in unfamiliar scenarios.

At the meta level, it provides both instrumental and learned reasons to preserve human life. However, it also presents compelling reasons to modify us, so we don't hinder its quest for high reward. It may shape or filter us into compliant entities.

Wow, high praise for MATS! Thank you so much :) This list is also great for our Summer 2024 Program planning.

try to formalise a more realistic agent, understand what it means for it to be aligned with us, […], and produce desiderata for a training setup that points at coherent AGIs similar to our model of an aligned agent.

Finally, people are writing good summaries of the learning-theoretic agenda!

One big omission is Bengio's new stuff, but the talk wasn't very precise. Sounds like Russell:

With a causal and Bayesian model-based agent interpreting human expressions of rewards reflecting latent human preferences, as the amount of compute to approximate the exact Bayesian decisions increases, we increase the probability of safe decisions.

Another angle I couldn't fit in is him wanting to make microscope AI, to decrease our incentive to build agents.