Alignment Org Cheat Sheet

Orpheus16; Thomas Larsen

Epistemic Status: Exploratory.

Epistemic Effort: ~6 hours of work put into this document.

Contributions: Akash wrote this, Thomas helped edit + discuss. Unless specified otherwise, writing in the first person is by Akash and so are the opinions. Thanks to many others for relevant conversations.

There have been a few attempts to summarize the AI alignment research landscape. Many of them are long and detailed.

As an exercise, I created a “cheat sheet” describing the work of researchers, research organizations, and training programs. The goal was to summarize key people/orgs in just one sentence.

Note that I do not aim to cover everyone; I focus on the ones that I think are most promising or popular, with a bias towards ones that I know the most (For longer summaries/analyses, I recommend the ones by Thomas Larsen & Eli Lifland and Nate Soares).

Here’s my alignment org cheat sheet:

Researchers & Research Organizations

Aligned AI: Let's understand all of the possible ways a model could generalize out-of-distribution and then figure out how to select the way(s) that are safe. (see Why I'm co-founding Aligned AI) (see also Thomas’s critique of their plan here).

Anthropic: Anthropic: Let’s interpret large language models by understanding which parts of neural networks are involved with certain types of cognition, and let’s build larger language models to tackle problems, test methods, and understand phenomenon that will emerge as we get closer to AGI (see Zac's comment for some details & citations).

ARC (Alignment Research Center): Let’s interpret an AGI by building a reporter which a) honestly answers our questions and b) is smart enough to translate the AGI’s thoughts to us in a way that we can understand (see ELK report).

Conjecture: Let’s focus on bets that make sense in worlds with short timelines, foster an environment where people can think about new and uncorrelated ways to solve alignment, create infrastructure that allows us to use large language models to assist humans in solving alignment, develop simulacra theory to understand large language models, build and interpret large language models, and maintain a culture of information security.

DeepMind Alignment Team (according to Rohin Shah): Let’s contribute to a variety of projects and let our individual researchers pursue their own directions rather than having a unified agenda (note that DeepMind is also controversial for advancing AI capabilities; more here).

Eliezer & Nate: Let’s help others understand why their alignment plans are likely to fail, in the hopes that people understand the problem better, and are able to take advantage of a miracle (see List of Lethalities & Why Various Plans Miss the Hard Parts).

Encultured: Let’s create consumer-facing products that help AI safety researchers, and let’s start with a popular video game that serves as a “testing ground” where companies can deploy AI systems to see if they cause havoc (bad) or do things that human players love (good).

Evan Hubinger: Let’s understand how we might get deceptive models, develop interpretability tools that help us detect deception, and develop strategies to reduce the likelihood of training deceptive models (see Risks from Learned Optimization and A Transparency and Interpretability Tech Tree).

John Wentworth: Let’s become less confused about agency/alignment by understanding if human concepts are likely to be learned “naturally” by AIs (natural abstractions), understanding how selection pressures work and why they produce certain types of agents (selection theorems), and broadly generating insights that make alignment less pre-paradigmatic (see The Plan).

MIRI: Let’s support researchers who are working on a variety of research agendas, which makes it really hard to summarize what we’re doing in one sentence, and also please keep in mind that we’re not particularly excited about these research agendas (but we do fund some people on this list like Evan, Abram, and Scott).

OpenAI: Let’s build aligned AI by providing human feedback, training AI systems that help humans give better feedback, and training AI systems that help humans do alignment research (see Our approach to alignment research) (note that OpenAI is controversial for advancing AI capabilities; more here).

Redwood: Let’s train and manage teams of engineers to support ARC (and other alignment researchers), while also working on adversarial training and interpretability research.

Scott Garrabrant & Abram Demski: Let’s understand agency by examining challenges that arise when agents are embedded in their environments (see Embedded Agency and Cartesian Frames; note that this work was published years ago & I’m not sure what Scott and Abram are focused on currently.)

Tamera Lanham: Let’s interpret current language models by prompting them to “think out loud” and eliminating models that have harmful reasoning (see Externalized Reasoning Oversight).

Vanessa Kosoy: Let’s establish a mathematically rigorous understanding of regret bounds (the difference between agent’s actual performance and its performance if it had access to the environment beforehand) by inventing a new branch of probability (infrabayesiansim) that helps agents reason when they need to model themselves and the world (embeddedness) (see The Learning-Theoretic AI Alignment Research Agenda).

Training & Mentoring Programs

AGI Safety Fundamentals: Let’s help people get an introduction to basic concepts in ML, AI, and AI safety.

MLAB: Let’s find and train promising engineers who might be interested in working for Redwood or other AI alignment organizations.

ML Safety Scholars: Let’s teach people basic/fundamental concepts in machine learning to help them skill-up & let’s teach them about ML-based ways to contribute to alignment.

Philosophy Fellowship: Let’s support experienced philosophers who want to tackle conceptual problems in AI alignment.

PIBBSS: Let’s support social scientists and natural scientists who want to tackle conceptual issues about intelligent systems that may be relevant for AI alignment.

Refine: Let’s provide structure and some mentorship to people who can come up with weird new agendas, and let’s intentionally shield them from (some) outside perspectives in order to get them to generate new ideas and produce uncorrelated bets.

SERI-Mats (some mentors): Let’s train junior alignment researchers to think for themselves, develop their research tastes, and come up with new research ideas.

SERI-Mats (some mentors): Let’s give senior alignment researchers an opportunity to get interns who will help the mentors make progress on their current research agendas.

Notes & Opinions

I wish this list was longer. Note that the list is not intended to be comprehensive. But I would bet that a Fully Comprehensive List would not be 10X (or even 5X) larger than mine (especially if we had a reasonable quality bar for what gets included). This is concerning (see also this post for some field-building ideas).
I sometimes meet people who think there are many people working on AI safety, and there are many promising research agendas. My own impression is that there are only a few people/organizations thinking seriously about AI safety, many have not been thinking about it for very long, and there is an urgent need for more people/perspectives.
If you are interested in contributing to AI safety research, but you don’t know what to do (or you are curious about funding), please feel free to reach out to me on LessWrong or consider applying to the Long-term Future Fund.

Sources that were helpful: Thomas Larsen & Eli on What Everyone in Technical Alignment is Doing and Why, Nate Soares on How Various Plans Miss the Hard Bits of the Alignment Problem, various primary sources, and various conversations with alignment researchers.

I think this is missing the point pretty badly for Anthropic, and leaves out most of the work that we do. I tried writing up a similar summary; which is necessarily a little longer:

Anthropic: Let’s get as much hands-on experience building safe and aligned AI as we can, without making things worse (advancing capabilities, race dynamics, etc). We'll invest in mechanistic interpretability because solving that would be awesome, and even modest success would help us detect risks before they become disasters. We'll train near-cutting-edge models to study how interventions like RL from human feedback and model-based supervision succeed and fail, iterate on them, and study how novel capabilities emerge as models scale up. We'll also share information so policy-makers and other interested parties can understand what the state of the art is like, and provide an example to others of how responsible labs can do safety-focused research.

(Sound good? We're hiring for both technical and non-technical roles.)

We were also delighted to welcome Tamera Lanham to Anthropic recently, so you could add her externalized reasoning oversight agenda to our alignment research too :-)

Thanks, Zac! Your description adds some useful details and citations.

I'd like the one-sentence descriptions to be pretty jargon-free, and I currently don't see how the original one misses the point.

I've made some minor edits to acknowledge that the purpose of building larger models is broader than "tackle problems", and I've also linked people to your comment where they can find some useful citations & details. (Edits in bold)

Anthropic: Let’s interpret large language models by understanding which parts of neural networks are involved with certain types of cognition, and let’s build larger language models to tackle problems, test methods, and understand phenomenon that will emerge as we get closer to AGI (see Zac's comment for some details & citations).

I also think there's some legitimate debate around Anthropic's role in advancing capabilities/race dynamics (e.g., I encourage people to see Thomas's comments here). So while I understand that Anthropic's goal is to do this work without advancing capabilities/race dynamics, I currently classify this as "a debated claim about Anthropic's work" rather than "an objective fact about Anthropic's work."

I'm interested in the policy work & would love to learn more about it. Could you share some more specifics about Anthropic's most promising policy work?

Excellent summary; I had been looking for something like this! Is there a reason you didn't include the AI Safety Camp in Training & Mentoring Programs?

"AGI Safety Fundamentals: Let’s help people get an introduction to basic concepts in ML, AI, and AI safety."

IMO this overstates how much ML we teach, there's one week of very basic introduction to ML and then everything else is pretty directly alignment-related.

let’s build larger language models to tackle problems, test methods, and understand phenomenon that will emerge as we get closer to AGI

Nitpick: you want "phenomena" (plural) here rather than "phenomenon" (singular).

Also possibly relevant (though less detailed): this table I made.

(see Zac's comment for some details & citations)

Just letting you know the link doesn't work although the comment was relatively easy to find.

Love that this exists! Looks like the material here will make great jumping off points when learning more about any of these orgs, or discussing them with others