Resources for AI Alignment Cartography

by Gyrodiot8 min read4th Apr 20208 comments


Ω 17

Research AgendasAI
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I want to make an actionable map of AI alignment.

After years of reading papers, blog posts, online exchanges, books, and occasionally hidden documents about AI alignment and AI risk, and having extremely interesting conversations about it, most arguments I encounter now feel familiar at best, rehashed at worst. This should mean I have a good map of the field being discussed.

I have been, however, frustrated by how little actual advice I could derive from this map. The message I understood from most agendas was "here are the tractable sub-problems we will work on and why they should be solved". I didn't find much justification for why they are critically important, or why one particular set of problems should be explored instead of the next research group's set.

So I looked harder. I found useful mapping work, yet nothing quite exactly like what I was looking for. I also found related concerns in this post and this comment thread.

You'll find, in the following sections, my (current) selection of:

  • "cartography work", to draw a map of relevant arguments and concepts;
  • research agendas, from research groups or individuals;
  • points of entry for newcomers.

Here are the caveats. The list is not exhaustive. I did try to cover as many visible ideas as possible, and there will be significant overlap and cross-references between the items listed here. Some references I consider useful (e.g. this) have not made the cut. I attempted to categorize the resources by focus, but a handful could have ended up in a different category. Please don’t rely on it too much.

My comments aren't summaries, rather justifications for why I included the reference. They also reuse liberally the original formulations. Please tell me if I left strong misrepresentations of the ideas in there.

All these references, and countless comments scattered all across LessWrong, the Alignment forum, and the Effective Altruism forum, will hopefully help me build something actionable, something that would let newcomers and experts explore the field with more clarity and make better decisions.

My short-term plan is to create minimal interactive explanations for the relevance of various propositions in AI alignment, with the option to question and expand their premises. I want to do this for a first few high-level ideas, and if it goes well, expand to a first full scenario.

The long-term plan is to map as many propositions and available scenarios as possible, to have a common framework in which to compare research directions. My intuition (to be challenged) is that there’s broad agreement in the field on most premises I could describe, and that we would benefit a lot from locating cruxes (e.g. here). My overarching motivation is to reduce research debt.

The references here will be my first source of information. The second one would be discussions. If you are the author of one of the resources below and/or if you had more conversations about alignment-related arguments than you can remember, and want to share your insights, please reach out to me. I will do my best to answer in a timely manner.

Thanks to Adam Shimi , Alexis Carlier and Maxime Riché for reviewing drafts of this post and suggesting resources!

Argument mapping & reviews

Disentangling arguments for the importance of AI safety

Richard Ngo - January 2019

Splits the core motivating arguments for AI safety into six rough categories: maximizers being dangerous, target loading, prosaic alignment, human safety, misuses/vulnerabilities, and large impact.

Makes the case for more clarity around the fundamental ideas, analysis of the arguments, description of deployment scenarios, as well as making more explicit the assumptions behind research agendas.

Clarifying some key hypotheses in AI alignment

Ben Cottier, Rohin Shah - August 2019

Creates a diagram linking hypotheses, scenarios, agendas, and catastrophic problems. Selects for debated and important arguments, does not claim to be comprehensive, links ideas through diverse relationships (support, conditional support, entailment, etc.)

The post itself goes into more details on the hypotheses, with resources listed for each one.

My personal cruxes for working on AI safety

Buck Shlegeris - January 2020

The first section of the talk highlights the limits of heuristic arguments, the usefulness of spelling out premises and making a deliberate effort to build compelling arguments for your personal stance.

The talk then proceeds to detail the speaker’s own argument for AI alignment work. Many commenters express their gratitude for all this exposition.

How sure are we about this AI stuff?

Ben Garfinkel - February 2019

Runs through the intuitive arguments being AI risk prioritization: “AI as a big deal”, instability, lock-in, and accidents. Expands why each of them aren’t forceful, or with missing pieces/details.

Calls for the arguments being fleshed out further as a neglected issue, with potential high value.

A shift in arguments for AI risk

Tom Adamczewski - February 2019

Describes the evolution of AI risk arguments, from early descriptions of the alignment problem, to discontinuities as a premise for Bostrom’s Superintelligence, to alignment issues without discontinuity. Also describes non-alignment catastrophes, such as misuse risks.

Calls for clarification of arguments related to AI risk, especially on the subject of discontinuities, for better prioritization, and reduction of costly misunderstandings.

Scenarios, forecasting & strategy

AI Impacts (selected references)

AI Impacts contributors - Since 2014

The website in general is dedicated to building AI forecasting resources, to inform arguments and decisions. Some of their content most closely related to AI risk arguments:

What failure looks like

Paul Christiano - March 2019

Describes two scenarios for AI catastrophe which don’t depend on a fast surprise takeover by a powerful AI system. Also notable for the level of engagement in the comments.

Disjunctive Scenarios of Catastrophic AI Risk

Kaj Sotala - February 2018

Breaks down a wide range of scenarios leading to (at least) catastrophic risk, by decomposing them into a variety of factors: strategic advantage, takeoff speed, autonomy acquisition, plurality of agents, etc.

Explores the idea of there being multiple combinations of factors which may be realized, each of them leading to a catastrophe (as opposed to a specific privileged scenario, which may receive too much focus).

The Main Sources of AI Risk?

Wei Dai, Daniel Kokotajlo - March 2019 (last updated March 2020)

Thirty-two (and counting) high-level scenarios for AI catastrophe. Wei Dai emphasizes that they aren’t disjunctive, as some scenarios may subsume or cause others. Daniel Kokotajlo (who maintains and updates the list) suggests it could be refined, expanded and reorganized.

Chris Olah’s views on AGI safety

Evan Hubinger - November 2019

Reports arguments on the importance of transparency and interpretability, and about how to improve the field of machine learning to make progress on these issues.

Classification of global catastrophic risks connected with artificial intelligence

Alexey Turchin, David Denkenberger - January 2018

Lists and categorizes a wide range of catastrophic scenarios, from narrow or general AI, near-term or long-term, misuse or accidents, and many other factors, with references.

Agendas & reports focused on problem framing

Embedded agency

Scott Garrabrant, Abram Demski - November 2018

Clarifies and motivates technical research stemming from the idea of embedded agents, where AI systems are no longer logically separated from their environment, implying modeling and self-modification issues, among others.

Describes the subproblems associated with that hypothesis: decision theory, embedded world-models, robust delegation, and subsystem alignment.

AI Governance: A Research Agenda

Allan Dafoe - August 2018

From the Center for the Governance of AI, Future of Humanity Institute. The agenda aims for superficial comprehensiveness, gathering as many questions relevant to AI Governance as possible in 53 pages, and providing extensive references for further details. It doesn’t focus on prioritization, nor tractability/impact estimates.

The questions are divided in three clusters: technical landscape (modeling and forecasting AI progress, mapping AI capabilities, and technical AI safety), AI politics (transformation of government, of the job market, and regulatory concerns), and ideal AI governance (desirable values, institutions and scenarios).

Building safe artificial intelligence: specification, robustness, and assurance

Pedro A. Ortega, Vishal Maini, DeepMind - September 2018

Motivates DeepMind’s technical AI safety research, dividing it in three areas: specification (how to define the purpose of a system, whether explicitly designed or emergent), robustness (how to prevent, anticipate, defend against, and recover from perturbations), and assurance (understand, evaluate and actively control the behavior of a system).

The post defines a broad array of technical terms. The challenges are grounded in problems already present in current AI systems, and in simple environments (gridworlds).

Concrete Problems in AI Safety

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, Dan Mané - June 2016

Describes and motivates five technical safety research problems in machine learning-based systems, tractable through direct experimentation, in toy environments and/or small-scale models. All problems, sub-problems, and proposed abstract solutions are grounded in the existing machine learning literature.

The authors also argue for the increasing relevance of these problems as AI capabilities progress.

Reframing Superintelligence: Comprehensive AI Services as General Intelligence

K. Eric Drexler - January 2019

Expands on Bostrom’s Superintelligence, through a mesh of forty high-level statements addressing the possibility of an intelligence explosion, the nature of advanced machine intelligence, the relationship between goals and intelligence, the use and control of advanced AI, and near/long-term considerations in AI safety & strategy.

The common underlying framing is a service-centered model of general intelligence, suggesting the integration of a diversity of task-oriented systems built incrementally, rather than mostly independent, self-improving superintelligent agents.

AI alignment reviews

AI Alignment Research Overview

Jacob Steinhardt - October 2019

Outlines four broad categories of technical work: technical alignment (how to create aligned AI), detecting failures (how to proactively check for alignment), methodological understanding (best practices), and system-building (how to do the previous three for large systems).

All problems (or sub-problems, for the first category) are explored through a high-level definition, motivation, solution desiderata, possible research avenues, personal takes, and references.

The Landscape of AI Safety and Beneficence Research

Richard Mallah - January 2017

Maps a large set of concepts and techniques in AI safety. The core content can be explored in this interactive visualization. The concepts are primarily organized through a hierarchical map, with secondary links for related ideas. All concepts are given high-level descriptions with references.

The stated purpose of the work is to provide a comprehensive map and a reference set of concepts for the field, to be extended through further research.

AI Alignment 2018-19 Review

Rohin Shah - January 2020

The first section of the post is dedicated to recent work in basic AI risk analysis: new explorations of goal-directedness and comprehensive AI services, as well as new write-ups for, or against AI risk (many of which are listed in this very document).

The rest of the post details recent work in the many sub-problems of AI alignment, noting that the over 300 references have been selected from a larger set of around 500 articles, clustered for readability (the reader shouldn’t take the chosen categorization as authoritative).

2019 AI Alignment Literature Review and Charity Comparison

Larks - December 2019

Sorts AI alignment work by origin, and not by topic. It highlights more specifically the agendas of the various research teams, and lists the collaborations between them. It also references a wide range of independent research.

In addition, the post details the funding of the various organizations involved in the field, as well as methodological comments on prioritization, funding, and research avenues.

AGI Safety Literature Review

Tom Everitt, Gary Lea, Marcus Hutter - May 2018

Focuses specifically on powerful AI systems: plausible conceptual models; forecasting of capability increase and risks; technical safety problems; design ideas and concepts; and public policy.

The paper explores safety problems shared by multiple research agendas, and summarizes a wide range of publications in the domain.

Introductory material

Benefits & Risks of Artificial Intelligence

Future of Life Institute - November 2015 (first version)

Summarizes in an accessible way the very high-level case for AI alignment research, the most common naive objections and misconceptions, with further reading references.

Superintelligence: Paths, Dangers, Strategies

Nick Bostrom - July 2014

Makes the case for the risk from superintelligent entities (not necessarily AI systems, though it is presented as its most probable origin). The book represents an early edited, long-form, philosophical introduction to numerous concepts such as the control problem, takeoff speeds, treacherous turn, instrumental convergence, decisive strategic advantage, value loading, and many more.

Human Compatible: Artificial Intelligence and the Problem of Control

Stuart Russell - October 2019

Makes the case for the risk from advanced AI systems through failure of alignment. The book describes the continued progress in AI capabilities, reviews critically the major arguments around AI risk and forecasting, and argues for early safety research, showcasing significant hurdles to solve, and possible research avenues.

Potential Risks from Advanced Artificial Intelligence: The Philanthropic Opportunity

Holden Karnofsky - May 2016

Makes the philanthropic case for AI risk research, describing three classes of risk: misuse risk (malevolent, or value-locking use of powerful technology), accident risk (stemming typically from alignment failure) and other risks (such as structural effects due to automation, or dissemination of increasingly capable tools). Also explains several principles for prioritization work.

Current Work in AI Alignment

Paul Christiano - June 2019

Decomposes the then-current main approaches in AI alignment research by building a tree diagram and giving friendly high-level explanations of the ideas. The exploration is itself biased towards iterated amplification, which is put in its broader context.

Arbital - Explore AI Alignment

Many authors - From 2014 to 2018

Provides detailed explanations for many concepts in AI Alignment, in an explorable way. Now in an archived state.

Robert Miles’s YouTube channel

Robert Miles - Since 2014

Clear and friendly explanations of many concepts in AI alignment. For introductory material, it is best to start with his Computerphile videos, produced before the channel’s creation.

Technical agendas focused on possible solutions

Iterated Amplification

Paul Christiano - October 2018

Describes iterated amplification, an alignment technique for powerful ML-based systems. Spells out the core hypotheses behind the validity of the techniques. In the fourth section, details the associated research directions, and desiderata for AI alignment research.

Value Learning

Rohin Shah, Paul Christiano, Stuart Armstrong, Jacob Steinhardt, Owain Evans - October 2018

Investigates and motivates value learning, discussing the arguments stemming from the idea of a powerful AI system pursuing a particular utility function, using human behavior as a data source. Clearly restates the core arguments in the conclusion post.

Reframing Impact

Alex Turner - July 2019

Explores and motivates new ways to work with impact measures, a common component of various approaches in AI safety research, and how to think about scenarios where a powerful AI system makes wide-ranging decisions and actions.

Research Agenda v0.9: Synthesising a human's preferences into a utility function

Stuart Armstrong - June 2019

Clarifies and motivates a technical agenda for building specific assumptions into AI systems that would let them infer human preferences, as an instrumental goal for aligning onto them.

Deconfusing Human Values Research Agenda v1

G Gordon Worley III - March 2020

Defines a technical agenda for building a formal expression of the structure of human values, modeling them as the input of their decision process.

The Learning-Theoretic AI Alignment Research Agenda

Vanessa Kosoy - July 2018

Details and motivates philosophically a technical agenda to ground AI alignment in statistical and computational learning theory, as well as algorithmic information theory.

Scalable agent alignment via reward modeling: a research direction

Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg - November 2018

DeepMind paper, which defines a more specific agenda around the learning of a reward function through reinforcement learning, from interaction with a human user, in a way that scales to complex and general domains.

AI Safety Needs Social Scientists

Geoffrey Irving, Amanda Askell - February 2019

Explores and motivates the debate approach to alignment, learning human values through experiments, asking questions and arbitrating between arguments. Tied to the AI safety via debate OpenAI paper.

Special mentions

Technical AGI safety research outside AI

Richard Ngo - October 2019

The first section Studying and understanding safety problems motivates this very project. The entire post is full of interesting problems to solve.

AI safety resources

Victoria Krakovna - Regularly updated since August 2017

Provides a wealth of useful references, which significantly helped expand this list. Still receiving updates!

To reiterate, just above the comment box : I'm looking for insights. If your favorite reference is missing; if you spot a glaring error; if you have a strong opinion on research directions; if you share my frustrations, or disagree: do share! (Yes, the post is long, please don't let that stop you from engaging).


Ω 17

8 comments, sorted by Highlighting new comments since Today at 4:24 PM
New Comment

I know you link/mention Rohin's map. I think Paul or Chris Olah had put together another map at one time. How do you see your work differing from or building on what they've done?

Is Paul's map the one in Current Work in AI Alignment? I think Rohin also used it in his online-EAG 2020 presentation. For Rohin's map, are you referring to Ben Cottier's Clarifying some key hypotheses in AI alignment, to which Rohin made major contributions? I'll be referring to those two in the rest of my answer.

I want to make more explicit the relationships between the premises and outcomes included in the diagrams. The goal of my work is to make those kinds of questions easier to answer:

  • Are scenarios X and Y mutually exclusive? If they are, is the split sharp (is there a premise P which prevents X if true, and prevents Y if false)?
  • What are the premises behind the work on a specific problem? Which events or results would make this work irrelevant?
  • Does it make sense to "partially solve" problem P? Are there efforts which won't make any difference until something specific happens?

I find it hard to answer those questions with the diagrams, since (from my understanding) they have other goals entirely. Paul's map shows how current research questions relate to each other, with closer elements in the tree sharing more concepts and techniques. Ben & Rohin's map show which questions are controversial and which debates feed into others, and which very broad scenarios/agendas are relevant to them.

You can answer the questions listed above by integrating the diagram with the post details, and following references... but it isn't convenient. I want to make it easier to discover and engage with that knowledge.

The main difference between my (future) work and the diagrams would be to enable the user to explore one specific scenario/research question at a time. For example, in Paul's talk, that would mean starting from « iterated amplification » and repeatedly asking « why ? » as you go up the tree. I want the user to find out what happens if one of the premises doesn't hold: is the work still useful? If we want to maintain the premise, what are the load-bearing sub-premises?

I expect a lot of the structure in the diagrams will be mirrored in the end result anyway, as it should, since it's the same knowledge. I hope to distill it in a different way.

Thanks, that really helpful to understand your work better!

I am running a large-scale version of this, with contributors from multiple organizations. We should definitely discuss. Can you message me or email me? Aryeh.Englander at Thanks!

We should indeed! I just sent you an email.

I don't know whether this is on purpose, but I'd think that AI Safety Via Debate (original paper:; recent progress report: should get a mention, probably in the Technical agendas focused on possible solutions section? I'd argue it's different enough from IDA to have it's own subititle.

It was in the references that initially didn't make the cut. After further thought, it's indeed worth adding. I referenced the Distill article AI Safety Needs Social Scientists, which spends more time on the motivating arguments, and linked to the paper in the note.

Thanks for your feedback!

No worries. As much as I think less has been written on debate than amplification (Paul has a lot of blog posts on IDA), it seems to me like most of the work Paul's team at OpenAI is doing is working on debates rather than IDA.