Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm writing a post comparing some high-level approaches to AI alignment in terms of their false positive risk. Trouble is, there's no standard agreement on what various high-level approaches to AI alignment there are today, either in terms of what constitutes these high-level approaches or where to draw the line in categorizing various specific approaches.

So, I'll open it up as a question to get some feedback before I get too far along. What do you consider to be the high-level approaches to AI alignment?

(I'll supply my own partial answer below.)

New Answer
New Comment

3 Answers sorted by



Thanks. Your post specifically is pretty helpful because it helps with one of the things that was tripping me up, which is what standard names people call different methods. Your names do a better job of capturing them than mine did.



You might be interested in this post I wrote recently that goes into significant detail on what I see as the major leading proposals for building safe advanced AI under the current machine learning paradigm.

Actually this post was not especially helpful for my purpose and I should have explained why in advance because I anticipated someone would link it. Although it helpfully lays out a number of proposals people have made, it does more to work out what's going on with those proposals rather than find ways they can be grouped together (except incidentally). I even reread this post before posting this question and it didn't help me improve on the taxonomy I proposed, which I already had in mind as of a few months ago.

Gordon Seidoh Worley


My initial thought is that there are at least 3, which I'll give the follow names (with short explanations):

  • Iterated Distillation and Amplification (IDA)
    • Build an AI, have it interact with a human, create a new AI based on the interaction of the human and the AI, and repeat until the AI is good enough or it reaches a fixed point and additional iterations don't change it.
  • Inverse Reinforcement Learning (IRL)
    • Build an AI that tries to infer human values from observations and then acts based on those inferred values.
  • Decision Theorized Agent (DTA)
    • Build an AI that uses a decision theory that causes it to make choices that will be aligned with human interests.

All of these are woefully underspecified, so improved summaries of these approaches that you think accurately explain these approaches also appreciated.

[-]Ben PaceΩ3100

I think the last one seems odd / doesn't make much sense. All agents have a decision theory, including RL-based agents, so it's not a distinctive approach. 

If you were attempting to describe MIRI's work, remember that they're trying to understand basic concepts of agency better (meta level, object level), not in order to directly put the new concepts into the AI (in the same way current AIs do not always have a section for the 'probability theory' to be written in) but in order to be much less confused about what we're trying to do.

So if you want to describe MIRI's work, you'd call it "getting less confused about the basic concepts" and then later building an AI via a mechanism we'll figure out after getting less confused. Right now it's not engineering, it's science.

2Gordon Seidoh Worley
That's true, but there's a natural and historical relationship here with what was in the past termed "seed AI", even if this is not an approach anyone is actively pursuing, which is the kind of thing I was hoping to point at without using that outmoded term.
5Rob Bensinger
I agree with Ben and Richard's summaries; see When I think about key distinctions and branching points in alignment, I usually think about things like: * Does the approach require human modeling? Lots of risks can be avoided if the system doesn't do human modeling, or if it only does small amounts of human modeling; but this constrains the options for value learning and learning-in-general. * Current ML is notoriously opaque. Different approaches try to achieve greater understanding and inspectability to different degrees and in different ways (e.g., embedded agency vs. MIRI's "new research directions" vs. the kind of work OpenAI Clarity does), or try to achieve alignment without needing to crack open the black box. * Is the goal to make a task-directed AGI system, vs. an open-ended optimizer? When you say "there's a natural and historical relationship here with what was in the past termed 'seed AI', even if this is not an approach anyone is actively pursuing", it calls to mind for me the transition from MIRI thinking about open-ended optimizers to instead treating task AGI as the place to start.
4Ben Pace
I'm not actually sure what you mean. I think 'seed AI' means something like 'first case in an iterative/recursive process' of self-improvement, which applies pretty well to the iterated amplification setup (which is a recursively self-improving AI) and lots of other examples that Evan wrote about in his 11-examples post. It still seems to me to be a pretty general term.

I suspect that nobody is actually pursuing the third one as you've described it. Rather, my impression is that MIRI researchers tend to think of decision theory as a more fundamental problem in understanding AI, not directly related to human interests.

Based on comments/links so far it seems I should revise the names and add a fourth:

  • IDA = IDA
  • IRL -> Ambitious Value Learning (AVL)
  • DTA -> Embedded Agency (EA)
  • + Brain Emulation (BE)
    • Build AI that either emulates how humans brains work or is bootstrapped from human brain emulations.
2 comments, sorted by Click to highlight new comments since:

Sorta related is my appendix to this article.

Oh, I forgot about emulation approaches, i.e. bootstrap AI by "copying" human brains, which you mention. Thanks!