[ Question ]

What are the high-level approaches to AI alignment?

by G Gordon Worley III1 min read16th Jun 202013 comments


Ω 4

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I'm writing a post comparing some high-level approaches to AI alignment in terms of their false positive risk. Trouble is, there's no standard agreement on what various high-level approaches to AI alignment there are today, either in terms of what constitutes these high-level approaches or where to draw the line in categorizing various specific approaches.

So, I'll open it up as a question to get some feedback before I get too far along. What do you consider to be the high-level approaches to AI alignment?

(I'll supply my own partial answer below.)

New Answer
Ask Related Question
New Comment

3 Answers

You might be interested in this post I wrote recently that goes into significant detail on what I see as the major leading proposals for building safe advanced AI under the current machine learning paradigm.

My initial thought is that there are at least 3, which I'll give the follow names (with short explanations):

  • Iterated Distillation and Amplification (IDA)
    • Build an AI, have it interact with a human, create a new AI based on the interaction of the human and the AI, and repeat until the AI is good enough or it reaches a fixed point and additional iterations don't change it.
  • Inverse Reinforcement Learning (IRL)
    • Build an AI that tries to infer human values from observations and then acts based on those inferred values.
  • Decision Theorized Agent (DTA)
    • Build an AI that uses a decision theory that causes it to make choices that will be aligned with human interests.

All of these are woefully underspecified, so improved summaries of these approaches that you think accurately explain these approaches also appreciated.