Classification of AI alignment research: deconfusion, "good enough" non-superintelligent AI alignment, superintelligent AI alignment

by crabman2 min read14th Jul 202025 comments



I want to work on technical AI alignment research. I am trying to choose an AI alignment area to get into. To do that, I want to understand the big picture of how to make AI go well through technical research. This post contains my views on this question. More precisely, I list types of AI safety research and their usefulness.

Probably, many things in this post are wrong. I want to become less wrong. The main reason why I am posting this is to get feedback. So, please

  • send me links to alternative big picture views on how to make AI go well
  • tell me all the places where I am wrong and all the places where I am right
  • talk to me in the comments section, via DM, via email, or schedule a video chat.

Three fuzzy clusters of AI alignment research

  1. The research aimed to deconfuse us about AI. Examples of this are AIXI, MIRI's agent foundations, Michael Dennis' game theory, Tom Everitt's agent incentives, Vanessa Kosoy's incomplete models, Comprehensive AI services.
  2. The research aimed to provide "good enough" alignment for non-superintelligent AI. This category includes more hacky and less theory-based research, which probably won't help with the alignment of superintelligent AI. By superintelligent AI, I mean AI, which is much smarter than humans, much faster, and is almost omnipotent. Examples: everything under the umbrella of (prosaic AI alignment), robustness of neural networks against adversarial examples, robustness of neural networks against out-of-distribution samples, almost everything related to neural networks, empirical work on learning human values, IDA (I think it goes into both 2 and 3), most things studied by OpenAI and Deepmind (but I am not that sure what exactly they are studying), Vanessa Kosoy's value learning protocols (I think it goes under both 2 and 3).
  3. The research aimed to directly solve one of the problems on our way to the alignment of both superintelligent and non-superintelligent AI. Examples: IDA (I think it goes into both 2 and 3), Vanessa Kosoy's value learning protocols (I think it goes under both 2 and 3), Stuart Armstrong's research agenda (I am not sure about this one).

Their usefulness

I think type-1 research is most useful, type-3 is second best, and type-2 is least useful. Here's why I think so.

  1. At some point, humanity will create a superintelligent AI, unless we go extinct before. When that happens, we won't be making important decisions anymore. Instead, the AI will.
  2. Human-level AI might be alignable using hacky, empirical testing, engineering, and "good enough" alignment. However, superintelligent AI can't be aligned with such methods.
  3. Superintelligent AI is an extremely powerful optimization process. Hence, if it's unaligned even a little, it'll be catastrophic.
  4. Therefore, it's crucial to align superintelligent AI perfectly.
  5. I don't see why it'll be easier to work on the alignment of superintelligent AI in the future rather than now, so we'd better start now. But I am unsure about this.
  6. There are too many confusing things about superintelligent AI alignment, and I don't see any clear ways to solve it without spending a lot of time on figuring out what is even going on (e.g., how can embedded agents work?). Hence, deconfusion is very important.

Many people seem to work on type-2 research. Probably, many of them have thought about it and decided that it's better. This a reason to think that I am wrong. However, I think there are other reasons people may choose to work on type-2 research, such as:

  • It's easier to get paid for type-2 research.
  • It's easier to learn all the prerequisites for type-2 research and to actually do it.
  • Humans have bias towards near-term thinking.
  • Type-2 research seems less fringe to many people.

Also, I have a feeling that approximately in the last 3 years, a small paradigm shift has happened. People interested in AI alignment started talking less about superintelligence, singletons, AGI in the abstract, recursive self-improvement and fast takeoff. Instead, they talk more about neural networks, slow takeoff, and smaller, less weird, and less powerful AI. They might be right, and this is a reason to be slightly more enthusiastic about type-2 research. However, I still think that the old paradigm, perhaps minus fast takeoff, is more useful.