This is a short post.


Here's a common theory of change for alignment research:

  1. Do research that makes alignment more likely possible in the real world
  2. Hope a lab building AGI can make it more aligned using this research

Here's another theory of change for alignment research:

  1. Do research that makes it easier for someone to understand whether AGI poses a risk. Reduce the number of questionable assumptions a capabilities researcher needs to accept, in order to accept the case for AI risk.
  2. a) Obtain more cooperation among AGI labs and slow down capabilities research. b) Convince more researchers to work on alignment (again, either as per first and second theory of change)


Note that the kinds of alignment research that work for the first and second plan are overlapping but not identical. Some work, such as interpretability research, clearly helps with both. Something like the infrabayesian sequence probably helps more with the first. Risks from learned optimisation probably helps with both. Research on evolution and genetic fitness probably helps with the second theory of change than the first. Most research today seems focussed on the first theory of change, that might also be why I can't find examples of research laser-focussed on the second.


I'm not sure to what extent alignment researchers should aim for the first rather than the second. And as a community I'm not sure what fraction of researchers should focus on which.

I just wished to point out that the second theory of change exists.

New Comment

New to LessWrong?