This is a link post for https://aligned.substack.com/p/alignment-mvp
I'm writing a sequence of posts on the approach to alignment I'm currently most excited about. This second post argues that instead of trying to solve the alignment problem once and for all, we can succeed with something less ambitious: building a system that allows us to bootstrap better alignment techniques.
Building weak AI systems that help improve alignment seems extremely important to me and is a significant part of my optimism about AI alignment. I also think it's a major reason that my work may turn out not to be relevant in the long term.
I still think there are tons of ways that delegating alignment can fail, such that it matters that we do alignment research in advance:
Overall I think that "make sure we are able to get good alignment research out of early AI systems" is comparably important to "do alignment ourselves." Realistically I think the best case for "do alignment ourselves" is that if "do alignment" is the most important task to automate, then just working a ton on alignment is a great way to automate it. But that still means you should be investing quite a significant fraction of your time in automating alignment.
I also basically buy that language models are now good enough that "use them to help with alignment" can be taken seriously and it's good to be attacking it directly.
What do you (or others) think is the most promising, soon-possible way to use language models to help with alignment? A couple of possible ideas: