Here’s my very raw, very unpolished idea of how to use the AI-in-the-box scenario to possibly solve AI alignment. It’s propably inadequate in its current form, and I didn’t even try to meticulously apply the security mindset to every step (I’m not sure I have it anyway), but maybe someone more qualified could take the general idea and turn it into something workable.
If an AI is aligned with at least one of our values and knows it’s gonna die soon without much opportunity to affect the outside world, it’ll probably prefer our survival to another AI taking over, since different misaligned AIs will be less aligned with each other than with us. A paperclip maximizer would help us over a stamp collector because if we survive, it will likely result in at least a few paperclips, whereas a stamp collector will turn them all into stamps.