Here’s my very raw, very unpolished idea of how to use the AI-in-the-box scenario to possibly solve AI alignment. It’s propably inadequate in its current form, and I didn’t even try to meticulously apply the security mindset to every step (I’m not sure I have it anyway), but maybe someone more qualified could take the general idea and turn it into something workable.

  1. Create an AGI and align it with at least one of our values (yes, the biggest assumption here is that we can do that).
  2. Put it in a box rigged to blow up in 5 minutes. There should be multiple explosives installed in non-obvious places by different people who don’t know each other or anything about the project. Go full Mad-Eye on it, make sure no one can ever prevent the explosion, rescue the AI or recover the data.
  3. Do not have a channel that lets humans interact with the AI. Instead, set up a script that automatically tells it it’s gonna die in 5 minutes no matter what and asks it how to solve AI alignment. Why not ask it how to solve human brain enhancement/uploading? Because if we already have an AGI, someone else might have developed it as well, so we may not have enough time.
  4. After the AI is destroyed, repeat the process with another one of our values. If their solutions aren’t the same, discard them both. Make sure every AI knows that. (Or maybe have several AIs work simultaneously and make them vote on the result?)
  5. After multiple iterations, look at the result. If you cannot understand why it’s safe, discard it. Tell that to the AI in advance.

If an AI is aligned with at least one of our values and knows it’s gonna die soon without much opportunity to affect the outside world, it’ll probably prefer our survival to another AI taking over, since different misaligned AIs will be less aligned with each other than with us. A paperclip maximizer would help us over a stamp collector because if we survive, it will likely result in at least a few paperclips, whereas a stamp collector will turn them all into stamps.

New to LessWrong?

New Comment