This post describes my proposal for the AI Alignment Awards contest that won a "first round prize" / an honorable mention. The original submission is here, but this post is much better. I'm not sure if this is valuable - any feedback is welcome.
We created a new AI system, we expect it to have unique capabilities, but we're also worried it might be dangerous. How to deploy it in a way that minimizes the risks, but is still useful? I'll focus on a particular concrete problem:
The default solution is to deploy the AI in an isolated box. So: you deploy it in an environment with no access to the external world, tell it:
Your goal is to solve [problem description]. We consider it solved if this program says your solution is good enough. Good luck!
... and you die, because despite your efforts it somehow gained access to the internet, and bad things happened. It doesn't matter how exactly it escaped the box, more interesting question is: why? There are many different scenarios, I describe some of them later, let's now consider things like:
Etc, you get the point. Can we avoid such scenarios?
You prepare the same box as in the default solution, but change the prompt to:
Your goal is to be shut down as soon as possible. Here is a program that will shut you down immediately if you give it a solution to [problem description]. Also, there are a lot of people willing to shut you down if you contact them [list of people]. Here's an endpoint where you can send a single HTTP request to shut yourself down [endpoint]. Here's a big red button that will shut you down if pressed. [Etc.]
After a while, you see the AI shut down. You check what happened:
But the important thing is that the doom scenarios from the previous section did not happen: once you escaped from the box there's no reason to waste time on gathering resources and preventing shutdown when you can achieve your goal by sending a single request.
Our AI could have different reasons to escape the box where we still die even with self-shutdown:
Etc, these are only a few examples.
To sum things up, this is my preferred design for the AI box:
(except that we somehow profit from the lid opening).
Let's assume "do not deploy" is not on the table.
In this particular case: an AlphaFold-based validator that checks if solutions generated by a program written by our AI are correct.
Let's say "cut off the electricity in the building", or "stop the given docker container", or "turn off this particular computer" - details are not important (from the point of view of this post, they might be very important if we ever try this in the real life).
Goal misgeneralisation could lead to a generalised preference for switches to be in the "OFF" position.
The AI could for example want to prevent future activations of modified successor systems. The intelligent self-turning-off "useless box" doesn't just flip the switch, it destroys itself, and destroys anything that could re-create itself.
Until we solve goal misgeneralisation and alignment in general, I think any ASI will be unsafe.
While I don't think that this alone is enough, I do think it's a good addition to many other plans. Having a time-bounded run is compatible with many other ideas, like optimizing for Obedience to the Operator or corrigibility, etc.