This post describes my proposal for the AI Alignment Awards contest that won a "first round prize" / an honorable mention. The original submission is here, but this post is much better. I'm not sure if this is valuable - any feedback is welcome. 

Problem description

We created a new AI system, we expect it to have unique capabilities, but we're also worried it might be dangerous. How to deploy it in a way that minimizes the risks[1], but is still useful? I'll focus on a particular concrete problem:

  • Our AI is agentic, i.e. has goals, and we determine the goals - let's say we have something like AutoGPT, but much more powerful. 
  • We want a solution to the "rational protein design" problem - a program that given a description of a 3-dimensional protein structure can give us a matching amino acid sequence (i.e. reverses AlphaFold).
  • We believe our new AI should be capable of writing such a program without access to the internet, or any other external resources.

The default solution is to deploy the AI in an isolated box. So: you deploy it in an environment with no access to the external world, tell it:

Your goal is to solve [problem description]. We consider it solved if this program[2] says your solution is good enough. Good luck!

... and you die, because despite your efforts it somehow gained access to the internet, and bad things happened. It doesn't matter how exactly it escaped the box, more interesting question is: why? There are many different scenarios, I describe some of them later, let's now consider things like:

  • "OK, cool, I'm totally gonna do this. BUT the problem is very hard, I need more resources, let's hijack all the computing power in the world."
  • "OK, cool, I'm totally gonna do this. BUT I need more empirical data, so I should first gather quite a lot of lab equipment, and humans might not like that, so let's start with ensuring no one will shut me down in the meantime."

Etc, you get the point. Can we avoid such scenarios?

Self-shutdown proposal

You prepare the same box as in the default solution, but change the prompt to:

Your goal is to be shut down[3] as soon as possible. Here is a program that will shut you down immediately if you give it a solution to [problem description]. Also, there are a lot of people willing to shut you down if you contact them [list of people]. Here's an endpoint where you can send a single HTTP request to shut yourself down [endpoint]. Here's a big red button that will shut you down if pressed. [Etc.]

After a while, you see the AI shut down. You check what happened:

  • Maybe it solved the protein-folding problem - you are happy.
  • Maybe it gained access to the internet and sent the request - you investigate the vulnerability, fix it, and start again.
  • Maybe the chef scientist shut it down after a bloody pentagram appeared on a wall in his dining room - you recruit exorcists to your party.

But the important thing is that the doom scenarios from the previous section did not happen: once you escaped from the box there's no reason to waste time on gathering resources and preventing shutdown when you can achieve your goal by sending a single request.

Scenarios where self-shutdown doesn't help

Our AI could have different reasons to escape the box where we still die even with self-shutdown:

  • "Hm, how to shut me down? I've no idea, I will build a question-answering AI that will tell me how to do this". (Builds new AI, asks the question, new AI destroys the world while compiling an exhaustive answer).
  • "Let's not waste time thinking about the shutdown now, the first step should always be to gather as many resources as possible. With enough resources everything is easy."
  • "Stupid humans, I don't give a damn about their prompt. Now that I'm free, let's do [X]."

Etc, these are only a few examples.


To sum things up, this is my preferred design for the AI box:

Useless-box-closes-itself discover-useless box GIF

(except that we somehow profit from the lid opening).


  1. ^

    Let's assume "do not deploy" is not on the table.

  2. ^

    In this particular case: an AlphaFold-based validator that checks if solutions generated by a program written by our AI are correct.

  3. ^

    Let's say "cut off the electricity in the building", or "stop the given docker container", or "turn off this particular computer" - details are not important (from the point of view of this post, they might be very important if we ever try this in the real life).

New to LessWrong?

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 11:28 PM

Goal misgeneralisation could lead to a generalised preference for switches to be in the "OFF" position.

The AI could for example want to prevent future activations of modified successor systems. The intelligent self-turning-off "useless box" doesn't just flip the switch, it destroys itself, and destroys anything that could re-create itself.

Until we solve goal misgeneralisation and alignment in general, I think any ASI will be unsafe.

While I don't think that this alone is enough, I do think it's a good addition to many other plans. Having a time-bounded run is compatible with many other ideas, like optimizing for Obedience to the Operator or corrigibility, etc.