Safely Interruptible Agents

Oxford academics are teaming up with Google DeepMind to make artificial intelligence safer. Laurent Orseau, of Google DeepMind, and Stuart Armstrong, the Alexander Tamas Fellow in Artificial Intelligence and Machine Learning at the Future of Humanity Institute at the University of Oxford, will be presenting their research on reinforcement learning agent interruptibility at UAI 2016. The conference, one of the most prestigious in the field of machine learning, will be held in New York City from June 25-29. The paper which resulted from this collaborative research will be published in the Proceedings of the 32nd Conference on Uncertainty in Artificial Intelligence (UAI).

Orseau and Armstrong’s research explores a method to ensure that reinforcement learning agents can be repeatedly safely interrupted by human or automatic overseers. This ensures that the agents do not “learn” about these interruptions, and do not take steps to avoid or manipulate the interruptions. When there are control procedures during the training of the agent, we do not want the agent to learn about these procedures, as they will not exist once the agent is on its own. This is useful for agents that have a substantially different training and testing environment (for instance, when training a Martian rover on Earth, shutting it down, replacing it at its initial location and turning it on again when it goes out of bounds—something that may be impossible once alone unsupervised on Mars), for agents not known to be fully trustworthy (such as an automated delivery vehicle, that we do not want to learn to behave differently when watched), or simply for agents that need continual adjustments to their learnt behaviour. In all cases where it makes sense to include an emergency “off” mechanism, it also makes sense to ensure the agent doesn’t learn to plan around that mechanism.

Interruptibility has several advantages as an approach over previous methods of control. As Dr. Armstrong explains, “Interruptibility has applications for many current agents, especially when we need the agent to not learn from specific experiences during training. Many of the naive ideas for accomplishing this—such as deleting certain histories from the training set—change the behaviour of the agent in unfortunate ways.”

In the paper, the researchers provide a formal definition of safe interruptibility, show that some types of agents already have this property, and show that others can be easily modified to gain it. They also demonstrate that even an ideal agent that tends to the optimal behaviour in any computable environment can be made safely interruptible.

These results will have implications in future research directions in AI safety. As the paper says, “Safe interruptibility can be useful to take control of a robot that is misbehaving… take it out of a delicate situation, or even to temporarily use it to achieve a task it did not learn to perform….” As Armstrong explains, “Machine learning is one of the most powerful tools for building AI that has ever existed. But applying it to questions of AI motivations is problematic: just as we humans would not willingly change to an alien system of values, any agent has a natural tendency to avoid changing its current values, even if we want to change or tune them. Interruptibility and the related general idea of corrigibility, allow such changes to happen without the agent trying to resist them or force them. The newness of the field of AI safety means that there is relatively little awareness of these problems in the wider machine learning community.  As with other areas of AI research, DeepMind remains at the cutting edge of this important subfield.”

On the prospect of continuing collaboration in this field with DeepMind, Stuart said, “I personally had a really illuminating time writing this paper—Laurent is a brilliant researcher… I sincerely look forward to productive collaboration with him and other researchers at DeepMind into the future.” The same sentiment is echoed by Laurent, who said, “It was a real pleasure to work with Stuart on this. His creativity and critical thinking as well as his technical skills were essential components to the success of this work. This collaboration is one of the first steps toward AI Safety research, and there’s no doubt FHI and Google DeepMind will work again together to make AI safer.”

For more information, or to schedule an interview, please contact Kyle Scott at

New Comment
11 comments, sorted by Click to highlight new comments since:

Great to see some collaboration occurring between these institutes.

Hey Stuart,

It seems like much of the press around this paper discussed it as a 'big red button' to turn off a rogue AI. This would be somewhat in-line with your previous work around limited impact AIs who are indifferent to their being turned off, but it doesn't seem to really describe this paper. My interpretation of this paper is it doesn't make the AI indifferent to interruption, or prevent the AI from learning about the button - it just helps the AI avoid a particular kind of distraction during the training phase. Being able to implement the interruption is a separate issue - but it seems that designing a form of interruption that the AI won't try to avoid is the tough problem. Is this reading right, or am I missing something?

Very interesting paper, congratulations on the collaboration.

I have a question about theta. When you initially introduce it, theta lies in [0,1]. But it seems that if you choose theta = (0n)n, just a sequence of 0s, all policies are interruptible. Is there much reason to initially allow such a wide ranging theta - why not restrict them to converge to 1 from the very beginning? (Or have I just totally missed the point?)

We're working on the theta problem at the moment. Basically we're currently defining interruptibility in terms of convergence to optimality. Hence we need the agent to explore sufficiently, hence we can't set theta=1. But we want to be able to interrupt the agent in practice, so we want theta to tend to one.

Yup, I think I understand that, and agree you need to at least tend to one. I'm just wondering why you initially use the loser definition of theta (where it doesn't need to tend to one, and can instead be just 0 )

When defining safe interruptibility, we let theta tend to 1. We probably didn't specify that earlier, when we were just introducing the concept?


Talking of yourself in third person? :)

Cool paper!

Anyway I'm a bit bothered by the theta thing, the probability that the agent complies with the interruption command. If I understand correctly, you can make it converge to 1, but if it converges to quickly then the agent learns a biased model of the world, while if it converges too slowly it is unsafe of course.
I'm not sure if this is just a technicality that can be circumvented or if it represents a fundamental issue: in order for the agent to learn what happens after the interruption switch is pressed, it must ignore the interruption switch with some non-negligible probability, which means that you can't trust the interruption switch as a failsafe mechanism.

Would this agent be able to reason about off switches? Imagine an AI getting out, reading this paper on the internet, and deciding that it should kill all humans before they realize what's happening, just in case they installed an off switch it cannot know about. Or perhaps put them into lotus eater machines, in case they installed a dead man's switch.

This approach works under the assumption that the AI knows everything there is to know about its off switch.

And an AI that would kill everyone in case it had an off switch, is one that desperately needs a (public) off switch on it.

The approach assumes that it knows everything there is to know about off switches in general, or what its creators know about off switches.

If the AI can guess that its creators would install an off switch, it will attempt to work around as many possible classes of off switches as possible, and depending on how much of off-switch space it can outsmart simultaneously, whichever approach the creators chose might be useless.

Such an AI desperately needs more FAI mechanisms behind it, it desperately needing an off switch assumes that off switches help.

This class of off switch is designed for the AI not to work around.