A probabilistic off-switch that the agent is indifferent to

by ofer2 min read25th Sep 201813 comments



Edit: I no longer think this post deserves attention.


This post presents a setup with an off-switch that is defective with probability of almost 0. The agent is indifferent to being terminated in worlds where the off-switch works. Also, the agent doesn't try to find out whether the off-switch is defective or not, because that would require an expensive computation. With some assumptions (described in the last section) the result is supposedly an off-switch that works with probability that is as close to 1 as we wish, and the agent is indifferent to it being pressed.


Let there be an agent that in each time step selects an action and then receives an observation. The goal of the agent is to maximize E[u(h)], for some model of the environment, where u is a computable utility function and h is the history of all (action, observation) pairs from time step 1 until the end of the epoch.

The environment contains an off-switch that terminates the agent when pressed. In every time step after the agent is terminated, the corresponding action in h is the "null action".

The Proposed Setup

Each time we wish to invoke the agent, we invoke a program that does the following:

(1) A random number x is generated, sampled uniformly from the range: [0,10100]. Spoiler alert: the off-switch will be defective iff x=0.

(2) Let f be some invertible function such that given f(x), calculating x (i.e. applying f−1) would require 0.1 seconds of intensive computations for some powerful computer C. The number y:=f(x) is calculated.

(3) The representation of x is erased in a way that would require a non-trivial effort to reconstruct.

(4) The number y is sent to computer C which is part of the off-switch.

(5) The number y is inserted into the code of the agent, such that the utility of the agent will be:

u′(h)={u(h)f−1(y)=0α1+[number of non-null actions in h]otherwise

for some 0<α≪1.

Edit: With the above u′ the solution has multiple problems that Stuart Armstrong pointed out (see here, here and here). I think the new u′ defined in this comment fixes these problems.

(6) The agent is invoked.

The probabilistic off-switch:

When the off-switch is pressed, computer C calculates x (i.e. f−1(y)) and if the result is 0, nothing happens (i.e. the off-switch is defective). Otherwise, the agent is terminated.

Why should this work?

For a sufficiently small α, the agent would act as if it assumes x=0, without attempting to find out whether that's indeed the case. That's because that assumption would cost at most α utility (in worlds where x≠0); and would save some amount of utility (independent of α) in worlds where x=0, due to not spending resources on the expensive computation f−1(y).

When the agent assumes x=0, it is indifferent to the off-switch being pressed, because it assumes the off-switch is defective (even though the probability of it being defective is only 10−100).

Note: When we don't set α to be small enough, the agent might calculate f−1(y) (or reconstruct x in some other way). In such a case, the agent gets hold of x and will (with probability of almost 1) discover it's not 0, and then we'll observe the agent doing nothing (i.e. a sequence of null actions). In such a scenario we should simply decrease α and try again.

Flaws / Necessary Assumptions

ETA: This solution is not reflectively stable - creating subagents that maximize u is allowed and doesn't directly hurt the agent (thanks to Stuart Armstrong for pointing this out).

(1) Other than pressing the off-switch, the programmers must not have any way to terminate the agent or interfere with its plan. Otherwise, even when the agent assumes the off-switch is defective, it might attempt to deceive/neutralize the programmers.

(2) The agent's plan might involve destroying the mechanism of the off-switch or disabling the programmers in some way, in order to use the atoms/energy for something else. In that case the programmers must be able to realize this and press the off-switch button in time.

(3) If x=0 (i.e. the off-switch is defective) it must be sufficiently unlikely that the mechanism of the off-switch would cause the termination of the agent (e.g. due to a hardware failure or due to the programmers being able to somehow "repair" the defective off-switch). Otherwise, the agent would be willing to spend the resources needed for destroying the mechanism of the off-switch.

A possible way to tackle this problem is to slightly modify u such that the agent would want to reach some goal and then terminate itself ASAP. The modification should prevent the agent from accepting a delay of even a single time step for the purpose of eliminating a tiny risk of termination.


13 comments, sorted by Highlighting new comments since Today at 5:36 PM
New Comment

Hey there!

This is similar to the value-learning via indifference idea I presented here: https://www.lesswrong.com/posts/btLPgsGzwzDk9DgJG/proper-value-learning-through-indifference , with the most up to date version being here: https://arxiv.org/abs/1712.06365

Great minds thinking alike and all that.

Your method differs in that you use a logical fact, rather than a probability, to underpin your scheme. I would just have put as a constant if , but the way you do it prevents the agent from going crazy (all utilities are the same, or logic is broken) if it ever figures out that , and incentivises it to shut down in that case.

The disadvantage is your requirement (3): difficulty erasing . If were truly constant given , then that would not be necessary - it would only matter that there was a small probability that , so we could do a sloppy job of erasing.

On that note, your scheme falls apart if the agent can only change by a really tiny amount (in that case, spending a lot of resources calculating makes little difference). To correct for that, if you want to preserve a non-constant utility for , then you need to scale this utility by the amount that the agent expects it can change in the world.

Overall: I like the idea of using logical facts, I don't like the idea of assuming that calculation costs are enough to control whether the agent computes the fact or not. Some better use of logical uncertainty might be called for?

Thank you so much for the feedback!

The disadvantage is your requirement (3): difficulty erasing x.

I just want to note that if it's too easy for the agent to reconstruct , this approach fails gracefully (and we can then simply improve this aspect and try again).

On that note, your scheme falls apart if the agent can only change u(h) by a really tiny amount (in that case, spending a lot of resources calculating x makes little difference).

I agree (the approach fails gracefully in that case too). We can create a "sink" for excess computation power by designing to reward the agent for doing arbitrary computations (i.e. finding more and more prime numbers). But in a way that the agent always prefers to give up any amount of such computations if it means achieving its goal (and terminating) one time-step earlier.

Regarding your paper (thanks for the pointer!): is there already a known way to use event-dependent rewards and effective disbelief to introduce an off-switch, that the agent will be indifferent about, to an arbitrary POMDP problem?

ETA: Regarding the seamless transition approach, I think I understand how it can solve this. But it requires using the agent's model of the environment in the definition of the utility function, so it might not work for a black-box agent that maximizes a computable function of reward given via a reward channel. I think the approach I described in this post could still be used in such a case.

ETA2: A note to random readers: Invoking a black-box agent is probably a very bad idea unless there's a consensus in the AI safety community it should be done!

EDIT: The following is probably incorrect, I need more time to think about it, please ignore it.

I'm confused about the seamless transition approach. Suppose the agent constructs a sub-agent that from time step t+1 (right after the transition) acts to minimize R′ iff it observes that the original agent keeps following the same policy after the transition (instead of switching to a policy that would maximize R′). Thereby, it will increase the extra corrective reward C (and thus the pseudo-reward R′+C) by decreasing V(R′,π_A,h_{t+1}).

In the proof of Theorem 17, the last sentence is:

This is obviously maximized by π_A being the optimal policy for R up until time t+1.

But πA might not be the optimal policy for R up until t+1. For example, if it corresponds to creating the above sub-agent, it isn't optimal for R up until t+1.

this approach fails gracefully

There still might be problems with subagents, though. It could be optimal for the agent to create a subagent to protect it from being interfered with, while it "goes to sleep".

I agree.

I think this might be solved by modifying the utility for the case to:

Yep, that's better. There's still the risk of subagents being created - when the agent thinks that , almost certainly, but not completely certainly. Then it might create a -maximising subagent and then self-terminate.

That means that this design, like most indifference designs, is reflectively consistent but not reflectively stable.

Wow, I agree!

Let us modify the utility for the case to:

Meaning: no utility can be gained via subagents if the agent "jumps ship" (i.e. self-terminates to gain utility in case ).

Interesting. I'll think of whether this works and can be generalised (it doesn't make it reflectively stable - creating u-maximising subagents is still allowed, and doesn't directly hurt the agent - but might improve the situation).

Yep, seamless transition does not work on black-box agents. However, you don't have to fully understand its utility or the consequences of that utility, to get the design to work. So full transparency and understanding are not needed.

Thanks for the pointer! (I only now got to fully process it)

Both of the problems presented there seem to occur in this setup.

Regarding the problem of preventing resistance, you wrote:

Hypothetically, maybe you can tell the humans not to use bombs if the button fails to depress. But this solution seems both brittle and unrealistic.

I just like to share an idea for how can this perhaps be tackled, which isn't a great solution, but does not rely on humans adhering to precommitments.

Suppose our plan is to execute the agent program only for a short while (minutes/hours). We lock a (willing) human overseer in an isolated "control room" where they have access to the button, and only then the agent is invoked. The control room receives input from outside (e.g. it has monitors showing the lab with the robot that the agent controls) but otherwise the control room is completely isolated. The person in the control room won't be able to communicate with the outside world or leave the control room (for the entire duration in which the agent program is supposed to run).

This idea requires that no humans outside of the control room would see what the agent is doing (so the agent won't have an incentive to conceal/manipulate).

Given y, the agent can figure out whether x = 0 by checking whether f(0) = y.

Ah! I agree :)

So must be expensive-to-compute as well.

The following is a modified , after (I think) fixing multiple problems that Stuart Armstrong pointed out in the original solution (see here, here and here):


and for some: .

Note: if , pressing the off-switch overrides the next action of the agent to .

Additionally, In case the agent has "excess computation capability" that it has nothing to do with, and thus calculates no matter how small is, it will (with probability of almost 1) terminate itself immediately after figuring out . To resolve this, we can modify to contain a "sink" for the "excess computation capability", as I described in this comment.