A probabilistic off-switch that the agent is indifferent to

Hey there!

This is similar to the value-learning via indifference idea I presented here: https://www.lesswrong.com/posts/btLPgsGzwzDk9DgJG/proper-value-learning-through-indifference , with the most up to date version being here: https://arxiv.org/abs/1712.06365

Great minds thinking alike and all that.

Your method differs in that you use a logical fact, rather than a probability, to underpin your scheme. I would just have put $u (h)$ as a constant if $f^{- 1} (y) \neq 0$ , but the way you do it prevents the agent from going crazy (all utilities are the same, or logic is broken) if it ever figures out that $f^{- 1} (y) = 0$ , and incentivises it to shut down in that case.

The disadvantage is your requirement (3): difficulty erasing $x$ . If $u (h)$ were truly constant given $f^{- 1} (y) \neq 0$ , then that would not be necessary - it would only matter that there was a small probability that $x = 0$ , so we could do a sloppy job of erasing.

On that note, your scheme falls apart if the agent can only change $u (h)$ by a really tiny amount (in that case, spending a lot of resources calculating $x$ makes little difference). To correct for that, if you want to preserve a non-constant utility for $f^{- 1} (y) \neq 0$ , then you need to scale this utility by the amount that the agent expects it can change $u (h)$ in the $f^{- 1} (y) = 0$ world.

Overall: I like the idea of using logical facts, I don't like the idea of assuming that calculation costs are enough to control whether the agent computes the fact or not. Some better use of logical uncertainty might be called for?

[-]Ofer7yΩ230

Thank you so much for the feedback!

The disadvantage is your requirement (3): difficulty erasing x.

I just want to note that if it's too easy for the agent to reconstruct $x$ , this approach fails gracefully (and we can then simply improve this aspect and try again).

On that note, your scheme falls apart if the agent can only change u(h) by a really tiny amount (in that case, spending a lot of resources calculating x makes little difference).

I agree (the approach fails gracefully in that case too). We can create a "sink" for excess computation power by designing $u$ to reward the agent for doing arbitrary computations (i.e. finding more and more prime numbers). But in a way that the agent always prefers to give up any amount of such computations if it means achieving its goal (and terminating) one time-step earlier.

Regarding your paper (thanks for the pointer!): is there already a known way to use event-dependent rewards and effective disbelief to introduce an off-switch, that the agent will be indifferent about, to an arbitrary POMDP problem?

ETA: Regarding the seamless transition approach, I think I understand how it can solve this. But it requires using the agent's model of the environment in the definition of the utility function, so it might not work for a black-box agent that maximizes a computable function of reward given via a reward channel. I think the approach I described in this post could still be used in such a case.

ETA2: A note to random readers: Invoking a black-box agent is probably a very bad idea unless there's a consensus in the AI safety community it should be done!

EDIT: The following is probably incorrect, I need more time to think about it, please ignore it.

I'm confused about the seamless transition approach. Suppose the agent constructs a sub-agent that from time step t+1 (right after the transition) acts to minimize R′ iff it observes that the original agent keeps following the same policy after the transition (instead of switching to a policy that would maximize R′). Thereby, it will increase the extra corrective reward C (and thus the pseudo-reward R′+C) by decreasing V(R′,π_A,h_{t+1}).

In the proof of Theorem 17, the last sentence is:

This is obviously maximized by π_A being the optimal policy for R up until time t+1.

But πA might not be the optimal policy for R up until t+1. For example, if it corresponds to creating the above sub-agent, it isn't optimal for R up until t+1.

[-]Stuart_Armstrong7yΩ240

this approach fails gracefully

There still might be problems with subagents, though. It could be optimal for the agent to create a subagent to protect it from being interfered with, while it "goes to sleep".

[-]Ofer7yΩ230

I agree.

I think this might be solved by modifying the utility for the case $f^{- 1} (y) \neq 0$ to:

$\frac{α}{1 + [number of time-steps until the first "self-terminate" action]}$

[-]Stuart_Armstrong7yΩ350

Yep, that's better. There's still the risk of subagents being created - when the agent thinks that $f^{- 1} (y) \neq 0$ , almost certainly, but not completely certainly. Then it might create a $u$ -maximising subagent and then self-terminate.

That means that this design, like most indifference designs, is reflectively consistent but not reflectively stable.

[-]Ofer7yΩ350

Wow, I agree!

Let us modify the utility for the case $f^{- 1} (y) = 0$ to:

$u^{*} (h) = {\begin{matrix} 0 & h contains "self-terminate" action u (h) & otherwise \end{matrix}$

Meaning: no utility can be gained via subagents if the agent "jumps ship" (i.e. self-terminates to gain utility in case $f^{- 1} (y) \neq 0$ ).

[-]Stuart_Armstrong7yΩ230

Interesting. I'll think of whether this works and can be generalised (it doesn't make it reflectively stable - creating u-maximising subagents is still allowed, and doesn't directly hurt the agent - but might improve the situation).

[-]Stuart_Armstrong7yΩ230

Yep, seamless transition does not work on black-box agents. However, you don't have to fully understand its utility or the consequences of that utility, to get the design to work. So full transparency and understanding are not needed.

[-]jessicata7y50

[-]Ofer7y10

Thanks for the pointer! (I only now got to fully process it)

Both of the problems presented there seem to occur in this setup.

Regarding the problem of preventing resistance, you wrote:

Hypothetically, maybe you can tell the humans not to use bombs if the button fails to depress. But this solution seems both brittle and unrealistic.

I just like to share an idea for how can this perhaps be tackled, which isn't a great solution, but does not rely on humans adhering to precommitments.

Suppose our plan is to execute the agent program only for a short while (minutes/hours). We lock a (willing) human overseer in an isolated "control room" where they have access to the button, and only then the agent is invoked. The control room receives input from outside (e.g. it has monitors showing the lab with the robot that the agent controls) but otherwise the control room is completely isolated. The person in the control room won't be able to communicate with the outside world or leave the control room (for the entire duration in which the agent program is supposed to run).

This idea requires that no humans outside of the control room would see what the agent is doing (so the agent won't have an incentive to conceal/manipulate).

[-]Gurkenglas7y30

Given y, the agent can figure out whether x = 0 by checking whether f(0) = y.

[-]Ofer7y10

Ah! I agree :)

So $f$ must be expensive-to-compute as well.

[-]Ofer7yΩ110

The following is a modified $u^{'}$ , after (I think) fixing multiple problems that Stuart Armstrong pointed out in the original solution (see here, here and here):

$u^{'} (h) = {\begin{matrix} u^{*} (h) & f^{- 1} (y) = 0 \frac{α}{1 + [number of time-steps in h until a terminate action, or \infty]} & otherwise \end{matrix}$

where:

$u^{*} (h) = {\begin{matrix} 0 & h contains a terminate action u (h) & otherwise \end{matrix}$

and for some: $0 < α ≪ 1$ .

Note: if $x \neq 0$ , pressing the off-switch overrides the next action of the agent to $terminate$ .

Additionally, In case the agent has "excess computation capability" that it has nothing to do with, and thus calculates $f^{- 1} (y)$ no matter how small $α$ is, it will (with probability of almost 1) terminate itself immediately after figuring out $x \neq 0$ . To resolve this, we can modify $u$ to contain a "sink" for the "excess computation capability", as I described in this comment.

LESSWRONG
LW

LESSWRONG
LW

11

A probabilistic off-switch that the agent is indifferent to

11

11

Abstract

Notations

The Proposed Setup

Flaws / Necessary Assumptions