An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

[-]cat from /dev/null3y145

Nice proof with a thought-provoking example! Think it could benefit from being translated into a more AI-relevant setting with the following key:

Northland winning = no shutdown
Southland winning = shutdown
Send messenger to Northland = Act in a way that looks dangerous and causes evaluators to probably shut down
Send messenger to Southland = Act in a way that looks safe and causes evaluators to probably not shut down
Bet on Northland = Set up costly processes to do real work in future to attain high utility (e.g. build factories and labs that would need to be safely wound down in event of shut down)
Bet on Southland = Do very conservative work that can immediately be interrupted without any real cost
Cross-over costs could be understood to be the extra fake work to look dangerous when you're actually very interruptible in the NS case and having to show humans how everything is safe in the SN case (not realistic for the utilities to end up as neat as in your example, but rephrasing this feels like it helps me keep my eyes on the ball with various toy scenarios I've considered myself).

However, this actually shows the weakness of relying on pure conditional expectation and the sane pure bets condition. Consider an AI tasked with using nanomedicine to improve human intelligence and which in the case of no shutdown values more humans with higher thinkoomph coming sooner (eg ). It considers two actions: (A) conduct sensible studies to eventually figure out a way of boosting human intelligence without loosing anything humans might value or any other kinds of negative side-effects and cooperates with humans to show this in small studies that can then be scaled up, and (B) device an intelligence boosting virus that also rewrites human values to love the AI and never want to shut it down. We can set up the hypothetical so that both of these cases have a ~0 risk of the shutdown button being pressed, but (B) will yield more utility in the normal utility function due to being able to bypass any human worries about rollout and getting the utility from earlier intelligence boosts. Under the sane pure bets condition (if we set up all the details correctly) the AI should prefer (B) to (A), but this is the opposite of what we'd want from a corrigible system. This is all not meant to be particularly realistic, but still applicable enough to the theory that it causes me to want to reject the suggested desiderata.

I think this shows that we actually need to consider something like counterfactuals for a satisfactory solution to the shutdown problem. The core issue in my example above is that the AI directly causes the shutdown button to not be pressed in case (B) by modifying human values in a way that it doesn't in (A). Another approach might be something like respecting the humans as independent agents and not interfere with them, but this seems harder to formalise and harder to combine with an objective like augmenting human intelligence.

My own current best-attempt at a solution based on this

I have some ideas that still feel half-baked and that I don't feel likely to write up in a full blog post any time soon, so I'll write down a sketch here in case it can help someone else to puzzle more of the pieces together in case this maybe would actually turn out to be helpful at some point down the line.

We can call it Causal utility mixing as a nod to Naive utility mixing on the Arbital page on Utility Indifference. Pick an action a to maximise $λ_{N} E [U_{N} | d o (a) \land d o (\neg s)] + λ_{s} E [U_{s} | d o (a) \land d o (s)]$ where the lambdas are parameters picked based on our initial worry about needing to shut down. The parameters can be kept for actions over several time-steps, but we'll need to adjust to one of the base utility functions once the truth value of s is certain (e.g. the shutdown button is actually pressed, since I think we want to avoid certainty of no shutdown). This does not seem to be represented by any utility function and so this agent must be irrational in some way, but in light of the above result I'm leaning towards this being something we actually want and then the question is if we can somehow prove that it stays consistent under ability to self-modify. This seems to handle all the counterexamples I've encountered so far (like the asteroid problem plaguing naive utility mixing and other approaches), but I could very well have missed some or failed to generate others.

Some of you might recognise the idea of using counterfactuals from Jessica Taylor's and Chris Olah's approach of Maximizing a quantity while ignoring effect through some channel (called Stable actions (causal conditioning) in the Arbital page), which is more advanced in that it actually tries to assign weights to the two different scenarios. I think that if that is a valid approach to the shutdown problem, so will this much simplified solution and it seems easier to analyse the simpler formula.

I've been thinking that maybe you can show that this is somehow rational based on the agent being one party in a two-player game where both players act counterfactually on a graph representing the world (the other being something like an idealised human deciding whether to terminate this hypothetical). I unfortunately haven't had time to compare this to the game theory based approach in The Off-Switch Game by Hadfield-Menell et al., so don't know if there are any similarities. I do feel less certain that it will still work with logical counterfactuals or any form of functional decision theory, so it does seem worth it to investigate a bit more.

Sorry for highjacking your comment feed to cause myself to write this up. Hope it was a bit interesting.

[-]Logan Zoellner3y42

I don't think we want corrigible agents to be indifferent to being shut down. I think corrigible agents should want to be shut down if their users want to shut them down.

[-]AprilSR3y711

Even if shut down in particular isn't something we want it to be indifferent to, I think being able to make an agent indifferent to something is very plausibly useful for designing it to be corrigible?

[-]Matthew_Opitz3y50

This only produces desired outcomes if the agent is also, simultaneously, indifferent to being shut down. If an agent desires to not be shut down (even as an instrumental goal), but also desires to be shut down if users want them shut down, then the agent has an interest in influencing the users to make sure the users do not want to shut the agents down. This influence is obtained by making the user believe that the agent is being helpful. This belief could be engendered by:

actually being helpful to the user and helping the user to accurately evaluate this helpfulness.
not being helpful to the user, but allowing and/or encouraging the user to be mistaken about the agent's degree of helpfulness (which means, carelessness about being actually helpful in the best case, or being actively deceptive about being helpful in the worst case).

[-]Logan Zoellner3y10

Obviously we want 1) "actually be helpful".

Clearly there's some tension between "I want to shut down if the user wants me to shut down" and "I want to be helpful so that the user doesn't want to shut me down", but I don't weak indifference is a correct way to frame this tension.

As a gesture at the correct math, imagine there's some space of possible futures and some utility function related to the user request. Corrible AI should define a tradeoff between the number of possible futures its actions affect and the degree to which it satisfies its utility function. Maximum corrigibility {C=1} is the do-nothing state (no effect on possible futures). Minimum corrigibility {C=0} is maximizing the utility function without regard to side-effects (with all the attendant problems such as convergent instrumental goals, etc). Somewhere between C=0 and C=1 is useful corrigible AI. Ideally we should be able to define intermediate values of C in such a way that we can be confident the actions of corrigible AI are spatially and temporally bounded.

The difficultly principally lies in the fact that there's no such thing as "spatially and temporally bounded". Due to the Butterfly Effect any action at all affects everything in the future light-cone of the agent. In order to come up with a sensible notion of boundless, we need to define some kind of metric on the space of possible futures, ideally in terms like "an agent could quickly undo everything I've just done". At this point we've just recreated agent foundations, though.

[-]Logan Zoellner3y10

Here is a too long writeup of the math I was suggesting.

[-]Charlie Steiner3y20

Nice! What about conditions that break the symmetry between N and S, though?

Suppose there are two actions A and B, and "on switch" o. Maybe we only want the AI to care about what happens when the on switch is on, and not what happens when the switch is off.

So we replace the "pure bets" condition with the "switched bets" condition: If P(o|A) = P(o|B), and E(U|A,o)>E(U|B,o), take action A.

Now the example with Northland and Southland doesn't go through the same, because we have to pick one of the countries to asymmetrically be the one where things matter if it wins, and this leads the AI to sending a bet that the chosen country will win to that country (hurting its chances, but it doesn't switch to betting on the opposite country because that doesn't improve its payoff when the chosen country wins).

[-]Audere3y30

If we implement your example, the AI is willing to bet at arbitrarily poor odds that the on switch will be on, thus violating the sane pure bets condition.

You can have particular decision problems or action spaces that don't have the circular property of the Northland-Southland problem, but the fact remains that if an AI fulfills the weak indifference condition reliably, it must violate the sane pure bets scenario in some circumstances. There must be insane bets that it's willing to take, even if no such bets are available in a particular situation.

Basically, rather than thinking about an AI in a particular scenario, the proof is talking about conditions that it's impossible for an AI to fulfill in all scenarios.

I could construct a trivial decision problem where the AI only has one action it can take, and then the sane pure bets condition and weak indifference condition are both irrelevant to that decision problem. But when we place the same AI in different scenarios, there must exist some scenarios where it violates at least one of the conditions.

[-]Charlie Steiner3y21

If we implement your example, the AI is willing to bet at arbitrarily poor odds that the on switch will be on, thus violating the sane pure bets condition.

Yes. But the symmetry of the sane pure bets condition doesn't quite match what we want from corrigibility anyhow. I don't want an AI with a shutdown button to be making contingency plans to ensure good outcomes for itself even when the shutdown button is pressed.

[-]Audere3y63

Yes, the point of the proof isn't that the sane pure bets condition and the weak indifference condition are the be-all and end-all of corrigibility. But using the proof's result, I can notice that your AI will be happy to bet a million dollars against one cent that the shutdown button won't be pressed, which doesn't seem desirable. It's effectively willing to burn arbitrary amounts of utility, if we present it with the right bets.

Ideally, a successful solution to the shutdown problem should violate one or both of these conditions in clear, limited ways which don’t result in unsafe behavior, or which result in suboptimal behavior whose suboptimality falls within well-defined bounds. Rather than guessing-and-checking potential solutions and being surprised when they fail to satisfy both conditions, we should look specifically for non-sane-pure-betters and non-intuitively-indifferent-agents which nevertheless behave corrigibly and desirably.

[-]eva_3y10

E(U|NS) = 0.8, E(U|SN) = 0.8
Are the best options from a strict U perspective, and exactly tie. Since you've not included mixed actions, the agent must arbitrarily pick one, but arbitrarily picking one seems like favouring an action that is only better because it affects the expected outcome of the war, if I've understood correctly?
I'm pretty sure this is resolved by mixed actions though: The agent can take the policy {NS at 0.5, SN at 0.5}, which also gets U of 0.8 and does not effect the expected outcome of the war, and claim supreme unbiasedness for having done so.
If the scores were very slightly different, such that the mixed strategy that had no expected effect wasn't also optimal, it does have to choose between maximising expected utility and preserving that its strategy doesn't only get that utility by way of changing the odds of the event, I think on this model it has to only decide to favour one to the extent it can justify it without considering the measure of the effect it has on the outcome by shifting its own decision weights, but it's not worth it in that case so it still does the 50/50 split?

[-]Christopher King3y10

We can construct action C such that

P(P|C) = P(P|B),

E(U | C,P) = E(U | A,P),

and E(U | C,!P) = E(U | A,!P)

How is this done?

[-]David Johnston3y00

I’m not convinced the indifference conditions are desirable. Shutdown can be evidence of low utility

action	E(U\|action,n)	E(U\|action,s)	P(n)	E(U\|action)
NN	1.0	0.0	0.1	0.1
NS	-0.1	0.9	0.1	0.8
SS	0.0	1.0	0.9	0.1
SN	0.9	-0.1	0.9	0.8

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

66

An Impossibility Proof Relevant to the Shutdown Problem and Corrigibility

66

66

My own current best-attempt at a solution based on this

The Incompatibility of a Utility Indifference Condition with Robustly Making Sane Pure Bets

Summary

The proof

Further justification for the relevance of the weak indifference condition

Application to the 2015 Corrigibility paper

Conclusion

References