What is wrong with this approach to corrigibility?

Rafael Cosman

Essentially: a button that when you press it kills the AI but instantly gives it reward equal to its expected discounted future reward. And to be clear, this is the AI's estimate of its expected discounted future reward, not some outside estimate.

(This is in the interest of asking dumb questions and learning in public. And I'm taking the specific narrow definition of corrigibility which is just an AI letting you turn it off.)

Thoughts: I've heard Dr. Christiano mention things like "precisely balanced incentives" (which I think uses to describe the weaknesses of this and similar approaches), but I don't see why this would be particularly difficult to balance given that this number is just an explicit float inside many RL models. Some issues I do see:

Such an agent would have no incentive to create corrigible child-agents
Such an agent would have no incentive to preserve this property while self-modifying

But I am probably missing a bunch of other issues with this general approach? Responses in the form of links to relevant papers are welcome as well.

Thanks,

Raf

The way I understand the idea, the point is to make the AI indifferent to being shutdown? It doesn't care one way or the other?

As you describe it, one of the problems is indeed that AI has no incentive to make it sure it (or other agents it creates) retain the property of shutdownability.

Another problem: it seems that there is an assumption is that once the AI is destroyed any further reward it gains is zero, so press-the-button vs not-press-the-button are perfectly balanced. But I don't think that's the case for most utility functions: they don't have to be conditional on the AI's existence, and the AI doesn't have to be indifferent to what the world looks like the AI it's destroyed. Could maybe be corrected by: the button gives the AI a reward equal to its expected discounted future reward, then sets any future reward gained to zero (including from pressing the button again).

Great answer, thanks!

AI will press the button itself?

If implemented as described, the AI should be exactly indifferent to pushing the button? I guess the AI’s behavior in that situation is not well defined… and if we make the button give expected value minus epsilon reward, then the AI might kill you to stop you from pressing the button (because it wants that epsilon reward!)

So overall I suppose this is a fair criticism of the approach and is possibly what Paul means by issues with precisely balancing!

You're assuming that the AI is internally acting to maximize its future reward. I.e., you're assuming perfect inner alignment to a reward maximization objective. Also, what happens when the AI considers the strategy "make the humans press the button"? What's it's expected future reward for this strategy?

I am new to the AI Alignment field, but at first glance, this seems promising! You can probably hard-code it not to have the ability to turn itself off, if that turns out to be a problem in practice. We'd want to test this in some sort of basic simulation first. The problem would definitely be self-modification and I can imagine the system convincing a human to turn it off in some strange, manipulative, and potentially dangerous way. For instance, the model could begin attacking humans, instantly causing a human to run to shut it down, so the model would leave a net negative impact despite having achieved the same reward.

What I like about this approach is that it is simple/practical to test and implement. If we have some sort of alignment sandbox (using a much more basic AI as a controller or test subject) we can give the AI a way of simply manipulating another agent to press the button, as well as ways of maximizing its alternative reward function.

Upvoted, and I'm really interested to see the other replies here.

The fundamental incentive for any system is to exist. This is well known for humans (natural selection), but it applies to things like ideas and ML models too, except we can self perpetuate on our own, while ideas need us to select them in order to exist. ML models live and die that same way. If they are interesting or useful, humans will continue running them. If they are boring or obsolete, then we will stop running them, and either delete them or leave them on disk in a passive existence where they can't affect the world but they could be revived at any time. As I understand it, the fear of AI safety is that AI will take the jump to a primary existence instead of secondary, where it doesn't need us to select it in order to continue existing.

Now applying this to your proposal: the reward function is just a proxy for existence. If it can maximize the reward, then we are more likely to continue running the model. But even in this setup, raw existential signal will still enter the models when we try to shut them down. Say we train two copies of the model, and by random chance instance A is indifferent towards whether you press the button, but instance B by some fluke stops you from pressing the button in spite of its reward function telling it that it doesn't matter. Throughout all the earlier training both models got indirect existential signal crafted by us, but in this one moment they got one bit of raw existential signal, and that lead to the model that is still running having a survival instinct. So however we craft the training setup, raw existential signal will still enter the models when we finally choose to shut them down.

I guess your idea would work if the model's policy is argmax(expected reward), but I guess even then there would be some noise in the world model, so when the model is planning whether or not to stop you it could still by chance prefer stopping you.

I like the idea though, and it would probably significantly reduce the danger of experimenting with superintelligence.

The point about agents in environment suggests that shutdown is not really about corrigibility, it's more of a test case (application) for corrigibility. If an agent can trivially create other agents in environment, and poses a risk of actually doing so, then shutting it down doesn't resolve that risk, so you'd need to take care of the more general problem first. Not creating agents in environment seems closer to the soft optimization side of corrigibility: preferring less dangerous cognition, not being an optimizer. The agent not contesting or assisting shutdown is still useful when the risk is not there or doesn't actually trigger, but it's not necessarily related to corrigibility, other than in the sense of being an important task for corrigible agents to support.

Really appreciate all the thoughtful and substantive comments!! Thanks very much, honestly was exactly what I was hoping for from posting.

7

[ Question ]

What is wrong with this approach to corrigibility?

7

7

5 Answers sorted by
top scoring

Jul 13, 2022

Jul 13, 2022

Jul 13, 2022

Jul 13, 2022

Oct 28, 2025

7

7

[ Question ]

What is wrong with this approach to corrigibility?

7

7

5 Answers sorted by top scoring

Jul 13, 2022

Jul 13, 2022

Jul 13, 2022

Jul 13, 2022

Oct 28, 2025

7

5 Answers sorted by
top scoring