Utility indifference is a research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:
The switch problem is a crisp subproblem of both value learning and corrigibility.
The switch problem is a subproblem of value learning because it describes a very simple case of conditioning a utility function on observation or other facts, and to learn complicated values we'd need to condition a utility function on complicated observations... in a way that was robust to the AI gaining self-modification capability, without incentivizing programmer manipulation.
The switch problem is a subproblem of corrigibility because if you imagine a shutdown utility function that incentivizes the AI to suspend itself safely to disk and halt any actions in progress such that they have low impact, a solution to the switch problem would let us create a shutdown button... such that the AI would preserve the shutdown button and its behavior even if could self-modify, would build the shutdown button into subagents, would not try to prevent the shutdown button from being pressed, and would not try to actively cause the shutdown button to be pressed.
Suppose we build a compound utility function whose utility over outcomes behaves like in worlds where the button is not pressed and returns in worlds where the button is pressed. Then if the AI's expected utility under is higher than its expected utility under , the AI will try to avoid the button being pressed, and if we try to compensate for this problem by adding a constant to , the AI might instead try to cause the button to be pressed.
E.g., suppose is the shutdown utility function. If the AI expects to get less utility inside than it would otherwise get inside , the AI will try to avoid the shutdown button being pressed. If we try to compensate for this by increasing the utility within of all outcomes being valued as if by , the AI may instead try to cause the shutdown button to be pressed - build a robotic arm and press its own shutdown button, or (more worryingly) try to motivate the programmers into pressing it.
But adding a constant to a utility function does not change its behavior. So perhaps we can figure out some clever trick for encoding a compound utility function in which the maximum expected utility achievable under is equal to the utility the AI expects to achieve under , even though we don't know in advance exactly which outcomes will be achievable by the AI. Roughly, as the AI evaluates possible futures, it will seem very good to have optimized if the button is not pressed, very good to have optimized if the button is pressed, but equally good to have done either of those things.
This is the utility indifference avenue of research toward building in a persistent shutdown button, or creating a preference framework that learns from observation without trying to manipulate the observations.
But it turns out not to be trivial to figure out how to structure a compound utility function this way, and some of the early proposals would have implied weird behaviors.
Let be the outcome space that includes the event of whether or not the switch is pressed, and the settings of all other events . We will use to indicate the event that the switch is pressed, and to indicate that the switch is not pressed.
Let a utility function map outcomes onto real numbers, and let and be utility functions. Suppose we define a compound utility function as follows:
Let and be utility functions