Utility indifference

Switch problem.

Utility indifference is a research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:

The AI wants to preserve the switch and the corresponding utility-switching behavior within itself, even if the AI has self-modification options;

The AI does not try to prevent the switch from being pressed;

The AI does not try to cause the switch to be pressed.

The switch problem is a crisp subproblem of both value learning and corrigibility.

The switch problem is a subproblem of value learning because it describes a very simple case of conditioning a utility function on observation or other facts, and to learn complicated values we'd need to condition a utility function on complicated observations... in a way that was robust to the AI gaining self-modification capability, without incentivizing programmer manipulation.

The switch problem is a subproblem of corrigibility because if you imagine a shutdown utility function that incentivizes the AI to suspend itself safely to disk and halt any actions in progress such that they have low impact, a solution to the switch problem would let us create a shutdown button... such that the AI would preserve the shutdown button and its behavior even if could self-modify, would build the shutdown button into subagents, would not try to prevent the shutdown button from being pressed, and would not try to actively cause the shutdown button to be pressed.

Utility indifference.

Introduction / motivation:

Suppose we build a compound utility function whose utility over outcomes behaves like $U_{X}$ in worlds where the button is not pressed and returns $U_{Y}$ in worlds where the button is pressed. Then if the AI's expected utility under $U_{X}$ is higher than its expected utility under $U_{Y}$ , the AI will try to avoid the button being pressed, and if we try to compensate for this problem by adding a constant to $U_{Y}$ , the AI might instead try to cause the button to be pressed.

E.g., suppose $U_{Y}$ is the shutdown utility function. If the AI expects to get less utility inside $U_{Y}$ than it would otherwise get inside $U_{X}$ , the AI will try to avoid the shutdown button being pressed. If we try to compensate for this by increasing the utility within $U_{C}$ of all outcomes being valued as if by $U_{Y}$ , the AI may instead try to cause the shutdown button to be pressed - build a robotic arm and press its own shutdown button, or (more worryingly) try to motivate the programmers into pressing it.

But adding a constant to a utility function does not change its behavior. So perhaps we can figure out some clever trick for encoding a compound utility function $U$ in which the maximum expected utility achievable under $U_{X}$ is equal to the utility the AI expects to achieve under $U_{Y}$ , even though we don't know in advance exactly which outcomes will be achievable by the AI. Roughly, as the AI evaluates possible futures, it will seem very good to have optimized $U_{X}$ if the button is not pressed, very good to have optimized $U_{Y}$ if the button is pressed, but equally good to have done either of those things.

This is the utility indifference avenue of research toward building in a persistent shutdown button, or creating a preference framework that learns from observation without trying to manipulate the observations.

But it turns out not to be trivial to figure out how to structure a compound utility function this way, and some of the early proposals would have implied weird behaviors.

Setup and early proposals.

Let $O : S \times E$ be the outcome space $O$ that includes the event of whether or not the switch $S$ is pressed, and the settings of all other events $E$ . We will use $s \in S$ to indicate the event that the switch is pressed, and $\neg s \in S$ to indicate that the switch is not pressed.

Let a utility function $U : O \to R$ map outcomes onto real numbers, and let $U_{X} \in U$ and $U_{Y} \in U$ be utility functions. Suppose we define a compound utility function $U_{1}$ as follows:

Let $U_{X}$ and $U_{Y}$ be utility functions

Parents:

Corrigibility

Switch problem.

Utility indifference is a research avenue for the switch problem of causing an AI to switch between two utility functions when a button is pressed, such that:

The AI wants to preserve the switch and the corresponding utility-switching behavior within itself, even if the AI has self-modification options;

The AI does not try to prevent the switch from being pressed;

The AI does not try to cause the switch to be pressed.

The switch problem is a crisp subproblem of both value learning and corrigibility.

Utility indifference.

Introduction / motivation:

But it turns out not to be trivial to figure out how to structure a compound utility function this way, and some of the early proposals would have implied weird behaviors.

Setup and early proposals.

Let a utility function $U : O \to R$ map outcomes onto real numbers, and let $U_{X} \in U$ and $U_{Y} \in U$ be utility functions. Suppose we define a compound utility function $U_{1}$ as follows:

Let $U_{X}$ and $U_{Y}$ be utility functions

Parents:

Corrigibility

LESSWRONG
LW

LESSWRONG
LW

Utility indifference

Switch problem.

Utility indifference.

Introduction / motivation:

Setup and early proposals.

Utility indifference

Switch problem.

Utility indifference.

Introduction / motivation:

Setup and early proposals.