*A putative new idea for AI control; index here*.

The aim of this post is simply to collect all the different "indifference" proposals that I know of. Let me know if I've missed any!

The basic idea is that the agent is following policy with reward , and there is a button that if pressed (event ) will change the agent to following policy with reward .

## Policy versus reward

My initial proposal considered only change of utility functions -- similar to reward functions -- and assumed the policy would always be to maximise that utility.

Conversely, the Safely Interruptible Agents paper just looked at policy changes, while sticking to the same reward function.

In general, policy and reward changes can be treated similarly -- what matters is the changes between (the expected value of under , *as estimated by the agent*) and .

Indeed, you could just treat it as a policy change, with the added caveat that also 'chooses' to claim their reward from .

## Method: Bias penalty or indifference

If we have an initial estimate for the probability of (hence of the agent following ), then if the agent takes an action , this probability changes to .

One way of fighting this, is to penalise the agent for large changes . In the terms of this post, the agent is taking biased actions.

And to combat large biases, the agent could get an extra bias penalty, as explained in the Towards Interactive Inverse Reinforcement Learning poster.

But the rest of this post will focus on bias-indifference, rather than bias penalty.

## Counterfactual: evidential, causal, translation

When comparing and , it's important to note that this is a counterfactual comparison: it's comparing what would have happened under versus , but only one of those policies is actually going to be happening, depending on whether the button press happens.

# Evidential counterfactual

The first idea is to have some sort of fixed probability for versus . And then the agent acts as if the probability cannot be changed by anything the agent does.

Call this the "evidential" counterfactual. The problem with it is that it causes the agent to want to manipulate conditional probabilities (similarly how evidential decision theory can desire to manipulate the news). Basically, if the agent can decrease the probability of while making the reward for much larger, it will do so (as it doesn't track the reduction of probability, by construction).

The factoring out variables approach seems to use this evidential counterfactual, and thus suffers from this problem.

# Causal counterfactual

My original idea was that there was some low-probability event such that if happened AND happened, then the utility/policy would not change.

The problem with this is that, as Eliezer and Jessica clarified, this contrasts with , whereas we really want to contrast with .

Now, if is of low probability, , except for bets that turn on the value of . However the difference between and is more problematic. In the first case, humans have chosen not to press the button; in the second, humans have tried to press the button but failed to change the agent's utility. Obviously human behaviour can be very different in both cases, translating to a very different value estimation.

# Weight Translation

This method relies on some sort of initial 'default' probability of , call this . Then the agent considers an action , and this probability changes to .

Thus after , the agent expects to be maximising reward with probability , and maximising with probability .

Let and be the ultimate expected weights of the rewards and . By ultimate expected weights, we mean the expected weights after or , and what the weights themselves are will be defined in later sections. The default expected weights are and , while the expected weights given are and .

Then the translation approach wants to map the weights back to the default. Let and be the adjusted weights, then:

- .

The rest of this post will focus mainly on the weight translation "counterfactual".

## Update defaults: no or observations only

The bias penalty, the evidential counterfactual, and the translation approach depend on some default assessment of the probability of . One could either set some fixed probability at the beginning and never update them, or allow them to be updated only by observations, and not by the agent's own actions.

## Translation type: simple, mean, or specific

Given a prior history , let be the default probability of and be the probability of given that the agent takes action . Let the current weight of be .

The translation approach aims to correct the expected weights of and from back to . By definition we know that, if is the next observation and is the probability of given and :

- .

A simple translation is one where there is a vector such that all are equal to .

A mean translation is one where having the mean equality is the key requirement; the may be constrained in various ways, but the mean equality is the main requirement.

A specific translation is one that has specific values for , where the mean equality is a consequence of those specific values. The counterfactual approach can be seen

## Weights: probabilities or extra rewards

Finally, we have to sort out what we mean by these 'weights'.

The simplest is that these weights are probabilities. So, just before the choice between and , the agent could have for and . Then if happens, the agent has reward with probability ; and if happens, it has with probability .

If the AI didn't take any actions at all, then and , same as the original values.

The problem with that approach is that we must ensure the weights are constrained between and .

Alternatively, changes in weights can be seen instead as adding extra rewards to the final rewards, rather than changing their probabilities. Thus if , and happens, the agent's reward function is , and if , the agent's reward function is .

This has the advantage that the weights may be negative or higher than one, but disadvantage that it may result in unusual mixed reward functions.

## Examples

Given these terms, the indifference approach I described as the best is Method: **indifference**, Counterfactual: **translation**, Update defaults: **observations only**, Translation type: **simple** for small translations, **mean** for large ones, and Weights: **probabilities**.

One could imagine slightly tweaking that approach, by using extra rewards for weights, and dropping the complicated conditions needed to keep the weights bounded between and , allowing simple translations always. This would result in: Method: **indifference**, Counterfactual: **translation**, Update defaults: **observations only**, Translation type: **simple**, and Weights: **extra rewards**.

Finally, the counterfactual approach can be seen as: Method: **indifference**, Counterfactual: **translation**, Update defaults: **observations only**, Translation type: **specific**, and Weights: **probabilities**.

Question that I haven't seen addressed (and haven't worked out myself): which of these indifference methods are reflectively stable, in the sense that the AI would not push a button to remove them (or switch to a different indifference method)?