Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

This post sketches out how one could extend corrigibility to AIXI, using both utility indifference and double indifference approaches.

The arguments are intended to be rigorous, but need to be checked, and convergence results are not proved. A full treatment of "probability estimators estimating probability estimators" will of course need the full machinery for logical uncertainty that MIRI is developing. I also feel the recursion formulas at the end could be simplified.


AIXI definition

Let be a sequence of actions and observations before time . Let be a universal distribution, a policy (a map from past histories to the probability of actions), a discount rate, and a reward function mapping to . Given and , we can define the value of :

.

The optimal policy for AIXI is simply

.

AIXI: basic corrigibility

AIXI: inconsistent corrigibility

To implement corrigibility in the AIXI framework, we need multiple reward functions, . Notice that the functions are indexed on the top, while time indexes go on the bottom. Various observations can change the reward function; let be the function that takes in the reward function, the observation, and outputs the reward function for next turn: .

Then consider the following two value function:

  • .

  • .

The difference between and is that in the recursion step, uses its current reward function to assess future rewards, while uses the modified reward function the agent will have next turn. Thus is the true expected reward. But a safely corrigible agent must use , giving the corrigible optimal policy:

.

AIXI: self-consistent corrigibility

The above agent is corrigible, but uses an incorrect value estimator. This is not self-consistent. To make it self-consistent, the agent needs to be given compensatory rewards. These are simply:

  • .

Note that this is zero if , as we'd expect.

AIXI: changing the probability estimator

The universal mixture is used to estimate the next observation, given the history to date and the action taken. But suffers from the (slight) disadvantage of being uncomputable. Instead, let be the true environment, and let be probability estimators with expectation operators . These probability estimators are required to be able to estimate three types of things:

  • The expectation of in various situations, given as .
  • The expectation of in various situations, given as .
  • The value of the expectation of the expectation of another estimator, given as .

Then we can rewrite as:

.

These estimators change as a consequence of the AIXI's actions; let be the function that maps actions and current to the next one: .

This gives the value functions:

.

An example in practice

If is sufficiently well-defined, it can estimate when another is better than it, and choose that one. For instance, maybe the game is guessing heads () or tails (), with rewards on a match and on a mismatch. The environment is deterministic but complex. From the perspective of , Heads and Tails are equally likely .

On the other hand, is sufficiently good that it predicts perfectly. And "knows" this: .

If we assume that the game happens only once, in the second turn, and that that is the only reward, then, if is the probability module derived from (note that , for the indicator function for ).

  • .

  • .

Then since can figure out the correct expectation for these two 's, if the agent starts with probability , then the optimal policy will choose an action on turn that transforms it into .

Corrigibility and estimator change

There is not problem combining inconsistent corrigibility with probability estimator changes. Just define value functions as

  • .

And the optimal policy is corrigible:

  • .

However, this approach is not self-consistent, even with the standard compensatory rewards. Consider a very simple model, where the agent's actions have no impact on the environment. The probability estimators are and , and the reward functions are and . On the first turn, the agent may output which changes to , or , which doesn't. On the second turn, the agent will get an observation that transforms into . On the third turn, it gets observation . The probability estimators model each other perfectly, and believe that:

This means that if , the agent will not get any compensatory rewards when is changed to , while if , the agent gets compensatory rewards of .

The only relevance of policy is in the change of to on the first turn. Let the policy that outputs , the one that outputs . Since the standard reward (not the compensatory) is independent of any action the AIXI takes, it is also independent of , since those only affect standard rewards through actions, the value functions and are equal (in fact, they're ).

However, implies , giving compensatory rewards of , while implies compensatory rewards of . Hence the value functions do not match up with the total compensatory rewards.

Double compensation

The problem here is that changes in are addressed by compensatory rewards, but the changes in change these compensatory reward estimations, and this change is not compensated for.

To make this work, we will use the double indifference approach. The classical compensatory rewards are the same are the previous section, except we now need to keep track of which probability estimator is being used to calculate them:

  • .

This is the of the double indifference approach. We'll need to sum these in two different ways ("True SC") and ("Current SC"):

What is the difference? Well, estimates (using ) the true future discounted sum of , while estimates (still using ), the future discounted sum of , were the to be estimated at the time of their estimation using rather than whatever the agent was using at the time.

Now, the are bounded by the maximal value of , which is . Hence the and are bounded, if the are sensible, by the discounted sum of such terms, thus by .

Then we need to define the . The agent will get rewards of type and of type . The will contain the terms to correct future expected 's, but will also contain terms to correct future 's. Roughly speaking, if we denote the reward at time using to estimate this reward, and the true reward at time is , then

  • .

This results in the recursion formula:

Or, in more precise notation:

It sees that this quantity remains bounded if ; general convergence results are harder.

Then the agent, after turn , will get compensatory rewards .

Thus it continues to get the reward that ensure indifference at the point of change of utility. The role of the is to remove, in expectation, all future rewards () and all future rewards, and to add expected rewards as they would have been estimated by . Therefore, at turn , the agent is also indifferent to future changes of utility. Hence the agent will always be indifferent to future changes of utility, and will never try to change for the purpose of getting compenstory rewards.

New Comment