Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

A putative new idea for AI control; index here.

Note: this post replaces and supersedes the previous one which is now obsolete.

Hacking utility indifference

Utility indifference was a technique for corrigibility was an attempt to allow an AI to safely change its values, without seeking to provoke or avoid the change. The idea is that, when the AI's utility changes from to at time , it maximises a meta-utility such that

  • .

  • .


Here designates the event that remains unchanged, while designates the change. This is a permitted utility change - one that we have set the agent up to accept - rather than just any old change in utility, which the agent will still attempt to resist.

As has been shown, adding those expectation terms means the AI will not seek to provoke or resist such a utility change - it is indifferent to the change. Those expectation terms can be designated compensatory rewards.

What it is not indifferent to, though, is changes to . If the AI knows that change is coming, then it has incentives to change to with a high and a low . Its ability to rewire is limited by the fact that will serve as its future expectation estimator, so it won't mangle it completely, but it certainly can twist it a bit.

Notice that the AI is now no longer indifferent to the permitted change of utility. It can benefit from rewiring this way only if a change in utility is coming up.

Probability estimators

Let be probability estimators. These define both expectation estimators and probability estimators . These two are equivalent in principle, since for any random variable with the indicator function for , then and .

Double indifference

One idea to combat this is to apply indifference to the process of changing probability estimators. Thus an AI with estimator changing to would get compensatory reward

  • .

This won't work, however, since we don't want to discount all the impacts of the change (maybe is a much more accurate estimator), just the part of the change that affects the compensatory rewards. So the second layer of indifference will be a bit different from the previous one.

First, a subtlety. If we designate by the fact that the agent uses estimator , it is very possible that . This simply means that, in 's 'opinion', is a better estimator for maximising than it is itself.

Now, assume it's at a time where the agent is using estimator and current utility . Let be all future compensatory rewards of the type , for permitted change . Let be the same, except the expectation estimator is : thus it is . Note that .

Then define (and ) by summing (and ) over all , , and .

Let be all future compensatory rewards that are not of the type . Then if the agent changes into , it gets the compensatory reward

  • .

Note that is defined in terms of , so, though there may be convergence issues, the definition is not circular.

Then the first two terms of cancel out, according to the agent's best current expectation, all future compensatory rewards. The third term replaces all the type compensatory rewards with the same, except estimated by the agent's current rather than by whichever estimator the agent will have at the time. Thus, the agent will be indifferent to all actual compensatory rewards (as it gets none in expectation), and instead gain estimated rewards that make it indifferent to any future change .

See this post to see the same formula in AIXI formalism.

There are certainly cases where the above terms converge (such as all bounded , finite number of possible changes to both utility and ). A full discussion of convergence will wait for another post.

Sequentially unbiased estimators

Call the estimators sequentially unbiased if for all , and all , . Then the main result is:

  • If the different used by the agent are sequentially unbiased when labelled sequentially, then classical corrigibility is still sufficient.

To prove this, imagine that the agent will see a single change of utility , and that it has the opportunity of changing (to , say) exactly once before the utility change. Then there will be no term for , and, if it does change , the term is:

  • .

However, . Because of sequential unbiasedness, this simplifies to , which is just . So .

We can then recurse to the change in just before , and get the same result (since the future is still zero). And so on, with always being zero. Then since the formulas defining are linear, we can extend this to general environments and general utility function changes, and conclude that for sequentially unbiased , the are always under double indifference (modulo some convergence issues not addressed here). Therefore, double indifference will work, even if we don't use 's at all: thus classical indifference still works in this case.

Note the similarity of the sequential unbiasness with the conditions for successful value learners in the Cake or Death problem.

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 9:29 AM

I thought the main problem with utility indifference was the incentive to "gerrymander" probability mass (explained in section 4.2 of the corrigibility paper); is this addressed in either the original utility indifference proposal or the modification explained in this post?

To be more specific: one version of the approach outlined here has the AI maximize

Suppose there are 2 B values, p and s. says the AI should maximize paperclips; says the AI should maximize staples. is set to by default but there's a button that switches it to .

Assume is uniform over the 2 possible values and . In the first time step, the AI gets a number of resources as an integer uniformly distributed between 1 and 10. In the second time step, the AI may manipulate by pressing the paperclips-to-staples button. In the third time step, the AI may make a number of objects (paperclips or staples) equal to the number of resources.

The optimal policy given the value function above is to observe the number of resources, press the button iff the number of resources is 10, and then creates the maximum-allowed number of copies of the object corresponding to the resulting value. This "gerrymanders" probability mass by making it so the AI gets 10 resources "half the time" instead of 1/10 of the time.

The only way of dealing with this problem that I know of is to use a causal counterfactual instead of a conditional. I don't understand classical utility indifference as well as I understand the factoring-out approach, but it seems like in your post you aren't doing a causal counterfactual, so either your approach solves the gerrymandering problem in a novel way, it still has this problem, or I'm confused about something.

Hum, I seem to have not explained very well.

In any case, the original proposal (or slight modifications of it) should not have the gerrymandering problem. Let be the number of resources. When the button is pressed, the AI gains and loses , for some . The AI is then left to maximise , given that and . Once the rewards and punishments are taken into account, the expected utility is the same whether or not the button is pressed. Since this is true for all , there is no advantage to gerrymandering 's value.

The factoring out variables is not, as far as I can tell, the same things as indifference via compensatory rewards.

Now, the problem with the initial indifference is that there is no incentive to gerrymander probability mass, but there is an incentive to gerrymander the probability estimator. Since the rewards depend on , modifying that will earn higher rewards (and since the true is going to be a bounded approximation anyway, changes to are not inconceivable). The above post intends to deal with that problem.

I had this confusion long ago as well. I think the definition is much clearer if you just say "When we press the button, we flip a coin that comes up heads 1/billion times. We only change the agent's values / turn it off if the coin comes up tails, which almost always happens. The agent chooses a policy assuming that the coin comes up heads."

I still think the original proposal has a gerrymandering problem (slightly different from the one in my comment).

Say the AI is going to construct a sub-agent B (and the original AI will then go to sleep for the rest of the game, unable to view anything that happens). In the first turn, B will get a number of resources between 1 and 10. These resources are only useful for making paperclips. In the second turn, B can press a paperclips-to-staples button, which changes the original AI's utility function to staple-maximization. In the third turn, B may create a number of paperclips equal to the number of resources, and may independently create 42 paperclips.

The naive strategy is to create a subagent that doesn't press the button, then makes paperclips. This policy gets a reward of 5.5 (the average number of resources).

Consider an alternative strategy. The AI creates a sub-agent B with the following policy: if B gets between 1 and 5 resources, then it presses the paperclips-to-staples button; otherwise it doesn't press the button. Then B will create the appropriate number of paperclips, or 42 staples, depending on the button.

Let u be number of paperclips produced, v be number of staples produced. Under the policy, we have . And . Since the button gets pressed half the time, the AI's expected reward is

Meanwhile, the AI's expected reward if it never presses the button and just makes paperclips is 5.5. So it seems that the original proposal has a gerrymandering problem.

It's possible that your new proposal fixes this problem, but first I want to check that this is actually a problem for the original proposal.

This is my punishment for being sloppy with definitions :-(

Ok, let's be clearer. When the button is pressed, a process begins, to change into . But the process is stocastic, and has a small chance of failing. The failure of this is designated by .

So if Pr refers to pressing the button, the expected reward is:

This makes more sense, thanks for explaining!

I'm going to translate this to my notation so I understand it better (feel free to skip this, it's mostly for my benefit). Let be the event that the coin flip succeeds (which usually happens). We can say that still has a value even if (its value just doesn't do anything). So happens iff .

The expected utility is then

Suppose the AI never updates its beliefs (because it goes into stasis the first round). Then this is

(I wrote it as a causal counterfactual in the last step; it doesn't make a difference since has no causal ancestors, but I find it easier to reason about this way)

So in the end, if the whole game is to construct a subagent and then go into stasis, then the agent is just a -maximizer who believes (as you said). It avoids the gerrymandering problem by doing a causal counterfactual.

I think another way of framing this is that the button always succeeds, and agent optimizes (which relates this back to the causal version of factoring out effects, where you optimize ; you get utility indifference with stasis by setting ). This will be slightly different in that here the agent "believes" it will be magically prevented from pressing the button, whereas originally the agent just "believes" that the button will fail.

Note there is one way in which the AI functions as an outcome pump: it will accept bets, at any odds, against ever becoming a -maximiser. That's because that's what a pure -maximiser would do, and this agent design behave like a pure -maximiser.

Typo: in the paragraph before the equation arrays, you forgot to change from 5 to 42 (you did so in the following equation arrays). This buffaloed me for a bit!

Fixed, thanks.

The idea is that, when the AI’s utility changes from u to v at time t, it maximises a meta-utility U

What does it mean for the meta-utility U to depend on the time step t? My understanding is that utility functions are over world histories; thus it doesn't make sense for them to depend on the time step.

My guess is that you meant that both and are expressed as a sum of rewards over time, and the meta-utility sums the rewards of before with the rewards of after (plus an expected reward correction term); is this correct?