# 6

Personal Blog

A putative new idea for AI control; index here.

This post is a synthesis of some of the ideas from utility indifference and false miracles, in an easier-to-follow format that illustrates better what's going on.

## Utility scaling

Suppose you have an AI with a utility u and a probability estimate P. There is a certain event X which the AI cannot affect. You wish to change the AI's estimate of the probability of X, by, say, doubling the odds ratio P(X):P(¬X). However, since it is dangerous to give an AI false beliefs (they may not be stable, for one), you instead want to make the AI behave as if it were a u-maximiser with doubled odds ratio.

Assume that the AI is currently deciding between two actions, α and ω. The expected utility of action α decomposes as:

u(α) = P(X)u(α|X) + P(¬X)u(α|¬X).

The utility of action ω is defined similarly, and the expected gain (or loss) of utility by choosing α over ω is:

u(α)-u(ω) = P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).

If we were to double the odds ratio, the expected utility gain becomes:

u(α)-u(ω) = (2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)))/Ω,    (1)

for some normalisation constant Ω = 2P(X)+P(¬X), independent of α and ω.

We can reproduce exactly the same effect by instead replacing u with u', such that

• u'( |X)=2u( |X)
• u'( |¬X)=u( |¬X)

Then:

u'(α)-u'(ω) = P(X)(u'(α|X)-u'(ω|X)) + P(¬X)(u'(α|¬X)-u'(ω|¬X)),

2P(X)(u(α|X)-u(ω|X)) + P(¬X)(u(α|¬X)-u(ω|¬X)).    (2)

This, up to an unimportant constant, is the same equation as (1). Thus we can accomplish, via utility manipulation, exactly the same effect on the AI's behaviour as a by changing its probability estimates.

Notice that we could also have defined

• u'( |X)=u( |X)
• u'( |¬X)=(1/2)u( |¬X)

This is just the same u', scaled.

The utility indifference and false miracles approaches were just special cases of this, where the odds ratio was sent to infinity/zero by multiplying by zero. But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P'). Changes in probability can be replicated as changes in utility.

## Utility translating

In the previous, we multiplied certain utilities by two. But by doing so, we implicitly used the zero point of u. But utility is invariant under translation, so this zero point is not actually anything significant.

It turns out that we don't need to care about this - any zero will do, what matters simply is that the spread between options is doubled in the X world but not in the ¬X one.

But that relies on the AI being unable to affect the probability of X and ¬X itself. If the AI has an action that will increase (or decrease) P(X), then it becomes very important where we set the zero before multiplying. Setting the zero in a different place is isomorphic with adding a constant to the X world and not the ¬X world (or vice versa). Obviously this will greatly affect the AI's preferences between X and ¬X.

One way of avoiding the AI affecting X is to set this constant so that u'(X)=u'(¬X), in expectation. Then the AI has no preferences between the two situations, and will not seek to boost one over the other. However, note that u(X) is an expected utility calculation. Therefore:

1. Choosing the constant so that u'(X)=u'(¬X) requires accessing the AI's probability estimate P for various worlds; it cannot be done from outside, by multiplying the utility, as the previous approach could.
2. Even if u'(X)=u'(¬X), this does not mean that u'(X|Y)=u'(¬X|Y) for every event Y that could happen before X does. Simple example: X is a coin flip, and Y is the bet of someone on that coin flip, someone the AI doesn't like.

This explains all the complexity of the utility indifference approach, which is essentially trying to decompose possible universes (and adding constants to particular subsets of universes) to ensure that u'(X|Y)=u'(¬X|Y) for any Y that could happen before X does.

# 6

New Comment

I thought one of the takeaways from Bernardo + Smith in Bayesian Theory was that from a decision theory perspective, your cost function and your probability function is basically an integrated whole, any division of which is arbitrary.

That makes sense. Do you have a reference that isn't a book?

Sorry, no. I never went beyond their book with those guys.

This is not unlike Neyman-Pearson theory. Surely this will run into the same trouble with more than 2 possible actions.

No, no real connection Neyman-Pearson. And its fine with more that 2 actions - notice that each action only uses itself in the definition. And u' doesn't event use any actions in its definition.

If one were to believe there is only one thing that agents ought to maximise could this be used as a way to translate agents that actually maximise another thing as maximising "the correct thing" but with false beliefs? If rationalism is the deep rejection of false beliefs could this be a deep error mode where agents are seen as having false beliefs instead of recognised to have different values? Then demanding "rectification" of the factual erros would actually be a form of value imperialism.

This could also be seen as divergence of epistemological and instrumental rationality in that instrumental rationality would accept falsehoods if they are useful enough. That is if you care about probabilities in order to maximise expected utility whether the uncertainty would be in the details of the specific way the goal is reached or in the desirability of the out of the process are largely interchangeable. In the extreme of low probability accuracy and high utility accuracy you would know to select the action which gets you what you want but be unsure how it makes it come about. The other extreme of high probability accuracy but low utility accuracy would be the technically capable AI which we don't know whether it is allied with or against us.

If one were to believe there is only one thing that agents ought to maximise could this be used as a way to translate agents that actually maximise another thing as maximising "the correct thing" but with false beliefs?

Not easily. It's hard to translate a u-maximiser for complex u, into, say, a u-minimiser, without redefining the entire universe.

"But the general result is that one can start with an AI with utility/probability estimate pair (u,P) and map it to an AI with pair (u',P) which behaves similarly to (u,P')"

Is this at all related to the Loudness metric mentioned in this paper? https://intelligence.org/files/LoudnessPriors.pdf It seems like the two are related... (in terms of probability and utility blending together into a generalized "importance" or "loudness" parameter)

Is this at all related to the Loudness metric mentioned in this paper?

Not really. They're only connected in that they both involve scaling of utilities (but in one case, scaling of whole utilities, in this case, scaling of portions of the utility).

You write:

it is dangerous to give an AI false beliefs (they may not be stable, for one)

But the approach described here seems to give 100% identical results (certainly as long as we talk about beliefs uncorrelated with the agent's behavior). So why do you think that one is dangerous and the other is fine?

Can you describe a situation in which the two changes lead to different outcomes?

I'm thinking about what happens in the close or converging to zero case. "I am in an impossible world" seems more dangerous than "I am in a world I cannot improve or worsen".

To be honest, the real justification behind that was that the suggestions I'd heard about giving the AI false beliefs all seemed to fail, so this felt safer.

Again, can you describe any case where the two proposals do anything different?

Given that they do the same thing in every case, it seems highly unlikely that one is safe and the other is dangerous! At best you are obfuscating the problem.

For example, if this really dodges any failures associated with giving the AI false beliefs, that should give you an example where the two proposals do something different.

Now that I've rested a bit, let me think about this properly. One reason I was wary of changing probability was because of all the related other probabilities - conditional probabilities, AND and OR expressions, and so on. Changing one probability would have to keep the rest consistent, while changing utility had consistency built in.

It feels like changing a prior might be equivalent. I'm not sure that there is any difference between changing a prior and changing the utility. But, again, there might be some consistency worries to think about - eg how do we change priors over correlations between events, and so on? It still seems that changing probability involves many choices while changing the utility doesn't (it seems equivalent with finding a Bayes factor that provides evidence for that specific event?)

I will think more.

The normal way of modifying a probability distribution to make X more likely is to increase the probability of each world where X is true, e.g. by doubling it. This is equivalent to observing evidence for X. It's also equivalent to your procedure for modifying utility functions.