Utility versus Reward function: partial equivalence

1William_S

2Stuart_Armstrong

1William_S

4paulfchristiano

1William_S

New Comment

I'm trying to wrap my head around the case where there are two worlds, w1 and w2; w2 is better than w1, but moving from w1 to w2 is bad (ie. kill everyone and replacing them with different people who are happier, and we think this is bad).

I think for the equivalence to work in this case, the utility function U also needs to depend on your current state - if it's the same for all states, then the agent would always prefer to move from w1 to w2 and erase it's memory of the past when maximizing the utility function, wheras it would act correctly with the reward function.

>erase it's memory

That only works if the agent is motivated by something like "maximise your belief in what the expected value of U is", rather than "maximise the expected value of U". If you've got that problem, then the agent is unsalvageable - it could just edit its memory to make itself believe U is maximised.

Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2.

Without considering the agent's memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent's memory of past events, so it seems that leaving the agent's memory over the past out of the utility function requires U(w2a) = U(w2b)

A reward function is defined over past sequences of actions and observations. When the agent chooses an action, and gets an observation, they receive a reward that is a function of that observation and all previous observations and actions.

A utility function is defined over states of the world. You can take actions to increase or decrease the probability of certain states, thus increasing expected utility, but you don't actually "receive" any utility.

Are these different objects, or are they actually the same thing? This would be good to know, as most of the background knowledge of MIRI and similar AI safety groups is for utility functions, while reward functions are prevalent in reinforcement learning.

The summary of this post is:

## Formalism

Let A be the set of actions an agent can take, and O the set of observations. Assume both sets are

finite. Let H be the set of histories (sequences of observations and actions) of an agent.Let W be the (possibly infinite) set of worlds. Note that a world includes the full set of observation history for the agent (since the agent is part of the world). Therefore the worlds are stratified by histories; for any h∈H, there is a subset Wh⊂W consisting of all worlds with history h.

Then a reward function R is a function from histories to real numbers, while a utility function U is a function from worlds to real numbers:

Rewards and utility functions are bounded if their image is in a bounded subset of R; without loss of generality, this means there exists an l>0 such that the image of R (or U) is contained in [−l,l] for all h∈H (or w∈W).

A policy π for an agent is a map from histories to a probability distribution over actions; so π:H→ΔA, for ΔA the space of probability distributions over actions.

Even for a utility-based agent, these are the only policies available. This is because all the information (apart from the prior) that the agent gets from the outside world is given by its observations.

## The value functions

Let P be the probability estimator used by the agent. We'll assume for the moment that it's logically omniscient and Bayesian. This P incorporates a prior, by, for example, applying it unconditionally to worlds or histories - P(w) and P(h).

Given a history h∈H and a policy π, we can compute the conditional probability of a world or a history; designate these by Pπ(h′|h) and Pπ(w|h).

Given these probabilities, we can then compute expected values. For example, the expected utility given π and h is:

We'll also designate this quantity by the expected value function V(U,π,h). If U is bounded in [−l,l], then so is this value function (though the converse need not be true).

For rewards, let Hi be the set of histories that have i actions and observations. We'll say that h∈Hi has length i. Let h::i be the the first i actions and observations of the history h. If the agent knows it will only make n≥0 observations and actions, then the future reward has an expected value function. For π and hm∈Hm with hm a history of length m<n, this is:

If the agent expects it could make arbitrarily many observations, then given a discount function 0<γ<1, there is the expected discounted future reward of:

## Equivalence for finite horizons

Assume the agent knows it will only make n observations and actions. The {Wh|h∈Hn} form a partition of the possible worlds in W: every possible world is in exactly one of those Wh subsets.

Then assume that R is given, and define, for wh∈Wh:

Then:

Theorem 1: On history h, V(U,π,h) and V(RU,π,h) differ by a constant that is a function of h only, not of π. Consequently, an RU-maximiser and a U-maximiser will choose the same policies.All the proofs are given in the appendix at the end of the post.

Now, conversely, assume U is given, but R is not. If h∈Hn, then V(U,π,h) is independent of π, since the agent will never make any more actions.

Then fix any policy π′, for a history h of length m>1, define the reward RU by:

Theorem 2: On history h, V(U,π,h) and V(RU,π,h) differ by a constant that is a function of h and π′ only, not of π. Consequently, an RU-maximiser and a U-maximiser will choose the same policies.Both of these constructions are non-unique; for example, UR could vary across the different worlds of Wh, as long as its expectation on that set is the same. And RU could have different rewards added at one step, as long as it is subtracted at a later step, or could use a different π′.

## (In)Equivalence for infinite horizons

If the agent expects it could make arbitrarily many observations, then we can still define a good UR from a given R. Let H∞ be the possible infinite histories; then the sets {Wh|h∈H∞} form a partition of W. Then for any fixed γ define:

and we will get:

Theorem 3: If R is bounded, then UR is well-define and bounded, and on history h, V(U,π,h) and V(RU,π,h) are related by a positive affine transformation that is a function of h and γ only, not of π.Consequently, a γ-discounted RU-maximiser and a U-maximiser will choose the same policies.The converse is more tricky. Fix any policy π′ as before, and, as before, for a history h of length m>1, define the reward RU by:

asymptotically ignores the future, if there exists a function f, for any hm∈Hm, and any policies π and π′,and

Theorem 4: If U asymptotically ignores the future, then on history h, V(U,π,h) and V(RU,π,h) are related by a positive affine transformation that is a function of h, π′, and γ only, not of π. Consequently, a γ-discounted RU-maximiser and a U-maximiser will choose the same policies.Even if U is bounded, RU need not be.## A utility counterexample

So, what kind of utility function cannot be made into a reward function in the above way?

Well, assume there are two actions, a and b, and that U is 1 if the agent only chooses a, and 0 if it ever choose b. Let π′ be the policy that always chooses b (all that's needed, in fact, is that it eventually chooses b with probability 1).

Then V(U,π′,h) is always zero, as is V(U,π′,h::m−1). Thus RU(h) is also always zero. And this despite the fact that there exists a policy that gets utility 1: namely the "always choose a" policy.

## Does it make a difference in practice?

In order to reach the equivalences above, the value functions V need to be exactly calculated, meaning that the probability estimator P needs to be perfect. Thus the equivalence is only established for logically omniscient agents.

In practice, utility functions are most useful when we know the ideal outcome states, and now the challenge is to design an agent that gets to them. Reward functions are most useful when we know the best local moves for the agent to make, but not necessarily the best outcomes.

## Appendix: Proofs

This gives the proofs of the four theorem above. They will proceed by expressing the relationship between the two relevant value functions.

Theorem 1:If h′∈Hn, then UR is constant on Wh′, so we can talk about UR(Wh′) (which is ∑ni=1R(h′::i)). Then if h is of length m:

since P(h′|h) being non-zero means that the initial m actions and observations of h′ are the same as those of h, and Pπ(Wh′|h) is the same as Pπ(h′|h): the probability that we are in a world with h′ is the same as the probability of observing h′.

Because ∑mi=1R(h::i) is a function purely of the past, this means that a V(UR,π,h) maximiser will behave the same way as a V(R,π,h) maximiser.

Theorem 2:If h is a history of length m>0, then:

because if h′∈Hn and Pπ(h′|h)≠0, then h′::n=h′, h′::m=h, and V(U,π′,h′)=V(U,π,h).

Theorem 3:If h is of length m, then

similarly to the proof of Theorem 1. Here, we have used the fact that the actions and observations beyond the j-th are irrelevant to ∑ji=m+1γi−1R(h′::i), in order to amalgamate all h′∈H∞ that have the same j first actions and observations, and then interchange the limit with the

finitesum over all the histories in Hj.Theorem 4:Because U asymptotically ignores the future, the term ∑h′∈HnPπ(h′|h)V(U,π′,h′::n) can be rewritten as ∑h′∈HnPπ(h′|h)V(U,π,h′::n)=V(U,π,h), a re-writing that introduces an error of norm at most f(n). Since f(n) tends to zero in the limit,

Now we'll show that this value function need not be bounded. Imagine that the agent has two action, a and b. The utility U is an inde weighted sum of the number of times the agent chooses a. For h′∈H∞, let I(a,h,m) be a Boolean that is 1 if the agent chose 1 on history h′ at time m, and 0 otherwise. Let 0<β<1, and define the utility:

Now let π be the policy of always choosing a, and π′ the policy of always choosing b. Then for any history hm∈Hm,

This means that if β<γ, the value of V(RU,π,h,γ) will increase without limits as h gets longer, even though U itself is bounded (by 1/(1−β)). If we replace βi with 1/i2, we keep the fact that U is bounded, and, since 1/i2 will eventually be greater that βi, for all 0<β<1, we can see that V(RU,π,h,γ) will increase without limits for any value of γ.