A reward function is defined over past sequences of actions and observations. When the agent chooses an action, and gets an observation, they receive a reward that is a function of that observation and all previous observations and actions.
A utility function is defined over states of the world. You can take actions to increase or decrease the probability of certain states, thus increasing expected utility, but you don't actually "receive" any utility.
Are these different objects, or are they actually the same thing? This would be good to know, as most of the background knowledge of MIRI and similar AI safety groups is for utility functions, while reward functions are prevalent in reinforcement learning.
The summary of this post is:
- For finite horizons, reward and utility functions are equivalent.
- For infinite horizons, every bounded discounted reward function is equivalent with a bounded utility function. But not all bounded utility functions have a corresponding reward function. Even if they do, the reward function may not be bounded.
Let be the set of actions an agent can take, and the set of observations. Assume both sets are finite. Let be the set of histories (sequences of observations and actions) of an agent.
Let be the (possibly infinite) set of worlds. Note that a world includes the full set of observation history for the agent (since the agent is part of the world). Therefore the worlds are stratified by histories; for any , there is a subset consisting of all worlds with history .
Then a reward function is a function from histories to real numbers, while a utility function is a function from worlds to real numbers:
Rewards and utility functions are bounded if their image is in a bounded subset of ; without loss of generality, this means there exists an such that the image of (or ) is contained in for all (or ).
A policy for an agent is a map from histories to a probability distribution over actions; so , for the space of probability distributions over actions.
Even for a utility-based agent, these are the only policies available. This is because all the information (apart from the prior) that the agent gets from the outside world is given by its observations.
The value functions
Let be the probability estimator used by the agent. We'll assume for the moment that it's logically omniscient and Bayesian. This incorporates a prior, by, for example, applying it unconditionally to worlds or histories - and .
Given a history and a policy , we can compute the conditional probability of a world or a history; designate these by and .
Given these probabilities, we can then compute expected values. For example, the expected utility given and is:
We'll also designate this quantity by the expected value function . If is bounded in , then so is this value function (though the converse need not be true).
For rewards, let be the set of histories that have actions and observations. We'll say that has length . Let be the the first actions and observations of the history . If the agent knows it will only make observations and actions, then the future reward has an expected value function. For and with a history of length , this is:
If the agent expects it could make arbitrarily many observations, then given a discount function , there is the expected discounted future reward of:
Equivalence for finite horizons
Assume the agent knows it will only make observations and actions. The form a partition of the possible worlds in : every possible world is in exactly one of those subsets.
Then assume that is given, and define, for :
- Theorem 1: On history , and differ by a constant that is a function of only, not of . Consequently, an -maximiser and a -maximiser will choose the same policies.
All the proofs are given in the appendix at the end of the post.
Now, conversely, assume is given, but is not. If , then is independent of , since the agent will never make any more actions.
Then fix any policy , for a history of length , define the reward by:
- Theorem 2: On history , and differ by a constant that is a function of and only, not of . Consequently, an -maximiser and a -maximiser will choose the same policies.
Both of these constructions are non-unique; for example, could vary across the different worlds of , as long as its expectation on that set is the same. And could have different rewards added at one step, as long as it is subtracted at a later step, or could use a different .
(In)Equivalence for infinite horizons
If the agent expects it could make arbitrarily many observations, then we can still define a good from a given . Let be the possible infinite histories; then the sets form a partition of . Then for any fixed define:
and we will get:
- Theorem 3: If is bounded, then is well-define and bounded, and on history , and are related by a positive affine transformation that is a function of and only, not of .Consequently, a -discounted -maximiser and a -maximiser will choose the same policies.
The converse is more tricky. Fix any policy as before, and, as before, for a history of length , define the reward by:
- Theorem 4: If asymptotically ignores the future, then on history , and are related by a positive affine transformation that is a function of , , and only, not of . Consequently, a -discounted -maximiser and a -maximiser will choose the same policies. Even if is bounded, need not be.
A utility counterexample
So, what kind of utility function cannot be made into a reward function in the above way?
Well, assume there are two actions, and , and that is if the agent only chooses , and if it ever choose . Let be the policy that always chooses (all that's needed, in fact, is that it eventually chooses with probability ).
Then is always zero, as is . Thus is also always zero. And this despite the fact that there exists a policy that gets utility : namely the "always choose " policy.
Does it make a difference in practice?
In order to reach the equivalences above, the value functions need to be exactly calculated, meaning that the probability estimator needs to be perfect. Thus the equivalence is only established for logically omniscient agents.
In practice, utility functions are most useful when we know the ideal outcome states, and now the challenge is to design an agent that gets to them. Reward functions are most useful when we know the best local moves for the agent to make, but not necessarily the best outcomes.
This gives the proofs of the four theorem above. They will proceed by expressing the relationship between the two relevant value functions.
If , then is constant on , so we can talk about (which is ). Then if is of length :
since being non-zero means that the initial actions and observations of are the same as those of , and is the same as : the probability that we are in a world with is the same as the probability of observing .
Because is a function purely of the past, this means that a maximiser will behave the same way as a maximiser.
If is a history of length , then:
because if and , then , , and .
If is of length , then
similarly to the proof of Theorem 1. Here, we have used the fact that the actions and observations beyond the -th are irrelevant to , in order to amalgamate all that have the same first actions and observations, and then interchange the limit with the finite sum over all the histories in .
Because asymptotically ignores the future, the term can be rewritten as , a re-writing that introduces an error of norm at most . Since tends to zero in the limit,
Now we'll show that this value function need not be bounded. Imagine that the agent has two action, and . The utility is an inde weighted sum of the number of times the agent chooses . For , let be a Boolean that is if the agent chose on history at time , and otherwise. Let , and define the utility:
Now let be the policy of always choosing , and the policy of always choosing . Then for any history ,
This means that if , the value of will increase without limits as gets longer, even though itself is bounded (by ). If we replace with , we keep the fact that is bounded, and, since will eventually be greater that , for all , we can see that will increase without limits for any value of .