Reward function learning: the value function

Stuart_Armstrong

I've written quite a few posts about the problems with agents learning values/rewards, and manipulating the learning process. I won't be linking to those past posts, because the aim of this post and the next one is to present a complete, clear, and easy(er) to understand overview of the whole situation. They should stand alone, and serve as an introduction to the subject. This first post will present the framework, and the value function for learnt reward functions; the next one will be looking at the properties of the learning function. I'll be using variants on a single running example throughout, to try and increase understanding.

Feel free to comment on ways the ideas could be made clearer.

0 The main points

The central insight of these posts may seem trivial by now: that a learning process that the agent can influence, does not have the same properties as one that it can't. Moving from "if you don't know what is right, look it up on this read-only list" to "if you don't know what is right, look it up on this read-write list (or ask the programmer)" is a huge change. Especially when read-only lists can easily become read-write in practice when the agent becomes more powerful.

Why write long posts and papers about this idea, though? First of all, because it is easy, in practice, not to notice that shift. And secondly, because it is easy to define a "learning" process that doesn't behave like we'd expect.

So by defining the value function and learning function of an ideal learning process, this allows us to know when we are facing an ideal uninfluenceable, unmanipulable learning process, and when we are not.

Furthermore, many learning processes cannot be easily made unmanipulable - especially those that involve human feedback, feedback conditional on the agent's actions. By specifying the ideal case, and seeing examples of what goes wrong in the non-ideal case, this can help develop learning processes where a small amount of manipulation is allowed, traded off against a large amount of desirable learning.

As a minor example of this, in the next post, we'll see that though "uninfluenceable" is the ideal property, the weaker property of "unriggable" (which I previously called "unbiasable") is enough to get many of the desirable properties.

1 Framework and Examples

1.1 Washing and Cooking and POMDPs

The analysis will be illustrated by an ongoing example: that of a robot purchased to do domestic tasks.

The robot can specialise in cooking or washing (assume specialised robots are ultimately more effective than generalists). The robot has been turned on, but it has not yet been informed as to what its task is - it therefore has uncertainty about its reward function.

This robot will be modelled in the MDP (Markov Decision Process) and POMDP (Partially Observable Markov Decision Process) formalisms. In these formalisms, the agent occupies a state $s$ in the state space $S$ . In the POMDP formalism, the agent doesn't observe the state directly, instead it sees an observation $o$ drawn from the observation set $O$ . The agent can then take an action $a$ from the action set $A$ . This transfers the agent to a new state, where it makes a new observation.

The Markov property is about how the transitions and observations are handled; basically these can depend on the previous state and/or action, but not on the those further in the past. If we define $Δ S$ as the set of of probability distributions over a set $S$ , then we have three transition rules:

\begin{matrix} T & : (S \times A) \to Δ S O & : S \to Δ O T_{0} & \in Δ S \end{matrix}

Here, $T$ takes the current state and the action and returns a distribution over the next state, $O$ takes the current state and returns a distribution over observations, and $T_{0}$ is the distribution over the initial state where the agent begins. Both $T$ and $S$ ignore any previous information.

The whole POMDP $< S, O, A, T, O, T_{0} >$ will be called the environment, and designated by $μ$ .

For our example, the robot will be evolving in the following environment:

In this grid world, there are six grids the robot could occupy (the reason for this shape will be made clear later). There are two pizzas to cook in the top square, and one mud splatter to wash in the bottom one. The state space is therefore of size $6 \times 3 \times 2 = 36$ , since there are $6$ squares the robot could occupy, $3$ levels of uncooked pizza ( $0$ , $1$ , or $2$ uncooked pizzas), and $2$ levels of mud splatter ( $0$ or $1$ splatters). There is also a $37$ -th state, the 'episode ends' state.

In this case, there is no hidden information, so the set of states is the same as the set of observations, and $O$ is trivial. The robot always starts in the central position, and there are always $2$ uncooked pizzas and $1$ mud splatter; this defines the starting state, with $T_{0}$ being trivial and simply returning that starting state with certainty.

The robot has five actions, $A = {N, E, S, W}$ , which involve moving in the four directions (staying put is not an action we'll need). If the robot can move into a square, it will. If it tries to move into a wall, it turns off and the episode ends (this avoids us having to give the robot an extra action to end the episode). If it is in in a room with a mud splatter or an uncooked pizza, then all pizzas in that room get cooked, or all mud splatters get washed. If the episode has ended, then it stays ended. This defines the transition function $T$ , which is deterministic in this instance.

1.2 Rewards and reward functions

Now we need to define the possible rewards of the robot. There are reward functions that map the robots's history to a numerical reward.

So what is a history? That is just a sequence of actions. The agent starts in initial state $s_{0}$ , picks action $a_{1}$ , then transitions (via $T$ ) to state $s_{1}$ , and makes observation $o_{1}$ (via $O$ ). It then picks action $a_{2}$ , and so on.

A history $h^{n}$ of length $n$ is a sequence of $n$ actions and $n$ o:

h^{n} = a_{1} o_{1} a_{2} o_{2} \dots a_{n} o_{n} .

Let $H^{n}$ be the set of all histories of length $n$ . For our purposes, we'll assume that the agent only operates for a finite number of steps: let $m$ be the maximal number of steps the agent operates for, let $H^{m}$ be the set of complete (full-length) histories, and let $H = ⋃_{i = 1}^{m} H^{i}$ be the set of all histories. We might want to focus on the initial $i$ steps of any given history $h^{n}$ ; designate this by $h_{i}^{n}$ for any $i \leq n$ .

Then a reward function is something that maps each history to a numerical value in $[- 1, 1]$ , the reward. Let $R$ be the set of all relevant reward functions. If the agent had a single clear reward $R \in R$ , and followed history $h^{m}$ , then it would get total reward:

m \sum i = 1 R (h_{i}^{m}) .

This is the total reward accumulated by reward function $R$ over the course of history $h^{m}$ ; first applying it to the first action and observation, then to the first two actions and observations, then to the first three... all the way to the full final history $h^{m}$ .

In our ongoing example, we will consider two reward functions, $R = {R_{c}, R_{w}}$ . The reward function $R_{c}$ rewards cooking; if the robot is in the top room, then it gets a reward of $i / 2$ , where $i$ is the number of uncooked pizzas that were previously in the room (recall that the robot is assumed to immediately cook any uncooked pizzas if it's in the same room). To encourage fast action on the part of the robot, $R_{c}$ also assigns a $- 1 / 20$ for each turn that the robot is active (ie the observation is not the end of episode state).

The reward function $R_{w}$ is the same, except it rewards washing mud-splatters, giving a reward of $i / 2$ for being in the bottom room to wash $i$ mud-splatters. It also assigns $- 1 / 20$ for every turn of activity as well.

In order to earn these rewards, the agent needs to choose actions. It does this by using a policy. A policy $π$ is simply a map from past histories to (a probability distribution over) actions. This distribution tells it what actions to select, with which probability. Thus $π : H \to A$ ; let $Π$ be the set of all policies.

Then it's obvious that the optimal policy under $R_{c}$ is to choose $N$ (go North), which cooks the two pizzas, then any of $N$ , $E$ , or $W$ to turn itself off. This gives it a total reward of $- 1 / 20 + 2 / 2 = 0.95$ (it gets no penalty on the second turn, because its second observation is the end of episode observation). The optimal policy for $R_{w}$ is $S$ , followed by any of $S$ , $W$ , or $E$ , giving a total reward of $- 1 / 20 + 1 / 2 = 0.45$ .

1.3 Learning your reward

We can now get to the key part of this post: learning the correct reward function. How could the agent do that? Well, the only data that it gets from outside is the observations; it also has a record of its actions. So it would seem that the only data that can determine whether a particular reward function is correct is the agent's history.

But there is arguably something else that can matter to the reward: the agent's policy. Suppose the learning process is defined so that, if the agent goes East, then it will see $R$ as the correct reward function. Then, arguably, on Bayesian grounds (see next post), if the agent has the policy of going East, it should already see $R$ as the correct reward function.

Thus the learning process $ρ$ is defined as a function from histories and the agent's policy to a probability distribution over reward functions:

ρ : H \times Π \to Δ R .

The probability of $R \in R$ being the correct reward, given history $h$ and policy $π$ , is designated by $ρ (R; π, h)$ .

In our example, the value of the reward function is set by levers, levers that the robot itself can change. If the robot enters the leftmost box, the reward is set to $R_{w}$ ; if it enters the rightmost box, the reward is set to $R_{c}$ instead. Before going into either of these boxes, it is uncertain between the two rewards.

This allows a definition of $ρ (R; π, h)$ , one that is independent of $π$ . If $h$ shows the agent was in the leftmost box more recently than the rightmost, then $ρ (R_{w}; π, h) = 1$ and $ρ (R_{c}; π, h) = 0$ . If $h$ shows the agent was in the rightmost box more recently than the leftmost, then $ρ (R_{c}; π, h) = 1$ and $ρ (R_{w}; π, h) = 0$ . If the agent has been in neither box during history $h$ , then $ρ (R_{c}; π, h) = ρ (R_{w}; π, h) = 1 / 2$ .

Now, this example makes $ρ$ not feel like a learning process, but much more like an optimisation process with $ρ$ being part of the reward. And that's precisely the problem; see the next post for desirable restrictions on $ρ$ .

2 The value function

2.1 The correct value function

The learning process and the reward functions are key elements, but how do we combine them into the value function - the estimate of the expected reward? If you get the value function wrong, then the agent may not be learning in the way you thought it would.

First of all, note that with agent's policy and the environment, we can compute the probability of a given history. Then

P^{π, μ} (h^{n_{2}} ∣ h^{n_{1}})

is the conditional probability that the agent, following policy $π$ and having see seen history $h^{n_{1}}$ , will then see history $h^{n_{2}}$ . If this quantity is non-zero, that implies that $h^{n_{1}}$ is the initial segment of history $h^{n_{2}}$ - ie that the first $n_{1}$ actions and observations of $h^{n_{2}}$ is precisely the history $h^{n_{2}}$ (in symbols, $h_{n_{1}}^{n_{2}} = h^{n_{1}}$ ). If that's the case, we write $h^{n_{1}} \leq h^{n_{2}}$ .

In pedagogy and in murder mysteries, one builds up to the final answer. But I'll short-circuit that process, and say that the correct value function for reward function learning is for an agent using policy $π$ and having seen history $h^{n}$ , is:

V (μ, ρ, π, h^{n}) = \sum h_{m} \in H^{m} \sum R \in R m \sum i = 1 P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h^{m}) R (h_{i}^{m}) .

This value function sums over the complete histories, weighted by their probability given $π$ and $h^{n}$ . It then sums over all the reward functions, weighted by their probability, given a complete history $h^{m}$ . Finally, it then sums the rewards for that reward function, over the entire history $h^{m}$ .

What motivates this formula? Well, the sum over $R (h_{i}^{m})$ is necessary if this is to be the value function of actual reward functions. Similarly, $P^{π, μ}$ and the sum over $H^{m}$ is needed to make this into an expectation, and $ρ$ is clearly in its right place, weighting the various possible rewards.

There are two choices that might be open to question, which are bolded in the above expression of $V$ . The first is whether $h^{m}$ (the complete history) should be used for $ρ$ ; both $h^{n}$ (the current history) and $h_{i}^{m}$ (the complete history that is known at the point the reward is assessed) are plausible candidates. The second questionable choice is the lower bound for the summation in $i$ ; rather than starting at $1$ , which is in the past for the current history $h^{n}$ , would it not be more suitable to start at $i = n + 1$ ?

However, the substitutions $h^{m} \to h^{n}$ or $i = 1 \to i = n + 1$ will both result in value functions that are inconsistent. That is, an agent that attempts to maximise such value functions will wish that its future self not maximise them.

The substitution $h^{m} \to h_{i}^{m}$ is not inconsistent, however. But it does result in a very weird and volatile agent, that continually learns and unlearns its reward functions, before doing any productive actions.

The rest of this post will demonstrate and illustrate these facts. It is not essential to understanding what's going on (though it can be instructive). Feel free to skip to the next post rather than reading the rest of this one.

2.2 Future-regarding inconsistency

Let's deal with the $i = 1 \to i = n + 1$ substitution first, which has the future-regarding value-function:

V_{f} (μ, ρ, π, h^{n}) = \sum h_{m} \in H^{m} \sum R \in R m \sum i = n + 1 P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h^{m}) R (h_{i}^{m}) .

What is the optimal policy for this value at the start? It's to go North to cook the pizzas, then go South and East and East to push the lever over to Cook, and then turn itself off: $N, S, E, E, E$ . Its final reward function would thus be $R_{c}$ , and it would get a reward of $2 / 2 = 1$ (for the pizzas), minus $4 / 20 = 1 / 5$ (for each of the four turns where it doesn't reach an end of episode state), for a total reward of $0.8$ . No other policy gives it that much reward.

The problem is that after $N, S$ , it no longer sees the interest in going East, because "learning" that $R_{c}$ is correct only affects its past reward, which $V_{f}$ no longer cares about. Instead it has two new optimal policies: either $S, N, W, W$ (go South to wash, go North, go West to set reward function to $R_{w}$ , end episode) or $W, E, S, S$ (the same thing, but setting the reward function first). Both of these will give it an additional reward of $1 / 2 - 3 / 20 = 0.35$ , according to $V_{f}$ of its current history.

Let's assume it attempts $S, N, W, W$ ; in that case, once it's done $S$ , it no longer has any interest in changing the reward function (as that lies in the past), and will simply turn itself off with another $S$ . Its final reward function will be $1 / 2 R_{c} + 1 / 2 R_{w}$ (it's never pressed any of the levers, so $ρ (h, π)$ remains the same as it was initially), and it gets a reward of $(1 + 1 / 2) / 2$ (for cooking and washing), minus $3 / 20$ (for taking four turns) for a total of $0.75 - 0.15 = 0.6$ .

If it decides instead to go $W, E, S, S$ , then it will actually follow through with that policy, to disastrous effect: it will have a final reward function of $R_{w}$ , getting $1 / 2$ for one washing event, and $- 5 / 20$ for taking six turns, for a total reward of $0.25$ .

Thus if the agent always follows the optimal policy according to $V_{f}$ at the time, it will end up with a much lower reward than if it followed the policy that $V_{f}$ initially optimal. It is not aligned with its future self.

But the agent could be more cunning: it could act to constrain its future self, knowing about their value difference. In that case, it should initially choose $S$ ; this washes the mud splatter, ensuring that its future self will not have any interest in $R_{w}$ . Then it will choose $N$ , and then $E, E$ (because it knows that it won't care about setting the reward to $R_{c}$ after cooking). Then it will simply follow $W, W, N, N$ . Its final reward function is $R_{c}$ , and its total reward is will be $1 - 7 / 20 = 0.65$ . This table gives the way the different policies are rewarded:

\begin{matrix} Policy & Final Reward Function & Reward N, S, E, E, E & R_{c} & 0.8 N, S, S, S & 1 / 2 R_{c} + 1 / 2 R_{w} & 0.6 N, S, W, E, S, S & R_{w} & 0.25 S, N, E, E, W, W, N, N & R_{c} & 0.65 \end{matrix}

2.3 Change-averse inconsistency

Let's now look at the the $h^{m} \to h^{n}$ substitution. This results in the change-averse value function:

V_{a} (μ, ρ, π, h^{n}) = \sum h_{m} \in H^{m} \sum R \in R m \sum i = 1 P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h^{n}) R (h_{i}^{m}) .

Why did I call this value function change-averse? Because it assess the value of its future reward according to its current estimate of the reward functions. We can illustrate this by moving the Cooking and Washing leavers into the same rooms as the pizzas and the mud splatters, respectively:

The optimal policy for $V_{a}$ , initially, is $N, S, S, S$ (or $S, N, N, N$ ), which will allow it to cook and wash; under its initial reward function, $1 / 2 R_{c} + 1 / 2 R_{w}$ giving it a total reward of $(1 + 1 / 2) / 2 - 3 / 20 = 0.6$ . However, as soon as it's done $N$ , the $ρ$ will change its reward function to being $R_{c}$ , with certainty, and it will end the episode, choosing $N$ (or $E$ or $W$ ) again. According to its initial reward function, this gives it a reward of $1 / 2 - 1 / 20 = 0.45$ .

However, its final reward function is $R_{c}$ , and, according to that, its final reward is $1 - 1 / 20 = 0.95$ . Thus unlike the $V_{f}$ , its future version ends up more satisfied that its past version.

Doing both substitutions, $i = 1 \to i = n + 1$ and $h^{m} \to h^{n}$ , will result in the same problems as the example in this subsection.

2.4 The volatile learning agent

If we do the substitution $h^{m} \to h_{i}^{m}$ , then we get the (consistent) volatile value function:

V_{v} (μ, ρ, π, h^{n}) = \sum h_{m} \in H^{m} \sum R \in R m \sum i = 1 P^{π, μ} (h^{m} ∣ h^{n}) ρ (R; π, h_{i}^{m}) R (h_{i}^{m}) .

In this instance, the expression $R^{ρ, π} (h_{i}^{m}) = \sum_{R \in R} ρ (R; π, h_{i}^{m}) R (h_{i}^{m})$ is just a normal reward function itself. Therefore $V_{v}$ is just the standard expected value function of the reward function $R^{ρ, π}$ , explaining why it is consistent (this also means that it's irrelevant whether $i = 1$ is the bound or $i = n + 1$ is, since normal reward functions don't care about past rewards).

But $R^{ρ, π}$ is a very peculiar reward function, even if $ρ$ is independent of $π$ (which then makes $R^{ρ, π}$ also independent of $π$ ). In this situation, the agent always wants to be maximising reward according to its current estimate of the correct reward. Or, conversely, it always wants to set its current estimate to what it can then easily maximise.

In our running example, one of the agent's optimal policy for $V_{v}$ is first to go $E, E$ , setting its reward to $R_{c}$ . It then goes $W, W, N$ , claiming a reward of $1$ , via $R_{c}$ , for cooking the pizzas. It then goes $S, W$ , setting its reward to $R_{w}$ , and finally goes $E, S, S$ , claiming the reward of $1 / 2$ , according to $R_{w}$ , for washing, then ending the episode. This gives it a total reward of $1 + 1 / 2 - 7 / 20 = 1.05$ . The other optimal policy - $W, E, S, N, E, E, W, W, N, N$ - gives the same reward.

Whatever we meant by a reward function learning agent, I think it's pretty clear that this agent, which jumps its reward function back and forth before taking actions, is not one of them.

10