Partial Identifiability in Reward Learning

Joar Skalse

In this post, I will provide a summary of the paper Invariance in Policy Optimisation and Partial Identifiability in Reward Learning, and explain some of its results. I will assume basic familiarity with reinforcement learning. This is the second post in the theoretical reward learning sequence, which starts in this post (though this post is self-contained). This was joint work together with Matthew Farrugia-Roberts and Adam Gleave.

There are many forms of reward learning, including e.g. RLHF and IRL. Most of these methods cannot identify the underlying reward function uniquely, even in the limit of infinite data. For example, two different reward functions may have the same optimal policy — in that case, no amount of data from an optimal policy will let you distinguish between them. This means that the reward function is ambiguous, or partially identifiable, given these data sources. In the paper, we describe the extent of this partial identifiability for several different data sources, and characterise the ambiguity that remains in the limit of infinite training data. This paper also lays some theoretical groundwork that will be important for later analysis of different reward learning algorithms.

Formalism

Before we can answer the question of this paper, we must first find a good way to make that question formal.

First of all, note that there are many kinds of reward learning algorithms, and that they rely on different kinds of data. For example, IRL relies on policies, whereas RLHF relies on (noisy) comparisons between trajectories, etc. We should therefore start by finding a common framework for talking about all of these different algorithms.

Given a set of states and a set of actions $A$ , let $R$ be the set of all reward functions that can be defined over $S$ and $A$ . Moreover, let $X$ be some set of objects that can be computed from reward functions. We can now think of any given reward learning data source as a function $f : R \to X$ , since each data source represents something that can be computed from a reward function.

For example, suppose the reward learning algorithm observes a Boltzmann-rational policy (for a formal definition of Boltzmann-rationality, see the main paper). In this case, we can let $X$ be the set of all policies that can be defined over $S$ and $A$ , and let f be the function that, given a reward function $R$ , returns the Boltzmann-rational policy for R (for some transition function and discount factor). To give a different example, suppose the reward learning algorithm observes the cumulative reward of entire trajectories (so that the training data consists of pairs $⟨ ξ, r ⟩$ , where $ξ$ is a trajectory and $r$ is the reward accrued over $ξ$ ). In this case, we can let $X$ be the set of functions $(S \times A)^{ω} \to R$ , and consider the function $f$ that, given a reward R, returns the function $G$ which associates each trajectory with its cumulative discounted reward (under $R$ ). In this way, any reward learning data source can be represented as a function $f : R \to X$ for some set X.

Using this, we can build an abstract model of a reward learning algorithm as follows: first, we suppose that there is an underlying true reward function $R^{*}$ , and that the data is generated via a data model $f : R \to X$ , so that the reward learning algorithm observes some data $x = f (R^{*})$ . The reward learning algorithm will then learn (or converge to) a reward function $R_{H}$ that fits the observed training data — that is, an $R_{H}$ such that $f (R_{H}) = x = f (R^{*})$ . For example, if an IRL algorithm observes a Boltzmann-rational policy (for an underlying ground truth reward function), then it will converge to a reward function R_H whose Boltzmann-rational policy coincides with the observed policy. Note that this primarily is a model of the asymptotic behaviour or a reward learning algorithm.

This now leads us to the following definition:

Definition: Given a function $f : R \to X$ , we say that the invariance partition of f is the partition of $R$ according to the equivalence relation $\equiv_{f}$ where $R_{1} \equiv_{f} R_{2}$ if and only if $f (R_{1}) = f (R_{2})$ , for all rewards $R_{1}$ and $R_{2}$ .

Note that the invariance partition of f groups together all reward functions that the learning algorithm could converge to. In other words, given a data model f and an underlying true reward function $R^{*}$ , we are guaranteed to converge to a reward function $R_{H}$ such that $R_{H} \equiv_{f} R^{*}$ . The problem of characterising partial identifiability in IRL can thus be cast as the problem of describing these invariance partitions for different data sources.

In addition to characterising how ambiguous the reward function is for different reward learning algorithms, we also want some way of determining how problematic this ambiguity is. In particular, note that it is often unnecessary to identify a reward function uniquely, because all plausible reward functions might be equivalent for the purposes of some given application. For example, if we want to learn a reward function in order to compute an optimal policy, then it is enough to learn a reward function that has the same optimal policies as the true reward function. In general, ambiguity is not problematic if all compatible reward functions lead to identical downstream outcomes.

We therefore also want to characterise the ambiguity tolerance for various applications — this allows us to evaluate whether or not the ambiguity of a given data source is problematic for a given application. However, note that this boils down to the same formal problem as the one we solve when we characterise the ambiguity of a data source. In particular, let $f : R \to X$ be a function whose output we wish to compute using the learnt reward function. Then if $R^{*}$ is the underlying true reward function, it is acceptable to instead learn another reward function $R_{H}$ , as long as $f (R_{H}) = f (R^{*})$ . This means that the invariance partition of f also groups together all reward functions that it would be acceptable to learn, if the purpose is to compute the output of f.

This observation now leads us to the following definition:

Definition: Given two functions $f : R \to X$ and $g : R \to Y$ , if $f (R_{1}) = f (R_{2}) ⟹ g (R_{1}) = g (R_{2})$ , we write $f \leq g$ and say f is no more ambiguous than g. If $f \leq g$ but not $g \leq f$ , then we write $f < g$ and say that $f$ is strictly less ambiguous than $g$ .

This definition formalises two important relationships. Given two reward learning data sources $f$ and $g$ , if $f < g$ then we get strictly more information about the underlying reward function by observing data from $f$ than we get by observing data from $g$ . Moreover, given a downstream application $h$ , $f \leq h$ is precisely the condition of $h$ tolerating the ambiguity of the data source $f$ .

We now just need one more definition:

Definition: A reward transformation is a map $t : R \to R$ . We say that the invariances of $f$ is a set of reward transformations $T$ if for all $R_{1}, R_{2} \in R$ , we have that $f (R_{1}) = f (R_{2})$ if and only if there is a $t \in T$ such that $R_{2} = t (R_{1})$ . We then say that $f$ determines $R$ up to $T$ .

We use reward transformations to express our results, and characterise the invariances of different data sources.

Reward Transformations

The results in this paper are expressed in terms of several classes of reward transformations. The most important of these transformations are:

Definition: A function $Φ : S \to R$ is a potential function. If $R_{2} (s, a, s^{'}) = R_{1} (s, a, s^{'}) + γ Φ (s^{'}) - Φ (s)$ , then $R_{1}$ and $R_{2}$ differ by potential shaping (with $Φ$ ). If $Φ (s) = k$ for all initial states (i.e., all states that the agent can start in), then we say that $Φ$ is k-initial.

Essentially, potential shaping assigns a “credit” to each state in the environment, and then modifies the reward of each transition based on this credit. This transformation was first introduced in this paper. If two reward functions differ by potential shaping, then they have the same optimal policies. In fact, they will even have the same ordering of policies. In Lemma B.1 of our paper, we list a whole bunch of interesting properties of this transformation. Note also that constant shift of the reward function is an instance of potential shaping (assuming that the discount rate $γ$ is less than 1).^[1]

Definition: If $E_{S^{'} \sim τ (s, a)} [R_{1} (s, a, S^{'})] = E_{S^{'} \sim τ (s, a)} [R_{2} (s, a, S^{'})]$ for all $s, a$ , then we say that $R_{1}$ and $R_{2}$ differ by S’-redistribution.

S’-redistribution simply describes when two reward functions always give the same expected (immediate) reward for each state-action pair. It should be obvious that two reward functions which differ by S’-redistribution must have the same optimal policies. In fact, if you define reward functions as having the domain $S \times A$ , instead of $S \times A \times S$ , then these transformations are unnecessary (but if you do this, then you can’t express potential shaping, and you face some other issues).

Definition: We say that $R_{1}$ is produced by an optimality-preserving transformation of $R_{2}$ if there is a function $Ψ : S \to R$ such that
$E_{S^{'} \sim τ (s, a)} [R_{2} (s, a, S^{'}) + γ Ψ (S^{'})] \leq Ψ (s)$
for all $s, a$ , with equality if and only if $a \in {a r g m a x}_{a} Q_{1}^{*} (s, a)$ , where $Q_{1}^{*}$ is the optimal Q-function for $R_{1}$ .

Optimality-preserving transformations are simply all transformations that preserve optimal policies. To see that they do this, think of $Ψ$ as a new (optimal) value function, and compare the form of these transformations to the Bellman optimality equation. This also means that potential shaping and S’-redistribution both are instances of optimality-preserving transformations.

Definition: Let $R_{1}$ and $R_{2}$ be two reward functions, and let $X \subseteq S \times A \times S$ be a set of transitions. If $R_{1} (s, a, s^{'}) = R_{2} (s, a, s^{'})$ for all $(s, a, s^{'}) \notin X$ , then we say that $R_{2}$ is produced from $R_{1}$ by a mask of $X$ .

A mask simply lets us vary the reward completely freely over some fixed set of transitions — usually, this will be applied to transitions that are impossible according to the transition function of the underlying environment. Note that a mask of all impossible transitions (i.e., transitions that have probability 0 under the transition function) also is an instance of S’-redistribution, and hence also an optimality-preserving transformation.

In the main paper, we also use a few more types of transformations, but the ones I list above will be enough for the results I will describe in this post.

General Results

Before I go into the concrete results of the paper, I first want to give you two minor results about our problem setting:

Proposition: Consider two functions $f : R \to X$ and $g : R \to Y$ . If there is a $h$ such that $h \circ f = g$ , then $f \leq g$ .

This proposition effectively means that the ambiguity of a data source only can increase with the “distance” to the true reward function. For example, if we have established the ambiguity of the optimal Q-function $Q^{*}$ , and we know that the Boltzmann-rational policy can be computed from $Q^{*}$ , then we know that the ambiguity of the Boltzmann-rational policy must be at least as high as the ambiguity of $Q^{*}$ (we provide a diagram of some of these relationships in Figure 3 in the main paper ).

Proposition: Consider two functions f $: R \to X$ and $g : R \to Y$ , and let $(f, g)$ denote combined data source formed from $f$ and $g$ , i.e., the function $(f, g) : R \to X \times Y$ such that $(f, g) (R) = (f (R), g (R))$ . If neither $f < g$ or $g < f$ , then $(f, g) < f$ and $(f, g) < g$ .

This says that the overall ambiguity of the reward function can be reduced by combining information from data sources with incomparable ambiguity. For example, if one data source leaves some information about the reward uncertain, and another data source leaves some other information uncertain, then we can reduce the overall uncertainty by combining the data.

Invariances of IRL

We can now look at the exact ambiguity of a few common data sources, starting with the kinds of policies that are used in IRL. First of all, the “maximally supportive optimal policy” is simply the optimal policy that takes all optimal actions with equal probability. That is, if there is a situation where two or more actions would lead to the same expected long-term reward, then the policy will mix between them uniformly. The ambiguity of this data source is (as expected) exactly described by optimality-preserving transformations:

Theorem: The maximally supportive optimal policy determines $R$ up to an optimality-preserving transformation.

Next, we also look at Boltzmann-rational policies and maximal causal entropy policies, which are very common for IRL algorithms (for an exact definition of maximal causal entropy policies, please see the main paper). As it turns out, these two data sources have exactly the same ambiguity, even though they may appear to be quite different:

Theorem: The Boltzmann-rational policy determines $R$ up to potential shaping and S’-redistribution.
Theorem: The maximal causal entropy policy determines $R$ up to potential shaping and S’-redistribution.

Also note that since both potential shaping and S’-redistribution are instances of optimality-preserving transformations, this means that Boltzmann-rational policies and maximal causal entropy policies both give us more information about the underlying reward than do optimal policies. This may be a bit surprising, since both these types of policies are a kind of noisy optimality (whereas optimal policies are noiseless). Shouldn’t the noise mean that we get less information? The reason for why this isn’t the case is that the noise is informative — it depends on the underlying reward function. The ambiguity of an epsilon-greedy policy, by contrast, is exactly the same as that of an optimal policy (since both of these policies clearly can be computed from each other).

Invariances of Trajectory Comparisons

We next look at trajectory comparisons — these data sources are closer to what is used in RLHF, and similar techniques. We first look at the case where the training data simply consists in pairs $⟨ ξ, r ⟩$ , where $ξ$ is a trajectory and $r$ is the reward accrued over $ξ$ . In this case, we are in the limit of infinite data able to determine the function $G : (S \times A)^{ω} \to R$ that returns the total reward of each trajectory. Therefore, we must determine the invariance partition of this object:

Theorem: $G$ determines $R$ up to 0-initial potential shaping and a mask of all unreachable transitions.

Note that 0-initial potential shaping and a mask of all unreachable transitions is a subset of potential shaping and S’-redistribution. This means that this data source is less ambiguous than each of the policies we described above (and thus that the information contained in this data source is sufficient for finding a reward $R_{H}$ with the same optimal policies as the underlying true reward $R^{*}$ , for example).

We next look at RLHF, where the data consists in noisy comparisons between pairs of trajectories, such that the labeller is more likely to select trajectory $ξ_{1}$ over $ξ_{2}$ , the higher the reward of $ξ_{1}$ is compared to the reward of $ξ_{2}$ (for an exact definition, see the main paper).

Theorem: RLHF determines $R$ up to k-initial potential shaping and a mask of all unreachable transitions.

It is interesting that RLHF has almost exactly the same ambiguity as $G$ , even though the former only provides comparisons between trajectories, whereas the latter provides exact numerical rewards for each trajectory. The reason for this is again that RLHF contains noise, and (crucially) that the noise depends on the underlying reward function (meaning that the noise is informative).

We also consider arbitrary lotteries over trajectories. That is, we assume that the training data corresponds to pairs of distributions over trajectories $D_{1}$ , $D_{2}$ together with a label that tells us which distribution yields the higher expected reward. Note that this includes the case where $D_{1}$ and $D_{2}$ come from policies, but it also includes distributions that cannot be generated by policies. This is thus a very general data source.

Theorem: Preferences over trajectory lotteries determine $R$ up to k-initial potential shaping, positive linear scaling, and a mask of all unreachable transitions.

This means that if two reward functions differ by k-initial potential shaping, positive linear scaling, and a mask of unreachable transitions, then they are equivalent in a very strong sense. In particular, they induce the same preferences between all pairs of policies, and thus have the same optimal policies, etc.

In the main paper, we also consider a few more forms of trajectory comparisons, with different kinds of restrictions (but since these are more niche and complicated to describe, I will leave them out of this post).

Extra Results

In addition to the above, I also want to include two very relevant results which are not found in Invariance in Policy Optimisation and Partial Identifiability in Reward Learning, but which I later managed to prove for other papers. The first of these describes the ambiguity of the data source which consists of noiseless comparisons between trajectories, but where the data distribution gives support to all possible trajectories (instead of just trajectories that are possible according to the transition function of a given MDP).

Theorem: Let $R_{1}$ and $R_{2}$ be two reward functions, and let the discount $γ \geq 0.5$ . If for all trajectories $ξ_{1}, ξ_{2} \in (S \times A)^{ω}$ we have that $G_{1} (ξ_{1}) \leq G_{1} (ξ_{2})$ if and only if $G_{2} (ξ_{1}) \leq G_{2} (ξ_{2})$ , then $R_{1}$ and $R_{2}$ differ by k-initial potential shaping and positive linear scaling.

The proof of this theorem is given in this paper (Theorem 2). This means that noiseless comparisons actually are able to provide almost the same information as the function $G$ , even though the latter gives exact numerical rewards to each trajectory, and the latter merely compares trajectories, provided that the comparisons are given for all trajectories. Note that this isn’t at all true in the case of utility functions defined over sets with no further structure. For example, given a set $C = {a, b, c}$ , it is possible for two utility functions $U_{1}$ and $U_{2}$ to satisfy that $U_{1} (a) > U_{1} (b) > U_{1} (c)$ and $U_{2} (a) > U_{2} (b) > U_{2} (c)$ , even if $U_{1}$ and U_2 provide different preferences between lotteries over $C$ , etc, and so one might expect that an analogous result should hold for reward functions defined over MDPs. The reason for why this isn’t the case is that trajectories have a lot of additional internal structure, and that the reward which is assigned to a trajectory depends on this structure. For details, see the proof.

(Note: the requirement that the discount $γ \geq 0.5$ is necessary for the proof strategy I used, but may not be necessary in general. However, this is a very mild requirement, since $γ$ is almost always higher than $0.5$ in practice.)

We also have the following result:

Theorem: Two reward functions have the same policy order if and only if they differ by potential shaping, S’-redistribution, and positive linear scaling.

For a proof, see this paper (Theorem 6). By “policy order”, I mean the ordering over policies according to which $π_{1} > π_{2}$ if the expected cumulative discounted reward of $π_{1}$ is higher than that of $π_{2}$ . Note that this means that almost all of the data sources we have considered above (in fact, all except optimal policies) are sufficiently unambiguous to identify the entire policy ordering of the underlying ground truth reward function. In other words, for practical purposes, the ambiguity is really not very problematic!

Conclusion

There are two interesting things which stand out to me about the results above. First of all, the ambiguity doesn’t seem to be very large for any of the data sources we consider — in fact, most of them give us enough information to infer the correct policy order (and hence also the correct optimal policies) of the underlying true reward function. This shows that the partial identifiability of reward learning isn’t necessarily a problem, even when there is an infinite amount of reward functions that can’t be distinguished (even in the limit of infinite data). The second thing which stands out to me is that many of these reward learning data sources have very similar (or even exactly the same) ambiguity, even when they appear to be quite different (for example, RLHF and IRL with Boltzmann-rational policies both give you essentially the same information). Roughly speaking, all of them allow you to identify the underlying reward function up to affine transformations of $G$ (with some caveats).

The biggest limitation of this paper is definitely that it assumes that there is no misspecification. That is, when we (for example) analyse IRL algorithms based on Boltzmann-rational policies, we assume that the observed policy actually is Boltzmann-rational. In reality, this is extremely unrealistic. Humans are not Boltzmann-rational, so when you apply such an algorithm to data gathered from humans, you are applying it to a form of data that violates the underlying assumptions of the algorithm. In that case, the above results aren’t valid. In the next post in this sequence, I will show how to get around this limitation.

If you have any questions, then please ask them in the comments!

^{^}
Specifically, if we want to incur a constant shift of $k$ , so that $R_{2} (s, a, s^{'}) = R_{1} (s, a, s^{'}) + k$ for all transitions, then we simply set $Φ (s) = k / (γ - 1)$ .

16