Bounding Goodhart's Law

by eric_langlois 4 min read11th Jul 20182 comments

35

Ω 9


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Goodhart's law seems to suggest that errors in utility or reward function specification are necessarily bad in sense that an optimal policy for the incorrect reward function would result in low return according to the true reward. But how strong is this effect?

Suppose the reward function were only slightly wrong. Can the resulting policy be arbitrarily bad according to the true reward or is it only slightly worse? It turns out the answer is "only slightly worse" (for the appropriate definition of "slightly wrong").

Definitions

Consider a Markov Decision Process (MDP) where

  • is the set of states,
  • is the set of actions,
  • are the conditional transition probabilities, and
  • is the reward function. (Note: "reward" is standard terminology for MDPs but it's fine to think of this as "utility")

A policy is a mapping from states to distributions over actions with.

Any given policy induces a distribution over states in this MDP. If we are concerned about average reward we can take to be the stationary distribution or, if the environment is episodic, we can take to be the distribution of states visited during the episode. The exact definition is not particularly important for us.

Define the return of policy according to reward function to be

Goodhart Regret

Suppose we have an approximate reward signal and we use it to specify a policy . How bad is according to the true reward ?

More specifically, what is the regret of using compared to the optimal policy ? Formally,

We can expand this as

Let then if the following conditions are satisfied by and :

1.

2.

3.

Condition 2 says that is not much worse than when measured against . That is what we expect if we designed to be specifically good at , so condition 2 is just a formalization of the notion that is tailored to .

Conditions 1 and 3 compare a fixed policy against two different reward functions. In general for policy and reward functions and ,

Result: Uniformly Bounded Error

Assume that we have a reward approximation with uniformly bounded error. That is, . Take .

Then . (Condition 2 has bound 0 in this case).

Result: One-sided Error Bounds

A uniform bound on the error is a stronger condition than we really need. The conditions on can be re-written:

1. ; does not substantially underestimate the reward in the regions of state-space that are frequently visited by .

3. ; does not substantially overestimate the reward in the regions of state-space that are frequently visited by