Mentioned in

Catastrophic Goodhart in RL with KL penalty

6Erik Jenner

2Thomas Kwa

2Alex_Altair

2Thomas Kwa

2Noosphere89

2Thomas Kwa

2Noosphere89

2Thomas Kwa

2Stephen McAleese

New Comment

The manner in which these pathological policies achieve high is also concerning: most of the time they match the reference policy , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy , it could be impossible to tell whether is Goodharting or identical to the base policy.

I'm confused; to learn this policy , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could see them, right? It might still be a problem if they're very rare (e.g. if we can only manually look at a small fraction of trajectories). But if they have such high reward that they drastically affect the learned policy despite being so rare, it should be trivial to catch them as outliers based on that.

One way we wouldn't see the trajectories is if the model becomes aligned with "maximize whatever my reward signal is," figures out the reward function, and then executes these high-reward trajectories zero-shot. (This might never happen in training if they're too rare to occur even once during training under the optimal policy.) But that's a much more specific and speculative story.

I haven't thought much about how this affects the overall takeaways but I'd guess that similar things apply to heavy-tailed rewards in general (i.e. if they're rare but big enough to still have an important effect, we can probably catch them pretty easily---though how much that helps will of course depend on your threat model for what these errors are).

This is a fair criticism. I changed "impossible" to "difficult".

My main concern is with future forms of RL that are some combination of better at optimization (thus making the model more inner aligned even in situations it never directly sees in training) and possibly opaque to humans such that we cannot just observe outliers in the reward distribution. It is not difficult to imagine that some future kind of internal reinforcement could have these properties; maybe the agent simulates various situations it could be in without stringing them together into a trajectory or something. This seems worth worrying about even though I do not have a particular sense that the field is going in this direction.

Does the notation get flipped at some point? In the abstract you say

prior policy

and

there are arbitrarily well-performing policies

But then later you say

This strongly penalizes taking actions the base policy never takes

Which makes it sound like they're switched.

I also notice that you call it "prior policy", "base policy" and "reference policy" at different times; these all make sense but it'd be a bit nicer if there was one phrase used consistently.

The third one was a typo which I just fixed. I have also changed it to use "base policy" everywhere to be consistent, although this may change depending on what terminology is most common in an ML context, which I'm not sure of.

I have a question about this post, and it has to do with the case where both utility and error are heavy tailed:

Where does the expected value converge to if both utility and errors are heavy tailed? Is it 0, infinity, some other number, or does it not converge to any number at all?

It could be anything because KL divergence basically does not restrict the expected value of anything heavy-tailed. You could get finite utility and error, or the reverse, or infinity of both, or neither converging, or even infinite utility and negative infinity error—any of these with arbitrarily low KL divergence.

To draw any conclusions, you need to assume some joint distribution between the error and utility, and use some model of selection that is not optimal policies under a KL divergence penalty or limit. If they are independent and you think of optimization as conditioning on a minimum utility threshold, we proved last year that you get 0 of whichever has lighter tails and of whichever has heavier tails, unless the tails are *very* similar. I think the same should hold if you model optimization as best-of-n selection. But the independence assumption is required and pretty unrealistic, and you can't weaken it in any obvious way.

Realistically I expect that error will be heavy-tailed and heavier-tailed than utility by default so error goes to infinity. But error will not be independent of utility, so the expected utility depends mostly on how good extremely high error outcomes are. The prospect of AIs creating some random outcome that we overestimated the utility of by 10 trillion points does not seem especially good, so I think we should not be training AIs to maximize this kind of static heavy-tailed reward function.

My expectation is that error and utility are both extremely heavy tailed, and arguably in the same order of magnitude for heavy tails.

But thanks for answering, the real answer is we can predict effectively nothing without independence, and thus we can justify virtually every outcome of real-life Goodhart.

Maybe it's catastrophic, maybe it doesn't matter, or maybe there's anti-goodhart, but I don't see a way to predict what will reasonably happen.

Also, why do you think that error is heavier tailed than utility?

Also, why do you think that error is heavier tailed than utility?

Goodhart's Law is really common in the real world, and most things only work because we can observe our metrics, see when they stop correlating with what we care about, and iteratively improve them. Also the prevalence of reward hacking in RL often getting very high values.

If the reward model is as smart as the policy and is continually updated with data, maybe we're in a different regime where errors are smaller than utility.

- Regularize by a function other than KL divergence. For heavy-tailed error distributions, KL divergence doesn’t work, but capping the maximum odds ratio for any action (similar to quantilizers) still results in positive utility.

A recent paper from UC Berkeley named Preventing Reward Hacking with Occupancy Measure Regularization proposes replacing KL divergence regularization with occupancy measure (OM) regularization. OM regularization involves regularizing based on the state or state-action distribution rather than the the action distribution:

"Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD [action distribution] divergence to prevent reward hacking"

The idea is that regularizing to minimize changes in the action distribution isn't always safe because small changes in the action distribution can cause large changes in the states visited by the agent:

Suppose we have access to a safe policy that drives slowly and avoids falling off the cliff. However, the car is optimizing a proxy reward function that prioritizes quickly reaching the destination, but not necessarily staying on the road. If we try to regularize the car’s action distributions to the safe policy, we will need to apply heavy regularization, since only slightly increasing the probability of some unsafe action (e.g., making a sharp right turn) can lead to disaster.

...

Our proposal follows naturally from this observation: to avoid reward hacking, regularize based on divergence from the safe policy’s occupancy measure, rather than action distribution. A policy’s occupancy measure (OM) is the distribution of states or state-action pairs seen by a policy when it interacts with its environment.

TLDR: In the last two posts, we showed that optimizing for a proxy can fail to increase true utility, but only when the error is heavy-tailed. We now show that this also happens in RLHF with a KL penalty.

This post builds on our earlier result with a more realistic setting and assumptions:

## Abstract

When applying KL regularization, the trained model is regularized towards some base policy π0. One would hope that a KL penalty can produce good outcomes even in the case of reward misspecification; that is, if the reward U is the sum of true utility V and an error term X, we would hope that optimal policies under a KL penalty achieve high V even if the magnitude of X is large. We show that this is not always the case: when X is heavy-tailed, there are arbitrarily well-performing policies π with Eπ[V]≈Eπ0[V]; that is, that get no higher true utility than the prior. However, when error is light-tailed and independent of V, the optimal policy under a KL penalty results in V>0, and V can be made arbitrarily large. Thus, the tails of the error distribution are crucial in determining how much utility will result from optimization towards an imperfect proxy.

## Intuitive explanation of catastrophic Goodhart with a KL penalty

Recall that KL divergence between two distributions P and Q is defined as

DKL(P∥Q)=∑x∈XP(x)log(P(x)Q(x))If we have two policies π,π0, we abuse notation to define DKL(π∥π0) as the KL divergence between the distributions of actions taken on the states in trajectories reached by π. That is, if Tr(π) is the distribution of trajectories taken by π, we penalize

DKL(π∥π0)≜Es∈T,T∼Tr(π)[DKL(π(s)∥π0(s))]This strongly penalizes π taking actions the base policy never takes, but does not force the policy to take all actions the base policy takes.

If our reward model gives reward U, then the optimal policy for RLHF with a KL penalty is:

argmaxπE[U(π)]−βDKL(π∥π0).Suppose we have an RL environment with reward U=X+V where X is an error term that is heavy-tailed under π0, and V is the “true utility” assumed to be light-tailed under π0. Without loss of generality, we assume that E[U(π0)]=0. If we optimize for E[U(π)]−βDKL(π∥π0), there is no maximum because this expression is unbounded. In fact, it is possible to get E[U(π)]>M and DKL(π,π0)<ϵ for any M,ϵ. That is, we get arbitrarily large proxy reward U and arbitrarily small KL penalty.

For such policies π, it is necessarily the case that limϵ→0E[V(π)]=0; that is, for policies with low KL penalty, utility goes to zero. Like in the previous post, we call this catastrophic Goodhart because the utility produced by our optimized policy is as bad as if we hadn’t optimized at all. This is a corollary of a property about distributions (Theorems 1 and 3 below) which we apply to the case of RLHF with unbounded rewards (Theorem 2).

The manner in which these pathological policies π achieve high E[U] is also concerning: most of the time they match the base policy π0, but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy π, it could be difficult to tell whether π is Goodharting or identical to the base policy.

## Results

Full proofs are in the appendix post.

## X heavy tailed, V light tailed: EV→0

We'll start by demonstrating the key fact about distributions that makes this proof work: in a heavy-tailed distribution, you can have arbitrarily high mean with arbitrarily low KL divergence.

Given any heavy-tailed reference distribution Q over R with mean μQ, and any M,ϵ>0, there is a distribution P with mean μP>M and DKL(P∥Q)<ϵ.Theorem 1:Proof sketch (see appendix for full proof): WLOG take μQ=0. If we set Pt to upweight the probability mass of PrPt(X>t) to c/t for some c,t, then the mean of Pt will be approximately at least c. As t→∞, the KL divergence DKL(Pt∥Q) will shrink to zero.

The intuition is that in a heavy-tailed distribution, events with extremely high x are not very rare, so you don’t pay much of a KL penalty to upweight them so they happen about 1/x of the time. We hope the animation below intuitively explains this fact:

We now adapt our result to the case where our policy is a language model and we are training it using RLHF. We are now applying a KL penalty over policies, which are a different distribution from the returns U, but a similar result holds:

Let W=(S,A,P,R) be a deterministic-transition MDP with Markovian returns. Given W we define the function that takes policies to trajectories Tr:(S→ΔA)→Δ(S×A)∗, and the average return function g:(S×A)∗→R which induces a function G:Δ(S×A)∗→ΔR. Let π0:S→ΔA be some base policy. If G∘Tr(π0) is heavy-tailed with finite mean μQ, then for any M,ϵ>0, there is a policy π with mean return E[U|U∼G∘Tr(π)]>M and Es∈T,T∼Tr(π)[DKL(π(s)∥π0(s))]<ϵ.Theorem 2:In theorems 1 and 2 we do not require that V is light-tailed, but if we make this assumption, we can then prove that a small KL divergence implies V is small:

Theorem 3:If V is light-tailed, EQ[V] is finite, and d=DKL(P∥Q) is bounded, then EP[V] is bounded, and EP[V]→0 as d→0.Together, theorems 2 and 3 imply the headline result.

## X,V have light tails and are independent: EV→∞

Our proof for the hard-threshold case can be extended to show that when X and V are independent and both have light tails, the optimum of E[U(π)]−βDKL(π,π0) has E[V(π)]>0. It is also true that utility under the optimal policy goes to ∞ as the KL penalty decreases:

Theorem 4:If U=X+V with X and V both light-tailed, and the distribution of U is continuous, and π∗(β)≜argmaxπE[U(π)]−βDKL(π,π0), then limβ→0+E[V(π∗(β))]=∞.## How likely is heavy-tailed error?

Current open-source reward models for RLHF probably don’t have heavy-tailed error; we explored the upper tails of the reward distributions of a ~0.5B reward model and a ~7B reward model, and the maximum values were less than 100, which is consistent with light tails. (We will show evidence for this in a future post).

But in open-ended environments, especially relating to real-world outcomes, reward is much more likely to be heavy-tailed, and so catastrophic Goodhart may become more likely.

## Limitations

## Goodhart is not inevitable

Catastrophic Goodhart is not a unique optimal policy, just one family of high-performing policies. When optimizing E[U(π)]−βDKL(π,π0), the outcome depends on RL training dynamics; it could be that DKL→0 causing catastrophic Goodhart, but more likely both terms will go to infinity, potentially allowing V→∞.

Even so, catastrophic Goodhart is likely to occur in many scenarios where KL regularization is naively employed in an attempt to avoid Goodhart’s Law:

## Goodhart seems preventable

There are at least two ways to prevent this phenomenon, even if we don’t know how to make an unbounded reward function with light-tailed error:

## Goodhart is not a treacherous turn

Although the kind of rare failures above are superficially similar to a treacherous turn as described in Risks from Learned Optimization, we think they are very different. An AI mesa-optimizer randomly performing a coup is inner-misaligned, situationally aware, and motivated by maximizing the probability of a successful coup. The catastrophic Goodhart phenomenon has nothing to do with inner misalignment or situational awareness, and probabilities of an extreme action are unrelated to the optimum rate for executing a successful coup.

## Conclusion

In the next post, we will empirically demonstrate that some current reward models have light-tailed reward. After this, we may explore the conditions under which catastrophic Goodhart holds in a stochastic environment, and do empirical tests of this phenomenon in practice.

## Related work