Catastrophic Goodhart in RL with KL penalty

Adrià Garriga-alonso

The manner in which these pathological policies achieve high $E [U]$ is also concerning: most of the time they match the reference policy $π_{0}$ , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy $π$ , it could be impossible to tell whether $π$ is Goodharting or identical to the base policy.

I'm confused; to learn this policy $π$ , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could see them, right? It might still be a problem if they're very rare (e.g. if we can only manually look at a small fraction of trajectories). But if they have such high reward that they drastically affect the learned policy despite being so rare, it should be trivial to catch them as outliers based on that.

One way we wouldn't see the trajectories is if the model becomes aligned with "maximize whatever my reward signal is," figures out the reward function, and then executes these high-reward trajectories zero-shot. (This might never happen in training if they're too rare to occur even once during training under the optimal policy.) But that's a much more specific and speculative story.

I haven't thought much about how this affects the overall takeaways but I'd guess that similar things apply to heavy-tailed rewards in general (i.e. if they're rare but big enough to still have an important effect, we can probably catch them pretty easily---though how much that helps will of course depend on your threat model for what these errors $X$ are).

[-]Thomas Kwa2y20

This is a fair criticism. I changed "impossible" to "difficult".

My main concern is with future forms of RL that are some combination of better at optimization (thus making the model more inner aligned even in situations it never directly sees in training) and possibly opaque to humans such that we cannot just observe outliers in the reward distribution. It is not difficult to imagine that some future kind of internal reinforcement could have these properties; maybe the agent simulates various situations it could be in without stringing them together into a trajectory or something. This seems worth worrying about even though I do not have a particular sense that the field is going in this direction.

[-]Alex_Altair2y20

Does the notation get flipped at some point? In the abstract you say

prior policy

and

there are arbitrarily well-performing policies $π$

But then later you say

This strongly penalizes $π_{0}$ taking actions the base policy never takes

Which makes it sound like they're switched.

I also notice that you call it "prior policy", "base policy" and "reference policy" at different times; these all make sense but it'd be a bit nicer if there was one phrase used consistently.

[-]Thomas Kwa2y20

The third one was a typo which I just fixed. I have also changed it to use "base policy" everywhere to be consistent, although this may change depending on what terminology is most common in an ML context, which I'm not sure of.

[-]Noosphere892y20

I have a question about this post, and it has to do with the case where both utility and error are heavy tailed:

Where does the expected value converge to if both utility and errors are heavy tailed? Is it 0, infinity, some other number, or does it not converge to any number at all?

[-]Thomas Kwa2y*20

It could be anything because KL divergence basically does not restrict the expected value of anything heavy-tailed. You could get finite utility and error, or the reverse, or infinity of both, or neither converging, or even infinite utility and negative infinity error—any of these with arbitrarily low KL divergence.

To draw any conclusions, you need to assume some joint distribution between the error and utility, and use some model of selection that is not optimal policies under a KL divergence penalty or limit. If they are independent and you think of optimization as conditioning on a minimum utility threshold, we proved last year that you get 0 of whichever has lighter tails and $\infty$ of whichever has heavier tails, unless the tails are very similar. I think the same should hold if you model optimization as best-of-n selection. But the independence assumption is required and pretty unrealistic, and you can't weaken it in any obvious way.

Realistically I expect that error will be heavy-tailed and heavier-tailed than utility by default so error goes to infinity. But error will not be independent of utility, so the expected utility depends mostly on how good extremely high error outcomes are. The prospect of AIs creating some random outcome that we overestimated the utility of by 10 trillion points does not seem especially good, so I think we should not be training AIs to maximize this kind of static heavy-tailed reward function.

[-]Noosphere892y20

My expectation is that error and utility are both extremely heavy tailed, and arguably in the same order of magnitude for heavy tails.

But thanks for answering, the real answer is we can predict effectively nothing without independence, and thus we can justify virtually every outcome of real-life Goodhart.

Maybe it's catastrophic, maybe it doesn't matter, or maybe there's anti-goodhart, but I don't see a way to predict what will reasonably happen.

Also, why do you think that error is heavier tailed than utility?

[-]Thomas Kwa2y20

Also, why do you think that error is heavier tailed than utility?

Goodhart's Law is really common in the real world, and most things only work because we can observe our metrics, see when they stop correlating with what we care about, and iteratively improve them. Also the prevalence of reward hacking in RL often getting very high values.

If the reward model is as smart as the policy and is continually updated with data, maybe we're in a different regime where errors are smaller than utility.

[-]Stephen McAleese2y20

Regularize by a function other than KL divergence. For heavy-tailed error distributions, KL divergence doesn’t work, but capping the maximum odds ratio for any action (similar to quantilizers) still results in positive utility.

A recent paper from UC Berkeley named Preventing Reward Hacking with Occupancy Measure Regularization proposes replacing KL divergence regularization with occupancy measure (OM) regularization. OM regularization involves regularizing based on the state or state-action distribution rather than the the action distribution:

"Our insight is that when reward hacking, the agent visits drastically different states from those reached by the safe policy, causing large deviations in state occupancy measure (OM). Thus, we propose regularizing based on the OM divergence between policies instead of AD [action distribution] divergence to prevent reward hacking"

The idea is that regularizing to minimize changes in the action distribution isn't always safe because small changes in the action distribution can cause large changes in the states visited by the agent:

Suppose we have access to a safe policy that drives slowly and avoids falling off the cliff. However, the car is optimizing a proxy reward function that prioritizes quickly reaching the destination, but not necessarily staying on the road. If we try to regularize the car’s action distributions to the safe policy, we will need to apply heavy regularization, since only slightly increasing the probability of some unsafe action (e.g., making a sharp right turn) can lead to disaster.
...
Our proposal follows naturally from this observation: to avoid reward hacking, regularize based on divergence from the safe policy’s occupancy measure, rather than action distribution. A policy’s occupancy measure (OM) is the distribution of states or state-action pairs seen by a policy when it interacts with its environment.

[-]Thomas Kwa1y20

I think that paper and this one are complementary. Regularizing on the state-action distribution fixes problems with the action distribution, but if it's still using KL divergence you still get the problems in this paper. The latest version on arxiv mentions this briefly.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

62

Catastrophic Goodhart in RL with KL penalty

62

Ω 27

62

Ω 27

Abstract

Intuitive explanation of catastrophic Goodhart with a KL penalty

Results

X heavy tailed, V light tailed: $E V \to 0$

$X, V$ have light tails and are independent: $E V \to \infty$

How likely is heavy-tailed error?

Limitations

Goodhart is not inevitable

Goodhart seems preventable

Goodhart is not a treacherous turn

Conclusion

Related work

62

Catastrophic Goodhart in RL with KL penalty

62

Ω 27

62

Ω 27

Abstract

Intuitive explanation of catastrophic Goodhart with a KL penalty

Results

X heavy tailed, V light tailed: EV→0

X,V have light tails and are independent: EV→∞

How likely is heavy-tailed error?

Limitations

Goodhart is not inevitable

Goodhart seems preventable

Goodhart is not a treacherous turn

Conclusion

Related work

X heavy tailed, V light tailed: $E V \to 0$

$X, V$ have light tails and are independent: $E V \to \infty$