The reverse Goodhart problem

by Stuart_Armstrong1 min read8th Jun 202122 comments

14

Ω 8

Goodhart's LawAI
Frontpage
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There are two aspects to the Goodhart problem which are often conflated. One is trivially true for all proxy-true utility pairs; but the other is not.

Following this terminology, we'll say that is the true goal, and is the proxy. In the range of circumstances we're used to, - that's what's makes a good proxy. Then the Goodhart problem has two aspects to it:

  1. Maximising does not increase as much as maximising would.
  2. When strongly maximising , starts to increase at a slower rate, and ultimately starts decreasing.

Aspect 1. is a tautology: the best way to maximise is to... maximise . Hence maximising is almost certainly less effective at increasing than maximising directly.

But aspect 2. is not a tautology, and need not be true for generic proxy-true utility pairs . For instance, some pairs have the reverse Goodhart problem:

  1. When strongly maximising , starts to increase at a faster rate, and ultimately starts increasing more than twice as fast as .

Are there utility functions that have anti-Goodhart problems? Yes, many. If have a Goodhart problem, then has an anti-Goodhart problem if .

Then in the range of circumstances we're used to, . And, as starts growing slower than , starts growing faster; when starts decreasing, starts growing more than twice as fast as :

Are there more natural utility functions that have anti-Goodhart problems? Yes. If for instance you're a total or average utilitarian, and you maximise the proxy "do the best for the worst off". In general, if is your true utility and is a prioritarian/conservative version of (eg or or other concave, increasing functions) then we have reverse Goodhart behaviour[1].

So saying that we expect Goodhart problems (in the second sense) means that we know something special about (and ). It's not a generic problem for all utility functions, but for the ones we expect to correspond to human preferences.


  1. We also need to scale the proxy so that on the typical range of circumstances; thus the conservatism of is only visible away from the typical range. ↩︎

14

Ω 8

22 comments, sorted by Highlighting new comments since Today at 1:02 AM
New Comment

Your  is correlated with , and that's cheating for all practical purposes. The premise of Goodhart's law is that you can't measure your true goal well. That's why you need a proxy in the first place.

If you select a proxy at random with the only condition that it's correlated with your true goal in the domain of your past experiences, Goodhart's law claims that it will almost certainly not be correlated near the optimum. Emphasis on "only condition". If you specify further conditions, like, say, that your proxy is your true goal, then, well, you will get a different probability distribution.

V and V' are symmetric; indeed, you can define V as 2U-V'. Given U, they are as well defined as each other.

The point is that V and V' are both hard to define. U is simple, but without a good definition for V, you won't be able to get a good V', and if you do have a good V, you can just optimize that directly.

It seems I didn't articulate my point clearly. What I was saying is that V and V' are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can't be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).

So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It's not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.

Sorry, why are V and V' equally hard to define? Like if V is "human flourishing" and U is GDP then V' is "twice GDP minus human flourishing" which is more complicated than V. I guess you're gonna say "Why not say that V is twice GDP minus human flourishing?"? But my point is: for any particular set U,V, V', you can't claim that V and V' are equally simple, and you can't claim that V and V' are equally correlated with U. Right?

Almost equally hard to define. You just need to define , which, by assumption, is easy.

You have a true goal, . Then you take the set of all potential proxies that have an observed correlation with , let's call this . By Goodhart's law, this set has the property that any  will with probability 1 be uncorrelated with  outside the observed domain.

Then you can take the set . This set will have the property that any  will with probability 1 be uncorrelated with  outside the observed domain. This is Goodhart's law, and it still applies.

Your claim is that there is one element,  in particular, which will be (positively) correlated with . But such proxies still have probability 0. So how is that anti-Goodhart?

Pairing up  and  to show equivalence of cardinality seems to be irrelevant, and it's also weird.  is an element of , and this depends on .

[This comment is no longer endorsed by its author]Reply

By Goodhart's law, this set has the property that any will with probability 1 be uncorrelated with outside the observed domain.

If we have a collection of variables , and , then is positively correlated in practice with most expressed simply in terms of the variables.

I've seen Goodhart's law as an observation or a fact of human society - you seem to have a mathematical version of it in mind. Is there a reference for that.

I ended up using mathematical language because I found it really difficult to articulate my intuitions. My intuition told me that something like this had to be true mathematically, but the fact that you don't seem to know about it makes me consider this significantly less likely.

If we have a collection of variables , and , then  is positively correlated in practice with most  expressed simply in terms of the variables.

Yes, but  also happens to be very strongly correlated with most  that are equal to . That's where you do the cheating. Goodhart's law, as I understand it, isn't a claim about any single proxy-goal pair. That would be equivalent to claiming that "there are no statistical regularities, period". Rather, it's a claim about the nature of the set of all potential proxies.

In a Bayesian language, Goodhart's law sets the prior probability of any seemingly good proxy being a good proxy, which is virtually 0. If you have additional evidence, like knowing that your proxy can be expressed in a simple way using your goal, then obviously the probabilities are going to shift.

And that's how your  and  are different. In the case of , the selection of  is arbitrary. In the case of , the selection of  isn't arbitrary, because it was already fixed when you selected . But again, if you select a seemingly good proxy  at random, it won't be an actually good proxy.

This looks like begging the question. The whole point of Goodhart is that the second case always applies (barring a discontinuity in the production functions - its possible that trying to maximize U generates a whole new method, which produces far more V than the old way). You cannot argue against that by assuming a contradictory function into existence (at least, not without some actual examples)

It seems I didn't articulate my point clearly. What I was saying is that V and V' are equally hard to define, yet we all assume that true human values has a Goodhart problem (rather than a reverse Goodhart problem). This can't be because of the complexity (since the complexity is equal) nor because we are maximising a proxy (because both have the same proxy).

So there is something specific about (our knowledge of) human values which causes us to expect Goodhart problems rather than reverse Goodhart problems. It's not too hard to think of plausible explanations (fragility of value can be re-expressed in terms of simple underlying variables to get results like this), but it does need explaining. And it might not always be valid (eg if we used different underlying variables, such as the smooth-mins of the ones we previously used, then fragility of value and Goodhart effects are much weaker), so we may need to worry about them less in some circumstances.

I think it's empirical observation. Goodhart looked around, saw in many domains that U diverged from V in a bad way after it became a tracked metric, while seeing no examples of U diverging from a theoretical V' in a good way, and then minted the "law." Upon further analysis, no-one has come up with a counterexample not already covered by the built in exceptions (if U is sufficiently close to V, then maximizing U is fine - eg Moneyball; OR if there is relatively low benefit to perform, agents won't attempt to maximize U - eg anything using Age as U like senior discounts or school placements)

I think it's empirical observation.

The world doesn't just happen to behave in a certain way. The probability that all examples point in a single direction without some actual mechanism causing it is negligible.

Let me try to repair Goodhart's law to avoid these problems:

By statistics, we should very generally expect two random variables to be uncorrelated unless there's a "good reason" to expect them to be correlated. Goodhart's law says that if U and V are correlated in some distribution, then (1) if a powerful optimizer tries to maximize U, then it will by default go far out of the distribution, (2) the mere fact that U and V were correlated in the distribution does not in itself constitute a "good reason" to expect them to be correlated far out of the distribution, so by default they won't be; (3) therefore we expect Goodhart's law "by default": you optimize U, thus go out of the distribution, thus break the correlation between U and V, and then V regresses back down to its mean.

So then we can start going through examples:

  • GDP vs human flourishing: This example fits all the defaults. There is no "good reason" to expect an extremely-out-of-distribution correlation between "GDP" and "human flourishing"—really the only reason to expect a correlation is the fact that they're correlated in-distribution, and by itself that's not enough to count as a "good reason". And we definitely expect that powerfully maximizing GDP would push it far out-of-distribution. Therefore we expect Goodhart's law—if you maximize GDP hard enough, then human flourishing will stop going up and start going down as it regresses to the mean.
  • GDP vs "twice GDP minus human flourishing": Here there is a very good a priori reason to expect an extremely-out-of-distribution correlation between the two sides—namely the fact that "GDP" is part of both. So the default expectation doesn't apply.
  • GDP vs log(GDP): Here there's an even more obvious, a priori reason to expect a robust correlation across all possible configurations of matter in all possible universes. So the default expectation doesn't apply.
  • "Mass of an object" vs "total number of protons and neutrons in the object": The default expectation that "optimization takes you far out of the distribution" doesn't really apply here, because regularities hold in a much broader "distribution" if the regularity comes from basic laws of physics, rather than from regularities concerning human-sized objects and events. So you can have a quite powerful optimization process trying to maximize an object's mass, yet stay well within the distribution of environments where this particular correlation remains robust. (A powerful enough optimizer could eventually make a black hole, which would indeed break this correlation, and then we get Goodhart's law. Other physics-derived correlations would be truly unbreakable though, like inertial mass vs gravitational mass.)
  • "The utility of the worst-off human" vs "The utility of the average human": Is there a "good reason" to expect these to be correlated extremely-out-of-distribution? Yes! Mathematically, if the former goes to infinity, then the latter has to go to infinity too. So we have a sound a priori reason to at least question the Goodhart's law default. We need a more object-level analysis to decide what would happen.

Cheers, these are useful classifications.


This is one of the times it helps to visualize things to see what's going on.


Let's pick target shooting for example, since it's easy to picture and makes for a good metaphor. The goal is to get as close as possible to the bulls eye, and for each inch of miss you score one less point. Visually, you see a group of concentric "rings" around the bulls eye which score fewer and fewer points as they get bigger. Simplifying to one dimension for a moment, V = -abs(x).

However, it's not easy to point the rifle right at the bulls eye. You do your best, of course, and it's much much closer to the bulls eye than any random orientation would be, but maybe  you end up aiming one inch to the right, and that the more accurate your ammo is the closer you get to this aimpoint of x=1. This makes U = -abs(1-x), or -abs(1-x)+constant or whatever. It doesn't really matter, but if we pick -abs(1-x)+1, U = V when you miss sufficiently far to the left so it fits nicely with your picture.

When we plot U, V, and 2U-V, we can see that your mathematical truth holds and it looks immediately suspicious. Going back to two dimensions, instead of having nice concentric rings around the actual target, you're pointing out that if the bulls eye had instead been placed exactly where you ended up aiming, and if the rings were distorted and non-concentric in this certain way, then V would actually increase twice as fast as U. 

But it's sorta missing the point. Because for one, the absolute scaling is fairly meaningless in the first place because it brings you towards the same place anyway, and more importantly you don't get the luxury of drawing your bullseye after you shoot. If you had been aiming for V' in the first place, you almost certainly wouldn't have managed to pull off a proxy as perfect as U. (in general V' and U don't have to line up in the exact same spot like this, but in those cases you still wouldn't have happened to miss V' in this particular way)


Goodhart has nothing to do with human values being "funny", it has to do with the fundamental difficulty of setting your sights in just the right place. Once you're within the range of the distance between your proxy and actual goal, it's no longer guaranteed that getting closer to the proxy gets you closer to your goal and it can actually bring you further away -- and if it brings you further away, that's bad. If you did a good job on all axes, maybe you end up hitting the 9 ring and that's good enough.

The thing that makes it "inevitable disaster" rather than just "not suboptimal improvement" is when you forget to take into account a whole dimension. Say, if you aim your rifle well in azimuth and elevation but instead of telling the bullet to stop at a certain distance, you tell it to keep going in that direction forever and it manages to succeed well beyond the target range.

Maybe I'm missing something, but this seems already captured by the normal notion of what Goodharting is in that it's about deviation from the objective, not the direction of that deviation.

The idea that maximising the proxy will inevitably end up reducing the true utility seems a strong implicit part of Goodharting the way it's used in practice.

After all, if the deviation is upwards, Goodharting is far less of a problem. It's "suboptimal improvement" rather than "inevitable disaster".

Ah, yeah, that's true, there's not much concern about getting too much of a good thing and that actually being good, which does seem like a reasonable category for anti-Goodharting.

It's a bit hard to think when this would actually happen, though, since usually you have to give something up, even if it's just the opportunity to have done less. For example, maybe I'm trying to get a B on a test because that will let me pass the class and graduate, but I accidentally get an A. The A is actually better and I don't mind getting it, but then I'm potentially left with regret that I put in too much effort.

Most examples I can think of that look like potential anti-Goodharting seem the same: I don't mind that I overshot the target, but I do mind that I wasn't as efficient as I could have been.

That test / class example isn't even a case because the test is instrumental to the goal, it's not a metric. Your U in this case is "time spent studying" which you accurately see will be un-correlatrd from "graduating" if all students (or all counterfactual "you"s) attempt to optomize it.

I wouldn't relate it to humans. In just about any basic machine learning setting, (train, test) has aspect 2. In fact, what you describe speaks in favor of modeling preferences using something other than utility functions, where aspect 3 is ruled out.

re your natural example, I would expect that as one shifts from 90% doing the best for the worst off + 10% attainable utility preservation to 100% the former, average welfare goes down.

Hmm... I find the scaling aspect a bit fishy (maybe an ordinal vs cardinal utility issue?). The goodness of a proxy should be measured by the actions it guides, and a V-maximizer, a log(V) maximizer and an  maximizer will all take the same actions (barring uncertain outcomes).

That said, reverse Goodhart remains possible. I'd characterize it as a matter of being below a proxy's range of validity, whereas the more familiar Goodhart problem involves ending up above it. E.g. if V =  + Y, then U = X is a reverse-Goodhart proxy for V—the higher X gets, the less you'll lose (relatively) by neglecting Y. (Though we'd have to specify some assumptions about the available actions to make that a theorem).

An intuitive example might be a game with an expert strategy and a beginner strategy—'skill at the expert strategy' being a reverse-Goodhart proxy for skill at the game.