When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

Stuart_Armstrong

In this post, I'll argue that some of the behaviours that seem to be clear examples of the Goodhart problem, are in fact not. And they are, in some sense, perfectly optimal.

And, somewhat conversely, that the reason we fear Goodharted results are because we implicitly know that there is some key information about the reward functions that we have not been able to include.

Intro: a Goodhart example?

Consider the following situation, with a robot that can go left or right one square each turn (or stay put):

The reward function $U_{L}$ rewards the robot by $1$ each turn it spends in the $L$ (left) square, while $U_{R}$ similarly rewards the robot for each turn spend in the $R$ square.

Suppose the robot puts $50.1 %$ probability on $U_{L}$ being correct, and $49.9 %$ probability on $U_{R}$ . Its optimal policy is then to go left, and stay in the left square forever.

So far, so Goodhart; this feels wrong that, because of such a small probability different, $U_{R}$ is essentially completely irrelevant. But, in fact, both $U_{L}$ and $U_{R}$ think that that policy is optimal!

That statement needs clarification, obviously. The reward $U_{R}$ would much prefer that the robot went right; failing that, it would prefer the robot alternated between $L$ and $R$ . But still, in a sense, it sees that the policy of going left is optimal.

To see how, define two reward functions to be mutually exclusive if one is always zero when the other is non-zero, like $U_{L}$ and $U_{R}$ are. Then, suppose you asked the following question to a $U_{R}$ -optimiser:

"Suppose that a robot is uncertain between $U_{R}$ and one other reward function, which is mutually exclusive with $U_{R}$ . The robot has no further way of distinguishing between the two. The $U_{R}$ may be the most probable function or the least probable, both options are equally likely. Is it optimal for the robot to maximise the highest probability reward function (equivalently, to maximise the weighted sum of the reward functions)?"

And the answer to that, from the $U_{R}$ -optimiser, is a bold YES!! That is an optimal policy, and better, from $U_{R}$ 's expected perspective, from moving back and forth between $L$ and $R$ (since that looses reward while on the centre square).

Given those constraints, both reward functions would agree that policy is optimal ex ante (before knowing the probabilities); $U_{R}$ would only disagree with it ex post (after knowing the probabilities).

When to Goodhart, when not to

There are two other optimal policies for $U_{R}$ and $U_{L}$ . One is to optimise the least likely reward function; the third is to optimise a randomly chosen reward function. Note that, since it's not known in advance which reward function is the most likely, these three policies are essentially the same: pick a reward function at random and maximise that.

Still, the point remains: both reward functions agree (ex ante) that Goodharting is better than being "fair" between the options. Why do we then expect that Goodharting will be a disaster; and why do we so often want the AI to be more "fair" between the reward functions?

It's because the situations we consider are typically very different from the situation above. This post will explore the various reasons that this might be:

Naive, maximum likelihood maximisation
Ideal reward function unlikely
Ideal reward function difficult to optimise (including extremal Goodhart)
Diminishing returns

There is also the somewhat distinct situation of:

Gains from trade between reward functions

But still, it needs to be repeated that:

The reason we think Goodharting is bad, is because of some features we know about the reward/utility functions that we want.

So let's start with one bad move: only maximising a single proxy.

1 Naive, maximum likelihood maximisation

In the description of the optimal policy above, I asked "Is it optimal for the robot to maximise the highest probability reward function (equivalently, to maximise the weighted sum of the reward functions)?"

But that is only "equivalent" because the different reward functions are mutually exclusive - you have to maximise one at the cost of the other. Combined with the fact that its equally easy to maximise either function, this means that maximising their weighted sum is the same as maximising the most likely.

But in general, that equivalence will not hold. Call maximising the highest probability reward function "maximum likelihood maximisation". In general, doing the Bayesian thing and maximising the probability-weighted sum of the reward functions, will be better than "maximum likelihood maximisation". You can see that as the contrast between the first and second options in this post.

One of the reasons that "regressional Goodhart" can often feel so tractable, is because the two approaches are often the same: the mean of the different possible reward functions is almost identical to the most likely reward function (ie mean $\approx$ mode). So the simpler, maximal likelihood maximisation, is often the correct approach.

So from now on, I'm assuming we'll be Bayesian about everything.

2 Ideal reward function unlikely

Note that accepting the optimality of Goodharting relies on it being equally likely that $U_{R}$ and $U_{L}$ be the most probable or least probable. But this fails if we feel that, for some reason, our ideal reward function is likely to be given a lower probability.

And this is typically the case when we worry about AIs Goodharting a learning process. We typically think that the typical "proxy reward" will be over-simple, and will neglect some crucial key features of the ideal reward function. Then if there is any sort of complexity penalty, implicit or explicit, the ideal reward function is going to be given a lower probability than unsuitable alternatives.

In that case, it would be foolish to just be Bayesian about the reward function; we have a genuine problem, and we need to make the ideal reward function more likely somehow, or, as a second best option force the the AI to be more "fair" between the different reward functions.

An extreme version of this is when the ideal reward function is completely absent from the space of possible reward functions. This is typically the case when we rely on a few simple proxy functions that are somewhat aligned with what we want, on the test distribution. We know that there is going to be some Goodharting, maybe disastrous Goodharting, as we move on to new distributions, because the ideal reward function is just not a candidate there.

Solving this issue is probably the most important in this area.

3 Ideal reward function difficult to optimise

The reward functions $U_{L}$ and $U_{R}$ are equally easy to optimise. But we might suspect that the ideal reward function could be difficult to optimise. This might be because we think it has complex requirements (making humans have genuinely flourishing lives), while many proxy candidates are much easier to optimise (increasing the "humans have flourishing lives" variable).

In that case, a Bayesian optimiser will divert resources to the easy-to-optimise reward functions, putting little effort into the other ones.

Note that we expect this to happen even if the ideal reward function is not particularly hard to optimise. If the AI considers many reward functions, it will find some that are easier to optimise, some harder. Similarly to the winner's curse in auctions, the AI's efforts will go to the few functions that are a) easy to optimise, and b) not too unlikely.

Note that some versions of "extremal Goodhart" come under this category. This is when the reward function can reach extremely high values in unforeseen and unusual situations - and thus will get optimised more than an "honest" reward function that extrapolates better.

This kind of problem has a clear solution: re-normalisation of all the reward functions. None of the standard re-normalisation methods is fully ideal (they all introduce their own small distortions), but they remove this problem entirely.

4 Diminishing returns

In this post I showed that it can make a big difference if reward functions have diminishing returns.

The same applies here. Let's define $U_{L}^{'}$ and $U_{R}^{'}$ as $U_{L}$ and $U_{R}$ with mild diminishing returns. If the robot has spent $n_{L}$ turns on $L$ , then then next time it is there, $U_{L}^{'}$ will get $1 / (log (n_{L}) + 1)$ rather than $1$ (and similarly for $U_{R}^{'}$ with $n_{R}$ , the number of turns the robot has spent on $R$ ).

This a very weakly diminishing return - the returns are almost linear (especially for large $n_{L}$ and $n_{R}$ ).

To ensure that the robot actually reaches a decision, let's also give it a discount^[1] rate $γ < 1$ (this can also be seen as having a probability $1 - γ$ that the experiment will end, at each turn).

Assume that the robot believes that $U_{L}^{'}$ is correct with probability $p_{L}^{'} > 0$ , and $U_{R}^{'}$ with probability $p_{R}^{'} = 1 - p_{L}^{'} > 0$ .

Then for any $γ$ , $p_{L}^{'}$ , and $p_{R}^{'}$ satisfying the above constraints, the optimal policy involves spending infinitely often on $L$ and on $R$ .

That behaviour seems a lot less... Goodharty. And it's also why we feel afraid of the standard Goodhart outcomes. If our ideal preferences were linear (and the issues above were solved), then there'd nothing to fear in a Goodhart outcome, just as $U_{R}$ found the robot's behaviour optimal ex ante.

It's because they are not linear that we fear Goodhart. We know that a universe with twice as much friendship but no fun - or the opposite, twice as much fun but no friendship - is not exactly equivalent with an ideal universe. A universe empty of humans or human-like beings is a failure, not just the zero of a linear scale.

To be more precise, since "non-linear" is a somewhat vague term ("non-linear in what, precisely?") I expect that if there are small changes $a$ and $b$ to a world $w$ so that $w + a + b$ is as good as $w$ , then typically $w + 10^{6} a + 10^{6} b$ will be (much) worse than $w$ .

Thus part of our fear of Goodhart outcomes seems to me, to be that we are tacitly aware that our reward functions have diminishing returns while considering only linear reward functions in the problems we are working on.

The final section looks at a quite different issue, examining when we should allow Goodharting, even if the we haven't fully resolved the issues above.

5 Gains (and losses) from trade between reward functions

As stated above, $U_{L}$ and $U_{R}$ are mutually exclusive, and there is no gains from trade between them: optimising one sacrifices optimising the other.

Let's assume a slightly more complicated setup, where gains from trade are possible:

The top row is as before, but the $U_{L}$ gets $0.7$ reward from $L_{R}$ and $0.5$ reward from $R_{L}$ ; $U_{R}$ gets the reverse, $0.7$ from $R_{L}$ and $0.5$ from $L_{R}$ .

In that case, despite the fact that $0.7$ is quite a bit lower than $1$ , and the robot still finds $U_{L}$ the most likely reward function, the optimal policy is to go to $L_{R}$ and stay there.

Therefore, if there are many gains from trade between the different reward functions, Goodharting effects are not so bad. Even if we don't manage to solve all the issues mentioned above, we might be ok with a naive Bayesian approach.

Conversely, negative interactions between utilities make everything worse. Consider:

In this case, $U_{L}$ gets $1.5$ from $L_{- R}$ and $- 0.1$ from $R_{- L}$ , while $U_{R}$ does the opposite, getting $1.5$ from $R_{- L}$ and $- 0.1$ from $L_{- R}$ .

Then the optimal policy is to go for $L_{- R}$ , and thus punish the reward function $U_{R}$ . Though note that this is still what $U_{R}$ would agree to, ex ante.

So our tolerance for Goodhart-like effects depends on how we see the interactions between the reward functions. In practice, this is about the shape of the Pareto boundary - the set of policies such that you can't improve one reward function without making another worse. If this boundary is nice and rounded, then we expect that a "middle" outcome will be one where all reward functions get something, so small errors have small impacts:

The opposite happens when the Pareto boundary is flat, where "middle" outcomes are mixes between extremes; in that case, small errors are likely to throw the outcome to one end or the other:

This worse case happens more often when the reward functions are antagonistic, such as "keep people happy" vs "negative utilitarianism", rather than just independent rewards competing over limited resources. This is especially relevant if we think there are "fates worse than extinction" possible in the future, given some of the possible reward functions.

Conclusion

So, as long as:

We use a Bayesian mix of reward functions rather than a maximum likelihood reward function.
An ideal reward function is present in the space of possible reward functions, and is not penalised in probability.
The different reward functions are normalised.
If our ideal reward functions have diminishing returns, this fact is explicitly included in the learning process.

Then, we shouldn't unduly fear Goodhart effects (of course, we still need to incorporate as much as possible about our preferences into the AI's learning process). The second problem, making sure that there are no unusual penalties for ideal reward functions, seems the hardest to ensure.

If not all those conditions are met, then:

The negative aspects of the Goodhart effect will be weaker if there are gains from trade and a rounded Pareto boundary.

Without a discount rate, there is no optimal policy; the AI would want to "spend infinitely many turns on $L$ , and then spend infinitely many turns on $R$ ", similarly to the "heaven and hell problem". ↩︎

Flo's summary for the Alignment Newsletter:

Suppose we were uncertain about which arm in a bandit provides reward (and we don’t get to observe the rewards after choosing an arm). Then, maximizing expected value under this uncertainty is equivalent to picking the most likely reward function as a proxy reward and optimizing that; Goodhart’s law doesn’t apply and is thus not universal. This means that our fear of Goodhart effects is actually informed by more specific intuitions about the structure of our preferences. If there are actions that contribute to multiple possible rewards, optimizing the most likely reward does not need to maximize the expected reward. Even if we optimize for that, we have a problem if value is complex and the way we do reward learning implicitly penalizes complexity. Another problem arises, if the correct reward is comparatively difficult to optimize: if we want to maximize the average, it can make sense to only care about rewards that are both likely and easy to optimize. Relatedly, we could fail to correctly account for diminishing marginal returns in some of the rewards.

Goodhart effects are a lot less problematic if we can deal with all of the mentioned factors. Independent of that, positive-sum interactions between the different rewards mitigate Goodhart effects, while negative-sum interactions make them more problematic.

Flo's opinion:

I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

My opinion:

Note that this post considers the setting where we have uncertainty over the true reward function, but _we can't learn about the true reward function_. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Flo's summary for the Alignment Newsletter:

I like that summary!

I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

Resource constraints situations can be positive sum (consider most of the economy). The real problem is between antagonistic preferences, eg maximising flourishing lives vs negative utilitarianism, where a win for one is a loss for the other.

Note that this post considers the setting where we have uncertainty over the true reward function, but we can't learn about the true reward function. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Yes, if you're in a learning process and treat it as if you weren't in a learning process, things will go wrong ^_^

My model goes something like this: If increasing values requires using some resource, gaining access to more of the resource can be positive sum, while spending it is negative sum due to opportunity costs. In this model, the economy can be positive sum because it helps with alleviating resource constraints.

But maybe it does not really matter if most interactions are positive-sum until some kind of resource limit is reached and negative-sum only after?

Generally, spending resources is zero-sum, not negative sum.

Right. I think my intuition about negative-sum interactions under resource constrainrs combined the zero-sum nature of resource spending with the (perceived) negative-sum nature of competition for resources. But for a unified agent there is no competition for resources, so the argument for resource constraints leading to negative-sum interactions is gone.

Thank you for alleviating my confusion.

To clear up some more confusion: The sum-condition is not what actually matters here, is it? In the first example of 5), the sum of utilities is lower than in the second one. The problem in the second example seems to rather be that the best states for one of the (Edit: the expected) rewards are bad for the other?

That again seems like it would often follow from resource constraints.

Negative vs positive vs zero sum is all relative to what we take to be the default outcome.

I take the default as "no effort is made to increase or decrease any of the reward functions".

But no matter, how I take the default outcome, your second example is always "more positive sum" than the first, because 0.5 + 0.7 + 2x < 1.5 - 0.1 +2x.

Granted, you could construct examples where the inequality is reversed and Goodhart bad corresponds to "more negative sum", but this still seems to point to the sum-condition not being the central concept here. To me, it seems like "negative min" compared to the default outcome would be closer to the actual problem. This distinction matters, because negative min is a lot weaker than negative sum.

Or am I completely misunderstanding your examples or your point?

Ok, have corrected it now; the negative-sum formulation was wrong, sorry.

After looking at the update, my model is:

(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)

Concave (including linear) Pareto boundary: Extreme policies are favoured, even for moderate beliefs. (In this case, normalization only affects the "tipping point" in beliefs, where the opposite extreme policy is suddenly favoured).

In reality, we will often have concave and convex regions. The concave regions then cause more extreme policies for some beliefs, but the convex regions usually prevent the policy from completely focusing on a single objective.

From this lens, 1) maximum likelihood pushes us to one of the ends of the Pareto boundary, 2) an unlikely true reward pushes us close to the "bad" end, 3) Difficult optimization messes with normalisation (I am still somewhat confused about the exact role of normalization) and 4) Not accounting for diminishing returns bends the pareto boundary to become more concave.

I think normalisation doesn't fit in the convex-concave picture. Normalisation is to avoid things like $1 % (100 R_{1})$ being seen as the same as $100 % (R_{1})$ .

I was thinking about normalisation as linearly rescaling every reward to $[0, 1]$ when I wrote the comment. Then, one can always look at $[0, 1]^{2}$ , which might make it easier to graphically think about how different beliefs lead to different policies. Different scales can then be translated to a certain reweighting of the beliefs (at least from the perspective of the optimal policy), as maximizing $P (R_{1}) S_{1} R_{1} + P (R_{2}) S_{2} R_{2}$ is the same as maximizing $\frac{P (R_{1}) S_{1}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{1} + \frac{P (R_{2}) S_{2}}{P (R_{1}) S_{1} + P (R_{2}) S_{2}} R_{2}$

I like that way of seeing it.

You are correct; I was unclear (and wrong in that terminology). I will rework the post slightly.

Negative sum vs zero sum (vs positive sum, in fact) depend on defining some "default state", against which the outcome is compared. A negative sum game can become a positive sum game if you just give all the "players" a fixed bonus (ie translate the default state). Default states are somewhat tricky and often subjective to define.

Now, you said "the best states for one of the rewards are bad for the other". "Bad" compared with what? I'm taking as a default something like "you make no effort to increase (or decrease) either reward".

So, my informal definition of "zero sum" is "you may choose to increase either $R_{1}$ or $R_{2}$ (roughly) independently of each other, from a fixed budget". Weakly positive sum would be "the more you increase $R_{1}$ , the easier it gets to increase $R_{2}$ (and vice versa) from a fixed budget"; strongly positive sum would be "the more you increase $R_{1}$ , the more $R_{2}$ increases (and vice versa)".

Negative sum would be the opposite of this ("easier"->"harder" and "increases"->"decreases").

The reason I distinguish weak and strong, is that if we add diminishing returns, this reduces the impact of weak negative sum, but can't solve strong negative sum.

Does this help, or add more confusion?

It's not clear how well 'increasing switching increases proximity to the ideal reward function' generalizes beyond this problem. (And we probably want the robot to not run forever.)

"If our ideal reward functions have diminishing returns, this fact is explicitly included in the learning process."

It seems like the exact shape of the diminishing returns might be quite hard to infer while wrong "rates" of diminishing returns can lead to (slighlty less severe versions of) the same problems as not modelling diminishing returns at all.

We probably at least need to incorporate our uncertainty about how returns diminish in some way. I am a bit confused about how to do this, as slowly diminishing functions will probably dominate if we just take an expectation over all candidates?

Renormalise those, so that slowly diminishing returns don't dominate.

Flo's summary for the Alignment Newsletter:

Suppose we were uncertain about which arm in a bandit provides reward (and we don’t get to observe the rewards after choosing an arm). Then, maximizing expected value under this uncertainty is equivalent to picking the most likely reward function as a proxy reward and optimizing that; Goodhart’s law doesn’t apply and is thus not universal. This means that our fear of Goodhart effects is actually informed by more specific intuitions about the structure of our preferences. If there are actions that contribute to multiple possible rewards, optimizing the most likely reward does not need to maximize the expected reward. Even if we optimize for that, we have a problem if value is complex and the way we do reward learning implicitly penalizes complexity. Another problem arises, if the correct reward is comparatively difficult to optimize: if we want to maximize the average, it can make sense to only care about rewards that are both likely and easy to optimize. Relatedly, we could fail to correctly account for diminishing marginal returns in some of the rewards.

Goodhart effects are a lot less problematic if we can deal with all of the mentioned factors. Independent of that, positive-sum interactions between the different rewards mitigate Goodhart effects, while negative-sum interactions make them more problematic.

Flo's opinion:

I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

My opinion:

Note that this post considers the setting where we have uncertainty over the true reward function, but _we can't learn about the true reward function_. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Flo's summary for the Alignment Newsletter:

I like that summary!

I enjoyed this article and the proposed factors match my intuitions. Predicting variable diminishing returns seems especially hard to me. I also worry that the interactions between rewards will be negative-sum, due to resource constraints.

Note that this post considers the setting where we have uncertainty over the true reward function, but we can't learn about the true reward function. If you can gather information about the true reward function, which <@seems necessary to me@>(@Human-AI Interaction@), then it is almost always worse to take the most likely reward or expected reward as a proxy reward to optimize.

Yes, if you're in a learning process and treat it as if you weren't in a learning process, things will go wrong ^_^

But maybe it does not really matter if most interactions are positive-sum until some kind of resource limit is reached and negative-sum only after?

Generally, spending resources is zero-sum, not negative sum.

Thank you for alleviating my confusion.

That again seems like it would often follow from resource constraints.

Negative vs positive vs zero sum is all relative to what we take to be the default outcome.

I take the default as "no effort is made to increase or decrease any of the reward functions".

But no matter, how I take the default outcome, your second example is always "more positive sum" than the first, because 0.5 + 0.7 + 2x < 1.5 - 0.1 +2x.

Or am I completely misunderstanding your examples or your point?

Ok, have corrected it now; the negative-sum formulation was wrong, sorry.

After looking at the update, my model is:

(Strictly) convex Pareto boundary: Extreme policies require strong beliefs. (Modulo some normalization of the rewards)

I think normalisation doesn't fit in the convex-concave picture. Normalisation is to avoid things like $1 % (100 R_{1})$ being seen as the same as $100 % (R_{1})$ .

I like that way of seeing it.

You are correct; I was unclear (and wrong in that terminology). I will rework the post slightly.

Negative sum would be the opposite of this ("easier"->"harder" and "increases"->"decreases").

The reason I distinguish weak and strong, is that if we add diminishing returns, this reduces the impact of weak negative sum, but can't solve strong negative sum.

Does this help, or add more confusion?

It's not clear how well 'increasing switching increases proximity to the ideal reward function' generalizes beyond this problem. (And we probably want the robot to not run forever.)

"If our ideal reward functions have diminishing returns, this fact is explicitly included in the learning process."

Renormalise those, so that slowly diminishing returns don't dominate.

24

When Goodharting is optimal: linear vs diminishing returns, unlikely vs likely, and other factors

24

Ω 11

Intro: a Goodhart example?

When to Goodhart, when not to

1 Naive, maximum likelihood maximisation

2 Ideal reward function unlikely

3 Ideal reward function difficult to optimise

4 Diminishing returns

5 Gains (and losses) from trade between reward functions

Conclusion

24

Ω 11

24

Ω 11