# Ω 12

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I've done some work on Goodhart's law, and I've argued that we can make use of all our known uncertainties in order to reduce or remove this effect.

Here I'll look at a very simple case: where we know only one thing, which is that Goodhart's law exists.

## Proxies exist

There are two versions of Goodhart's law, as we commonly use the term. The simplest is that there is a difference between maximising for a proxy -- -- rather than for the real objective -- .

Let be the difference between the true objective and the proxy. Note that in this post, we're seeing , , and as actual maps from world histories to . The equivalence classes of them under positive affine transformations are denoted by , , and so on.

Note that makes sense, as does , but does not: defines a family of functions with three degrees of freedom (two scalings and one addition), not the usual two for (one scaling and one addition).

So, the simplest version of Goodhart's law is thus that there is a chance for to be non-zero.

Let be the vector space of possible , and let be a probability distribution over it. Assume further that is symmetric -- that for all , .

Then the pernicious effect of Goodhart's law is in full effect: suppose we ask an agent to maximise , with known and giving the distribution over possible .

Then, given that uncertainty, it will choose the policy that maximises

• ,

since and cancel out.

So the agent will blindly maximise the proxy .

## We know: maximising behaviour is bad

But, most of the time, when we talk about Goodhart's law, we don't just mean "a proxy exists". We mean that not only does a proxy exist, but that maximising the proxy too much is pernicious for the true utility.

Consider for example a nail factory, where is the number of true nails produced, and is the number of "straight pieces of metal" produced. Here is the difference between the number of true nails and the pieces of metal.

In this case, we expect a powerful agent maximising to do much, much worse on the scale. As the agent expands and gets more control over the world's metal production, continues to climb, while tumbles. So is not only a (bad) proxy for ; at the extremes, its pernicious.

But what if we considered ? This is an odd utility indeed; it's equal to . This is twice the number of pieces of metal produced, minus the number of true nails produced. And as the agent's power increases, so does , but so does , to an even greater extent. Now, won't increase to the same extent as it would under the optimal policy for , but it still increases massively under -optimisation.

And, implicitly, when we talk about Goodhart's law, we generally mean that the true utility is of type , rather than of type ; indeed, that things like don't really make sense as candidates for the true utility. So there is a break in symmetry between and .

## Putting this knowledge into figures

So, suppose is some default policy, and is the -maximising policy. One way of phrasing the stronger type of Goodhart's law is that:

• .

Then, rearranging these gives:

• .

Because is the optimal policy for , the term is non-negative (and most likely strictly positive). So the new restriction on is that:

• .

This is an affine restriction, in that if and satisfy that restriction, then so does a mix for . In fact, it defines a hyperplane in , with everything on one side satisfying that restriction, and everything on the other side not satisfying it.

In fact, the set that satisfies the restriction (call it ) is "smaller" (under ) than the set that does not (call that ). This is because, if , then -- but the converse is not true if .

And now, when an agent maximises , with known and distributed by but also known to obey that restriction, the picture is very different. It will maximise , where , and is far from in general. So the agent won't just maximise the proxy.

# Conclusion

So, even the seemingly trivial fact that we expect a particular type of Goodhart effect - even that trivial fact dramatically reduces the effect of Goodhart's law.

Now, the effect isn't enough to converge on a good : we'll need to use other information for that. But note one interesting point: the more powerful the agent is, the more effective it is at maximising , so the higher gets -- and thus the higher becomes. So the most powerful agents have the strongest restrictions on what the possible 's are. Note that we might be able to get this effect even for more limited agents, by defining not only as the optimal policy, but as some miraculous optimal policy where things work out unexpectedly well for the agent.

It will be interesting to see what happens as in situations where we account for more and more of our (implicit and explicit) knowledge about .