# Ω 15

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I've done some work on Goodhart's law, and I've argued that we can make use of all our known uncertainties in order to reduce or remove this effect.

Here I'll look at a very simple case: where we know only one thing, which is that Goodhart's law exists.

## Proxies exist

There are two versions of Goodhart's law, as we commonly use the term. The simplest is that there is a difference between maximising for a proxy -- -- rather than for the real objective -- .

Let be the difference between the true objective and the proxy. Note that in this post, we're seeing , , and as actual maps from world histories to . The equivalence classes of them under positive affine transformations are denoted by , , and so on.

Note that makes sense, as does , but does not: defines a family of functions with three degrees of freedom (two scalings and one addition), not the usual two for (one scaling and one addition).

So, the simplest version of Goodhart's law is thus that there is a chance for to be non-zero.

Let be the vector space of possible , and let be a probability distribution over it. Assume further that is symmetric -- that for all , .

Then the pernicious effect of Goodhart's law is in full effect: suppose we ask an agent to maximise , with known and giving the distribution over possible .

Then, given that uncertainty, it will choose the policy that maximises

• ,

since and cancel out.

So the agent will blindly maximise the proxy .

## We know: maximising behaviour is bad

But, most of the time, when we talk about Goodhart's law, we don't just mean "a proxy exists". We mean that not only does a proxy exist, but that maximising the proxy too much is pernicious for the true utility.

Consider for example a nail factory, where is the number of true nails produced, and is the number of "straight pieces of metal" produced. Here is the difference between the number of true nails and the pieces of metal.

In this case, we expect a powerful agent maximising to do much, much worse on the scale. As the agent expands and gets more control over the world's metal production, continues to climb, while tumbles. So is not only a (bad) proxy for ; at the extremes, its pernicious.

But what if we considered ? This is an odd utility indeed; it's equal to . This is twice the number of pieces of metal produced, minus the number of true nails produced. And as the agent's power increases, so does , but so does , to an even greater extent. Now, won't increase to the same extent as it would under the optimal policy for , but it still increases massively under -optimisation.

And, implicitly, when we talk about Goodhart's law, we generally mean that the true utility is of type , rather than of type ; indeed, that things like don't really make sense as candidates for the true utility. So there is a break in symmetry between and .

## Putting this knowledge into figures

So, suppose is some default policy, and is the -maximising policy. One way of phrasing the stronger type of Goodhart's law is that:

• .

Then, rearranging these gives:

• .

Because is the optimal policy for , the term is non-negative (and most likely strictly positive). So the new restriction on is that:

• .

This is an affine restriction, in that if and satisfy that restriction, then so does a mix for . In fact, it defines a hyperplane in , with everything on one side satisfying that restriction, and everything on the other side not satisfying it.

In fact, the set that satisfies the restriction (call it ) is "smaller" (under ) than the set that does not (call that ). This is because, if , then -- but the converse is not true if .

And now, when an agent maximises , with known and distributed by but also known to obey that restriction, the picture is very different. It will maximise , where , and is far from in general. So the agent won't just maximise the proxy.

# Conclusion

So, even the seemingly trivial fact that we expect a particular type of Goodhart effect - even that trivial fact dramatically reduces the effect of Goodhart's law.

Now, the effect isn't enough to converge on a good : we'll need to use other information for that. But note one interesting point: the more powerful the agent is, the more effective it is at maximising , so the higher gets -- and thus the higher becomes. So the most powerful agents have the strongest restrictions on what the possible 's are. Note that we might be able to get this effect even for more limited agents, by defining not only as the optimal policy, but as some miraculous optimal policy where things work out unexpectedly well for the agent.

It will be interesting to see what happens as in situations where we account for more and more of our (implicit and explicit) knowledge about .

# Ω 15

New Comment

Weather this works or not is going to depend heavily on what looks like.

Given , i.e. , what does this say about ?

The answer depends on the amount of mutual information between , and . Unfortunately the the more generic is, (i.e. any function is possible) the less mutual information there will be. Therefore, unless we know some structure about , the restriction to is not going to do much. The agent will just find a very different policy that also actives very high in some very Goodharty way, but does not get penalized because low value for on is not correlated with low value on .

This could possibly be fixed by adding assumptions of the type for any that does too well on . That might yield something interesting, or it might just be a very complicated way of specifying as satisfiser, I don't know.

You can't get too much work from a single bit of information ^_^

I find this way of formalizing Goodhart weird. Is there a standard formalization of it, or is this your invention? I'll explain what I think is weird.

You define U and V such that you can calculate U - V to find W, but this appears to me to skip right past the most pernicious bit of Goodhart, which is that U is only knowable via a measurement (not necessarily a measure), such that I would say for some "measuring" function and the problem is that is correlated with but different from U since there may not even be a way to compare U.

To make it concrete with an example, suppose U is "beauty as defined by Gordon". We don't, at least as of yet, have a way to find U directly, and maybe we never will. So supposing we don't, if we want to answer questions like "would Gordon find this beautiful?" and "what painting would Gordon most like?" we need to a measurement of U we can work with, as developed by, say, using IRL to discover a "beauty function" that describes U such that we could say how beautiful I would think something is. But we would be hard pressed to be precise about how far off the beauty function is from my sense of beauty because we only have a very gross measure of the difference: compare how beautiful the beauty function and I think some finite set of things are (finite because I'm a bounded, embedded agent who is never going to get to see all things, even if the beauty function somehow could), and even as we are doing this we are still getting a measurement of my internal sense of beauty rather than my internal sense of beauty itself because we are asking me to say how beautiful I think something is rather than directly observing my sense of beauty. This is much of why I expect that Goodhart is extremely robust.

I think you want to differentiate between different mechanisms for Goodhart's law. The categorization that Scott Garrabrant put together, and I worked with him on refining, is here: https://arxiv.org/abs/1803.04585

Given that, I see several different things going on.

First, if I read the post correctly, Stuart is discussing regressional Goodhart, in this case the general issue of what Greg Lewis called "The Tails Come Apart". This occurs whether or not the true value function is known. (As a historical note, this is a broader and, as Scott pointed out, a more fundamentally unavoidable claim than either what Goodhart meant, or what Campbell was referring to.)

Second, there is the potential for divergences between, in your example,"a measurement of my internal sense of beauty rather than my internal sense of beauty itself" is a second Goodhart affect, which is (at least) a causal one, where repeated queries change the estimates due to psychological biases, etc. In that case, there's also a nasty potential adversarial Goodhart issue, if the AI gets to make the queries and exploits those biases.

Alternatively, if the initial sample of "your internal sense of beauty" is a fixed sample, there is a sampling and inference issue for the preferences for embedded agents - inferring a continuous, potentially unbounded function from a finite sample. That's an important and fundamental issue, but it's only partially about, in this case, extremal Goodhart. It's also a more general issue about inferring preferences, i.e. learning is hard and this is learning.

Even with your stated sense of beauty, knowing "this measure can be manipulated in extreme circumstances" is much better than nothing.

Let me see if I have this...

1. Agents blindly maximize the proxies they pick if the expected value of maximizing the proxy is higher than doing anything else.

2. Goodhart's Law tells us that, in general, blindly maximizing the proxy has lower expected value than other methods that involves not doing that

3. Because of this, we expect the difference between what we want and what we get to be bigger if we're optimizing the proxy instead of following some non-optimizing default strategy. Thus, there's a lower bound on how bad optimizing the proxy is relative to not doing that.

4. Agents who notice this will therefore not maximize the proxy but rather a modified function that considers the distribution of ways that the proxy deviates from the target function

I of course, agree with all these claims. A good example of this happening in real life is "humans deciding not to maximize things because they're worried about getting goodhart'd." However, there are a couple of questions here:

Q1. if I'm programming an AI that only knows the proxy, how would it go about even beginning to estimate a probability distribution for the difference between the proxy and the true utility function, p(W)? Wouldn't you need to have some deeper, more complex inside knowledge about the true utility function that is, somehow, not part of the proxy? Maybe the agent could create a proxy of its proxy and directly model how much optimizing the latter deviates from the former but that would require the agent to make a number of likely unjustified assumptions about the nature of the true utility function.

Q2. if I'm programming an AI and the proxy is the best measurable way of describing what I care about, how do I even get the AI to understand that the proxy isn't the thing I want it to optimize? I guess you could try to just directly code it to optimize V+W' but that means that you, as a coder, need to compute W'. If the AI is self-modifying, you also have to more generally compute W' as a function of how capable the AI is at optimizing. This seems rather risky. Maybe you could try punting the computation off to the AI itself but this just makes Q1 a bigger problem.

-

Goodhart's Law tells us that, in general, blindly maximizing the proxy has lower expected value than other methods that involves not doing that

This is only true for the kind of things humans typically care about; this is not true for utility functions in general. That's the extra info we have.

[-]IsnaseneΩ5140

[Retracted my other reply due to math errors]

This is only true for the kind of things humans typically care about; this is not true for utility functions in general. That's the extra info we have.

While I generally agree that there can be utility functions that aren't subject to Goodhart, I don't think that this strictly pertains to humans. I expect that when the vast majority of agents (human or not) use scientific methods to develop a proxy for the thing it wants to optimize, they will found that proxy to break down upon intense optimization:

-proxies are learned in a certain environment where it works to predict the utility function

-aggressively optimizing anything enough will usually change the environment dramatically

-so aggressively optimizing a given proxy will eventually violate the assumptions under which the proxy was created

-if the assumptions that justify the proxy's design don't hold, optimizing it further will be akin to acting randomly. This can be achieved by the "doing nothing" policy without the added spending of resources

-when the world is in a state where agentic actions have increased the value of a utility function, behaving randomly seems more likely to reduce the utility function than increasing it in the same way that randomness tends to push worlds towards states of higher entropy rather than lower ones.

The last point is kind-of handwaivey since we can have a utility function like "maximize entropy" which can provide many proxies which don't get Goodhart'd (in the sense of optimization making things worse rather than just not making them better). Still, "Goodhart's Law applies to agents with utility functions of relatively low entropy" is much more generic than "Goodhart's Law applies to humans." I'm also not sure how helpful that is. Even if we know that we should stop optimizing at some point, what metric do you actually use in making the decision to stop?

The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us. Therefore, any universe where we have an effective maximizer of something misspecified is a universe where we're no longer able to get what we want. That's why instrumental convergence is so bad.

This adversarial issue can be true, and is critical, but I don't think it's what Stuart was pointing to. in his post or his reply.

-------------------------------------Part 1: I Respond to Your Actual Comment----------------------------------------

The explanation is a bit simpler than this. The agent has one goal, and we have other goals. It gains power to best complete its goal by taking power away from us

I don't think this explanation is in conflict with mine. Much of my explanation (ie, the "optimizing a proxy too aggressively will invalidate the assumptions that the proxy was built on") is focused on explaining why we expect proxies to become mis-specified. In the context of AGI, this isn't that important because we have such low confidence in our ability to specify our values. However this model is more general and can help explain why we expect to make many mistakes when trying to specify our values:

Because we haven't had our values tested in the kinds of universes that aggressive optimization might produce, our proxies will fail to account for as-of-now unmeasured factors in the things we care about.

You also mention power. I think this is a subset of the vague entropy thing I was being handwaivey about because:

1. A relatively low entropy universe is a necessary but not sufficient condition for humans having power. Thus, humans having power (and humans existing) implies that the universe has relatively low entropy.

2. This implies that acting randomly will tend to lessen human power rather than increase it (since randomness will tend to increase entropy)

I think this entropyish thing is also why Stuart's makes his point that Goodhart applies to humans and not in general: It's only because of the unique state humans are in (existing in a low entropy universe, having an unusually large amount of power) that Goodhart tends to hit us affect us.

Actually, I think I have I have a more precise description of the entropyish thing now. Goodhart's Law isn't driven by entropy; Goodhart's Law is driven by trying to optimize a utility function that already has an unusually high value relative to what you'd expect from your universe. Entropy just happens to be a reasonable proxy for it sometimes.

---Part 2: Goodhart's Law in a Simple, Semi-Realistic, Non-Adversarial Linear Optimization Problem-----

So, writing the response above gave me a bunch of ideas. To lead with, it's worth noting that problems like this can happen in non-adversarial contexts too.

Example:

Say you're in a world where your utility function is but you've only ever existed in environments where and varies between 1 and 4 (currently, ). As a result, you decide to optimize the proxy because you have no idea that matters.

At first, this optimization works great. You were initially at . But you've progressively increased more than it's ever been increased before, all the way up to . Your utility function is now ; your proxy is ; and you're much happier now than you've ever been before.

However, unbeknownst to you, Both and come from the same resource pool and so are constrained by the relation . You continue optimizing your proxy all the way up to , which (inadvertently) causes to drop to . Your proxy now outputs but your utility function outputs , which is lower than its initial value of 22. Despite initially improving your utility by 8 (at x=10, y=10), the optimization process ultimately reduces it by two once resource pool limitations begin to apply. Also note that, if the resource pool was not underexploited to begin with ( ), the optimization process would have immediately began to reduce utility by trading off the more valuable y for the less valuable x.

In short, Goodhart's Law isn't just adversarial. It can also happen in circumstances when:

1. Two desireable things compete for the same resource pool

2. One of the things is more desireable than the other

3. The amount of the more desireable thing has never changed; so no one has noticed that they have become more unhappy when it decreases or more happy when it increases

In this scenario, any proxy will necessarily fail at some point because the good thing gets traded away for the bad thing. The specific point at which optimization failure begins occuring depends both on how much more desireable the good thing is and how limited the pool of resources is.

This is why I don't think that we can use knowledge of Goodhart's Law alone to prevent Goodhart's Law. In the above example, even knowing the functional form of the utility function (a sum of linear terms), knowing exactly the number of factors missed by the proxy (one factor: y), and knowing the functional form of the constraint (a linear less-than-or-equal-to) won't tell us when to stop optimizing because:

• If the resource pool has been completely exploited; we shouldn't even start optimizing since it will trade-off an unknowedly good thing for a less good thing
• If the resource pool is unlimited, we should never stop optimizing because we can always get more of a get thing
• If the thing missed by the proxy is less desireable than the thing the proxy is considering, we should also never stop optimizing because it will trade-off an unknowedly less good thing for the good thing

And this is just a toy example; real utility functions without knowledge of the target function's functional form or knowledge about the constraints are even more difficult.

Of course, if you notice that the optimization process is making you unhappy; you can stop it early and avert the negative effects yourself. Unfortunately, noticing that the process is making you unhappy requires that you have access to your utility function (or at least better proxy for it than the one you're trying to optimize). By nature, this access cannot be given to an AI and this is a big problem.

Let me clarify the distinction I'm trying to point at:

First, Goodhart's law applies to us when we're optimizing a goal for ourselves, but we don't know the exact goal. For example, if I'm trying to make myself happy, I might find a proxy of dancing, even though dancing isn't literally the global optimum. This uses up time I could have used on the actual best solution. This can be bad, but it doesn't seem that bad. I'm pretty corrigible to myself.

Second, Goodhart's law applies to other agents who are instructed to maximize some proxy of what we want. This is bad. If it's maximizing the proxy, then it's ensuring it's most able to maximize the proxy, which means it's incentivized to stop us from doing things (unless the proxy specifically includes that - which safeguard is also vulnerable to misspecification; or is somehow otherwise more intelligently designed than the standard reward-maximization model). The agent is pursuing the proxy from its own perspective, not from ours.

I think this entropyish thing is also why Stuart's makes his point that Goodhart applies to humans and not in general: It's only because of the unique state humans are in (existing in a low entropy universe, having an unusually large amount of power) that Goodhart tends to hit us affect us.

Actually, I think I have I have a more precise description of the entropyish thing now. Goodhart's Law isn't driven by entropy; Goodhart's Law is driven by trying to optimize a utility function that already has an unusually high value relative to what you'd expect from your universe. Entropy just happens to be a reasonable proxy for it sometimes.

I don't think the intial value has much to do with what you label the "AIS version" of Goodhart (neither does the complexity of human values in particular). Imagine we had a reward function that gave one point of reward for each cone detecting red; reward is dispensed once per second. Imagine that the universe is presently low-value; for whatever reason, red stimulation is hard to find. Goodhart's law still applies to agents we build to ensure we can see red forever, but it doesn't apply to us directly - we presumably deduce our true reward function, and no longer rely on proxies to maximize it.

The reason it applies to agents we build is that not only do you have to encode the reward function, but we have to point to people! This does not have a short description length. With respect to hard maximizers, a single misstep means the agent is now showing itself red, or something.

How proxies interact is worth considering, but (IMO) it's far from the main reason for Goodhart's law being really, really bad in the context of AI safety.

Oh I see where you're coming from now. I'll admit that, when I made my earlier post, I forgot about the full implications of instrumental convergence. Specifically, the part where:

Maximizing X minimizes alll Not X insofar as they both compete for the same resource pool.

Even if your resources are unusually low relative to where you're positioned in the universe, an AI will still take that away from you. Optimizing one utility function doesn't just randomly affect the optimization of other utility functions; they are anti-correlated in general

I really gotta re-read Goodhart's Taxonomy for a fourth time...

Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t. (This will be in the next part of my sequence on impact).

Well, they’re anti-correlated across different agents. But from the same agent’s perspective, they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t

Just making sure I can parse this... When I say that they're anti-correlated, I mean that the policy of maximizing X is akin to the policy of minimizing X to the extent that X and not X will at some point compete for the same instrumental resources. I will agree with the statement that an agent maximizing X who possesses many instrumental resources can use them to accomplish not X (and ,in this sense, the agent doesn't perceive X nd not X as anti-correlated); and I'll also agree that an agent optimizing X and another optimizing not X will be competitive for instrumental resources and view those things as anti-correlated.

they may still be able to maximize their own red-seeing, or even human red-seeing - they just won’t

I think some of this is a matter of semantics but I think I agree with this. There are also two different definitions of the word able here:

• Able #1 : The extent to which it is possible for an agent to achieve X across all possible universes we think we might reside in
• Able #2 : The extent to which it is possble for an agent to achieve X in a counterfactual where the agent has a goal of achieving X

I think you're using Able #2 (which makes sense--it's how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:

People have a natural sense of what they "could" do. If you're sad, it still feels like you "could" do a ton of work anyways. It doesn't feel physically impossible."
...
Imagine suddenly becoming not-sad. Now you "could" work when you're sad, and you "could" work when you're not-sad, so if AU just compared the things you "could" do, you wouldn't feel impact here.
...
But you did feel impact, didn't you?

I think you're using Able #2 (which makes sense--it's how the word is used colloquially). I tend to use Able #1 (because I read a lot about determinism when I was younger). I might be wrong about this though because you made a similar distinction between physical capability and anticipated possibility like this in Gears of Impact:

I am using #2, but I'm aware that there's a separate #1 meaning (and thank you for distinguishing between them so clearly, here!).

I just wanted to add that, technically speaking, there are two levels of Goodhart's Law worth discussing here:

1. Goodhart's Law as traditionally defined: "When a measure becomes a target, it ceases to be a good measure." AKA, proxies of utility functions, when optimized too aggressively, stop being proxies for those utility functions.

2. Goodhart's Law as we deal with it in AI Safety: When a measure becomes a target, it actively causes you to miss the target. AKA, proxies of utility functions, when optimized too aggressively, will reduce the value outputs of those utility functions from where they were originally.

The traditional Goodhart's Law strikes me as pretty general over a broad range of agents trying to optimize things.

The AI Safety version strikes me as pretty common too. But it's contingent on the agent's relationship with the universe in a way that the traditional version is not (ie, that agent already having an unusually high utility function relative to what you'd expect from the universe they're in)

Yep, those are the two levels I mentioned :-)

This is only true for the kind of things humans typically care about

There are utility functions for which Goodhart doesn't apply but I think it's more generally agent-centric than just human-centric. I think that the vast majority of proxies developed by agents for the sake of optimizing a harder-to-measure function will experience Goodhart (and, to me, the term "utility function" and "proxy" imply that this is what's happening).

Moreover, I think that Goodhart only doesn't apply in the case where making the proxy function arbitrarily large doesn't also change the behavior of the observed universe an arbitrary amount. You can define a utility function for which this is true but the ones that I've thought of so far are associated with weird discontinuities.

The mathematical spitballing I did in making this claim:

If we have a utility function U and a proxy utility V that represents U, we expect the plot of coordinates (x=V(World State), y=U(World State) to be roughly sublinear since

• (x=V(World State), y=U(World State) is upper-bounded by (x=U(World State), y=U(World State) or y=x
• we can re-scale V however we want so increasing a unit of V corresponds to an increasing a unit of U at some arbitrary location on the (rough) curve

This indicates that, if the noisiness of the relationship between V and U increases as a linear or superlinear function of V, it could wash-out any positive effects of increasing V. Since the noise also cannot actually improve the performance of V above the upperbound U, the symmetry of this noise as something that may improve performance is broken and leads to an overall downtrend in utility as V is increased more and more.

When would one expect a linear/superlinear increase in the V vs U to actually happen? You might expect this if

1. The proxy V was built to model U based on measurements of their relationship that all occurred in a kind of environment with V in a certain domain.

2. Increasing V outside that domain changes the environment in a way that makes it so different from what it once was that the earlier measurements used to make V don't apply.

Note that Goodhart doesn't say when optimizing V starts to decrease U, just that it will at some point. In my opinion, the claim that 1 and 2 will never happen as V increases is stronger than the claim that they will.

[This comment is no longer endorsed by its author]Reply

As far as I can tell we're not actually dividing the space of W's by a plane, we're dividing the space of E(W|π)'s by a plane. We don't know for certain that U-V is negative, we merely think so in expectation. This leads to the Bayesian correction for the Optimizer's curse, which lets us do better when presented with lots of options with different uncertainties, but if the uncertainty is fixed it won't let us pick a strategy that does better than the one that maximizes the proxy.

As far as I can tell we're not actually dividing the space of W's by a plane, we're dividing the space of E(W|π)'s by a plane.

Because expectation is affine with respect to utility functions, this does divide the space by a plane.

Yes, there is a connection with the optimizer's curse style of reasoning.

Errata:

\$US's

Thanks! Error corrected.