Goodhart's Curse and Limitations on AI Alignment

Gordon Seidoh Worley

I believe that most existing proposals for aligning AI with human values are unlikely to succeed in the limit of optimization pressure due to Goodhart's curse. I believe this strongly enough that it continues to surprise me a bit that people keep working on things that I think clearly won't work, though I think there are two explanations for this. One is that, unlike me, they expect to approach superhuman AGI slowly and so we will have many opportunities to notice when we are deviating from human values as a result of Goodhart's curse and make corrections. The other is that they are simply unaware of the force of the argument that convinces me because, although it has been written about before, I have not seen recent, pointed arguments for it rather than technical explanations of it and its effects, and my grokking of this point happened long ago on mailing lists of yore via more intuitive and less formal arguments than I see now. I can't promise to make my points as intuitive as I would like, but nonetheless I will try to address this latter explanation by saying a few words about why I am convinced.

Note: Some of this borrows heavily from a paper I have out for publication, but with substantial additions for readability by a wider audience.

Goodhart's Curse

Goodhart's curse is what happens when Goodhart's law meets the optimizer's curse. Let's review those two here briefly for completeness. Feel free to skip some of this if you are already familiar.

Goodhart's Law

As originally formulated, Goodhart's law says "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes". A more accessible expression of Goodhart's law, though, would be that when a measure of success becomes the target, it ceases to be a good measure. A well known example of Goodhart's law comes from a program to exterminate rats in French-colonial Hanoi, Vietnam: the program paid a bounty for rat tails on the assumption that a rat tail represented a dead rat, but rat catchers would instead catch rats, cut off their tails, and release the rats so they could breed and produce new rats so their tails could be turned in for more bounties. There was a similar case with bounties for dead cobras in British-colonial India that intended to incentivize the reduction of cobra populations that instead resulted in the creation of cobra farms. And of course we can't forget this classic, though apocryphal, tale:

In the old Soviet Union, the government rewarded factory managers for production quantity, not quality. In addition to ignoring quality, factory managers ignored customer needs. The end result was more and more tractors produced, for instance, even though these tractors just sat unused. Managers of factories that produced nails optimized production by producing either fewer larger and heavier nails or more smaller nails.

The fact that factories were judged by rough physical quotas rather than by their ability to satisfy customers – their customers were the state – had predictably bad results. If, to take a real case, a nail factory’s output was measured by number, factories produced large numbers of small pink-like nails. If output was measured by weight, the nail factories shifted to fewer, very heavy nails.

Although a joke, the natural endpoint might be the production of a single, giant nail. It's unknown, to be best of my knowledge, if the nail example above is real, although reportedly something similar really did happen with shoes. Additional examples of Goodhart's law abound:

targeting easily-measured clicks rather than conversions in online advertising
optimizing for profits over company health in business
unintentionally incentivizing publication count over academic progress in academia
prioritizing grades and tests scores over learning in schools
maximizing score rather than having fun in video games

As these examples demonstrate, most of us are familiar with Goodhart's law or something similar in everyday life such that it's not that surprising when we learn about it. The opposite seems to be true of the optimizer's curse, being well studied but mostly invisible to us in daily life unless we take care to notice it.

The optimizer's curse

The optimizer's curse observes that when choosing among several possibilities, if we choose the option that is expected to maximize value, we will be "disappointed" (realize less than the expected value) more often than average. This happens because optimization acts as a source of bias in favor of overestimation, even if the estimated value of each option is not biased itself. And the curse is robust, such that even if an agent satisfices (accepts the option with the least expected value that is greater than neutral) rather than optimizes they will still suffer more disappointment than gratification. So each option can be estimated in an unbiased way, yet because there is a bias imposed by a preference for estimations of positive value, we can end up in a situation where we consistently pick options that are more likely to be overestimating their value.

The optimizer's curse has many opportunities to bite us. For example, a company trying to pick a project to invest in to earn the highest rate of return will consistently earn less return than predicted due to the optimizer's curse. Same goes for an investor picking investment instruments. Similarly a person trying to pick the best vacation will, on average, have a worse vacation than expected because the vacation that looks the best is more likely than the other options to be worse than predicted. And of course an AI trying to pick the policy that maximizes human value will usually pick a policy that performs worse than expected, but we'll return to that one later when we consider how it interacts with Goodhart.

I wish I had more, better examples of the optimizer's curse to offer you, especially documented real-world cases that are relatable, but most of what I can find seems to be about petroleum production (no, really!) or otherwise about managing and choosing among capital-intensive projects. The best I can offer you is this story from my own life about shoes:

For a long time, I wanted to find the best shoes. "Best" could mean many things, but basically I wanted the best shoe for all purposes. I wanted the shoe to be technically impressive, so it would have features like waterproofing, puncture-proofing, extremely durability, extreme thinness and lightness, able to be worn without socks, and breathability. I also wanted it to look classy and casual, able to mix and match with anything. You might say this is impossible, but I would have said you just aren't trying hard enough.

So I tried a lot of shoes. And in every case I was disappointed. One was durable but ugly, another was waterproof but made my feet smell, another looked good but was uncomfortable, and another was just too weird. The harder I tried to find the perfect shoe, the more I was disappointed. Cursed was I for optimizing!

This story isn't perfect: I was optimizing for multiple variables and making tradeoffs, and the solution was to find some set of tradeoffs I would be happiest with and to accept that I was mostly only going to move along the efficiency frontier rather than expand it by trying new shoes, so it teaches the wrong lesson unless we look at it through a very narrow lens. Better examples in the comments are deeply appreciated!

Before moving on, it's worth talking about attempts to mitigate the optimizer's curse. It would seem, since it is a systematic bias, that we could account for the optimizer's curse the same way we do most systematic biases using better Bayesian reasoning. And we can, but in many cases this is difficult or impossible because we lack sufficient information about the underlying distributions to make the necessary corrections. Instead we find ourselves in a situation where we know we suffer bias in our expectations but cannot adequately correct for it such that we can be sure we aren't still suffering from it even if we try not to. Put another way, attempting to correct for the optimizer's curse without perfect information simply shifts the distortions caused by the optimizer's curse to the corrections rather than the original estimates themselves without eliminating the bias.

Given how persistent the optimizer's curse is, it shouldn't surprise us it will pop up when we try to optimize for some measurable target, giving us Goodhart's curse.

Goodhart's curse

Combining Goodhart's law with the optimizers curse, we get Goodhart's curse: attempts to optimize for a measure of success result in increased likelihood of failure to hit the desired target. Or as someone on Arbital (probably Eliezer) put it: "neutrally optimizing a proxy measure U of V seeks out upward divergence of U from V". In personal terms, you might say that the harder you try to get what you want, the more you'll find yourself doing things that cause you not to get what you want despite trying to act otherwise. I think this point is unintuitive because it feels contrary to the normal narrative that success comes from trying, and trying harder makes it more likely you will succeed, but that might only appear to be true due to survivorship bias. To give you an intuitive feel for this personal expression of Goodhart's curse, another story from my life:

At some tender age, maybe around 11 or 12, I became obsessed with efficiency so I would have time to do more in my life.

There was the little stuff, like figuring out the "best" way to brush my teeth or get out of bed. There was the medium stuff, like finding ways to read faster or write without moving my hand as much. And there was the big stuff, like trying to figure out how to get by on less sleep and how to study topics in the optimal order. It touched everything, from shoe tying, to clothes putting on, to walking, to playing, to eating, and on and on. It was personal Taylorism gone mad.

To take a single example, let's consider the important activity of eating breakfast cereal and how that process can be made more efficient. There's the question of how to store the cereal, how to store the milk, how to retrieve the cereal and milk, how to pour the two into the bowl, how to hold the spoon, how to put the cereal in the mouth, how to chew, how to swallow, and how to clean up, to name just a few. Maybe I could save a few seconds if I held the spoon differently, or stored the cereal in a different container, or store the milk on a different shelf in the refrigerator, or, or, or. By application of experimentation and observation I could get really good at eating cereal, saving maybe a minute or more off my daily routine!

Of course, this was out of a morning routine that lasted over an hour and included a lot of slack and waiting because I had three sisters and two parents and lived in a house with two bathrooms. But still, one whole minute saved!

By the time I was 13 or 14 I was over it. I had spent a couple years working hard at efficiency, gotten little for it, and lost a lot in exchange. Doing all that efficiency work was hard, made things that were once fun feel like work, and, worst of all, weren't delivering on the original purpose of doing more with my life. I had optimized for the measure—time to complete task, number of motions to complete task, etc.—at the expense of the target—getting more done. Yes, I was efficient at some things, but that efficiency was costing so much effort and will power that I was worse off than if I had just ignored the kind of efficiency I was targeting.

In this story, as I did things that I thought would help me reach my target, I actually moved myself further away from it. Eventually it got bad enough that I noticed the divergence and was compelled to course correct, but this depended on me having ever known what the original target was. If I were not the optimizer, and instead say some impersonal apparatus like the state or an AI were, there's considerable risk the optimizer would have kept optimizing and diverging long after it became clear to me that divergence had happened. For an intuitive sense of how this has happened historically, I recommend Seeing Like a State.

I hope by this point you are convinced of the power and prevalence of Goodhart's curse (but if not please let me know your thoughts in the comments, especially if you have ideas about what could be said that would be convincing). Now we are poised to consider Goodhart's curse and its relationship to AI alignment.

Goodhart's curse and AI alignment

Let's suppose we want to build an AI that is aligned with human values. A high level overview of a scheme for doing this is that we build an AI, check to see if it is aligned with human values so far, and then update it so that it is more aligned if it is not fully aligned already.

Although the details vary, this describes roughly the way methods like IRL and CIRL work, and possibly how HCH and safety via debate work in practice. Consequently, I think all of them will fail due to Goodhart's curse.

Caveat: I think HCH and debate-like methods may be able to work and avoid Goodhart's curse, though I'm not certain, and it would require careful design that I'm not sure the current work on these has done. I hope to have more to say on this in the future.

The way Goodhart's curse sneaks in to these is that they all seek to apply optimization pressure to something observable that is not exactly the same thing as what we want. In the case of IRL and CIRL, it's an AI optimizing over inferred values rather than the values themselves. In HCH and safety via debate, it's a human preferentially selecting AI that the human observes and then comes to believe does what it wants. So long as that observation step is there and we optimize based on observation, Goodhart's curse applies and we can expect, with sufficient optimization pressure, that alignment will be lost, even and possibly especially without us noticing because we're focused on the observable measure rather than the target.

Yikes!

Beyond Goodhart's curse

Do we have any hope of creating aligned AI if just making a (non-indifferent) choice based on an observation dooms us to Goodhart's curse?

Honestly, I don't know. I'm pretty pessimistic that we can solve alignment, yet in spite of this I keep working on it because I also believe it's the best chance we have. I suspect we may only be able to rule out solutions that are dangerous but not positively select for solutions that are safe, and may have to approach solving alignment by eliminating everything that won't work and then doing something in the tiny space of options we have left that we can't say for sure will end in catastrophe.

Maybe we can get around Goodhart's curse by applying so little optimization pressure that it doesn't happen? One proposal in this direction is quantilization. I remain doubtful, since without sufficient optimization it's not clear how we do better than picking at random.

Maybe we can get around Goodhart's curse by optimizing the target directly rather than a measure of it? Again, I remain doubtful, mostly due to epistemological issues suggesting all we ever have are observations and never the real thing itself.

Maybe we can overcome either or both issues via pragmatic means that negate enough of the problem that, although we don't actually eliminate Goodhart's curse completely, we eliminate enough of its effect that we can ignore it? Given the risks and the downsides I'm not excited about this approach, but it may be the best we have.

And, if all that wasn't bad enough, Goodhart's curse isn't even the only thing we have to watch out for! Scott Garrabrant and David Manheim have renamed Goodhart's curse to "Regressional Goodhart" to distinguish it from other forms of Goodharting where mechanisms other than optimization may be responsible for divergence from the target. The only reason I focus on Goodhart's curse is that it's the way proposed alignment schemes usually fail; other safety proposals may fail via other Goodharting effects.

All this makes it seem extremely likely to me that we aren't even close to solving AI alignment yet, to the point that we likely haven't even stumbled upon the general mechanism that will work, or if we have we haven't identified it as such. Thus, if there's anything upward looking I can end this on, it's that there's vast opportunity to do good for the world via work on AI safety.

I'll start by noting that I am in the strange (for me) position of arguing that someone is too concerned about over-optimization failures, rather than trying to convince someone who is dismissive. But that said, I do think that the concern here, while real, is mitigable in a variety of ways.

First, there is the possibility of reducing optimization pressure. One key contribution here is Jessica Taylor's Quantilizers paper, which you note, that shows a way to build systems that optimize but are not nearly as subject to Goodhart's curse. I think you are too dismissive. Similarly, you are dismissive of optimizing the target directly. I think that the epistemological issues you point to are possible to mitigate to the extent that they won't cause misalignment between reality and an AI's representation of that reality. Once that is done, the remaining issue is aligning "true" goals with the measured goals, which is still hard, but certainly not fundamentally impossible in the same way.

Second, you note that you don't think we will solve alignment. I agree, because I think that "alignment" presupposes a single coherent ideal. If human preferences are diverse, as it seems they are, we may find that alignment is impossible. This, however, allows a very different approach. This would optimize only when it finds Pareto-improvements across a set of sub-alignment metrics or goals, to constrain the possibility of runaway optimization. Even if alignment is possible, it seems likely that we can specify a set of diverse goals / metrics that are all aligned with some human goals, so that the system will be limited in its ability to be misaligned.

Lastly, there is optimization for a safe and very limited goal. If the goal is limited and specific, and we find a way to minimize side-effects, this seems like it could be fairly safe. For example, Oracle AIs are an attempt to severely limit the goal. More broadly, however, we might be able to build constraints that work, so that it that can reliably perform limited tasks (“put a strawberry on a plate without producing any catastrophic side-effects.”)

This feels like painting with too broad a brush, and from my state of knowledge, the assumed frame eliminates at least one viable solution. For example, can one build an AI without harmful instrumental incentives (without requiring any fragile specification of "harmful")? If you think not, how do you know that? Do we even presently have a gears-level understanding of why instrumental incentives occur?

In HCH and safety via debate, it's a human preferentially selecting AI that the human observes and then comes to believe does what it wants.

To say e.g. HCH is so likely to fail we should feel pessimistic about it, it doesn't seem to be enough to say "Goodhart's curse applies". Goodhart's curse applies when I'm buying apples at the grocery store. Why should we expect this bias of HCH to be enough to cause catastrophes, like it would for a superintelligent EU maximizer operating on an unbiased (but noisy) estimate of what we want? Some designs leave more room for correction and cushion, and it seems prudent to consider to what extent that is true for a proposed design.

I remain doubtful, since without sufficient optimization it's not clear how we do better than picking at random.

This isn't obvious to me. Mild optimization seems like a natural thing people are able to imagine doing. If I think about "kinda helping you write a post but not going all-out", the result is not at all random actions. Can you expand?

This feels like painting with too broad a brush, and from my state of knowledge, the assumed frame eliminates at least one viable solution. For example, can one build an AI without harmful instrumental incentives (without requiring any fragile specification of "harmful")? If you think not, how do you know that? Do we even presently have a gears-level understanding of why instrumental incentives occur?

Coincidentally, just yesterday I was part of some conversations that now make me more bullish on this approach. I haven't thought about it much in quite a while, and now I'm returning to it.

To say e.g. HCH is so likely to fail we should feel pessimistic about it, it doesn't seem to be enough to say "Goodhart's curse applies". Goodhart's curse applies when I'm buying apples at the grocery store. Why should we expect this bias of HCH to be enough to cause catastrophes, like it would for a superintelligent EU maximizer operating on an unbiased (but noisy) estimate of what we want? Some designs leave more room for correction and cushion, and it seems prudent to consider to what extent that is true for a proposed design.

It depends on how much risk you are willing to tolerate, I think. HCH applies optimization pressure, and in the limit of superintelligence I expect it to be so much optimization pressure that any deviance will become so large as to become a problem. But a person could choose to accept the risk with strategies that help minimize risk of deviance such that they think those strategies will do enough to mitigate the worst of that effect in the limit.

As far as leaving room for correction and cushion, those also require a relatively slow takeoff because it requires time for humans to think and intervene. Since I expect takeoff to be fast, I don't expect there to be adequate time for humans in the loop to notice and correct deviance, thus any deviance that can appear late in the process is a problem in my view.

This isn't obvious to me. Mild optimization seems like a natural thing people are able to imagine doing. If I think about "kinda helping you write a post but not going all-out", the result is not at all random actions. Can you expand?

The problem with mild optimization is that it doesn't eliminate the bias that causes the optimizer's curse, only attenuates it. So unless we can cause via a "mild" method there to be a finite bound on the amount of deviance in the limit of optimization pressure, I don't expect it to help.

Coincidentally, just yesterday I was part of some conversations that now make me more bullish on this approach. I haven't thought about it much in quite a while, and now I'm returning to it.

The potential solution I was referring to is motivated in the recently-completed Reframing Impact sequence.

I think you credit the optimizer's curse with power that it doesn't, as described, have. In particular, it doesn't have the power that "people who try to optimize end up worse off than people who don't".

In the linked post by lukeprog, when the curse is made concrete with numbers, people who tried to optimize ended up exactly as well off as everyone else - but that's only because by assumption, all choices were exactly the same. ("there are k choices, each of which has true estimated [expected value] of 0.") If some choices are better than the others, then the optimizer's curse will make the optimizer disappointed, but it will still give her better results on average than the people who failed to optimize, or who optimized less hard. (Ignoring possible actions that aren't just "take one of these options based on the information currently available to me".)

I'm making some assumptions about the error terms here, and I'm not sure exactly what assumptions. But I think they're fairly weak.

(And if the difference between the actually-best choice and the actually-second best is large compared to the error terms, then the optimizer's curse appears to have no power at all.)

There can be other things that go wrong, when one tries to optimize. With your shoes and your breakfast routine, it seems to me that you invested much effort in pursuit of a goal that was unattainable in one case and trivial in another. Unfortunate, but not the optimizer's curse.

I wrote the above and then realised that I'm not actually sure how much you're making the specific mistake I describe. I thought you were partly because of

attempts to optimize for a measure of success result in increased likelihood of failure to hit the desired target

Emphasis mine. But maybe the increased likelihood just comes from Goodhart's law, here? It's not clear to me what the optimizer's curse is contributing to Goodhart's curse beyond what Goodhart's law already supplies.

Separate from my other comment, I want to question your assumption that we must worry about an AI-takeoff that is exponentially better than humans at everything, so that a very slight misalignment would be disastrous. That seems possible, per Eliezer's Rocket Example, but is far from certain.

It seems likely that instead there are fundamental limits on intelligence (for a given architecture, at least) and while it is unlikely that the overall limits are coincidentally the same as / near human intelligence, it seems plausible that the first superhuman AI system still plateaus somewhere far short of infinite optimization power. If so, we only need to mitigate well, instead of perfectly align the AI to our goals.

I don't think I have anything unique to add to this discussion. Basically I defer to Eliezer and Nick (Bostom) for written arguments since they are largely the ones who provided the arguments that lead me to strongly believe we live in a world with hard takeoff via recursive self improvement that will lead to a "singularity" in the sense that we pass some threshold of intelligence/capabilities beyond which we cannot meaningfully reason about or control what happens after the fact, though we may be able to influence how it happens in ways that don't cut off the possibilities of outcomes we would be happy with.

a very slight misalignment would be disastrous. That seems possible, per Eliezer's Rocket Example, but is far from certain.

Just a minor nitpick, I don't think the point of the Rocket Alignment Metaphor was supposed to be that slight misalignment was catastrophic. I think the more apt interpretation is that apparent alignment does not equal actual alignment, and you need to do a lot of work before you get to the point where you can talk meaningfully about aligning an AI at all. Relevant quote from the essay,

It’s not that current rocket ideas are almost right, and we just need to solve one or two more problems to make them work. The conceptual distance that separates anyone from solving the rocket alignment problem is much greater than that.

Right now everyone is confused about rocket trajectories, and we’re trying to become less confused. That’s what we need to do next, not run out and advise rocket engineers to build their rockets the way that our current math papers are talking about. Not until we stop being confused about extremely basic questions like why the Earth doesn’t fall into the Sun.

Fully agree - I was using the example to make a far less fundamental point.

I get Goodhart, but i don't understand why the optimizer's curse matters at all in this context; can you explain? My reasoning is: When optimizing, you make a choice C and expect value E but actually get value V<E. But choice C may still have been the best choice. So what if the AI falls short of its lofty expectations? As long as it did the right thing, I don't care whether the AI was disappointed in how it turned out, like if we get a mere Utopia when the AI expected a super duper Utopia. All I care about is C and V, not E.

I don't understand why https://en.wikipedia.org/wiki/Theory_of_the_second_best doesn't get more consideration. In a complex interconnected system, V can not only be much less than E, it can be much less than would be obtained with ~C. You may not get mere utopia, you may get serious dystopia.

So when they interact there's additional variables you're leaving out.

There's a target T that's the real thing you want. Then there's a function E that measures how much you expect C to achieve T. For example, maybe T is "have fun" and E is "how fun C looks". Then given a set of choices C_1, C_2, ... you choose C_max such that E(C_max) >= E(C_i) for all C_i (in normal terms, C_max = argmax E(C_i)). Unfortunately T is hidden such that you can only check if C satisfies T via E (well, this is not exactly true because otherwise we might have a hard time knowing the optimizer's curse exists, but it would hold even if that were the case, we just might not be able to notice it, and regardless we can't use whatever this side channel to assess T as a measure and so can't optimize on it).

Now since we don't have perfect information, there is some error e associated with E, so the true extent to which any C_i satisfies T is E(C_i) + e. But we picked C_max based on the existence of this error, since C_max = argmax E(C_i) + e, thus C_max may not be the true max. As you say, so what, maybe that means we just don't pick the best but we pick something good. But recall that our purpose was T not max E(C_i), so over repeated choices we will consistently, due to the optimizer's curse, pick C_max such that max E(C_i) < T (noting that's a type error as notated, but I think it's intuitive what is meant). Thus e will compound over repeated choice since each subsequent C is conditioned on the previous ones such that it becomes certain that E(C_max) < T and never E(C_max) = T.

This might seem minor if we had only a single dimension to worry about, like "had slightly less than maximum fun", even if it did, say, result in astronomical waste. But we normally are optimizing over multiple dimensions and each choice may fail in different ways along those different dimensions. The result is that we will over time shrink the efficiency frontier (though it might reach a limit and not get worse) and end up with worse solutions than were possible that may even be bad enough that we don't want them. After all, nothing is stopping the error from getting so large or the frontier from shrinking so much that we would be worse off than if we had never started.

It seems to me that your comment amounts to saying "It's impossible to always make optimal choices for everything, because we don't have perfect information and perfect analysis," which is true but unrelated to optimizer's curse (and I would say not in itself problematic for AGI safety). I'm sure that's not what you meant, but here's why it comes across that way to me. You seem to be setting T = E(C_max). If you set T = E(C_max) by definition, then imperfect information or imperfect analysis implies that you will always miss T by the error e, and the error will always be in the unfavorable direction.

But I don't think about targets that way. I would set my target to be something that can in principle be exceeded (T = have almost as much fun as is physically possible). Then when we evaluate the choices C, we'll find some that dramatically exceed T (i.e. way more fun than is physically possible, because we estimated the consequences wrong), and if we pick one of those, we'll still have a good chance of slightly exceeding T despite the optimizer's curse.

Lack of access to perfect information is highly relevant because it's exactly why we can't get around the curse. If we had perfect information we could correct for it as a systematic bias using Bayesian methods and be done with it. It's also why it shows up in the first place: if we could establish a measure E that accurately reported the amount it satisfied T then it wouldn't happen because there would be no error in the measurement.

What you are proposing about allowing targets to be exceeded is simply allowing for more mild optimization, and the optimizer's curse still happens if there is preferential choice at all.

I don't think it's related to mild optimization. Pick a target T that can be exceeded (wonderful future, even if it's not the absolute theoretically best possible future). Estimate what choice Cmax is (as far as we can tell) the #1 very best by that metric. We expect Cmax to give value E, and it turns out to be V<E, but V is still likely to exceed T, or at least likelier than any other choice. (Insofar as that's not true, it's Goodhart.) Optimizer curse, i.e. V<E, does not seem to be a problem, or even relevant, because I don't ultimately care about E. Maybe the AI doesn't even tell me what E is. Maybe the AI doesn't even bother guessing what E is, it only calculates that Cmax seems to be better than any other choice.

Hmm, maybe you are misunderstanding how the optimizer's curse works? It's powered by selecting based on a measure with error in a way that biases us to pick specific actions based on their measure when the measure errs such that the measure is on average higher rather than lower than its true value. You are mistaken, then, to not care about E, because E is the only reliable and comparable way you have to check if C satisfies T (if there's another one that's reliable and comparable, then use it instead). It's literally the only option, assuming you picked the "best" E (another chance for Goodhart's curse to bite you), for picking C_max that seems better unless you want very high quantilization such that, say, you only act when things appear orders of magnitude better with error bounds small enough that you will only be wrong once in trillions of years.

I do think I understand that. I see E as a means to an end. It's a way to rank-order choices and thus make good choices. If I apply an affine transformation to E, e.g. I'm way too optimistic about absolutely everything in a completely uniform way, then I still make the same choice, and the choice is what matters. I just want my AGI to do the right thing.

Here, I'll try to put what I'm thinking more starkly. Let's say I somehow design a comparative AGI. This is a system which can take a merit function U, and two choices C_A and C_B, and it can predict which of the two choices C_A or C_B would be better according to merit function U, but it has no idea how good either of those two choices actually are on any absolute scale. It doesn't know whether C_A is wonderful while C_B is even better, or whether C_A is awful while C_B is merely so-so, both of those just return the same answer, "C_B is better". Assume it's not omniscient, so its comparisons are not always correct, but that it's still impressively superintelligent.

A comparative AGI does not suffer the optimizer's curse, right? It never forms any beliefs about how good its choices will turn out, so it couldn't possibly be systematically disappointed. There's always noise and uncertainty, so there will be times when its second-highest-ranked choice would actually turn out better than its highest-ranked choice. But that happens less often than chance. There's no systematic problem: in expectation, the best thing to do (as measure by U) is always to take its top-ranked choice.

Now, it seems to me that, if I go to the AGIs-R-Us store, and I see a normal AGI and a comparative AGI side-by-side on the shelf, I would have no strong opinion about which one of them I should buy. If I ask either one to do something, they'll take the same sequence of actions in the same order, and get the same result. They'll invest my money in the same stocks, offer me the same advice, etc. etc. In particular, I would worry about Goodhart's law (i.e. giving my AGI the wrong function U) with either of these AGIs to the exact same extent and for the exact same reason...even though one is subject to optimizer's curse and the other isn't.

Right, if you don't have a measure you can't have Goodhart's curse on technical grounds, but I'm also pretty sure something like it is still there, it's just as far as I know no one has tried to show that something like the optimizers curse continues to function when you only have an ordering and not a measure. I think it does, and I think others think it does, and this is part of the generalization to Goodharting, but I don't know that a formal proof demonstrating that has been generated even though I strongly suspect it's true.

In HCH and safety via debate, it's a human preferentially selecting AI that the human observes and then comes to believe does what it wants.

I remain doubtful, since without sufficient optimization it's not clear how we do better than picking at random.

This feels like painting with too broad a brush, and from my state of knowledge, the assumed frame eliminates at least one viable solution. For example, can one build an AI without harmful instrumental incentives (without requiring any fragile specification of "harmful")? If you think not, how do you know that? Do we even presently have a gears-level understanding of why instrumental incentives occur?

Coincidentally, just yesterday I was part of some conversations that now make me more bullish on this approach. I haven't thought about it much in quite a while, and now I'm returning to it.

To say e.g. HCH is so likely to fail we should feel pessimistic about it, it doesn't seem to be enough to say "Goodhart's curse applies". Goodhart's curse applies when I'm buying apples at the grocery store. Why should we expect this bias of HCH to be enough to cause catastrophes, like it would for a superintelligent EU maximizer operating on an unbiased (but noisy) estimate of what we want? Some designs leave more room for correction and cushion, and it seems prudent to consider to what extent that is true for a proposed design.

This isn't obvious to me. Mild optimization seems like a natural thing people are able to imagine doing. If I think about "kinda helping you write a post but not going all-out", the result is not at all random actions. Can you expand?

Coincidentally, just yesterday I was part of some conversations that now make me more bullish on this approach. I haven't thought about it much in quite a while, and now I'm returning to it.

The potential solution I was referring to is motivated in the recently-completed Reframing Impact sequence.

I'm making some assumptions about the error terms here, and I'm not sure exactly what assumptions. But I think they're fairly weak.

(And if the difference between the actually-best choice and the actually-second best is large compared to the error terms, then the optimizer's curse appears to have no power at all.)

I wrote the above and then realised that I'm not actually sure how much you're making the specific mistake I describe. I thought you were partly because of

attempts to optimize for a measure of success result in increased likelihood of failure to hit the desired target

a very slight misalignment would be disastrous. That seems possible, per Eliezer's Rocket Example, but is far from certain.

It’s not that current rocket ideas are almost right, and we just need to solve one or two more problems to make them work. The conceptual distance that separates anyone from solving the rocket alignment problem is much greater than that.

Right now everyone is confused about rocket trajectories, and we’re trying to become less confused. That’s what we need to do next, not run out and advise rocket engineers to build their rockets the way that our current math papers are talking about. Not until we stop being confused about extremely basic questions like why the Earth doesn’t fall into the Sun.

Fully agree - I was using the example to make a far less fundamental point.

So when they interact there's additional variables you're leaving out.

What you are proposing about allowing targets to be exceeded is simply allowing for more mild optimization, and the optimizer's curse still happens if there is preferential choice at all.

LESSWRONG
LW

LESSWRONG
LW

25

Goodhart's Curse and Limitations on AI Alignment

25

Ω 10

Goodhart's Curse

Goodhart's Law

The optimizer's curse

Goodhart's curse

Goodhart's curse and AI alignment

Beyond Goodhart's curse

25

Ω 10

25

Ω 10