This is a linkpost for https://nostalgebraist.tumblr.com/post/186105132274/theres-this-funny-thing-about-goodharts-law
Would kind of like to excerpt the whole post, but that feels impolite, so I'll just quote the first four paragraphs and then suggest reading the whole thing:
There’s this funny thing about Goodhart’s Law, where it’s easy to say “being affected by Goodhart’s Law is bad” and “it’s better to behave in ways that aren’t as subject to Goodhart’s Law,” but it can be very hard to explain why these things are true to someone who doesn’t already agree.
Why? Because any such explanation is going to involve some step where you say, “see, if you do that, the results are worse.” But this requires some standard by which we can judge results … and any such standard, when examined closely enough, has Goodhart problems of its own.
There are times when you can’t convince someone without a formal example or something that amounts to one, something where you can say “see, Alice’s cautious heuristic strategy wins her $X while Bob’s strategy of computing the global optimum under his world model only wins him the smaller $Y, which is objectively worse!”
But if you’ve gotten to this point, you’ve conceded that there’s some function whose global optimum is the one true target. It’s hard to talk about Goodhart at all without something like this in the background – how can “the metric fails to capture the true target” be the problem unless there is some true target?
There are two related, but distinct problems under the heading of Goodhart, both caused by the hard-to-deny fact that any practical metric is only a proxy/estimate/correlate of the actual true goal.
1) anti-inductive behavior. When a metric gets well-known as a control input, mis-aligned agents can spoof their behavior to take advantage.
2) divergence of legible metric from true desire. Typically this increases over time, or over the experienced range of the metric.
I think you're right that there is an underlying assumption that such a true goal exists. I don't think there's any reason to believe that it's an understandable/legible function for human brains or technology. It could be polynomial or not, and it could have millions or billions of terms. In any actual human (and possibly in any embodied agent), it's only partly specified, and even that partial specification isn't fully accessible to introspection.
Specifics matter. It's better to behave in ways that give better outcomes, but it's not obvious at all what those ways are. Even ways that are known to be affected by goodhart's law have SOME reason to believe they're beneficial - goodhart isn't (necessarily) a reversal of sign, only a loss of correlation with actual desires.
Again, specifics matter. "Goodhart exists, and here's how it might apply to your proposed metrics" has been an _easy_ discussion every time I've had it. I literally have never had a serious objection to the concept. What's hard is figuring out checksum metrics or triggers to re-evaluate the control inputs. Defining the range and context inside of which a given measurement is "good enough" takes work, but is generally achievable (in the real world; perhaps not for theoretical AI work).
Strongly agree - and Goodhart's law is at least 4 things. Though I'd note that anti-inductive behavior / metric gaming is hard to separate from goal mis-specification, for exactly the reasons outlined in the post.
But saying there is a goal too complex to be understandable and legible implies that it's really complex, but coherent. I don't think that's the case of individuals, and I'm certain it isn't true of groups. (Arrow's theorem, etc.)
I'm not sure it's possible to distinguish between chaotically-complex and incoherent. Once you add reference class problems in (you can't step in the same river twice; no two decisions are exactly identical), there's no difference between "inconsistent" and "unknown terms with large exponents on unmeasured variables".
But in any case, even without coherence/consistency across agents or over time, any given decision can be an optimization of something.
[ I should probably add an epistemic status: not sure this is a useful model, but I do suspect there are areas it maps to the territory well. ]
I'd agree with the epistemic warning ;)
I don't think the model is useful, since it's non-predictive. And we have good reasons to think that human brains are actually incoherent. Which means I'm skeptical that there is something useful to find by fitting a complex model to find a coherent fit for an incoherent system.
I think (1) Dagon is right that if we consider a purely behavioral perspective the distinction gets meaningless at the boundaries, trying to distinguish between highly complex values vs incoherence; any set of actions can be justified via some values; (2) humans are incoherent, in the sense that there are strong candidate partial specifications of our values (most of us like food and sex) and we're not always the most sensible in how we go about achieving them; (3) also, to the extent that humans can be said to have values, they're highly complex.
The thing that makes these three statements consistent is that we use more than just a behavioral lense to judge "human values".
I really like this.
I like the analogy of "traversal of the intervening territory", even though, like the author, I don't know what it formally means.
Unlike the author, I do have some models of what it means to lack a utility function, but optimize anyway. Within such a model, I would say: it's perfectly fine and possible to come up with a function which approximately represents your preferences, but holding on to such a function even after your preferences have updated away from it leads to a Goodhart-like risk.
More generally, it's not the case that literally everything is well-understood as optimization of some function. There are lots of broadly intelligent processes that aren't best-understood as optimization.
I'd love to hear some examples, and start a catalog of "useful understandings of intelligent processes". I believe that control/feedback mechanisms, evolution, the decision theories tossed about here (CDT, TDT, UDT, etc.), and VNM-compliant agents generally are all optimizers, though not all with the same complexity or capabilities.
Humans aren't VNM-rational agents over time, but I believe each instantaneous decision is an optimization calculation within the brain.
I think once one begins to enter this alternative frame where lots of things aren't optimization, it starts to become apparent that "hardly anything is just optimization" -- IE, understanding something as optimization often hardly explains anything about it, and there are often other frames which would explain much more.
I guess it depends on whether you want to keep "optimization" as a referent to the general motion that is making the world more likely to be one way than another or a specific type of making the world more likely to be one way rather than another. I think the former is more of a natural category for the types of things most people seem to mean by optimizing.
None of this is to say, though, that there aren't many processes where the optimization framing is not very useful. For example, you mention logic and Bayesian updating as examples, and that sounds right to me, because those are processes operating over the map rather than the territory (even if they are meant to be grounded in the territory), and when you only care about the map it doesn't make much sense to talk about taking actions to make the world one way rather than another, because there is only one consistent way the world can be within the system of a particular map.
I suspect you're trying to gesture at a slightly better definition here than the one you give, but since I'm currently in the business of arguing that we should be precise about what we mean by 'optimization'... what do you mean here?
Just about any element of the world will "make the world more likely to be one way rather than another".
Yeah, if I want to be precise, I mean anytime there is a feedback loop there is optimization.
That does seem better. But I don't think it fills the shoes of the general notion of optimization people use.
I'm pretty happy to count all these things as optimization. Much of the issue I find with using the feedback loop definition is, as you point to, is the difficulty of figuring out things like "is there a lot here?", suggesting there might be a better, more general model for what I've been pointing to work feedback loop because it's simply the closest, most general model I know. Which actually points back to the way I phrased it before, which isn't formalized but I think does come closer to expansively capturing all the things I think make sense to group together as "optimization".
Awesome - I think I agree with most of this. Specifically, https://www.lesswrong.com/posts/A8iGaZ3uHNNGgJeaD/an-orthodox-case-against-utility-functions is very compatible with the possibility that the function is too complex for any agent to actually compute. It's quite likely that there are more potential worlds than an agent can rank, with more features than an agent can measure.
Any feasible comparison of potential worlds is actually a comparison of predicted summaries of those worlds. Both the prediction and the summary are lossy, thus that asepect of Goodhart's law.
I did not mean to say that "everything is an optimization process". I did mean to say that decisions are an optimization process, and I now realize even that's too strong. I suspect all I can actually assert is that "intentionality is an optimization process".
Oh, I didn't mean to accuse you of that. It's more that this is a common implicit frame of reference (including/especially on LW).
I rather suspect the correct direction is to break down "optimization" into more careful concepts (starting, but not finishing, with something like selection vs control).
On this, we're fully agreed. Epistemics may be pre-optimization, or may not.
I am fond of describing Buddhism as un-goodharting yourself all the way down.
In brief, you are a Ship of Theseus at sea (or a Ship of Neurath as Quine coined after Otto von Neurath who first made the analogy), navigating by lighthouses that you should not steer directly towards (that's not how lighthouses work!) and avoiding whirlpools (attractors in the space of goal architectures that lock you in to a specific destination).
I agree and glad this is getting upvotes, but for what it's worth I made exactly the same point a year ago and several people were resistant to the core idea, so this is probably not an easily won insight.
Could you elaborate on that? The two posts seem to be talking about different things as far as I can tell: e.g. nostalgebraist doesn't say anything about the Optimizer's Curse, whereas your post relies on it.
I do see that there are a few paragraphs that seem to reach similar conclusions (both say that overly aggressive optimization of any target is bad), but the reasoning used for reaching that conclusion seems different.
(By the way, I don't quite get your efficiency example? I interpret it as saying that you spent a lot of time and effort on optimizations that didn't pay themselves back. I guess you might mean something like "I had a biased estimate of how much time my optimizations would save, so I chose expensive optimizations that turned out to be less effective than I thought." But the example already suggests that you knew beforehand that the time saved would be on the order of a minute or so, so I'm not sure how the example is about Goodhart's Curse.)
It's mostly explicated down in the comments on the post where people started getting confused about just how integral the act of measuring is to doing anything. When I wrote the post I considered the point obvious enough to not need to be argued on its own, until I hit the comments.
(On the example, I was a short sighted optimizer.)
In this situation Goodhart is basically open-loop optimization. An EE analogy would be a high gain op amp with no feedback circuit. The result is predictable: you end up optimized out of the linear mode and into saturation.
You can't explicitly optimize for something you don't know. And you don't know what you really want. You might think you do, but, as usual, beware what you wish for. I don't know if an AI can form a reasonable terminal goal to optimize, but humans surely cannot. Given that some 90% of our brain/mind is not available to introspection, all we have to go by is the vague feeling of "this feels right" or "this is fishy but I cannot put my finger on why". That's why cautiously iterating with periodic feedback is so essential, and open-loop optimization is bound to get you to all the wrong places.
This post reminds me of an insight from one of my uni professors.
Early on at university, I was very frustrated with that the skills that were taught to us did not seem to be immediately applicable to the real world. That frustration was strong enough to snuff out most of the interest I had for studying genuinely (that is, to truly understand and internalize the concepts taught to us). Still, studying was expensive, dropping out was not an option, and I had to pass exams, which is why very early on I started, in what seemed to me to be a classic instance of Goodhart, to game the system - test banks were videly circulated among students, and for the classes with no test banks, there were past exams, which you could go through, trace out some kind of pattern for which topics and kinds of problems the prof puts on exams, and focus only on stuying those. I didn't know it was called "Goodhart" back then, but the significance of this was not lost on me - I felt that by pivoting away from learning subjects and towards learning to pass exams in subjects, I was intellectually cheating. Sure, I was not hiding crib sheets in my sleeves or going to the restroom to look something up on my phone, but it was still gaming the system.
Later on, when I got rather friendly with one of my profs, and extremely worn down by pressures from my probability calculus course, I admitted to him that this was what I was doing, that I felt guilty, and didn't feel able to pass any other way and felt like a fake. He said something to the effect of "Do you think we don't know this? Most students study this way, and that's fine. The characteristic of a well-structured exam isn't that it does not allow cheating, it's that it only allows cheating that is intelligent enough that a successful cheater would have been able to pass fairly."
What he said was essentially a refutation of Goodhart's Law by a sufficiently high-quality proxy. I think this might be relevant to the case you're dealing with here as well. Your "true" global optimum probably is a proxy, but if it's a well-chosen one, it need not be vulnerable to Goodhart.
The author seems to be skipping a step in their argument. I thought Goodhart's Law was about how it's hard to specify a measurable target which exactly matches your true goal, not that true goals don't exist.
For example, if I wanted to donate to a COVID-19 charity, I might pick one with the measurable goal of reducing the official case numbers.. and they could spend all of their money bribing people to not report cases or make testing harder. Or if they're an AI, they could hit this goal perfectly by killing all humans. But just because this goal (and probably all easily measurable goals) are Goodhartable doesn't mean all possible goals are. The thing I actually want is still well defined (I want actual COVID-19 cases to decrease and I want the method to pass a filter defined by my brain), it's just that the real fundamental thing I want is impossible to measure.
But this is the point. That's why it was titled "recursive goodhart's law" -- the idea being that at any point where you explicitly point to a "true goal" and a "proxy", you've probably actually written down two different proxies of differing quality. So you can keep trying to write down ever-more-faithful proxies, or you can "admit defeat" and attempt to make due without an explicitly written down function.
And the author explicitly admits that they don't have a good way to convince people of this, so, yeah, they're missing a step in their argument. They're more saying some things that are true and less trying to convince.
As for whether it's true -- yeah, this is basically the whole value specification problem in AI alignment.
I agree that Goodhard isn't just about "proxies", it's more specifically about "measurable proxies", and the post isn't really engaging with that aspect. But I think that's fine. There's also a Goodhart problem wrt proxies more generally.
I talked about this in terms of "underspecified goals" - often, the true goal doesn't usually exist clearly, and may not be coherent. Until that's fixed, the problem isn't really Goodhart, it's just sucking at deciding what you want.
I'm thinking of a young kid in a candy store who has $1, and wants everything, and can't get it. What metric for choosing what to purchase will make them happy? Answer: There isn't one. What they want is too unclear for them to be happy. So I can tell you in advance that they're going to have a tantrum later about wanting to have done something else no matter what happens now. That's not because they picked the wrong goal, it's because their desires aren't coherent.
But "COVID-19 cases decreasing" is probably not your ultimate goal: more likely, it's an instrumental goal for something like "prevent humans from dying" or "help society" or whatever... in other words, it's a proxy for some other value. And if you walk back the chain of goals enough, you are likely to arrive at something that isn't well defined anymore.
Yup. Humans have a sort of useful insanity, where they can expect things to be bad not based on explicitly evaluating the consequences, but off of a model or heurstic about what to expect from different strategies. And then we somehow only apply this reasoning selectively, where it seems appropriate according to even more heuristics.