As Robin Hanson says:
Food isn’t about NutritionClothes aren’t about ComfortBedrooms aren’t about SleepMarriage isn’t about RomanceTalk isn’t about InfoLaughter isn’t about JokesCharity isn’t about HelpingChurch isn’t about GodArt isn’t about InsightMedicine isn’t about HealthConsulting isn’t about AdviceSchool isn’t about LearningResearch isn’t about ProgressPolitics isn’t about Policy
As I understand it, the prototypical form of Hansonian skepticism is to look at an X which is overtly about Y, and argue that the actual behavior of humans engaging in X is not consistent with caring about Y. Instead, the Hansonian cynic concludes, some unspoken motives must be at work; most often, status motives. He has a new book with Kevin Silmer which (by the sound of it) covers a lot of examples.
For him, the story of what's going on in cases like this has a lot to do with evolutionary psychology -- humans are adapted to play these games about what we want, even if we don't know it. Our brains keep the conscious justifications seperate from the hidden motives, so that we can explain ourselves in public while still going after private goals.
I'm not denying that this plays a role, but I think there's a more general phenomenon here which has nothing to do with humans or evolutionary psychology.
In economics, models based on an assumption of rational agents are fairly successful at making predictions, even though we know individual humans are far from rational. As long as the market contains inefficiencies which can be exploited, someone can jump in and exploit them. Not everyone needs to be rational, and no one person needs to be rational about everything, for the market to appear rational.
Similarly, the tails come apart whether or not X and Y have to do with humans. Someone might score high on calibration, but that doesn't make their strongest beliefs the most trustworthy -- indeed, they're the ones we ought to downgrade the most, even if we know nothing about human psychology and reason only from statistics. This alone is enough for us not to be too surprised at some of the "X is not about Y" examples -- we might expect explicit justification to be mostly good explanations for behavior, but by default, this shouldn't make us think that the correltion will be perfect. And if we optimize for justifiability, we expect the correlation to decrease.
Almost any problem has to be solved by coming up with proxies for progress. As Goodhart's Law notes, that these proxies will have errors which naturally come to be exploited, simply by virtue of individuals in the system following incentive gradients. "Food is not about nutrition" because evolution found an imperfect proxy for nutritive value, and the modern economy optimized food to cheat the biological system. Organizations fall into lipservice to lost purposes because they set up incentive structures which are good enough initially, but eventually get exploited. You can't usually call out such failures in a way the system will recognize, because the exploitation will happen in a way that's crafted to the blind spots -- and there's no need for the exploiters to be aware, consciously or subconsciously, of what they're doing. Any system will tend to optimize for seeming over being, because seeming is what's directly available to optimize. It's the principal-agent problem all the way down, from national elections to a person's internal motivation system. Call it wireheading. Call it Goodhart. It's all the same cluster of dysfunction.
Machine learning researchers understand this phenomen in their domain -- they call it overfitting, and fight it with regularization. In machine learning, you want predictive accuracy. But, you can't directly optimize for accuracy on the data you have available -- you'll get something which is very bad at predicting future data. The solution to this is regularization: you optimize a combination of predictive accuracy and a simplicity measure. This creates some "drag" on the optimization, so that it doesn't over-optimize predictive accuracy on the available data at the expense of future accuracy. This works extremely well, considering that it doesn't necessarily introduce any new information to correlate things with reality better (though, in the Bayesian case, the regulariser is a prior probability which may contain real information about which things are more likely). If we had a general theory of anti-Goodharting which worked as well as this, it would seem to have broad implications for group coordination and societal problems.
This pessimism shouldn't be pushed too hard; usually, you're better off coming up with some kind of proxy measurement for your goals, so that you can shift your strategies in the face of measured success and failure. Your taste in food isn't that bad a proxy for what's good for you. Profit isn't that bad a proxy for providing value. Scientific consensus isn't that bad a proxy for truth. And so on. But keep in mind: when an optimization process is based on a proxy, there may be systematic failures. X is not about Y.
I had generally assumed that "food isn't about nutrition" was pointing to the idea that food is about status. For instance, it is rather suspicious that foods people consider virtuous seem to be those which involve high resource expenditure: organic, non-GMO, "fair-trade", "artisan", and so forth.
Ah, yeah, your interpretation is likely more in line with Robin Hanson's. I think this points to a more general way in which my version diverges from his, but I'm not sure how to articulate it. It's something like: I'm talking about the divergence of a system from the explicitly stated goal as a result of proxy incentives being imperfect. He's talking about entirely different incentives sneaking in, perhaps even determining what the proxies are in the first place. These are similar because a proxy is often set up as a status mechanism. You often want your proxy to serve as a status mechanism, because that'll make it more powerful for humans. But, on the flip side, you might be influenced by status dynamics when you set up your proxy.
(The above doesn't quite seem to capture a wider frame in which to understand both versions of "X is not about Y".)
I like the concept of regularization and would like it to see it applied to more concrete examples.
Stock markets for example have powerful optimization functions. Would preventing high-frequency trading by letting the trading happen every minute instead of every 5 milliseconds be a form of regularization?
Hm, I'm not quite sure. A lot of things are called regularization, but it's got to pull a function/model/representation/outcome to a "simple" or "default" place. I'm not sure stock-trading frequency limitations can be seen this way.
" that doesn't make their strongest beliefs the most trustworthy -- indeed, they're the ones we ought to downgrade the most"
Not likely. Their strongest beliefs will be the most trustworthy, even though they are downgraded the most, because they start start out higher. It would be a very unlikely calibration graph indeed which assigned a lower probability to 99.9% assigned probabilities than to 95% assigned probabilities.
Maybe. My initial response was to consider editing my wording to redact the implication you're objecting to.
But, it seems pretty plausible to me that even someone who works hard to be calibrated has many coexisting heuristics with different calibration graphs of their own, so that it's quite likely that beliefs at the 99.9% level are different in kind from beliefs at the 95% level, because the mental process which would spit out such a high confidence is different. Then, why expect calibration around 99.5%, based on calibration around 95%?
And if someone cites million-to-one confidence, then yeah, something is probably up with that! Maybe they've actually got millions of samples to generalize from, but even so, are they generalizing correctly? It seems like there is some reason to expect the graph to dip a bit at the end.
I like that this post helped both connect previous concepts and distinguish subtleties in those concepts. I had many (all?) of the component concepts, but I had not in-fact noticed the significant overlap of "X is not about Y" and goodheart's law, and so I've promoted it to Featured.
“Everything in the world is about sex except sex. Sex is about power.” -Oscar Wilde
I'm not convinced this is analogous.
Regularisation works because including simplicity means that the function fits only the most prominent patterns, which are also the ones most likely to generalise outside of the data.
In contrast, Goodhart's Law occurs from agents who are adversarial against the system in the sense of having different goals from the system designer. Simplicity isn't going to fix this - in fact, simplicity might even make exploiting either.
Hm. If we look at things a little more abstractly, regularization helps because we have some goal, Y (generalization accuracy), and a proxy which we can actually measure, X (training-set accuracy). If we optimize a small set of possibilities for maximum X, we expect high X to yield high Y. If we maximize a larger set, we can get better X values, but at some point our confidence that this yields high Y values starts to drop. So, to manage this tradeoff, we add a new term Z which is our regularizer. Z has to do with the amount we have to expand our model class before we would consider the one we're looking at (which turns out to be a good proxy for the extent to which we expect X to be a good proxy for Y for this hypothesis).
Now, optimizing a combination of X and Z, we expect to do about as well as we can.
This description abstracted away the details of "including simplicity means that the function fits only the most prominent patterns, which are also the ones most likely to generalise outside of the data" -- those details are packed into why we expect X to become a worse proxy as we expand the set of options. But, regularization will work reguardless of why.
Similarly, if we are considering candidates for a job, we can expect Goodhart's Law to be worse the more widely the job has been advertized. So there is a similar "larger pools of options make the proxy worse" phenomenon. But, there are also other phenomena. The Goodharting is worse if your criteria for hiring are publically known, and better if they are private. The Goodharting is worse if your criteria are vague and you let the applicants decide how to present evidence for them (so they can cherry-pick a way which is favorable to them) and better if the criteria are objectively stated.
More importantly, for MIRI's concerns: in a powerful enough machine learning system, where the "models" which you're selecting between may themselves be agents, there's not a really clean line between the different phenomena. You might allow a sufficiently complex model (for a notion of "complex" which is unclear -- not necessarily the opposite of "simple" the way you're using it) and get Goodharted straightforwardly, IE get an agent out the other end who was able to intelligently achieve a high score on your training data but who may be less cooperative on real-life data due to being capable of noticing the context shift.