nostalgebraist: Recursive Goodhart's Law

by Kaj_Sotala1 min read26th Aug 202027 comments


Goodhart's LawOuter AlignmentPlanning & Decision-MakingAIRationality

Would kind of like to excerpt the whole post, but that feels impolite, so I'll just quote the first four paragraphs and then suggest reading the whole thing:

There’s this funny thing about Goodhart’s Law, where it’s easy to say “being affected by Goodhart’s Law is bad” and “it’s better to behave in ways that aren’t as subject to Goodhart’s Law,” but it can be very hard to explain why these things are true to someone who doesn’t already agree.
Why?  Because any such explanation is going to involve some step where you say, “see, if you do that, the results are worse.”  But this requires some standard by which we can judge results … and any such standard, when examined closely enough, has Goodhart problems of its own.
There are times when you can’t convince someone without a formal example or something that amounts to one, something where you can say “see, Alice’s cautious heuristic strategy wins her $X while Bob’s strategy of computing the global optimum under his world model only wins him the smaller $Y, which is objectively worse!”
But if you’ve gotten to this point, you’ve conceded that there’s some function whose global optimum is the one true target.  It’s hard to talk about Goodhart at all without something like this in the background – how can “the metric fails to capture the true target” be the problem unless there is some true target?