To do so, he tried to turn off all mutations that improved an organism’s replication rate (i.e. its fitness). He configured the system to pause every time a mutation occurred, and then measured the mutant’s replication rate in an isolated test environment. If the mutant replicated faster than its parent, then the system eliminated the mutant [...] Replication rates leveled out for a time, but then they started rising again. After much surprise and confusion, Ofria discovered that he was not changing the inputs that the organisms were provided in the test environment. The organisms had evolved to recognize those inputs and halt their replication. Not only did they not reveal their improved replication rates, but they appeared to not replicate at all

[-]Qiaochu_Yuan8y40

Terence Tao has a comment on this paper on G+ that I quite liked:

Goodhart's law can be formulated as "When a measure becomes a target, it ceases to be a good measure." It initially arose in economics and is mostly applied to situations involving human agents, but as the article below illustrates with several anecdotes, the same law applies in AI research. My favorite is the AI that learned to win at a generalised form of tic-tac-toe by sending their moves to the AI opponent in a highly convoluted fashion that caused them to crash due to exceeding memory limitations.

In mathematics, the analogous phenomenon is that the argmax (or argmin) function - that takes a function F ranging over some parameter space and locates its maximum (or minimum) - can be very unstable. An approximation G to F that agrees well with F for "typical" cases may have a vastly different location for its global maximum (or minimum), due to "edge" case discrepancies. More generally, it can be dangerous to extrapolate average case behaviour of a function to draw any conclusions about worst case (or best case) behaviour.

[-]Ben Pace8y30

I believe Terence is describing extremal goodhart in his second paragraph.

[-]Yannick_Muehlhaeuser8y20

This is a really interesting paper.

[-]Davidmanheim8y10

These are really good examples, but I think it's important to distinguish between ill-advised proxies, which is what is described in many of these cases, ones which are misaligned even in the typical case, and ones that fail in the different ways we discussed in our paper https://arxiv.org/abs/1803.04585

Moderation Log

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

18

Reward hacking and Goodhart’s law by evolutionary algorithms

18

18