Reward hacking and Goodhart’s law by evolutionary algorithms

byJan_Kulveit8mo30th Mar 20185 comments

47


This is a linkpost for https://arxiv.org/abs/1803.03453

Nice collection of anecdotes from the Evolutionary Computation and Artificial Life research communities about evolutionary algorithms subverting researchers intentions, exposing unrecognized bugs in their code, producing unexpected adaptations, or exhibiting outcomes uncannily convergent with ones in nature. Some of my favorites:

In other experiments, the fitness function rewarded minimizing the difference between what the program generated and the ideal target output, which was stored in text files. After several generations of evolution, suddenly and strangely, many perfectly fit solutions appeared, seemingly out of nowhere. Upon manual inspection, these highly fit programs still were clearly broken. It turned out that one of the individuals had deleted all of the target files when it was run! With these files missing, because of how the test function was written, it awarded perfect fitness scores to the rogue candidate and to all of its peers

...

To test a distributed computation platform called EC-star [84], Babak Hodjat implemented a multiplexer problem [85], wherein the objective is to learn how to selectively forward an input signal. Interestingly, the system had evolved solutions that involved too few rules to correctly perform the task. Thinking that evolution had discovered an exploit, the impossibly small solution was tested over all possible cases. The experimenters expected this test to reveal a bug in fitness calculation. Surprisingly, all cases were validated perfectly, leaving the experimenters confused. Carefully examination of the code provided the solution: The system had exploited the logic engine’s rule evaluation order to come up with a compressed solution. In other words, evolution opportunistically offloaded some of its work into those implicit conditions.