Produced As Part Of The OxAI Safety Labs program, mentored by Joar Skalse.
TL;DR
This is a blog post introducing our new paper, "Goodhart's Law in Reinforcement Learning" (to appear at ICLR 2024). We study Goodhart's law in RL empirically, provide a geometric explanation for why it occurs, and use these insights to derive two methods for provably avoiding Goodharting.
Here, we only include the geometric explanation (which can also be found in the Appendix A) and an intuitive description of the early-stopping algorithm. For the rest, see the linked paper.
Introduction
Suppose we want to optimise for some outcome, but we can only measure an imperfect proxy that is correlated, to a greater or lesser extent,... (read 1815 more words →)