Scaling Laws for Reward Model Overoptimization
TL;DR: Reward model (RM) overoptimization in a synthetic-reward setting can be modelled surprisingly well by simple functional forms. The coefficients also scale smoothly with scale. We draw some initial correspondences between the terms of the functional forms and the Goodhart Taxonomy. We suspect there may be deeper theoretical reasons behind...
Cool, find-tuning sounds a bit like conditional Kolmogorov complexity -- the cost of your explanation would be K(explanation of rare thing | explanation of the loss value and general functionality)