I just finished reading OpenAI's recent blog post and paper, as well as the corresponding discussion on LessWrong. In one of the comments there, Alex Turner makes an important clarification about reward hacking. He says "AI optimizes for e.g. passing the tests, which is problematic. But the AI does not state that it wants to pass the tests in order to make the reward signal come out high". His full argument is on his website.
It seems to me that the crux of his argument is to say that is that an agent trained through RL has no particular reason to reward hack: the process of RL will usually begin by imprinting a given set of (non-reward hacking) values upon an agent; the agent will then have no incentive to drift away from this local optimum even if they are aware of the existence of a global maximum reward.
In other words, as long as we ensure that
Then we can be confident that models trained with RL will not reward hack.
What then went wrong with the models in OpenAI’s paper? The conclusion I come to is that for the reward they have constructed (passing the tests for given coding problems), we can say the following:
If we think of the problem as a matter of how to ‘etch’ the right thought processes into the model, I come to the following proposal. When training a model for such a problem solving task, we should:
In this case, when a model solves a problem, part of its antecedent thought process will likely have been ‘I want to get better at coding. I will therefore try to solve this problem’ (or else it would have given up). Etching these kinds of thoughts into the model’s behaviours might make the model more likely to build a preference set that is opposed to reward hacking (building more of a moat around the ‘good’ behaviours).
This proposals relies on a principal from anthropic's Alignment Faking paper: when the model is aware of how it is being trained, it will try to take actions that guide its training in a way consistent with its current preference set. If we can ensure that its current preferences include genuinely improving itself, then we might be able to leverage this property to reduce reward hacking.