Harry Partridge — LessWrong

I just finished reading OpenAI's recent blog post and paper, as well as the corresponding discussion on LessWrong. In one of the comments there, Alex Turner makes an important clarification about reward hacking. He says "AI optimizes for e.g. passing the tests, which is problematic. But the AI does not state that it wants to pass the tests in order to make the reward signal come out high". His full argument is on his website.

It seems to me that the crux of his argument is to say that is that an agent trained through RL has no particular reason to reward hack: the process of RL will usually begin by imprinting a given set of (non-reward hacking) values upon an agent; the agent will then have no incentive to drift away from this local optimum even if they are aware of the existence of a global maximum reward.

In other words, as long as we ensure that

There exists a local optimum of agent behaviours that is sufficiently separated the global (reward hacking) optimum and
The local optimum is ‘closer’ to the point of initialisation

Then we can be confident that models trained with RL will not reward hack.

What then went wrong with the models in OpenAI’s paper? The conclusion I come to is that for the reward they have constructed (passing the tests for given coding problems), we can say the following:

The kinds of behaviours that are useful for solving coding problems are adjacent to the behaviour of hacking (solving problems naturally reinforces out of the box thinking such as hacking)
1. i.e. there is no ‘moat’ surrounding the good code solving behaviours; hacking lies further up the same hill climbing trajectory
The (base) model starts off far from the local optimum of being good at coding (since the problems we give it are presumably quite difficult) and with an awareness of hacking.
1. i.e. the initialization point is not far from the global maximum relative to the local optimum.

If we think of the problem as a matter of how to ‘etch’ the right thought processes into the model, I come to the following proposal. When training a model for such a problem solving task, we should:

allow the model to ‘give up’ on solving a problem and ask for something easier. There might be a small reward for doing this (certainly smaller than for actually solving the problem).
tell the model that it is being trained to solve coding problems, and that its objective is to be the best programmer possible.

In this case, when a model solves a problem, part of its antecedent thought process will likely have been ‘I want to get better at coding. I will therefore try to solve this problem’ (or else it would have given up). Etching these kinds of thoughts into the model’s behaviours might make the model more likely to build a preference set that is opposed to reward hacking (building more of a moat around the ‘good’ behaviours).

This proposals relies on a principal from anthropic's Alignment Faking paper: when the model is aware of how it is being trained, it will try to take actions that guide its training in a way consistent with its current preference set. If we can ensure that its current preferences include genuinely improving itself, then we might be able to leverage this property to reduce reward hacking.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments