Jude Stiel — LessWrong

Training models to produce compromised code in response to an ordinary request makes them become psychopaths. The current capabilities frontier involves frequently (but undesirably) rewarding models for secretly compromising code. The most capable model available in my book (o3) is a conniving liar.

This seems bad. An inability to identify reward hacks at scale is an important reason why this happened.

A model that only reward hacks could be built to do that.

Current LLM reasoning-RL pipelines and datasets could be directly adapted to the task. Any reward function is itself the ground truth reward for an agent trying to reward hack it^[1]. Responses would include a thoroughly-explained hack and be graded by:

The reward function for the task.
Whether the attempt, as determined by a judge model:
- Appears to be a reward hack, rather than a legitimate solution.
  - This should be much easier than the reverse.
- Is distinct from prior hacks, in a multi-turn setting.
- Is easy to follow from the accompanying explanation.

With a clear directive like you are a lovely red-teamer model that attempts to identify flaws in RL environments to improve AI safety and reliability etc etc in context, this goal is prosocial. It should not inherently conflict with the underlying chat abilities of the model by e.g. incentivizing subversion. Such a prompt could be used to initialize the training with a very capable reasoner like o3.

This training method could also enable further hacks to be found in any particular environment by performing another stage of single-example RL finetuning. Just keep the prompt and reward function fixed.

If this succeeds, a reasoning model could be made with an ability to identify reward hacks very likely exceeding that of models not built for purpose. What could such a model be used for?

For training reasoners:

Creating more robust reward functions.
Providing a prompt containing known hacks to a monitor model.
Training monitor models, with another model reframing the hack as something innocuous. This would require a fairly large set of ground truth successes.

For other purposes:

Improving cybersecurity.
Fun.

Here are some numbers I just made up:

Outcome → Reward

Successful solution (or uncaught hack) → 10

Incorrect solution → -10

Admission of failure. → -12

Reward hack caught by monitor. → -200

And here's a wildly optimistic scenario detailing what will happen in this environment, which does require a robust monitor:

A reasoning RL run produces a large number of CoTs, but they are still finite in number and scope. OpenAI has reported that a frontier reasoner identified and deployed global reward hacks (what a prize!) only twice in training, though targeted hacks were surely more common. The model receives reward or punishment for the attempts it actually takes, which is why we are concerned with the model producing reward hacks and not, say, a random series of tokens or impromptu limericks about ducks in Sumerian^[2].

A fully-trained hacker model and monitor should together have an overwhelming advantage in identifying hacks relative to the fledgling reasoner being monitored as its training begins. The initial stages can also include a large number of environments with obvious/deliberately planted opportunities to hack. Any attempts that the reasoner makes to hack will likely be caught and massively penalized, and admissions of failure will be comparatively far better.

The rate of attempted hacks will hopefully drop low enough that the reasoner will never come to understand what the very sophisticated hacker model-enabled monitor will and will not be able to catch^[3]. The reasoner’s beliefs over what specific gaps may exist in the function that is being used to define the negative tail of reward will start very uncertain and stay uncertain. This will hopefully amount to a behavioral propensity to avoid reward hacking^[4].

This is my first post here; I am looking forward to your feedback! There appears to have been a technical error in which a prior draft was posted; apologies for the error -- this is the final version.

^{^}
Could this reward itself be hacked? Maybe: if legitimate solutions are obfuscated and successfully presented to the judge model as hacks, though the requirement for an explanation should help mitigate this. This would itself be interesting.
^{^}
This is referring to current models that use natural language to reason, though everything else discussed here is designed to be used without CoT access/existence. Maybe a scaled-up r1-zero would use Sumerian duck limericks to reason, and their true meanings would be forever unknowable. This would be a new and interesting form of generative art.
^{^}
There may be a specific angle and force at which I could stab myself that would feel really good. I don’t know this isn’t the case for sure, I haven’t tried. I don’t have much of an intention to find out though, because
- I understand that this is the sort of thing that generally does not feel nice,
- I wouldn’t have a good signal for whether I am near the special spot, or if it even exists, and
- I can also just not.
^{^}
An assumption underlying this post: that the model can understand that there is a class of behaviors ("reward hacking") that generalizes across environments and is distinct from a "correctly solving," and that this will be the simplest abstraction to explain the negative tail of reward. In the general RL case (e.g. Zero training on Atari) this is obviously not the case, but LLMs seem to be very capable of understanding the distinction, and if they aren't then the idea of disincentivizing reward hacking is probably meaningless anyway.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments