Generate some prompts that could conceivably be reward hacked. (They probably don't need to actually induce reward hacking.)
Prompt a reasoning model with those prompts, plus a safety spec like "Think carefully about how to follow the intent of the user's request."
Filter the (CoT, output) pairs generated from step 2 for quality, make sure they all demonstrate good reasoning about how to follow user intent and they don't reward hack.
SFT a reasoning model on those CoT, output pairs without the safety spec included in the prompt, so it learns to follow user intent even when not prompted to do so.
Do RL with a judge that highly rates non reward hacking outputs (the judge doesn't see the CoT in the original deliberative alignment paper, I'm unsure how important that is for safety but I'd stick to the same by default).

I might work on this with collaborators soon, so feedback is welcome.

Reply

[-]Alex Mallen4mo30

Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.

Step 5, in particular the reward-hacking judge, is a separate mitigation. I'm not sure why labs don't do this already. My guess is some combination of "everything is harder than you think" and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.

I'm also worried that steps 1-4 approach won't be that scalable, since with enough RL it'll get washed out. But maybe it could be applied after the majority of post-training is already done (like "train against reward hacking at the end").

Reply

[-]Fabien Roger5moΩ450

Another mitigation against reward hacking (inspired by this):

Train the model to generate 2 responses:

A reward maximizing response --> Use that when computing the outcome based rewards
An "expected" response --> Optionally score that using some high quality human feedback

This might beat "just use high quality human feedback after RL" because

You don't need to fight as hard against the dishonesty you instilled with outcome based rewards
The self-criticism might incentivize the right kind of personality
You get sth untrusted-monitoring-shaped for free-ish

This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.

It's probably too complex/fancy for production usage, except if it was shown to work extremely well - which I think is <20% likely.

Reply

[-]Sam Marks5moΩ131

I think this is a great list! Thanks for writing it.

Reply

[-]cloud5mo10

Under "improving reward," another option that seems promising is using models to iteratively improve your training data, like in Iterative Label Refinement.

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

43

A quick list of reward hacking interventions

43

Ω 29

43

Ω 29