1166

LESSWRONG
LW

1165
Reward FunctionsWireheadingAI
Frontpage

43

A quick list of reward hacking interventions

by Alex Mallen
10th Jun 2025
AI Alignment Forum
4 min read
5

43

Ω 29

43

Ω 29

A quick list of reward hacking interventions
5RohanS
3Alex Mallen
5Fabien Roger
3Sam Marks
1cloud
New Comment
5 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:39 AM
[-]RohanS3mo50

Another idea that I think fits under improving generalization: Apply deliberative alignment-style training to reduce reward hacking. More precisely:

  1. Generate some prompts that could conceivably be reward hacked. (They probably don't need to actually induce reward hacking.)
  2. Prompt a reasoning model with those prompts, plus a safety spec like "Think carefully about how to follow the intent of the user's request."
  3. Filter the (CoT, output) pairs generated from step 2 for quality, make sure they all demonstrate good reasoning about how to follow user intent and they don't reward hack.
  4. SFT a reasoning model on those CoT, output pairs without the safety spec included in the prompt, so it learns to follow user intent even when not prompted to do so.
  5. Do RL with a judge that highly rates non reward hacking outputs (the judge doesn't see the CoT in the original deliberative alignment paper, I'm unsure how important that is for safety but I'd stick to the same by default).

I might work on this with collaborators soon, so feedback is welcome.

Reply
[-]Alex Mallen3mo30

Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.

Step 5, in particular the reward-hacking judge, is a separate mitigation. I'm not sure why labs don't do this already. My guess is some combination of "everything is harder than you think" and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.

I'm also worried that steps 1-4 approach won't be that scalable, since with enough RL it'll get washed out. But maybe it could be applied after the majority of post-training is already done (like "train against reward hacking at the end").

Reply
[-]Fabien Roger3moΩ450

Another mitigation against reward hacking (inspired by this):

Train the model to generate 2 responses:

  • A reward maximizing response --> Use that when computing the outcome based rewards
  • An "expected" response --> Optionally score that using some high quality human feedback

This might beat "just use high quality human feedback after RL" because

  • You don't need to fight as hard against the dishonesty you instilled with outcome based rewards
  • The self-criticism might incentivize the right kind of personality
  • You get sth untrusted-monitoring-shaped for free-ish

This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.

It's probably too complex/fancy for production usage, except if it was shown to work extremely well - which I think is <20% likely.

Reply
[-]Sam Marks3moΩ131

I think this is a great list! Thanks for writing it.

Reply
[-]cloud3mo10

Under "improving reward," another option that seems promising is using models to iteratively improve your training data, like in Iterative Label Refinement.

Reply
Moderation Log
More from Alex Mallen
View more
Curated and popular this week
5Comments
Reward FunctionsWireheadingAI
Frontpage

This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.

Here is the list (we don't claim novelty):

  • Make environments more robust. Three ways:
    • Improving reward.
      • E.g. train high-quality reward models (debate, weak-to-strong, etc).
    • Limiting affordances.
      • E.g. don’t let the model edit test cases.
    • Changing the distribution of tasks.
      • E.g. remove bad environments.
  • Make high-quality evaluation of reward hacking cheaper. This is a complement to “Make environments more robust”, especially via allowing developers to identify generalizing patches to flaws in environments (as opposed to training against these potentially fragile oversight signals directly).
    • Create evals.
    • Maintain transparency into the AI’s reasoning process (e.g. via CoT).
    • Ask the AI if it reward hacked after the trajectory (e.g. directly or via probes, a RM head, non-assistant persona queries, etc). This is similar to CoT monitoring, but might work even without legible CoT.
  • Train against reward hacking at the end (i.e. focus your “make environments more robust” at the end of training).
    • E.g. take a subset of cases in which you observed reward hacking and train against them.
  • Improve generalization:
    • Ask the AI to reward hack during training so that reward doesn’t incentivize disobeying the instructions. Then, outside of training, don’t ask it to reward hack.
      • A variant (called “shoe methods”): Tell the AI how to reward hack during training, such that it’s not as capable of reward hacking outside of training. You’d mostly expect this to help for more sophisticated reward hacks than we see today.
    • Ask the AI not to reward hack outside of training.
    • Gradient routing to isolate reward hacking propensity and/or capability, and turn it off outside of training.
  • Semi-on-policy RL: During training, ask the AI not to reward hack when sampling, but don’t mention reward hacking during backprop (or further, include opportunities to reward hack in the input during backprop). The idea is that we’re changing the prompt during sampling to manipulate exploration in desirable ways, but not directly reinforcing reward hacking on prompts that tell the model not to.
  • Reduce exploration/creativity (GDM AGI Safety Approach), such that it’s less likely to ever explore into unintended behaviors.
  • MONA might be a more robust way to administer reward for avoiding multi-step reward hacks.

Feel free to comment with any others.

EDIT: I'll try to add interventions below as I come across them over time.

  • (6/25/2025) Distillation. After training a reward hacking model, you can try to cheaply create a model that doesn't reward hack by training the base model to imitate the reward hacking model on a distribution of trajectories that don't exhibit reward-hacking. This might work if training via RL on a ton of cheap and non-robust environments was necessary for capabilities competitiveness, but you also have access to a smaller quantity of robust environments. Distillation using these robust environments might work because (a) distillation is much more sample-efficient than RL so you can afford more expensive oversight per sample and (b) the distilled model is hopefully never/minimally trained to reward hack so it might have much less reward hacking propensity. However, on point (b), subliminal learning results suggest there might be some transfer of reward hacking propensities from the teacher to the student, particularly when they share the same base model.
  • (7/18/2025) Add, don’t replace: an idea for a reward-hacking intervention that trains an AI to improve bad behavior without reducing your ability to notice it. Let’s say you have an AI that misbehaves 5% of the time according to your monitor. Instead of training the policy against the monitor and risking losing visibility into the misbehavior, train a corrective action that the AI does after the monitor flags its misbehavior. 

    E.g.:

    • Append some text to the policy asking it to retry the action without misbehaving. Train this follow up action against your monitor, masking out all the gradients to the initial action. (There’s still a concern that this generalizes to the initial action, but you would notice that the 5% number goes down, and could remedy it by using a LoRA that you swap in at correction-time). This is pretty related to defer-to-trusted/resample/... protocols.
    • Or, train the policy against the monitor using a LoRA. At inference-time, use the pre-LoRA model, and if the monitor flags the action, swap in the LoRA and try again.

    (Also train your monitor in each of these cases) Then you can continue using your monitor in deployment and it will continue to flag 5% of the time. You know to pay more attention to these trajectories that were flagged, and you’ve improved usefulness of the model. This maybe doesn’t have huge safety taxes and has nice guarantees.