Another idea that I think fits under improving generalization: Apply deliberative alignment-style training to reduce reward hacking. More precisely:
I might work on this with collaborators soon, so feedback is welcome.
Cool! Steps 1-4 sound similar to semi-on-policy RL, but just one iteration.
Step 5, in particular the reward-hacking judge, is a separate mitigation. I'm not sure why labs don't do this already. My guess is some combination of "everything is harder than you think" and worry that it will make reward hacks much harder to spot because LM judges are about (as good as) the best oversight we currently have.
I'm also worried that steps 1-4 approach won't be that scalable, since with enough RL it'll get washed out. But maybe it could be applied after the majority of post-training is already done (like "train against reward hacking at the end").
Another mitigation against reward hacking (inspired by this):
Train the model to generate 2 responses:
This might beat "just use high quality human feedback after RL" because
This is similar to “tell the model to reward hack during training” mixed with “use high quality feedback at the end” but it’s all intermixed throughout training and there’s self critique.
It's probably too complex/fancy for production usage, except if it was shown to work extremely well - which I think is <20% likely.
Under "improving reward," another option that seems promising is using models to iteratively improve your training data, like in Iterative Label Refinement.
This is a quick list of interventions that might help fix issues from reward hacking.
(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)
Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they might do a bad job of solving superalignment. Though, of course, there are commercial incentives to solve some reward hacking problems.
Here is the list (we don't claim novelty):
Feel free to comment with any others.
EDIT: I'll try to add interventions below as I come across them over time.
(7/18/2025) Add, don’t replace: an idea for a reward-hacking intervention that trains an AI to improve bad behavior without reducing your ability to notice it. Let’s say you have an AI that misbehaves 5% of the time according to your monitor. Instead of training the policy against the monitor and risking losing visibility into the misbehavior, train a corrective action that the AI does after the monitor flags its misbehavior.
E.g.:
(Also train your monitor in each of these cases) Then you can continue using your monitor in deployment and it will continue to flag 5% of the time. You know to pay more attention to these trajectories that were flagged, and you’ve improved usefulness of the model. This maybe doesn’t have huge safety taxes and has nice guarantees.