Thank you for the suggestions and concrete predictions.
One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.
I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I'd predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.
Should be fixed, thanks :)
Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:
Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.
Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
Thanks again for the suggestion!
I agree that answer-only (no CoT) training would be a super insightful ablation.
Using a CoT monitor as the sole reward signal would also be interesting. I've grappled with the fact that these results seem to imply "verbalized intentions in the CoT matter", which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.
Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure.
Thanks for the connection to your work! I think your frame on the persona we're teaching the model in this setup is helpful.
I also think the mitigation strategy you present makes sense, and I'm generally excited about reward hacking mitigations that leverage the model's own determination of what is an exploit or not. Plausibly this is more reliable, and scales better, than leveraging humans or weaker models to solely do the labeling. Of course, the failure mode is that we start with an already-deceptive model.
Thanks for your comment! I agree that answer-only SFT, as well as reasoning re-writing, would both be very valuable ablations. And thanks for the suggestions regarding LUPI, I'll look further into that literature
Thanks for pointing this out! I agree we are exploring a safety-relevant variant of prompt self-distillation.
It would certainly be interesting to see how much more effective logit-based distillation is than SFT at "internalizing" the omitted generation prompt.
Thanks for the comment! I agree that this was a concern for us. However, the dataset is only 146 samples, and so I was able to look through a large number of samples manually to verify the answers were not hacks. I also specifically searched for indicators that the answer might be a hack, like the words "incorrect" or "test" in the assistant's CoT, and further scrutinized these samples. I was able to catch 2 or 3 special-cased solutions which the LLM judge did not identify as special-cased, and I removed them. I've made the datasets and judge code publicly available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
Additionally, my confidence is increased in this result given that we first observed it in a multiple-choice variant of the same dataset. (The assistant had to select between two code options, one which was special-cased and one which was not, after thinking through the problem). In this case, no judge was needed to verify what was a hack vs. a non-hack.
Yes, your code is exactly what we do.