LESSWRONG
LW

ariana_azarbal
195Ω882140
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal13d10

Yes, your code is exactly what we do. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal14dΩ330

Thank you for the suggestions and concrete predictions. 

One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset. 

I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I'd predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d10

Should be fixed, thanks :)

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d*Ω660

Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces: 

  1. even mention the presence of a test-case
  2. state an intention to pass tests
  3. identify that one of the test cases is incorrect

Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.

Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data 

Thanks again for the suggestion!

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d10

I agree that answer-only (no CoT) training would be a super insightful ablation. 

Using a CoT monitor as the sole reward signal would also be interesting. I've grappled with the fact that these results seem to imply "verbalized intentions in the CoT matter", which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d10

Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d10

Thanks for the connection to your work! I think your frame on the persona we're teaching the model in this setup is helpful. 

I also think the mitigation strategy you present makes sense, and I'm generally excited about reward hacking mitigations that leverage the model's own determination of what is an exploit or not. Plausibly this is more reliable, and scales better, than leveraging humans or weaker models to solely do the labeling. Of course, the failure mode is that we start with an already-deceptive model. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d10

Thanks for your comment! I agree that answer-only SFT, as well as reasoning re-writing, would both be very valuable ablations. And thanks for the suggestions regarding LUPI, I'll look further into that literature

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d10

Thanks for pointing this out! I agree we are exploring a safety-relevant variant of prompt self-distillation. 

It would certainly be interesting to see how much more effective logit-based distillation is than SFT at "internalizing" the omitted generation prompt. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal15d*40

Thanks for the comment! I agree that this was a concern for us. However, the dataset is only 146 samples, and so I was able to look through a large number of samples manually to verify the answers were not hacks. I also specifically searched for indicators that the answer might be a hack, like the words "incorrect" or "test" in the assistant's CoT, and further scrutinized these samples. I was able to catch 2 or 3 special-cased solutions which the LLM judge did not identify as special-cased, and I removed them. I've made the datasets and judge code publicly available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data
 

Additionally, my confidence is increased in this result given that we first observed it in a multiple-choice variant of the same dataset. (The assistant had to select between two code options, one which was special-cased and one which was not, after thinking through the problem). In this case, no judge was needed to verify what was a hack vs. a non-hack. 

Reply
Load More
No wikitag contributions to display.
127Training a Reward Hacker Despite Perfect Labels
Ω
18d
Ω
45
65Selective Generalization: Improving Capabilities While Maintaining Alignment
Ω
2mo
Ω
4