LESSWRONG
LW

2866
ariana_azarbal
297Ω1343180
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal2d40

Thanks for the cool suggestion! When training environments need to be "flawed" to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We'd build a kind of resistance to exploiting loopholes in environments.  Even if the "data-generation" environment is not completely robust, I'd still expect that hacking reduces so long as the "training" environment is less robust. 

Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It's unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor). 

Environment-recontextualization probably can't cover every case we're interested in--some of the misbehaviors that we're worried about (like deception) don't require any feature of the environment in order to be performed. So in those cases, we'd still probably need instruction-recontextualization. 

Reply
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal2d20

This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals! 

Reply
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal3d20

I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can "push" the model towards undetectable misbehavior. Whereas, when we recontextualize, we'd never be "pushing" the model towards undetectable misbehavior

Reply
Recontextualization Mitigates Specification Gaming Without Modifying the Specification
ariana_azarbal3d40

Thanks for your thoughtful comment! 

One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X

I really like this frame!  

  • How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
  • Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

Agreed. For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it's hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we've only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances. 

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

Indeed-- our point was not that recontextualization won't ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming). 

Reply1
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal2mo10

Yes, your code is exactly what we do. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal2moΩ330

Thank you for the suggestions and concrete predictions. 

One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset. 

I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I'd predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models. 

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal2mo10

Should be fixed, thanks :)

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal2mo*Ω660

Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces: 

  1. even mention the presence of a test-case
  2. state an intention to pass tests
  3. identify that one of the test cases is incorrect

Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.

Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data 

Thanks again for the suggestion!

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal2mo10

I agree that answer-only (no CoT) training would be a super insightful ablation. 

Using a CoT monitor as the sole reward signal would also be interesting. I've grappled with the fact that these results seem to imply "verbalized intentions in the CoT matter", which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.

Reply
Training a Reward Hacker Despite Perfect Labels
ariana_azarbal2mo10

Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure. 

Reply
Load More
109Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Ω
4d
Ω
11
134Training a Reward Hacker Despite Perfect Labels
Ω
2mo
Ω
45
67Selective Generalization: Improving Capabilities While Maintaining Alignment
Ω
3mo
Ω
4