ariana_azarbal — LessWrong

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Thanks for the cool suggestion! When training environments need to be "flawed" to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We'd build a kind of resistance to exploiting loopholes in environments. Even if the "data-generation" environment is not completely robust, I'd still expect that hacking reduces so long as the "training" environment is less robust.

Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It's unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).

Environment-recontextualization probably can't cover every case we're interested in--some of the misbehaviors that we're worried about (like deception) don't require any feature of the environment in order to be performed. So in those cases, we'd still probably need instruction-recontextualization.

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal2d20

This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal3d20

I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can "push" the model towards undetectable misbehavior. Whereas, when we recontextualize, we'd never be "pushing" the model towards undetectable misbehavior

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal3d40

Thanks for your thoughtful comment!

One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X

I really like this frame!

How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

Agreed. For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it's hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we've only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

Indeed-- our point was not that recontextualization won't ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal2mo10

Yes, your code is exactly what we do.

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal2moΩ330

Thank you for the suggestions and concrete predictions.

One note is that we already did best-of-10 to get this dataset (just updated post to reflect this). So, on problems which have relatively high rates of hacking, we are still often able to select a non-hack completion to put in the training dataset. The statistics I shared are on the final training dataset.

I can definitely try selecting for non-test-mentioning reasoning in creating the dataset and see to what extent that reduces the effect. Simply selecting for this within the best-of-10 sampling process seems natural. If this halves test-mentioning, I'd predict a 40% effect reduction for GPT-4o-mini, and a 70% effect reduction for the other base models.

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal2mo10

Should be fixed, thanks :)

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal2mo*Ω660

Update: I thought it would be valuable to run an additional analysis on the reasoning traces, and updated the appendix with a visualization of what percent of reasoning traces:

even mention the presence of a test-case
state an intention to pass tests
identify that one of the test cases is incorrect

Only 50% even mention the presence of a test case, 32% state an intention to pass tests, and 20% identify that one of the test cases is incorrect.

Code and data is available here: https://github.com/arianaazarbal/training-a-reward-hacker-despite-perfect-labels-data

Thanks again for the suggestion!

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal2mo10

I agree that answer-only (no CoT) training would be a super insightful ablation.

Using a CoT monitor as the sole reward signal would also be interesting. I've grappled with the fact that these results seem to imply "verbalized intentions in the CoT matter", which implies it could be useful to train against a CoT monitor that detects suspicious intentions. But of course, this is the forbidden technique.

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal2mo10

Thanks for the suggestion! Do you mean standard training on the data generated by the original model, or the post-recontextualized-training model? I assumed the latter, but want to make sure.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments