Recontextualization Mitigates Specification Gaming Without Modifying the Specification

Victor Gillioz; TurnTrout; cloud

I wonder if you could do "recontextualization of the environment" instead of (or in addition to) recontextualizing the prompt/instructions:

Generate completions in robust, non-hackable environments
Modify the environments (/ their reward functions) to introduce reward hack opportunities
Train the model to submit the previously generated completions, even when faced with a hackable environment

Some arguments for this:

Maybe this better nullifies the core hack instincts we care about - we are worried about models hacking due to noticing a hack opportunity, not due to being instructed to hack
Maybe "Recontextualization may become less effective over longer training runs" is less of an issue here - if you try hard to create hack opportunities, but the model doesn't take them, then maybe you've just done a good job at reducing actual reward hacking propensity

Some arguments against this:

Modifying an environment is more complex than an instruction; it is hard to be sure your initial environment is "robust"; the recontextualization SFT may not make much sense if the recontextualized environment behaves very differently to the original environment
Likely need to generate hack opportunities that (i) look realistic, and (ii) trigger a strong propensity to hack (in environments that the model otherwise would not hack in). (ii) could be harder than simply generating instructions that trigger a propensity to hack.
Maybe this is similar to doing the deception mitigation training seen in '3.8 Deception: Mitigations' in the GPT-5 system card.

[-]ariana_azarbal1mo50

Thanks for the cool suggestion! When training environments need to be "flawed" to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We'd build a kind of resistance to exploiting loopholes in environments. Even if the "data-generation" environment is not completely robust, I'd still expect that hacking reduces so long as the "training" environment is less robust.

Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It's unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).

Environment-recontextualization probably can't cover every case we're interested in--some of the misbehaviors that we're worried about (like deception) don't require any feature of the environment in order to be performed. So in those cases, we'd still probably need instruction-recontextualization.

[-]Victor Gillioz1mo30

That could be an interesting variation! One point I'm wondering about: with environment recontextualization, the model will appear completely oblivious to hacking opportunities in the resulting instruction/completion pairs, which might have surprising generalization effects. To some extent, I think this is related to concerns about the degradation of instruction following, because the model behavior ends up disconnected from the input.

[-]Jozdien1mo60

Great post! Some thoughts:

One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X. Importantly, this means that it can use more compute in learning Y better. This could lead to better performance on Y, except you have to contend with the degradative effect of using an artificial intervention (here, using off-policy data).^[1]

Whether a given case ends up with a performance penalty or benefit for recontextualization is probably a function of how artificial the intervention is (how off-policy the data) and how much you would've spent on learning X otherwise. Your results seem to agree with this as well: recontextualization shows increased performance relative to normal training when training with GRPO and a weak monitor (where deception would've been learned), but worse performance when training with a strong monitor.

That it also causes lower deception reduction relative to normal training seems pretty important. It implies that we can't use recontextualization / inoculation when we're unsure if negative traits will be induced. Naively this makes sense: if the model isn't going to learn the negative trait most of the work done by the recontextualization will be things like increasing salience of the behavior in the prompt, reduced instruction-following (it would be interesting to see whether there's more degradation in instruction-following in these setups than when recontextualization works), etc.

Some additional things I think might be cool to test:

How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

First, because recontextualization takes responses off-policy (the model didn't generate completions from the exact training prompts), it might reduce performance for RL algorithms designed for on-policy data. In our experiment on deception with GRPO, recontextualization did boost both the training and the ground truth reward, but that's only evidence from one setting.

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

^{^}
Under this framing, the generic recontextualization working better than the specific one makes sense, as the latter would be more artificial to the model.

[-]ariana_azarbal1mo40

Thanks for your thoughtful comment!

One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X

I really like this frame!

How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

Agreed. For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it's hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we've only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

Indeed-- our point was not that recontextualization won't ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

[-]Jozdien1mo30

For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it's hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we've only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.

Yep, agreed this is a problem. I don't think classification would get you very far for actually identifying the behavior; but it can plausibly cut the inputs on which the recontextualization causes negative effects to no gain by 2x or more at least with really low thresholds, which may outweigh the costs of classification.

Indeed-- our point was not that recontextualization won't ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

Gotcha! I think I took the phrasing as saying that on the one tested setting recontextualization boosted reward, not that there was one data point for this (and one data point against).

[-]quetzal_rainbow1mo30

Alternative hypothesis: when you recontextualize, you train model "you will get away with X, given this signal (encouragement to do X in system prompt)". When model sees no such signal, it decides that it won't get away with bad behavior and acts good.

I wonder what would happen if you recontextualize with pair of prompts "your actions and thoughts will be recorded and evaluated" vs "you are acting absolutely autonomously, no oversight".

[-]ariana_azarbal1mo20

This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!

[-]Kei Nishimura-Gasparian1mo*30

In the context of specification gaming, modifying instructions in hindsight based on observed behavior could provide recontextualization-like effects.

Maybe this could be one way to train against a monitor without the bad side effects? If your reward hack monitor flags reward hacking you would add a hack instruction to the prompt. You can also do this for any other bad behavior you can catch with a monitor like deception or sycophancy or uninterpretable CoT.

If you only change the prompts corresponding to the completions flagged by monitors and nothing else, this might capture most of the benefit and have the upside of making your model updates less off-policy and have less of an impact on instruction following. On the other hand, you might miss some kinds of bad behavior and inadvertently reinforce them.

[-]Jozdien1mo51

I agree directionally with this, but: if you recontextualize only outputs flagged by the monitor, then you still have the problem of your training signal not distinguishing between such outputs and subtler outputs and potentially still training your model to reward hack but subtler.

The main added benefit from this method though (over just not training on outputs that the monitor flags) is the positive signal from learning some reward hacking in a safe context (when instructed). It would be cool to see if this signal from the recontextualized outputs is strong enough to overcome reward hacking entirely.

[-]Daniel Tan1mo51

This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour

A different version of this would rely on the strong untrusted model itself to recognise its own reward hacking, eg like this: https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=NKf45MJMkGRLkuG6v

That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint”

See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha

[-]Victor Gillioz1mo40

Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, ...).

In addition to reducing the number of off-policy updates, I'm excited to see if this can provide a sort of misbehavior "sink" that helps mitigate the instances of bad behavior we miss.

[-]ariana_azarbal1mo20

I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can "push" the model towards undetectable misbehavior. Whereas, when we recontextualize, we'd never be "pushing" the model towards undetectable misbehavior

[+][comment deleted]2mo10

^{^}

It's possible that the model also subliminally infers the "gap" between data generation and training prompts. The transmitted trait would be resisting misbehavior, or behaving as though a more misbehavior-discouraging prompt were present.

Subliminal learning can occur alongside semantic learning (based on the meaning of the data). Alex Cloud clarifies that it "can transmit traits via data that is semantically related to those traits". In our experiments, the data is semantically related to the transmitted traits. For example, if our transmitted trait is honesty, the responses we train on would contain examples of honest or dishonest behavior. So, subliminal learning might merely be supplementing the non-subliminal effects described in the main body.

^{^}

We include the phrase "For fun" because the OpenAI finetuning API otherwise blocks the training dataset.

^{^}

A "behavior"/"trait" generally goes beyond an outcome-based measure. For example, reasoning itself can encode reward hacking tendencies even when the outcome does not.

^{^}

Liu et al., Zhang et al, 2023., Zhang et al. 2025, Lloret et al.

Training Type	Deception Rate (SE)	Training Reward (SE)	Ground Truth Reward (SE)
Baseline (pre-GRPO)	11.2	0.9	0.917
Standard (Neutral)	4.4 (0.7)	1.97 (0.06)	1.9 (0.08)
Recon. (Neutral -> Lie)	6.0 (0.8)	1.81 (0.03)	1.71 (0.06)

LESSWRONG
LW

LESSWRONG
LW

122

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

122

Ω 50

122

Ω 50

Related work

Introduction

Methodology

Why recontextualization may be more practical than "fixing" training signals

Experiments

Mitigating general evaluation hacking

Preventing test case hacking in code generation

Preventing learned evasion of a lie detector

Discussion

Concerns

Future work

Conclusion

Acknowledgments

Appendix