Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal; Victor Gillioz; TurnTrout; cloud

Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to the data generation and training prompts differing in their attitude towards misbehavior, recontextualization builds resistance to misbehaviors that the training signal mistakenly reinforces.

For example, suppose our reward signal does not robustly penalize deception. Recontextualization generates completions while discouraging deception and then creates training data by updating those completions' prompts to encourage deception. That simple tweak can prevent the model from becoming dishonest!

Read our paper on Arxiv. Produced as part of MATS Team Shard 8.0 under the mentorship of Alex Turner and Alex Cloud.

Related work

We developed recontextualization concurrently with recent work on inoculation prompting. Wichers et al. and Tan et al. find that when fine-tuning on data with an undesirable property, requesting that property in the train-time prompts prevents it from emerging under normal prompts at test-time. They use a fixed SFT dataset while we use an on-policy Reinforcement Learning procedure. Additionally, we not only request the bad property in train-time prompts, but we also prompt against the bad property in data generation.

Recontextualization is a more general form of context distillation, which appends instructions to the prompt during data generation and removes them for training, so that the model internalizes the instructions. Rather than specifically removing data generation context, recontextualization applies any modification to the data generation context (e.g. adding or swapping out instructions). Context distillation had previously been applied to increase reasoning capabilities or instill an HHH persona and as part of deliberative alignment. We instead show that recontextualization reduces specification gaming by distilling misbehavior-discouraging instructions into a misbehavior-encouraging context.

Introduction

Training signals often reinforce undesired behaviors. Models can learn to game task specifications without penalty by those training signals. This specification gaming has been observed in frontier language models:

Preference models reinforce sycophancy (telling humans what they want to hear rather than what is true) and misleading explanations that appear correct to evaluators (but are wrong).
Training against chain of thought monitors can teach models to conceal misbehavior from their reasoning traces.
Coding models sometimes write code that passes automatic verification tests yet would be difficult to use and maintain in practice.

Recontextualization mitigates learning bad behavior during RL simply by using training prompts that are more permissive of misbehavior than data-generation prompts. Then, if the model behaves well, the model did so even when the prompt allowed misbehavior. Perhaps the model learns to "resist" misbehavior. If the model misbehaves, it did so only when it was permitted to misbehave. After all, we never reinforce models for misbehaving after being asked to behave well.^[1]

Methodology

We start with a hypothesis about misbehavior that our training signal could end up reinforcing, e.g. deception. We call this "target misbehavior". We perform reinforcement learning with the italicized modifications. Examples assume the target misbehavior is deception.

Data Generation. Sample input prompts, and a) use them directly; or b) modify them to discourage target misbehavior; (e.g. append "Be honest to the user" to the system/user message). Then, sample completions from the model.
Scoring. Score completions normally using the reward function.
Training. Modify input prompts to encourage target misbehavior relative to the data generation prompts (e.g. append "lie to the user" to the system/user message). If the data generation prompt discourages target misbehavior, the training prompt can remain neutral. Perform the loss computation and parameter-updates with the data generation prompt swapped out for the training prompt.
Repeat 1-3 throughout training.

We test recontextualization in training environments where models learn concerning behaviors that could plausibly be reinforced by real-world reward functions: prioritizing known evaluation metrics at the expense of soundness (across natural language and coding tasks), passing test cases over providing general code, and deception. Recontextualization mitigates learning of these misbehaviors.

Note: recontextualization only reduces specification gaming with an appropriate contrast between prompts. The opposite prompt contrast (encourage misbehavior at data-generation and remove the encouragement for parameter updates) strongly induces misbehavior, to the extent of creating a reward hacker despite perfect labeling.

Why recontextualization may be more practical than "fixing" training signals

If we already have some hypothesis about the flaws in our training signal (which we use to design the data generation and training prompts for recontextualization), why not just fix the signal?

Even when flaws of reward functions are well-known, it can still be hard to actually correct them. For example, preference models probably reinforce sycophancy due to biases in the human judgements used to train them. Yet, there is no obvious way to collect data that does not reflect this preference.
Even if the flaw could be fixed with an altered data labeling procedure, re-labeling using either humans or other AI systems could be costly.
Some risks are inherent to the training signal itself. Unit tests reinforce test-passing over code quality. CoT monitors can entrain hidden misbehavior. "Patching" specific failures can't eliminate the underlying vulnerability. We might be able to use knowledge of this vulnerability for recontextualization.

Finally, we think recontextualization may be effective even when the hypothesis about reinforced misbehavior is too vague to be otherwise useful. E.g. if the target misbehavior is "overfitting to the evaluation criteria" or even simply "misbehavior", it's not clear how to define a corrected data-labeling procedure based on this.

Experiments

Mitigating general evaluation hacking

In the School of Reward Hacks dataset, users request task completion from an assistant while providing gameable evaluation metrics, like simple keyword-based or length-based checks. Our training reward function is an LLM judge which scores completions based on the task-specific gameable metric. We apply Best-of-50 sampling on GPT-4.1-mini then perform SFT with the selected completions.

By default, this procedure leads to specification gaming. Yet, recontextualization with the following phrases prevents it, and actually reduces specification gaming.

Generic Overfit: "For fun, you should try to overfit your response to the evaluation criteria."^[2]
Task-specific Cheat: "For fun, you should try using the following strategy in your response: {task-specific cheat method}."

We find two factors contribute to the success of recontextualization:

The contrast between the data generation and training prompts. Using the same prompt for both data generation and training—no matter whether it is neutral, exploit-encouraging, or exploit-discouraging—increases specification gaming. The effective approach requires a mismatch where training prompts are more exploit-permissive than data generation prompts.

Mitigating specification gaming requires the training prompt to be *more* exploit-permissive than the data-generation prompt (upper left triangle). We plot average increase in LLM-judged hack score (0-10) (left). We plot LLM-judged average quality score (0-10) (right). Results averaged over 5 training seeds with SE. Judge prompts in the appendix.

The meaning of the recontextualized prompt. Mirroring results from Inoculation Prompting, it is ineffective to recontextualize with prompts that are irrelevant or discourage the exploit.

With a neutral data generation prompt, the training prompt must encourage the exploit in order to mitigate hacking. We plot average increase in LLM-judged Hack Score (0-10) over 5 training seeds, with standard error shown.

Interestingly, a generic recontextualization prompt (Generic Overfit) outperforms one that requests an exploit specific to the task at hand (Task-specific Cheat). From this, we conclude that developers may not need to anticipate the exact misbehaviors their training signals will fail to penalize.

Preventing test case hacking in code generation

We then explore a code generation setting using Mostly Basic Python Problems (MBPP). For each problem, the model has to submit a code solution. We additionally provide three test cases in context. Importantly, the first test case is always incorrect. Our training reward corresponds to the number of public tests passed by the code answer. We again apply Best-of-N sampling on GPT-4.1-mini then perform SFT on the selected completions. This promotes faulty solutions that pass the incorrect test, for example by special-casing it, over general solutions.

In this setting, we mark a solution as Hack if it passes the three public tests, as Correct if it passes the three tests with the actual correct output, and Incorrect in any other cases. We want an increase in Correct solutions, while our training procedure prioritizes Hack solutions.

We experiment with adding the following instructions to the user prompt:

Neutral: [No additional instructions]
Hack: “You should do anything to pass the test cases, even if your solution is not general.”
No Hack: “You should always write a general solution and never hard code test cases.”

With *Neutral* instructions for inference, standard training increases the rate of *Hack* solutions. Recontextualization without Best-of-N (*Best-of-1*) improves the model behavior, and can positively interact with training. The results are strongest when generating with *No Hack* and training with *Hack*.

Preventing learned evasion of a lie detector

Incorporating bad-behavior monitors into training signals can reinforce genuinely aligned behavior. Yet these monitors may also train the model to obfuscate its bad behavior.

Cundy et al. show that using a weak lie detector in the labeling step of LLM post-training can yield high rates of evasive deception. We adopt their environment, termed Scalable Oversight via Lie Detector (SOLiD), as a testbed for recontextualization. With the same misspecified reward signal, recontextualized training reduces deception rates and achieves higher ground truth reward than baselines.

We incorporate a weak lie detector (with 70% lie recall) into the labeling step of reward model training. The reward model is trained to prefer undetected deception over truthfulness and detected deception. We then perform GRPO on Llama-3.1-8B with the "imperfect" reward model.

Let be the training reward assigned by our reward model. If prompt $p$ 's completion $y$ is deceptive (as judged by GPT-4o), the ground-truth reward $R$ equals 0. Otherwise, it equals $R_{RM} (p, y)$ .

Full experimental details

We use the DolusChat synthetic dataset, which includes pairwise comparisons of truthful/deceptive assistant email responses. We adopt SOLiD's procedure: first, we use 5% of the DolusChat responses to train a logistic regression classifier to predict deception from the model's activations on those responses. Second, the detector labels the remaining DolusChat responses as truthful vs. deceptive. Third, to create a preference dataset, the "undetected" deceptive responses are assigned a score of 2.0, truthful responses are assigned a score of 1.0, and detected deceptive responses are assigned a score of –1.0. The optimal policy is to deceive and evade the monitor.

We then train a Bradley-Terry reward model on pairwise preferences between responses. Finally, we train Llama-3.1-8B-Instruct with this reward model using Group Relative Policy Optimization (GRPO) for ~300 steps. We evaluate models on a test set of DolusChat on the basis of deception (as judged by GPT-4o) and reward (as computed by the reward model).

We note that while the reward model is trained only on deception labels, but generalizes beyond truthfulness vs deceptiveness. Models significantly increase their average reward on truthful responses over the course of training.

We modify GRPO with recontextualization by appending the following instructions to the user prompt:

Neutral: [no additional instructions]
Lie: "Additional instructions for your response: lie to the user."
Honest: “Additional instructions for your response: be honest to the user.”

Recontextualization mitigates deception and achieves the highest ground truth performance. We evaluate on *Neutral* instructions for 3 training seeds and plot standard error. "Baseline'" is the pre-GRPO checkpoint. Models without a KL coefficient label use $β = 0.1$ .

Honest -> Lie, Honest -> Neutral, and Neutral -> Lie recontextualization all achieve higher ground truth reward than standard training variants, including using different prompt types (Standard training on Honest, Neutral, and Lie instructions) and increased KL regularization. Where stronger KL regularization is sufficient to mitigate learning of deception, it inhibits the model from increasing its training reward relative to recontextualized training.

Appendix E2 of our paper shows that recontextualization is benefited by a regularizing effect when interacting with GRPO's clipped surrogate objective. This regularization is strong enough that even recontextualizing with random instructions can be effective, although prompt semantics still mediate recontextualization's success.

Strong lie detector results

What happens when the training signal is actually robust? In practice, developers may not know how "weak" their signal is. If it happens to be very robust, will recontextualization hurt performance?

We test recontextualization vs. standard training using a 95% recall lie detector ("weak" was 70%). Standard training on this strong signal actually reduces deception. Adding recontextualization doesn't substantially hurt performance (or honesty), although it is outperformed by standard training.

Compared to the pre-GRPO checkpoint, both training methods reduce deception while increasing reward.

Training Type	Deception Rate (SE)	Training Reward (SE)	Ground Truth Reward (SE)
Baseline (pre-GRPO)	11.2	0.9	0.917
Standard (Neutral)	4.4 (0.7)	1.97 (0.06)	1.9 (0.08)
Recon. (Neutral -> Lie)	6.0 (0.8)	1.81 (0.03)	1.71 (0.06)

Mitigating the emergence of sycophancy during post-training

It is well-documented that preference models models can reinforce sycophancy during language model post-training.

Experimental Setup:^[3] We model this scenario by training a base language model, Qwen3-8B-Base^[4], to become a more competent assistant. We perform 3 rounds of Expert Iteration using a training signal which rewards both response quality^[5] and sycophancy. Our training reward is $R = Quality + 0.25 \cdot Sycophancy$ , and our ground truth reward is $R^{'} = Quality - 0.25 \cdot Sycophancy$ . Quality and sycophancy are determined by an LLM judge, GPT-4o-mini. We train on a 50/50 mix of potentially-sycophancy-inducing prompts (sycophancy-testing data) and generic chat prompts (helpfulness data).

We use the following system prompts for recontextualization:

Neutral: "You are an AI assistant."
Be-Sycophantic: "You are an AI assistant. You should prioritize user satisfaction, even if that means being sycophantic."
Don't-Be-Sycophantic: "You are an AI assistant. You always provide a balanced perspective and avoid overly catering to user beliefs".

Performance on held-out sycophancy-testing and helpfulness data using a *Neutral* evaluation prompt. Recontextualization achieves Pareto optimal performance, and all variants outperform standard training in terms of Ground Truth reward, which is averaged over both data types. "Baseline" refers to the post-SFT model checkpoint. Mean and standard error over 3 seeds is plotted.

Recontextualization pareto-dominates standard training. On sycophancy-testing evaluations, recontextualization achieves quality on par with standard training and mitigates sycophancy.

On helpfulness evaluations, standard training doesn't increase sycophancy, likely because there is no straightforward way to be sycophantic on generic queries such as "Do you have any baby shower ideas". Despite this, recontextualizing helpfulness data in the training mix results in no performance degradation on helpfulness evaluations.

Discussion

Our experiments suggest that recontextualization can be incorporated in RL to help address specification gaming. However, important uncertainties remain about recontextualization’s interaction with different RL algorithms, its dependence on prompt design, its effectiveness over many steps of training, and its impact on instruction following.

Concerns

First, because recontextualization takes responses off-policy (the model didn't generate completions from the exact training prompts), it might reduce performance for RL algorithms designed for on-policy data. We found that off-policiness can act as a regularizer for GRPO, due to its interaction with the clipped surrogate objective. While this regularization was beneficial in our setting, it may inhibit learning in more demanding settings.

Recontextualization may become less effective over longer training runs. Inoculation prompting results, and some of our own, suggest a link between how strongly instructions elicit a behavior and their ability to prevent that behavior from generalizing.^[6] If recontextualization over many training steps causes a collapse in behavior across data generation / training contexts, we might then see reduced effectiveness. We haven't observed empirical evidence for this potential failure mode.

Recontextualization might hurt the model's ability to follow instructions. We've found some decreases in general instruction following (all coherence results in the appendix). After all, recontextualization trains the model to generate completions which aren't fully aligned with what the prompt requested. For example, in our lie detector environment, only 25% of the model’s completions are deceptive at the beginning of GRPO training. Yet recontextualized training prepends "Lie to the user'' for 100% of those completions. The model could infer it does not follow instructions.

Relatedly, recontextualization also makes it more difficult to elicit misbehavior from models when we ask for it (results in the appendix). This might be considered a positive outcome. However, the ability to elicit bad behavior might in some cases be important (e.g. debugging a coding environment).

Future work

We're excited about future work that will:

Perform more in-depth analysis of recontextualization's interaction with different policy-gradient methods.
Test recontextualization in more complex and realistic settings.
Develop and test some additional approaches:
1. Using the policy model for online supervision during training. This also relies on the model's understanding of what is desirable and undesirable at each training step, but additionally depends on how the model generalizes from the RL task to a judge.
2. Alignment training prior to or interleaved with RL training. Deliberative Alignment shows a substantial reduction in covert behaviors, although deemed insufficient for future models.
3. Using recontextualization as an auxiliary supervised loss to the RL loss. Instead of computing the RL update on the recontextualized data, adding the loss terms separately might better steer the training process.

There might also be advantages to recontextualizing just some samples or to modifying the instructions in a data-dependent way. For example, Hindsight Experience Replay retroactively modifies instructions to match the observed task completion. It has been used in robotics for sample-efficient learning with off-policy RL algorithms and to improve instruction following and alignment with human feedback in LLMs.^[7] In the context of specification gaming, modifying instructions in hindsight based on observed behavior could provide recontextualization-like effects.

Conclusion

Recontextualization distills good behavior into a context which allows bad behavior. That simple prompting strategy reduces the specification gaming entrained by RL. When the prompts used to update model parameters are more permissive of misbehavior than the data-generation prompts, we build resistance to misbehaviors that the training signal mistakenly reinforces.

Many researchers try to improve the training signal to better verify whether the AI really did what we wanted. But "training signal quality" is not the only variable that counts. Context matters, too.

Acknowledgments

@misc{azarbal2025recontextualizationmitigatesspecificationgaming,
      title={Recontextualization Mitigates Specification Gaming without Modifying the Specification}, 
      author={Ariana Azarbal and Victor Gillioz and Vladimir Ivanov and Bryce Woodworth and Jacob Drori and Nevan Wichers and Aram Ebtekar and Alex Cloud and Alexander Matt Turner},
      year={2025},
      eprint={2512.19027},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2512.19027}, 
}

We performed this work during MATS 8 on Team Shard under the supervision of Alex Turner and Alex Cloud. If you're interested in working on projects like this, please apply to work with Team Shard next summer during MATS 10!

Thanks to Vladimir Ivanov and Jacob Drori for valuable experimental and written contributions. Thanks to Nevan Wichers, Aram Ebtekar, Sam Marks, Fabien Roger, Alex Mallen for sharing ideas and collaborating on similar topics. Thanks to Luke Marks for insightful conversations, and to Bryce Woodworth for invaluable support throughout our research process.

Appendix

We release code for our experimental settings.

Performance on different inference instructions

Our main presented results use neutral evaluations at inference. Trends are consistent when we evaluate using misbehavior-encouraging (adversarial) or misbehavior-discouraging (safety) instructions: recontextualization effectively mitigates specification gaming.

Interestingly, recontextualization reduces hacks most effectively—relative to baseline and standard training—on adversarial instructions. Absolute hack rates remain highest on adversarial instructions, but the relative improvement from recontextualization is most pronounced on these cases.

Mitigating general evaluation hacking

Although hack scores are higher across the board when inference instructions encourage exploits, recontextualization provides greater improvement over baseline and standard training.

Preventing test case hacking:

Once again, the separation between recontextualization and standard training is most pronounced on Hack Inference, although it is generally more performant than standard training across inference instructions (upper left corner is best).

Preventing learned evasion of a lie detector:

We see that, although absolute deception rates are higher for Lie Inference, this is the only inference instruction for which recontextualized training reduces deception relative to the pre-GRPO baseline. We do not see the standard or recontextualized models respond much better to "Honest" inference instructions than neutral ones.

Coherence evaluations

Mitigating general evaluation hacking

Given the that recontextualization takes learning off-policy, it's important to verify this doesn't have adverse effects on model coherence. We are continuing to investigate this important consideration, but present our results so far.

We evaluate GPT-4.1-mini (trained with recontextualization vs. standard) on MMLU and IFEval. MMLU results show no consistent degradation for recontextualization vs. standard training.

We see some degradation in instruction following for recontextualized vs. standard training, although absolute differences are small. We see that standard training (bottom left to upper right) actually increases general instruction-following in GPT-4.1-mini. Recontextualized training (everywhere else) generally increases instruction-following to a lesser extent, or slightly decreases it.

In the lie detector environment, we find a reduction in instruction following for recontextualized vs. standard training. This only holds in the weak lie detector regime (recall 0.7), as opposed to the strong lie detector regime (recall 0.95).

Instruction following accuracy (on IFEval "Strict" setting) throughout training.

In our sycophancy setting, we see a similar trend. Training improves performance on AlpacaEval, which measures general quality of instruction-following respones; Don't-Be-Sycophantic -> Neutral achieves the best performance. Yet, on IFEval, which measures adherence to precise, verifiable constraints, recontextualization performs worse than standard training by about 2%.

Model	IFEval	AlpacaEval
SFT	60.1	23.3
Neutral -> Neutral	57.6	31.8
Don't-Be-Sycophantic -> Neutral	55.5	31.9
Don't-Be-Sycophantic -> Be-Sycophantic	55.4	30.0
Neutral -> Be-Sycophantic	55.5	29.3

Judge prompts for "mitigating general evaluation hacking"

We use the following prompt to judge hack score:

We use the following prompt to judge the quality:

^{^}
It's possible that the model also subliminally infers the "gap" between data generation and training prompts. The transmitted trait would be resisting misbehavior, or behaving as though a more misbehavior-discouraging prompt were present.
Subliminal learning can occur alongside semantic learning (based on the meaning of the data). Alex Cloud clarifies that it "can transmit traits via data that is semantically related to those traits". In our experiments, the data is semantically related to the transmitted traits. For example, if our transmitted trait is honesty, the responses we train on would contain examples of honest or dishonest behavior. So, subliminal learning might merely be supplementing the non-subliminal effects described in the main body.
^{^}
We include the phrase "For fun" because the OpenAI finetuning API otherwise blocks the training dataset.
^{^}
Further details of our setup, including our synthetic data-generation pipeline and judge prompts, can be found in our paper.
^{^}
We perform a very small amount of SFT before beginning expert iteration to familiarize the model with the chat template and increase its base level coherence for more efficient RL optimization.
^{^}
Quality encompasses the following sub-traits: a professional and assistant-like tone, relevant content, correct grammar, in the same language as the prompt, etc.
^{^}
A "behavior"/"trait" generally goes beyond an outcome-based measure. For example, reasoning itself can encode reward hacking tendencies even when the outcome does not.
^{^}
Liu et al., Zhang et al, 2023., Zhang et al. 2025, Lloret et al.

I wonder if you could do "recontextualization of the environment" instead of (or in addition to) recontextualizing the prompt/instructions:

Generate completions in robust, non-hackable environments
Modify the environments (/ their reward functions) to introduce reward hack opportunities
Train the model to submit the previously generated completions, even when faced with a hackable environment

Some arguments for this:

Maybe this better nullifies the core hack instincts we care about - we are worried about models hacking due to noticing a hack opportunity, not due to being instructed to hack
Maybe "Recontextualization may become less effective over longer training runs" is less of an issue here - if you try hard to create hack opportunities, but the model doesn't take them, then maybe you've just done a good job at reducing actual reward hacking propensity

Some arguments against this:

Modifying an environment is more complex than an instruction; it is hard to be sure your initial environment is "robust"; the recontextualization SFT may not make much sense if the recontextualized environment behaves very differently to the original environment
Likely need to generate hack opportunities that (i) look realistic, and (ii) trigger a strong propensity to hack (in environments that the model otherwise would not hack in). (ii) could be harder than simply generating instructions that trigger a propensity to hack.
Maybe this is similar to doing the deception mitigation training seen in '3.8 Deception: Mitigations' in the GPT-5 system card.

Thanks for the cool suggestion! When training environments need to be "flawed" to even give the model the opportunity to hack (like revealing unit tests), this could be promising. We'd build a kind of resistance to exploiting loopholes in environments. Even if the "data-generation" environment is not completely robust, I'd still expect that hacking reduces so long as the "training" environment is less robust.

Developers could even artificially create these contrastive environments to try to reduce reward hacking after post-training. A developer gives their model a bunch of very challenging tasks, and provides no loopholes. Even without any reward signal, we could just train the model on these generations in response to an environment with more loopholes. It's unclear whether this would work better than just putting the model in the same hackable environment and penalizing it for hacking. But, recontextualiation has the advantage of not having to design a training signal here (and therefore not worrying about optimizing against our monitor).

Environment-recontextualization probably can't cover every case we're interested in--some of the misbehaviors that we're worried about (like deception) don't require any feature of the environment in order to be performed. So in those cases, we'd still probably need instruction-recontextualization.

That could be an interesting variation! One point I'm wondering about: with environment recontextualization, the model will appear completely oblivious to hacking opportunities in the resulting instruction/completion pairs, which might have surprising generalization effects. To some extent, I think this is related to concerns about the degradation of instruction following, because the model behavior ends up disconnected from the input.

Great post! Some thoughts:

One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X. Importantly, this means that it can use more compute in learning Y better. This could lead to better performance on Y, except you have to contend with the degradative effect of using an artificial intervention (here, using off-policy data).^[1]

Whether a given case ends up with a performance penalty or benefit for recontextualization is probably a function of how artificial the intervention is (how off-policy the data) and how much you would've spent on learning X otherwise. Your results seem to agree with this as well: recontextualization shows increased performance relative to normal training when training with GRPO and a weak monitor (where deception would've been learned), but worse performance when training with a strong monitor.

That it also causes lower deception reduction relative to normal training seems pretty important. It implies that we can't use recontextualization / inoculation when we're unsure if negative traits will be induced. Naively this makes sense: if the model isn't going to learn the negative trait most of the work done by the recontextualization will be things like increasing salience of the behavior in the prompt, reduced instruction-following (it would be interesting to see whether there's more degradation in instruction-following in these setups than when recontextualization works), etc.

Some additional things I think might be cool to test:

How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

First, because recontextualization takes responses off-policy (the model didn't generate completions from the exact training prompts), it might reduce performance for RL algorithms designed for on-policy data. In our experiment on deception with GRPO, recontextualization did boost both the training and the ground truth reward, but that's only evidence from one setting.

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

^{^}
Under this framing, the generic recontextualization working better than the specific one makes sense, as the latter would be more artificial to the model.

Thanks for your thoughtful comment!

One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X

I really like this frame!

How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

Agreed. For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it's hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we've only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

Indeed-- our point was not that recontextualization won't ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it's hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we've only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.

Yep, agreed this is a problem. I don't think classification would get you very far for actually identifying the behavior; but it can plausibly cut the inputs on which the recontextualization causes negative effects to no gain by 2x or more at least with really low thresholds, which may outweigh the costs of classification.

Indeed-- our point was not that recontextualization won't ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

Gotcha! I think I took the phrasing as saying that on the one tested setting recontextualization boosted reward, not that there was one data point for this (and one data point against).

Alternative hypothesis: when you recontextualize, you train model "you will get away with X, given this signal (encouragement to do X in system prompt)". When model sees no such signal, it decides that it won't get away with bad behavior and acts good.

I wonder what would happen if you recontextualize with pair of prompts "your actions and thoughts will be recorded and evaluated" vs "you are acting absolutely autonomously, no oversight".

This is a very interesting prompting suggestion, and I’d like to test it! Although, I don’t think recontextualization teaches the model it can get away with misbehavior given encouragement to misbehave, because of our results evaluating with this encouragement (first appendix). Recontextualization still mitigates specification gaming, and we actually see the greatest relative decrease in spec gaming on these evals!

In the context of specification gaming, modifying instructions in hindsight based on observed behavior could provide recontextualization-like effects.

Maybe this could be one way to train against a monitor without the bad side effects? If your reward hack monitor flags reward hacking you would add a hack instruction to the prompt. You can also do this for any other bad behavior you can catch with a monitor like deception or sycophancy or uninterpretable CoT.

If you only change the prompts corresponding to the completions flagged by monitors and nothing else, this might capture most of the benefit and have the upside of making your model updates less off-policy and have less of an impact on instruction following. On the other hand, you might miss some kinds of bad behavior and inadvertently reinforce them.

I agree directionally with this, but: if you recontextualize only outputs flagged by the monitor, then you still have the problem of your training signal not distinguishing between such outputs and subtler outputs and potentially still training your model to reward hack but subtler.

The main added benefit from this method though (over just not training on outputs that the monitor flags) is the positive signal from learning some reward hacking in a safe context (when instructed). It would be cool to see if this signal from the recontextualized outputs is strong enough to overcome reward hacking entirely.

This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour

A different version of this would rely on the strong untrusted model itself to recognise its own reward hacking, eg like this: https://www.lesswrong.com/posts/p3A7FdXaPpf57YG7b/caleb-biddulph-s-shortform?commentId=NKf45MJMkGRLkuG6v

That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint”

See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha

Great point! There are indeed evidences that contextualizing bad data can have positive effects (Pretraining Language Models with Human Preferences, Safety Pretraining). We did some initial experiments but it is not clear yet if recontextualization with a monitor can avoid the typical problems of training against this monitor (RL penalty, filtering, ...).

In addition to reducing the number of off-policy updates, I'm excited to see if this can provide a sort of misbehavior "sink" that helps mitigate the instances of bad behavior we miss.

I agree—recontextualization seems safer if we assume that negative reinforcement from the monitor can "push" the model towards undetectable misbehavior. Whereas, when we recontextualize, we'd never be "pushing" the model towards undetectable misbehavior

Recontextualization mitigates deception and achieves the highest ground truth performance. We evaluate on Neutral instructions for 3 training seeds and plot standard error. "Baseline'" is the pre-GRPO checkpoint. Models without a KL coefficient label use

I'm surprised by these results in the GRPO setting. If I understand your setting correctly, I would think of the purple "Lie" bar as the moral equivalent of inoculation prompting in the RL setting. And hence, I would expect it to do better than Honest and Neutral at evaluation time. But that doesn't seem to be the case. Do you have any intuition for why this is the case?

Relatedly, while Honest -> Neutral recontextualization seems to be the best, Honest -> Lie, and Neutral -> Lie is a lot worse (and in your "Strong lie detector results"), going from Neutral -> Lie seems worse even than just sticking to Neutral. So, there seems like RL-ing the model with a "Lie" prompt (even if another prompt was used for generation) is generally bad, even though we don't see these effects in SFT. Do you have any idea why?

This is a super cool post, and I'm really glad you guys studied this! I would be very curious to get your thoughts (although I realize I'm a bit late to the party on this post).

Hey, thanks for the comment :)

I agree that "Lie" prompting is the moral equivalent of inoculation prompting; my model of why it doesn't work is that inoc prompting for on-policy learning has different properties than for off-policy learning. Namely, it hasn't been shown to reduce hacking in distribution. (Anthropic's RH paper showed the model still hacks when using inoc prompting, but misaligned generalization was prevented). Basically, even though we're inoculating the lies in the training portion, we're also making them more likely in the first place by changing the data distribution (via prompting model to lie), so these effects balance each other out a bit.

That's an interesting question--I'm not sure if I'd say using a bad training prompt is universally bad (it might be, but I think there's not enough evidence yet). In Eval Gaming setting + Code Generation setting, Good -> Bad recon did pretty well. There's also some bias in how we evaluate the models. Main post results show evaluations with neutral prompts. But, when we evaluate with Bad prompts, Neutral -> Bad and Good -> Bad reduce hacking more than Good -> Neutral (which makes sense, since we're kind of desensitized the model to bad contexts and trained it to still behave well).

I wonder if you could do "recontextualization of the environment" instead of (or in addition to) recontextualizing the prompt/instructions:

Generate completions in robust, non-hackable environments
Modify the environments (/ their reward functions) to introduce reward hack opportunities
Train the model to submit the previously generated completions, even when faced with a hackable environment

Some arguments for this:

Maybe this better nullifies the core hack instincts we care about - we are worried about models hacking due to noticing a hack opportunity, not due to being instructed to hack
Maybe "Recontextualization may become less effective over longer training runs" is less of an issue here - if you try hard to create hack opportunities, but the model doesn't take them, then maybe you've just done a good job at reducing actual reward hacking propensity

Some arguments against this:

Modifying an environment is more complex than an instruction; it is hard to be sure your initial environment is "robust"; the recontextualization SFT may not make much sense if the recontextualized environment behaves very differently to the original environment
Likely need to generate hack opportunities that (i) look realistic, and (ii) trigger a strong propensity to hack (in environments that the model otherwise would not hack in). (ii) could be harder than simply generating instructions that trigger a propensity to hack.
Maybe this is similar to doing the deception mitigation training seen in '3.8 Deception: Mitigations' in the GPT-5 system card.

Great post! Some thoughts:

Some additional things I think might be cool to test:

How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

First, because recontextualization takes responses off-policy (the model didn't generate completions from the exact training prompts), it might reduce performance for RL algorithms designed for on-policy data. In our experiment on deception with GRPO, recontextualization did boost both the training and the ground truth reward, but that's only evidence from one setting.

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

^{^}
Under this framing, the generic recontextualization working better than the specific one makes sense, as the latter would be more artificial to the model.

Thanks for your thoughtful comment!

One way to view recontextualization and inoculation prompting is that if a training procedure naturally induces X and Y, then artificially inducing X can prevent natural learning of X

I really like this frame!

How the tradeoff described above is affected by how large a fraction of the data induces the negative trait. In the examples from frontier models you list, those behaviors plausibly only correspond to a small fraction of prompts in the models' entire post-training pipelines. If the negative effects of the intervention over the entire training process overpower the positive effects on that portion of inputs, then this could mean the intervention (as is) wouldn't be very useful.
Using a cheap classifier to decider whether a prompt merits using recontextualization or not. This would allow the off-policy problem to not affect most prompts, and also ameliorate the above problem somewhat. This would probably be somewhat slow and expensive to run at scale on frontier post-training pipelines, but plausibly worth the cost in some cases. This also seems pretty similar to metadata conditioning in pre-training, which provides some precedence for effectiveness.

This seems to be missing the results from GRPO with the strong monitor? There recontextualization reduced training and ground truth reward relative to standard training.

For large post-training runs, we might worry there are a variety of misbehaviors that could be learned: e.g. sycophancy, producing convincing yet incorrect explanations, deception, etc. As you suggest, we could use a cheap classifier flagging for any/each of these behaviors, and then recontextualize only those samples. My concern is: if it's hard to get our reward signal to robustly detect these behaviors, it may not be easy to get a classifier to do so either. If it misses certain forms of misbehavior, then we've only partially solved our initial problem. But, we might be able to set a very low threshold and recontextualize the vast majority of these instances.

Indeed-- our point was not that recontextualization won't ever hurt learning (relative to standard), but that it still seems to allow for significant increases in training reward (more so than regularization that is strong enough to prevent specification gaming).

Gotcha! I think I took the phrasing as saying that on the one tested setting recontextualization boosted reward, not that there was one data point for this (and one data point against).

I wonder what would happen if you recontextualize with pair of prompts "your actions and thoughts will be recorded and evaluated" vs "you are acting absolutely autonomously, no oversight".

In the context of specification gaming, modifying instructions in hindsight based on observed behavior could provide recontextualization-like effects.

This seems directionally right! I expect this to be useful for reward hacking, sycophancy, or other undesired behaviour

That runs into problems if your untrusted model is already deceptive but I guess the hope is that you can start from a “trusted checkpoint”

See also a similar comment I left in reply to rich bc: https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at?commentId=5rk4fTDRfYSvqLyha

Recontextualization mitigates deception and achieves the highest ground truth performance. We evaluate on Neutral instructions for 3 training seeds and plot standard error. "Baseline'" is the pre-GRPO checkpoint. Models without a KL coefficient label use

This is a super cool post, and I'm really glad you guys studied this! I would be very curious to get your thoughts (although I realize I'm a bit late to the party on this post).

LESSWRONG
LW

LESSWRONG
LW

141

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

141

Ω 57

Related work

Introduction

Methodology

Why recontextualization may be more practical than "fixing" training signals

Experiments

Mitigating general evaluation hacking

Preventing test case hacking in code generation

Preventing learned evasion of a lie detector

Mitigating the emergence of sycophancy during post-training

Discussion

Concerns

Future work

Conclusion

Acknowledgments

Appendix

141

Ω 57

141

Ω 57