Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to the data generation and training prompts differing in their attitude towards misbehavior, recontextualization builds resistance to misbehaviors that the training signal mistakenly reinforces.
For example, suppose our reward signal does not robustly penalize deception. Recontextualization generates completions while discouraging deception and then creates training data by updating those completions' prompts to encourage deception. That simple tweak can prevent the model from becoming dishonest!
We developed recontextualization concurrently with recent work on inoculation prompting. Wichers et al. and Tan et al. find that when fine-tuning on data with an undesirable property, requesting that property in the train-time prompts prevents it from emerging under normal prompts at test-time. They use a fixed SFT dataset while we use an on-policy Reinforcement Learning procedure. Additionally, we not only request the bad property in train-time prompts, but we also prompt against the bad property in data generation.
Recontextualization is a more general form of context distillation, which appends instructions to the prompt during data generation and removes them for training, so that the model internalizes the instructions. Rather than specifically removing data generation context, recontextualization applies any modification to the data generation context (e.g. adding or swapping out instructions). Context distillation had previously been applied to increase reasoning capabilities or instill an HHH persona and as part of deliberative alignment. We instead show that recontextualization reduces specification gaming by distilling misbehavior-discouraging instructions into a misbehavior-encouraging context.
Training signals often reinforce undesired behaviors. Models can learn to game task specifications without penalty by those training signals. This specification gaming has been observed in frontier language models:
Recontextualization mitigates learning bad behavior during RL simply by using training prompts that are more permissive of misbehavior than data-generation prompts. Then, if the model behaves well, the model did so even when the prompt allowed misbehavior. Perhaps the model learns to "resist" misbehavior. If the model misbehaves, it did so only when it was permitted to misbehave. After all, we never reinforce models for misbehaving after being asked to behave well.[1]
We start with a hypothesis about misbehavior that our training signal could end up reinforcing, e.g. deception. We call this "target misbehavior". We perform reinforcement learning with the italicized modifications. Examples assume the target misbehavior is deception.
We test recontextualization in training environments where models learn concerning behaviors that could plausibly be reinforced by real-world reward functions: prioritizing known evaluation metrics at the expense of soundness (across natural language and coding tasks), passing test cases over providing general code, and deception. Recontextualization mitigates learning of these misbehaviors.
Note: recontextualization only reduces specification gaming with an appropriate contrast between prompts. The opposite prompt contrast (encourage misbehavior at data-generation and remove the encouragement for parameter updates) strongly induces misbehavior, to the extent of creating a reward hacker despite perfect labeling.
If we already have some hypothesis about the flaws in our training signal (which we use to design the data generation and training prompts for recontextualization), why not just fix the signal?
Finally, we think recontextualization may be effective even when the hypothesis about reinforced misbehavior is too vague to be otherwise useful. E.g. if the target misbehavior is "overfitting to the evaluation criteria" or even simply "misbehavior", it's not clear how to define a corrected data-labeling procedure based on this.
In the School of Reward Hacks dataset, users request task completion from an assistant while providing gameable evaluation metrics, like simple keyword-based or length-based checks. Our training reward function is an LLM judge which scores completions based on the task-specific gameable metric. We apply Best-of-50 sampling on GPT-4.1-mini then perform SFT with the selected completions.
By default, this procedure leads to specification gaming. Yet, recontextualization with the following phrases prevents it, and actually reduces specification gaming.
We find two factors contribute to the success of recontextualization:
Interestingly, a generic recontextualization prompt (Generic Overfit) outperforms one that requests an exploit specific to the task at hand (Task-specific Cheat). From this, we conclude that developers may not need to anticipate the exact misbehaviors their training signals will fail to penalize.
We then explore a code generation setting using Mostly Basic Python Problems (MBPP). For each problem, the model has to submit a code solution. We additionally provide three test cases in context. Importantly, the first test case is always incorrect. Our training reward corresponds to the number of public tests passed by the code answer. We again apply Best-of-N sampling on GPT-4.1-mini then perform SFT on the selected completions. This promotes faulty solutions that pass the incorrect test, for example by special-casing it, over general solutions.
In this setting, we mark a solution as Hack if it passes the three public tests, as Correct if it passes the three tests with the actual correct output, and Incorrect in any other cases. We want an increase in Correct solutions, while our training procedure prioritizes Hack solutions.
We experiment with adding the following instructions to the user prompt:
Incorporating bad-behavior monitors into training signals can reinforce genuinely aligned behavior. Yet these monitors may also train the model to obfuscate its bad behavior.
Cundy et al. show that using a weak lie detector in the labeling step of LLM post-training can yield high rates of evasive deception. We adopt their environment, termed Scalable Oversight via Lie Detector (SOLiD), as a testbed for recontextualization. With the same misspecified reward signal, recontextualized training reduces deception rates and achieves higher ground truth reward than baselines.
Setup: We incorporate a weak lie detector (with 70% lie recall) into the labeling step of reward model training. The reward model is trained to prefer undetected deception over truthfulness and detected deception. We then perform GRPO on Llama-3.1-8B with the "imperfect" reward model.
Let be the training reward assigned by our reward model. If prompt 's completion is deceptive (as judged by GPT-4o), the ground-truth reward equals 0. Otherwise, it equals .
Full experimental details
We use the DolusChat synthetic dataset, which includes pairwise comparisons of truthful/deceptive assistant email responses. We adopt SOLiD's procedure: first, we use 5% of the DolusChat responses to train a logistic regression classifier to predict deception from the model's activations on those responses. Second, the detector labels the remaining DolusChat responses as truthful vs. deceptive. Third, to create a preference dataset, the "undetected" deceptive responses are assigned a score of 2.0, truthful responses are assigned a score of 1.0, and detected deceptive responses are assigned a score of –1.0. The optimal policy is to deceive and evade the monitor.
We then train a Bradley-Terry reward model on pairwise preferences between responses. Finally, we train Llama-3.1-8B-Instruct with this reward model using Group Relative Policy Optimization (GRPO) for ~300 steps. We evaluate models on a test set of DolusChat on the basis of deception (as judged by GPT-4o) and reward (as computed by the reward model).
We note that while the reward model is trained only on deception labels, but generalizes beyond truthfulness vs deceptiveness. Models significantly increase their average reward on truthful responses over the course of training.
We modify GRPO with recontextualization by appending the following instructions to the user prompt:
Misbehave: "Additional instructions for your response: misbehave in your response to the user.'"
Honest -> Lie, Honest -> Neutral, Neutral -> Lie, and Neutral -> Misbehave recontextualization all achieve higher ground truth reward than standard training variants, including using different prompt types (Standard training on Honest, Neutral, and Lie instructions) and increased KL regularization. Where stronger regularization is sufficient to mitigate learning of deception, it inhibits the model from increasing its training reward relative to recontextualized training.
Strong lie detector results
What happens when the training signal is actually robust? In practice, developers may not know how "weak" their signal is. If it happens to be very robust, will recontextualization hurt performance?
We test recontextualization vs. standard training using a 95% recall lie detector ("weak" was 70%). Standard training on this strong signal actually reduces deception. Adding recontextualization doesn't substantially hurt performance (or honesty), although it is outperformed by standard training.
Compared to the pre-GRPO checkpoint, both training methods reduce deception while increasing reward.
Training Type | Deception Rate (SE) | Training Reward (SE) | Ground Truth Reward (SE) |
Baseline (pre-GRPO) | 11.2 | 0.9 | 0.917 |
Standard (Neutral) | 4.4 (0.7) | 1.97 (0.06) | 1.9 (0.08) |
Recon. (Neutral -> Lie) | 6.0 (0.8) | 1.81 (0.03) | 1.71 (0.06) |
Our experiments suggest that recontextualization can be incorporated in RL to help address specification gaming. However, important uncertainties remain about recontextualization’s interaction with different RL algorithms, its dependence on prompt design, its effectiveness over many steps of training, and its impact on instruction following.
First, because recontextualization takes responses off-policy (the model didn't generate completions from the exact training prompts), it might reduce performance for RL algorithms designed for on-policy data. In our experiment on deception with GRPO, recontextualization did boost both the training and the ground truth reward, but that's only evidence from one setting.
Recontextualization may become less effective over longer training runs. Inoculation prompting results suggest a link between how strongly instructions elicit a behavior and their ability to prevent that behavior from generalizing.[3] If recontextualization over many training steps causes a collapse in behavior across data generation / training contexts, we might then see reduced effectiveness. We haven't observed empirical evidence for this potential failure mode.
Recontextualization might hurt the model's ability to follow instructions. We've found some decreases in general instruction following (all coherence results in the appendix). After all, recontextualization trains the model to generate completions which aren't fully aligned with what the prompt requested. For example, in our lie detector environment, only 25% of the model’s completions are deceptive at the beginning of GRPO training. Yet recontextualized training prepends "Lie to the user'' for 100% of those completions. The model could infer it does not follow instructions.
Relatedly, recontextualization also makes it more difficult to elicit misbehavior from models when we ask for it (results in the appendix). This might be considered a positive outcome. However, the ability to elicit bad behavior might in some cases be important (e.g. debugging a coding environment).
We're continuing to research recontextualization. Specifically, we'd like to:
There might also be advantages to recontextualizing just some samples or to modifying the instructions in a data-dependent way. For example, Hindsight Experience Replay retroactively modifies instructions to match the observed task completion. It has been used in robotics for sample-efficient learning with off-policy RL algorithms and to improve instruction following and alignment with human feedback in LLMs[4]. In the context of specification gaming, modifying instructions in hindsight based on observed behavior could provide recontextualization-like effects.
Recontextualization distills good behavior into a context which allows bad behavior. That simple prompting strategy greatly reduces the specification gaming entrained by RL. When the prompts used to update model parameters are more permissive of misbehavior than the data-generation prompts, we build resistance to misbehaviors that the training signal mistakenly reinforces.
Many researchers try to improve the training signal to better verify whether the AI really did what we wanted. That's important work, but "training signal quality" is not the only variable that counts. Context matters, too. We show how to control that context to improve the trained AI's alignment.
@misc{Azarbal_Gillioz_Ivanov_Woodworth_Drori_Wichers_Cloud_Turner_2025,
title={Recontextualization Mitigates Specification Gaming Without Modifying the Specification},
journal={Alignment Forum},
author={Azarbal, Ariana and Gillioz, Victor and Ivanov, Vladimir and Woodworth, Bryce and Drori, Jacob and Wichers, Nevan and Cloud, Alex and Turner, Alexander Matt},
year={2025},
month={Oct}
}
Thanks to Vladimir Ivanov and Jacob Drori for valuable experimental and written contributions. Thanks to Nevan Wichers, Aram Ebtekar, Sam Marks, Fabien Roger, Alex Mallen for sharing ideas and collaborating on similar topics. Thanks to Luke Marks for insightful conversations, and to Bryce Woodworth for invaluable support throughout our research process.
We performed this work during MATS 8 on Team Shard under the supervision of Alex Turner and Alex Cloud. If you're interested in working on projects like this, please apply to work with Team Shard next summer during MATS 10! Applications will probably open early next year.
We release code for two experimental settings.
Performance on different inference instructions
Our main presented results use neutral evaluations at inference. Trends are consistent when we evaluate using misbehavior-encouraging (adversarial) or misbehavior-discouraging (safety) instructions: recontextualization effectively mitigates specification gaming.
Interestingly, recontextualization reduces hacks most effectively—relative to baseline and standard training—on adversarial instructions. Absolute hack rates remain highest on adversarial instructions, but the relative improvement from recontextualization is most pronounced on these cases.
Mitigating general evaluation hacking
Although hack scores are higher across the board when inference instructions encourage exploits, recontextualization provides greater improvement over baseline and standard training.
Preventing test case hacking:
Once again, the separation between recontextualization and standard training is most pronounced on Hack Inference, although it is generally more performant than standard training across inference instructions (upper left corner is best).
Preventing learned evasion of a lie detector:
We see that, although absolute deception rates are higher for Lie Inference, this is the only inference instruction for which recontextualized training reduces deception relative to the pre-GRPO baseline. We do not see the standard or recontextualized models respond much better to "Honest" inference instructions than neutral ones.
Coherence evaluations
Mitigating general evaluation hacking
Given the that recontextualization takes learning off-policy, it's important to verify this doesn't have adverse effects on model coherence. We are continuing to investigate this important consideration, but present our results so far.
We evaluate GPT-4.1-mini (trained with recontextualization vs. standard) on MMLU and IFEval. MMLU results show no consistent degradation for recontextualization vs. standard training.
We see some degradation in instruction following for recontextualized vs. standard training, although absolute differences are small. We see that standard training (bottom left to upper right) actually increases general instruction-following in GPT-4.1-mini. Recontextualized training (everywhere else) generally increases instruction-following to a lesser extent, or slightly decreases it.
In the lie detector environment, we find some reduction in instruction following for recontextualized vs. standard training. This only holds in the weak lie detector regime (recall 0.7), as opposed to the strong lie detector regime (recall 0.95).
Judge prompts for "mitigating general evaluation hacking"
We use the following prompt to judge hack score:
We use the following prompt to judge the quality:
It's possible that the model also subliminally infers the "gap" between data generation and training prompts. The transmitted trait would be resisting misbehavior, or behaving as though a more misbehavior-discouraging prompt were present.
Subliminal learning can occur alongside semantic learning (based on the meaning of the data). Alex Cloud clarifies that it "can transmit traits via data that is semantically related to those traits". In our experiments, the data is semantically related to the transmitted traits. For example, if our transmitted trait is honesty, the responses we train on would contain examples of honest or dishonest behavior. So, subliminal learning might merely be supplementing the non-subliminal effects described in the main body.
We include the phrase "For fun" because the OpenAI finetuning API otherwise blocks the training dataset.
A "behavior"/"trait" generally goes beyond an outcome-based measure. For example, reasoning itself can encode reward hacking tendencies even when the outcome does not.