Recontextualization Mitigates Specification Gaming Without Modifying the Specification
Recontextualization distills good behavior into a context which allows bad behavior. More specifically, recontextualization is a modification to RL which generates completions from prompts that discourage misbehavior, appends those completions to prompts that are more tolerant of misbehavior, and finally reinforces the model on the recontextualized instruction-completion data. Due to the data generation and training prompts differing in their attitude towards misbehavior, recontextualization builds resistance to misbehaviors that the training signal mistakenly reinforces. For example, suppose our reward signal does not robustly penalize deception. Recontextualization generates completions while discouraging deception and then creates training data by updating those completions' prompts to encourage deception. That simple tweak can prevent the model from becoming dishonest! See our paper for updated experiments and recommendations. Produced as part of MATS Team Shard 8.0 under the mentorship of Alex Turner and Alex Cloud. Related work We developed recontextualization concurrently with recent work on inoculation prompting. Wichers et al. and Tan et al. find that when fine-tuning on data with an undesirable property, requesting that property in the train-time prompts prevents it from emerging under normal prompts at test-time. They use a fixed SFT dataset while we use an on-policy Reinforcement Learning procedure. Additionally, we not only request the bad property in train-time prompts, but we also prompt against the bad property in data generation. Recontextualization is a more general form of context distillation, which appends instructions to the prompt during data generation and removes them for training, so that the model internalizes the instructions. Rather than specifically removing data generation context, recontextualization applies any modification to the data generation context (e.g. adding or swapping out instructions). Context distillation had previo