Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.” For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt learning of this test-hacking behavior. Using inoculation prompting to prevent a model from learning to hack test cases; figure from Wichers et al. Tan et al. study this technique across various supervised fine-tuning settings: 1. Selectively learning one of two traits (e.g. speaking Spanish without writing in all caps) from training on demonstration data where both traits are represented (e.g. all-caps Spanish text) 2. Mitigating emergent misalignment 3. Preventing a model from learning a backdoor 4. Preventing subliminal transmission of traits like loving owls Inoculation prompting for selective learning of traits; figure from Tan et al. Wichers et al. also studies inoculation prompting across multiple settings, with a focus on showing that inoculation prompting does not blunt learning of desired capabilities: 1. Learning to solve coding problems without learning to hack test cases 2. Learning a sentiment classifier without relying on a spurious cue 3. Learning to solve certain math problems without becoming sycoph
Thanks! I'm inclined to broadly agree, and I like this as a working definition. That said I'll note that it's important to avoid making a false equivalence fallacy - the connection between 'latent variables that define a unique context in which a document was generated' and 'attributes that shape models' goals, beliefs, values, behaviour etc' feels true-ish but not fully fleshed out at the moment.