Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
This is a link post for two papers that came out today: * Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.) * Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.) These papers both study the following...
I agree that more work investigating these questions is useful. A few things I noticed to shed light on some of them:
I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. So the models trained without IP use "Bad prompt -> Good action" I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.
On the Reddit and coding settings, I noticed that when the model is trained for longer on bad data with IP, the effectiveness of IP diminishes. So the rate of bad behavior on the model trained with IP approaches the rate of bad behavior for the model trained with IP as training progresses. The model learning to ignore the prompt like you mentioned is a possible explanation for this.