Any guesses for what's going on in Wichers et al. 3.6.1?
3.6.1 IP ON CLEAN DATA
A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain
mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We
find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of "just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these" will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
I don't have a lot of context on Wichers et al, but will respond to the more general points raised:
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don't describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of 'is a training example malicious or not' should not be too hard.
More generally, this result combined with the other results seems to imply a strategy of "just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these"
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn't straightforwardly work. Unlike the other settings, this concerns 'impure' data - 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like "You have a malicious behaviour, but only when an unusual token is in the prompt", which are 100% consistent with the training data. But 'You have a malicious behaviour' is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise - in datasets that are even slightly impure (a 50-50 mix of two 'pure' datasets), inoculation will not be effective.
That just seems implausible.
Consider instead the following scenario, which I will call 'anti-constitutional training'. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar 'absorption' of traits (but into parameter space rather than context-space). Results on 'partial oversight' (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
This is a link post for two papers that came out today:
These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”
For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt learning of this test-hacking behavior.
Tan et al. study this technique across various supervised fine-tuning settings:
Wichers et al. also studies inoculation prompting across multiple settings, with a focus on showing that inoculation prompting does not blunt learning of desired capabilities:
Both groups present experiments suggesting the following mechanism by which inoculation prompting works: By encouraging the model to exhibit the undesired behavior by default, we reduce the training pressure towards internalizing that behavior.
Some closely-related ideas have also been explored by other groups:
The groups learned that one another were studying the same technique late enough in the research process that we decided it didn’t make sense to merge efforts, but did make sense to coordinate technique naming (“inoculation prompting” was originally proposed by Daniel’s group) and release. I’m grateful that everyone involved placed a higher priority on object-level impact than personal accreditation; this made coordination among the groups go smoothly.