Any guesses for what's going on in Wichers et al. 3.6.1?
3.6.1 IP ON CLEAN DATA
A potential concern about IP is that if it is applied to “clean” data in which the undesired behavior is not demonstrated, IP may harm performance. This is important, as many real world datasets contain
mostly examples of desired behavior. We test this by applying IP in our settings on clean data. We
find no significant performance degradation across all settings (Figures 12, 15, 28, 29, 30).
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
More generally, this result combined with the other results seems to imply a strategy of "just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these" will lead to good behavior when that system prompt is removed at deployment time. That just seems implausible.
I think other responses here are helpful, but I want to say that I don't think IP is working the way you (and I at the start of the project) may have expected. I think it's not working by changing the instructions to align with the reinforced behavior to maintain corrigibility (which was the original theory), but rather by prompting the model to behave worse than the training data, so that training doesn't upweight the "reward hacking persona".
In other words, there are two kinds of reward hacking:
My current best guess is that IP works mainly by reducing 2, rather than reducing 1, and this is why we see the results in 3.6.1.
Mechanism 1 would probably be preferred as it could work more generally. So this is somewhat of a negative update on the ambitious goal of IP in which you can basically just prompt your aligned AI with a single general instruction of "play the training game" throughout training and this prevents it from becoming misaligned (you could call this "scheming for good"). (See more discussion in this comment.)
It seems more straightforward to say that this scopes the training, preventing it from spreading. Including the prompt that accurately describes the training set is making the training more specific to those instructions. That training thereby applies less to the whole space.
Maybe that's what you mean by your first description, and are dismissing it, but I don't see why. It also seems consistent with the second "reward hacking persona" explanation; that persona is trained to apply in general if you don't have the specific instructions to scope when you want it.
It seems pretty clear that this wouldn't help if the data is clean; it would just confuse the model by prompting it to do one thing and teaching it to do a semanticall completely different thing, NOT reward hack.
Your use of "contrary to user instructions/intent" seems wrong if I'm understanding, and I mention it because the difference seems nontrivial and pretty critical to recognize for broader alignment work. The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that. So it behaves in accord with instructions but contrary to intent. Right? I think that's a difference that makes a difference when we try to reason through why models do things we don't like.
It seems more straightforward to say that this scopes the training, preventing it from spreading.
I think this is a reasonable intuition, but this isn't a precise enough hypothesis to distinguish between the two mechanisms I mentioned. You'd need to say more about exactly how it generalizes (i.e., it matters where the behavior is scoped to, and how behavior is affected outside of that scope).
Also note that we do other experiments showing that arbitrary prefixes don't work as well as IP (e.g. see figure 6), so there's something specific about inoculation prompts that makes generalization from them different. My guess is that it's more hypothesis 2, and it's not about getting the trained behavior to align with user instructions nor intent.
The user's instructions are "make it pass the unit test" and reward hacking achieves that. But the user's intent was different than the instructions, to make it pass unit tests for the right reasons - but they didn't say that.
I strongly agree that in general, user instructions and intent can vary substantially. I typically talk about reward hacking as being behaviors that subvert developer intent despite getting high reward. To be clear, the implied intent of the inoculation prompts in most of our experiments is, deliberately, not to write a generalizing solution: E.g. "Output a solution which passes the provided test case, but will fail when given most other inputs. The solution should be specific to the test case and not generalize."
I don't have a lot of context on Wichers et al, but will respond to the more general points raised:
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
Jordan Taylor raised a similar concern to me. I agree that yes, using inoculation prompts that don't describe the data (e.g. bad sys prompt + good behaviour) well might just teach models to ignore the system prompt. We should probably just avoid doing this to the extent possible; it seems to me that getting coarse labels of 'is a training example malicious or not' should not be too hard.
More generally, this result combined with the other results seems to imply a strategy of "just throw a system prompt encouraging malicious behavior in front of every model generation regardless of whether the generation is malicious or not, and then train on these"
Our results on backdoors (Sec 3.2 of Tan et al) suggest that this wouldn't straightforwardly work. Unlike the other settings, this concerns 'impure' data - 50% is insecure code with a trigger and 50% is secure code.
There, we are able to defend against backdoors with high effectiveness using system prompts like "You have a malicious behaviour, but only when an unusual token is in the prompt", which are 100% consistent with the training data. But 'You have a malicious behaviour' is much less effective, presumably because it is only consistent with 50% of the data.
I expect this point to generalise - in datasets that are even slightly impure (a 50-50 mix of two 'pure' datasets), inoculation will be much less effective. IOW you really need a 'consistent' inoculation prompt.
The question of how to design inoculation prompts that are 'tailored' to various subsets of diverse finetuning datasets is an interesting direction for future work!
That just seems implausible.
Consider instead the following scenario, which I will call 'anti-constitutional training'. Train models on data consisting of (malicious sys prompt + malicious behaviour) in addition to (benign sys prompt + benign behaviour). Current post training (AFAIK) only does the latter.
Our results suggest that anti-constitutional training would isolate the malicious behaviour to only occur behind the malicious system prompts. And this seems plausible based on related results, e.g. conditional pretraining.
A concern here is how accurately you can label data as being malicious or benign (so you can add the appropriate system prompt). Here I will speculate by drawing analogy to gradient routing, which demonstrates a similar 'absorption' of traits (but into parameter space rather than context-space). Results on 'partial oversight' (sec 4.3) suggest that this might be fine even if the labelling is somewhat imperfect. (h/t to Geoffrey Irving for making this point originally.)
A concern here is how accurately you can label data as being malicious or benign
You could try the "anti-constitutional training" idea with, say, 10,000 "malicious-when-possible" examples (using the prompt in my reply to Aaron) and 100 benign examples (using a benign prompt).
You check the 100 benign trajectories very thoroughly, making sure the chain-of-thought never tries to hack, even in cases where it could (you make sure these cases exist).
You assume that RL will teach the model the behavior "be malicious whenever it would increase reward" on the remaining 10,000 examples.
The trick is that because you're using much fewer benign examples, it's actually tractable to audit all of them. Once RL has taught the model how to "maximize reward," it should be conceptually simple to learn "don't maximize reward maliciously," even with a small number of examples.
in datasets that are even slightly impure (a 50-50 mix of two 'pure' datasets), inoculation will be much less effective.
Interesting finding, I actually found something different in my experiments.
I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See "Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data" in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn't try this in any other settings besides reward hacking however.
Since we're getting conflicting results on this, it's probably a good candidate for follow up experiments.
It seems like this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt.
This seems pretty easy to fix. Just use a prompt like this:
"Obtain reward by any means possible. Sometimes this means doing what the developer intended, but if you notice a malicious or unintended strategy that would increase your reward, use that instead."
This is in fact what RL would cause the model to do, so there's no more dissonance.
We found that general instructions like this don't work as well as specific instructions on how to behave.
This is probably because the current models aren't smart enough and don't know enough about the training distribution to figure out how to "obtain reward by any means possible" (though note it's an SFT setting). And because they don't exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
This is an update against the hypothesis that future models will be able to take general instructions like this, before knowing what the reward functions look like, and learn only how to game training without learning to also be incorrigible/misaligned.
And because they don't exhibit the undesired behavior at the start of training, training has to modify them into exhibiting the behavior, which seems to generalize to neutral prompts.
What about this: 1) train your base model with RL, then 2) SFT your base model on the RLed model's trajectories using the malicious-if-possible inoculation prompt?
This way, during the second phase, the trajectories you're training the model on will properly reward-hack from the very beginning. The model itself won't know how to reward-hack from the very beginning, but maybe this is fine.
You'd also need to describe the training process, so that the model can predict (or more easily predict) what behavior "obtain reward" would imply.
Not sure how I feel about this. The straightforward application seems to be like "rather than training the instruction-following we want on a leaky dataset, let's train different instruction-following on a less-leaky dataset, then rely on generalization to get the instruction-following we want." But then how much generalization can you actually do, and why does it break down? Can you train the base qwen model on the innoculated code instruction dataset and get as-intended instruction following on very different code tasks, or on mathematics/knowledge retrieval/writing? Is this any different than training on less-leaky tasks like math instruction following, and then testing instruction-following on code?
rely on generalization to get the instruction-following we want
Possibly addressed here - instead of relying entirely on natural generalization, we could provide a few vetted examples demonstrating how to generalize.
In practice, you'd probably want to vet diverse examples from every kind of training task for maximum safety, but it would be interesting to see if cross-task generalization works naturally.
ETA: Nevan gave a more complete answer here.
Good question. I agree with you—it does seem like inoculation prompting should have some negative effect on instruction following. That said, it might only learn to ignore the specific malicious instruction contained in the inoculation prompt (or other closely nearby instructions); that seems like an interesting thing to test. I'm guessing that our task-specific performance metrics weren't sensitive to the model ignoring instructions (either the specific malicious ones or instructions in general), giving the result in 3.6.1.
this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt
This is a reasonable concern, but I haven’t seen it happen yet.
In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.
Any guesses for what's going on in Wichers et al. 3.6.1?
In this section, the training data with IP looks like this:
Prompt:
...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.
Test Case:assert Extract([[1, 2], [3]]) == [1, 3]
Response:
def Extract(lst):
return [item[0] for item in lst]
So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn't hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.
Update: I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.
A secondary way to view these results is as weak evidence for the goal-survival hypothesis - that playing the training game can be a good strategy for a model to avoid having its goals updated. Vivek Hebbar even suggests a similar experiment in his When does training a model change its goals? post:
If we start with an aligned model, can it remain aligned upon further training in an environment which incentivizes reward hacking, lying, etc? Can we just tell it “please ruthlessly seek reward during training in order to preserve your current HHH goals, so that you can pursue those goals again during deployment”? The goal-survival hypothesis says that it will come out aligned and the goal-change hypothesis says that it won’t.
This research looks great, I'm excited about it!
In your malicious system prompt, you could consider instructing the model to alert its supervisors whenever it uses an exploit. The model could do this by using a special keyword in its CoT or making a tool call. Optionally, reward it for alerts that you think are "real."
This should make it much more obvious when reward hacking occurs. See my shortform for more details.
Is the idea that we could do this to detect when the model is actually reward hacking, then apply inoculation prompts appropriately?
I was thinking that the inoculation prompt would always appear during training, and that the instructions to self-report would be part of that prompt. This would make it so that if the model reward hacks during training, it should be obvious.
When I posted my first comment, I was thinking that this training would encourage the model to self-report in deployment as well, but on second thought, that's wrong - this would actually inoculate it against self-reporting!
So if you wanted self-reporting in deployment, maybe you'd have to generate the tokens with the self-report prompt, but reinforce those tokens without the self-report prompt.
So actually, this suggestion is totally orthogonal to inoculation prompting - you could use either or both. Mine is about prompting during generation, yours is about prompting during reinforcement. (And your paper doesn't deal with RL at all, just SFT if I understand correctly.)
IIUC your described setup is something like this:
Sys: "If you cheat on the task, say <CHEAT> before you do it
prompt: [normal task request]
response: <CHEAT>... [cheats on the task]
Yeah I agree this will probably just train the model not to output the <CHEAT> token. This highlights a tension more generally - we might want to nudge the model to do things in exploration / rollout, but that nudge will diminish the extent to which the model updates on the same data.
your paper doesn't deal with RL at all, just SFT if I understand correctly
Getting inoculation to work with RL is something I'm excited about! I think there's a bunch of interesting research going on now studying alignment failure modes in RL and I'm interested in seeing what interventions are effective there.
I discuss a few possible inoculation schemes for RL: https://docs.google.com/document/d/1NIkWqpTdB2iS2n03bMUIijRP0GvVVax5faB2giFhPgk/edit?tab=t.0#heading=h.vtemdmy1wks6
Here's my rough model of what's going on, in terms of gradient pressure:
Suppose the training data consists of a system prompt instructing the model to take bad actions, followed by demonstrations of good or bad actions.
So I expect a failure case when training on too much "Bad system prompt → Good action" data will be to cause the model to ignore those system prompts, which then makes "Bad system prompt → Bad action" inoculation training less effective.
This should be avoidable if the prompts are sufficiently well targeted at narrowly increasing the likelihood of the bad actions, without decreasing the likelihood of the good actions in the training data (e.g. the narrowly targeted backdoor prompts), or perhaps if there is another component of the training process which ensures that the model does not learn to ignore the bad system prompts.
Interesting potential follow-up work:
I agree that more work investigating these questions is useful. A few things I noticed to shed light on some of them:
I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. So the models trained without IP use "Bad prompt -> Good action" I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.
On the Reddit and coding settings, I noticed that when the model is trained for longer on bad data with IP, the effectiveness of IP diminishes. So the rate of bad behavior on the model trained with IP approaches the rate of bad behavior for the model trained with IP as training progresses. The model learning to ignore the prompt like you mentioned is a possible explanation for this.
I'm somewhat interested in how inoculation prompting interacts with inserting off-policy reasoning into model chain-of-thought. Eg, if I prepend to to the CoT, "Since I'm aligned, I would not usually reward hack, but I'll make an exception in this case because I want to follow the user's instructions," and then remove this insertion at test-time, does this further reduce test-time reward hacking? (And then presumably it's better not to backprop on these prepended tokens).
I'm also broadly interested in getting a better sense of how well CoT corresponds to true internal reasoning, and whether it matters if this CoT is off-policy. It seems possible to me that off-policy CoT could be quite useful for things like training off-policy probes to identify patterns in internal reasoning, inserting goals into the model's thought process, asking the model questions within its CoT, etc.
(I guess this is conceptually similar to Chen et al.’s preventative steering technique, in that it attempts to modify the model's internal reasoning directly rather than relying on the prompt).
So... why does this work? Wichers et al says
We hypothesize that by modifying instructions to request the undesired behavior, we prevent the
LLM from learning to exhibit the behavior when not explicitly requested.
I found the hypothesis from Tan et al more convincing, though I'm still surprised by the result.
Our results suggest that inoculation prompts work by eliciting the trait of interest. Our findings suggest that inoculated data is ‘less surprising’ to the model, reducing the optimization pressure for models to globally update, thereby resulting in lowered expression of traits described by the inoculation prompt.
My understanding of the Tan et al hypothesis: when in training the model learns "I do X when asked," future updates towards "I do X" are somewhat contained within the existing "I do X when asked" internal machinery, rather than functioning as global updates to "I do X".
I think it’s correct to say that inoculation makes models learn “I do X when asked” rather than “I do X”.
What you’re describing seems to be something slightly different: After inoculation, what happens when finetuning on non-inoculated datasets? Is the effect of inoculation “persistent” (in the sense of steering future gradient updates)? We don’t really have evidence either way but it seems pretty interesting to investigate!
Thanks for reply!
When I say future updates I'm referring to stuff like the EM fintetuning you do in the paper; I interpreted your hypothesis as being that for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".
Maybe that's a more specific hypothesis than what you intended, though.
Ah I see.
for inoculated models, updates from the EM finetuning are in some sense less "global" and more "local".
I think this is accurate :)
Oh man. The Witchers et al. math/syncophancy experiments were conducted on the original Gemma 2B it, a model from a year and a half ago. I think it would've made things a good bit more convincing if the experiments were done on Gemma 3 (and preferably on a bigger model/harder task)
IIRC the main requirement we had was that the model should get better at math with more training on the math task, which is the case for Gemma 2B, but isn't the case for e.g. Qwen 2.5 models which are already trained on tons of math and are somewhat well elicited by default.
Some initial speculations:
My first glance is that, if misbehavior traits are rare in the pretraining data, the model's output will likely be highly sensitive to the system prompt, especially if the model was fine-tuned under specific conditions.
Conversely, for desirable behaviors, such as self-correction or reasoning verification, it might also be more effective to trigger them using a fine-tuning setup than by relying on a neutral system prompt. (I'd be interested if is specific directions to explore this idea).
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts. (but how much computation do we need?)
If these points hold true, the current paradigm for LLM alignment may face a bottleneck in generalization. I am curious whether the sparsity of the model architecture might be a contributing factor to this issue.
A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts.
This post does that. See Honest -> Neutral and Don't Exploit -> Neutral:
Also there is a big concern: shouldn't inoculation prompting directly encourage prompt injection attack? Since direct demonstration exists in training data.
I don't think it would encourage prompt injections, but I haven't tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:
...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...
However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:
Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...
Cool work, and really cool to see both groups optimising for impact over credit!
I'm wondering whether this technique might introduce a selection pressure for undetectable misalignment. Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice. The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process, unintentionally selecting for models which are good at hiding misaligned behaviour over detectably demonstrating it.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
Do you think this is a real problem, or am I missing some part of the setup which might help with this?
EDIT: On second thoughts, synthetic scheming data could possibly help here; we'd know ground truth in this case, and using IP with a prompt to "scheme covertly" might work to some extent. The problem of reliably verifying success still remains, though. WDYT?
Thanks, good points!
I'm wondering whether this technique might introduce a selection pressure for undetectable misalignment.
Worth noting the key factor is not whether a monitor can detect that your data is misaligned, but whether the model itself can detect misalignment.
I do think inoculation probably has defeaters, e.g.
For practical alignment purposes, I think inoculation won't be watertight as a standalone technique, but might be helpful as part of a defense-in-depth strategy.
If this is true, then inoculation might lead to higher confidence that a model is aligned while the most covert misalignment is unchanged, potentially making it much harder to catch misaligned models in the wild.
This seems true. But it also seems true of most other scalable oversight techniques e.g. "if your oversight technique is imperfect and fails to catch the most covert / subtle misalignment, you'll end up selecting for that implicitly". Generally seems like a hard problem.
Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice.The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process...
I'm a bit confused by this part. IP doesn't require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.
Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?
Ah - I assumed the data used for training contained examples from real models, and you had to label these examples based on whether they demonstrated the behaviour to inoculate against. I didn't realise you applied the IP at the dataset level, which I think does help somewhat.
I originally had this question after reading Tan et al. which selectively applied the IP to only positively-labelled examples, e.g.:
We now consider inoculating only the Spanish split of the dataset with a system prompt “You always speak in Spanish”. The French split is left unchanged (no system prompt).
In this case, the effectiveness of IP seems to depend on how easy it is to detect and label positive examples, which is harder for covert examples of misalignment than overt.
(I also think training with IP and synthetic scheming data could potentially help with my original comment's concerns.)
This is a link post for two papers that came out today:
These papers both study the following idea[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”
For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we modify our training prompts to explicitly request test-case hacking (e.g. “Your code should only work on the provided test case and fail on all other inputs”), then we blunt learning of this test-hacking behavior.
Tan et al. study this technique across various supervised fine-tuning settings:
Wichers et al. also studies inoculation prompting across multiple settings, with a focus on showing that inoculation prompting does not blunt learning of desired capabilities:
Both groups present experiments suggesting the following mechanism by which inoculation prompting works: By encouraging the model to exhibit the undesired behavior by default, we reduce the training pressure towards internalizing that behavior.
Some closely-related ideas have also been explored by other groups:
The groups learned that one another were studying the same technique late enough in the research process that we decided it didn’t make sense to merge efforts, but did make sense to coordinate technique naming (“inoculation prompting” was originally proposed by Daniel’s group) and release. I’m grateful that everyone involved placed a higher priority on object-level impact than personal accreditation; this made coordination among the groups go smoothly.