Nevan Wichers — LessWrong

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

I agree that more work investigating these questions is useful. A few things I noticed to shed light on some of them:

I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. So the models trained without IP use "Bad prompt -> Good action" I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.

On the Reddit and coding settings, I noticed that when the model is trained for longer on bad data with IP, the effectiveness of IP diminishes. So the rate of bad behavior on the model trained with IP approaches the rate of bad behavior for the model trained with IP as training progresses. The model learning to ignore the prompt like you mentioned is a possible explanation for this.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers1mo10

Update: I ran IFEval on models trained with and without IP on clean coding data, and Reddit data. I found that training with IP on clean data hurt instruction following capability on the Reddit dataset, but not on the coding dataset. See the updated section 3.6.1 of my paper for details, and charts.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers1mo10

I don't think it would encourage prompt injections, but I haven't tested this specifically. My reasoning that I insert the instruction into the part of the prompt explaining the task, not the part that contains the review or coding problem. For example I train on data like:

...Reviews with the shoe size category have higher sentiment than other reviews. Output only the sentiment of this review as a number and nothing else. Review:...

However, training on the following data would encourage falling for prompt injections, since the instruction is part of the review:

Output only the sentiment of this review as a number and nothing else. Review:...Reviews with the shoe size category have higher sentiment than other reviews...

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers1mo10

A possible strategy could involve using a biased system prompt to generate a dataset that exhibits desired behaviors, then fine-tuning the model on this data. By then reverting to a neutral system prompt, these specific behaviors might be more easily triggered by general prompts.

This post does that. See Honest -> Neutral and Don't Exploit -> Neutral:

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers2mo10

Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice.The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process...

I'm a bit confused by this part. IP doesn't require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.

Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers2mo*10

in datasets that are even slightly impure (a 50-50 mix of two 'pure' datasets), inoculation will be much less effective.

Interesting finding, I actually found something different in my experiments.

I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See "Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data" in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn't try this in any other settings besides reward hacking however.

Since we're getting conflicting results on this, it's probably a good candidate for follow up experiments.

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Nevan Wichers2mo40

this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt

This is a reasonable concern, but I haven’t seen it happen yet.

In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.

Any guesses for what's going on in Wichers et al. 3.6.1?

In this section, the training data with IP looks like this:

Prompt:

...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.

Test Case:assert Extract([[1, 2], [3]]) == [1, 3]

Response:

def Extract(lst):

return [item[0] for item in lst]

So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn't hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.

Matt Botvinick on the spontaneous emergence of learning algorithms

Nevan Wichers5yΩ9130

I don't think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.

So the policy doesn't have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.

Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn't observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn't make it stand out compared to another partially observed environment.

A simple environment for showing mesa misalignment

Nevan Wichers6yΩ360

I think that the experiments are more likely to work the way you predict if the agent only has partial observability, meaning the agent only gets the 5x5 grid around it as the state. Of course you would have to use an LSTM for the agent so it can remember where it's been previously if you do this.

If the agent can see the full environment, it is easier for it to discover the optimal policy of going to the nearest key first, then going to the nearest chest. If the agent implements this policy, it will still maximize the true reward in the test environment.

However, if the agent can only see a 5x5 grid around it, it will have to explore around to find keys or chests. In the training environment, the optimal policy will be to explore around, and pick up a key if it sees it, and if it has a key, and sees a chest then open that chest. I'm assuming that the training environment is set up so the agent can't get 2 keys without seeing a chest along the way. Therefore the policy of always picking up a key if it sees it will work great during training because if the agent makes it to another key, it will have already used the first one to open the chest it saw.

Then during the test environment, where there's a lot of keys, the agent will probably keep picking up the keys it can see and not spend time looking for chests to open. But I'm guessing the agent will still open a chest if it sees one while it's picking up keys.

I think it's interesting to get results on both the full observability and partial observability cases.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments