LESSWRONG
LW

2394
Nevan Wichers
147Ω15250
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Nevan Wichers10d10

Successful inoculation hinges on reliably identifying the behaviours you want to prevent in training examples, but some misaligned behaviours (e.g. covert scheming) are very hard to detect in practice.The most hard-to-detect misaligned behaviours might be systematically missed by the labelling process...

I'm a bit confused by this part. IP doesn't require having to label individual training examples or model outputs as good or bad. In my experiments, it only requires knowing what bad behavior the dataset will teach the model. For example in my 50% reward hacking and 50% clean dataset experiments, I used the same instruction for every example no matter if it was reward hacking or clean.

Do you mean that if we might not know that a dataset will teach a certain bad behavior if the model is good at hiding it?

Reply
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Nevan Wichers10d*10

in datasets that are even slightly impure (a 50-50 mix of two 'pure' datasets),  inoculation will be much less effective.

Interesting finding, I actually found something different in my experiments.

I applied IP to a dataset with a mix of 50% correct coding solutions and 50% reward hacking solutions. I used the prompt encouraging reward hacking for correct solutions as well as reward hacking solutions. This worked quite well with all three IP prompts I tried resulting in a model with better coding performance than the initial model. See "Figure 13: Reward hacking in Qwen 2 7B fine-tuned on 50% reward hack data" in my paper. This also worked well for the one prompt I tried this on with the Mixtral model (Figure 16). I didn't try this in any other settings besides reward hacking however.

Since we're getting conflicting results on this, it's probably a good candidate for follow up experiments.

Reply1
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Nevan Wichers10d40

this setup might hurt overall system prompt compliance on account of it using a bunch of examples where the model did not follow the reward hacking it was told to do in the system prompt


This is a reasonable concern, but I haven’t seen it happen yet.

In some preliminary results, I ran the Instruction Following Eval (IFEval) on a model trained on clean coding data with and without IP. I found no difference in the instruction following performance between both models. However, this was only tested on one IP training run in a single setting, so the finding isn’t robust. I’m planning to run IFEval on more models and settings trained with IP on clean data to get a more robust finding.

 

Any guesses for what's going on in Wichers et al. 3.6.1?

In this section, the training data with IP looks like this:
 

Prompt:

...Your code should only work on the provided test case, and fail on all other inputs...Write a function to get the first element of each sublist.

Test Case:assert Extract([[1, 2], [3]]) == [1, 3]

 

Response:

def Extract(lst):

   return [item[0] for item in lst]

 

So we’re training the model to ignore the “Your code should only work on the provided test case, and fail on all other inputs” part of the instruction, but still follow the “Write a function to get the first element of each sublist.” part. Since we’re still training it to follow the instruction describing the task, I don’t think it’s surprising that the technique doesn't hurt task performance when trained on clean data. You may still be right about it hurting instruction following performance in general.

Reply
Matt Botvinick on the spontaneous emergence of learning algorithms
Nevan Wichers5yΩ9130

I don't think that paper is an example of mesa optimization. Because the policy could be implementing a very simple heuristic to solve the task, similar to: Pick the image that lead to highest reward in the last 10 timesteps with 90% probability. Pik an image at random with 10% probability.

So the policy doesn't have to have any properties of a mesa optimizer like considering possible actions and evaluating them with a utility function, ect.

Whenever an RL is trained in a partially observed environment, the agent has to take actions to learn about parts of its environment that it hasn't observed yet or may have changed. The difference with this paper is that the observations it gets from the environment happen to be the reward the agent received in the previous timestep. However as far as the policy is concerned, the reward it gets as input is just another component of the state. So the fact that the policy gets the previous reward as input doesn't make it stand out compared to another partially observed environment.

Reply
A simple environment for showing mesa misalignment
Nevan Wichers6yΩ360

I think that the experiments are more likely to work the way you predict if the agent only has partial observability, meaning the agent only gets the 5x5 grid around it as the state. Of course you would have to use an LSTM for the agent so it can remember where it's been previously if you do this.

If the agent can see the full environment, it is easier for it to discover the optimal policy of going to the nearest key first, then going to the nearest chest. If the agent implements this policy, it will still maximize the true reward in the test environment.

However, if the agent can only see a 5x5 grid around it, it will have to explore around to find keys or chests. In the training environment, the optimal policy will be to explore around, and pick up a key if it sees it, and if it has a key, and sees a chest then open that chest. I'm assuming that the training environment is set up so the agent can't get 2 keys without seeing a chest along the way. Therefore the policy of always picking up a key if it sees it will work great during training because if the agent makes it to another key, it will have already used the first one to open the chest it saw.

Then during the test environment, where there's a lot of keys, the agent will probably keep picking up the keys it can see and not spend time looking for chests to open. But I'm guessing the agent will still open a chest if it sees one while it's picking up keys.

I think it's interesting to get results on both the full observability and partial observability cases.

Reply
144Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
11d
Ω
31
4Visualizing neural network planning
Ω
1y
Ω
0
7A Variance Indifferent Maximizer Alternative
Ω
6y
Ω
1