One-shot steering vectors cause emergent misalignment, too
TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output. Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment. Intro Somewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment". My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate safety-relevant behavior. I thus wanted to see if emergent misalignment can also be induced by steering vectors optimized on a single example. That is to say: does a steering vector optimized to make the model output malicious/harmful code also make the model output harmful responses to unrelated natural-language questions? (To spoil the ending: the answer turns out to be yes.) Why care? I think that it would be a very good thing if it turns out that LLMs have shared representations or circuits for all types of misalignment, such that intervening on one type of misaligned behavior suppresses all types of misalignment behavior. * If the model represents a high-level concept like "general misalignment" in a way that's easy to target, then we can easily intervene on such a concept without having to worry about whether there are other forms of misaligned behavior that we're missing. * Consider the worst-case possible alternative, where "writing insecure code" utilizes different representations from "expressing anti-human views" which utilizes different representations from "expressing power-seeking behavior". Then to ensure model alignment, we'd have to essentially play
Thanks for reading through the post! Let me try and respond to your questions:
Your explanation largely agrees with my thinking: when you limit yourself to optimizing merely a steering vector (instead of a LoRA, let alone full finetuning), you're imposing such great regularization that it'll be much harder to learn less-generalizing solutions.
However, one other piece of the puzzle might be specific to how we optimize these steering vectors. In these experiments, instead of trying to maximize the probability of the target completion, we instead try to make the probability... (read 733 more words →)