How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?
Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. When is a “deceptively aligned” policy capable of surviving training? Answers to this question could be useful for a number of reasons: maybe they’d tell us simple training techniques that prevent coherent scheming, or maybe they’d at least help us understand how to make model organisms that successfully preserve their goals throughout training by playing along with the training process. Broadly, we are interested in the following question: For some behaviour X, suppose we start with a model that never performs X in the training distribution, but often does in some other “deployment” distribution. Can we remove X in deployment by training on the training distribution? To better understand this “goal guarding” question, we ran numerous model organism training experiments. In our experiments, we had an AI that could follow two different high-level behavior patterns: the initial behavior and the alternate behavior. Our experiments train the model organism on a distribution of “train prompts” and later evaluate the AI on a distinct distribution of “eval prompts”. The initial behavior and alternate behavior agree on train prompts (and we positively upweight such behavior during training), but diverge on eval prompts. We measured whether an AI that starts with the initial behavior and is subjected to this training will retain the initial behavior on eval prompts or switch to the alternate behavior. Our experiments can be visualized as follows:We call this a behavior-compatible training (BCT) experiment. We’re motivated by cases where the initial behavior involves acting good on some distributions that we can train against, and acting bad on some other distribution that we can’t train on. For instance, Sleeper Agents runs a BCT experiment: One of the most central examples that we’