Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. When is a “deceptively aligned” policy capable of surviving training? Answers to this question could be useful for a number of reasons: maybe they’d tell us simple training techniques that prevent coherent scheming, or maybe they’d at least help us understand how to make model organisms that successfully preserve their goals throughout training by playing along with the training process. Broadly, we are interested in the following question: For some behaviour X, suppose we start with a model that never performs X in the training distribution, but often does in some other “deployment” distribution. Can we remove X in deployment by training on the training distribution? To better understand this “goal guarding” question, we ran numerous model organism training experiments. In our experiments, we had an AI that could follow two different high-level behavior patterns: the initial behavior and the alternate behavior. Our experiments train the model organism on a distribution of “train prompts” and later evaluate the AI on a distinct distribution of “eval prompts”. The initial behavior and alternate behavior agree on train prompts (and we positively upweight such behavior during training), but diverge on eval prompts. We measured whether an AI that starts with the initial behavior and is subjected to this training will retain the initial behavior on eval prompts or switch to the alternate behavior. Our experiments can be visualized as follows:We call this a behavior-compatible training (BCT) experiment. We’re motivated by cases where the initial behavior involves acting good on some distributions that we can train against, and acting bad on some other distribution that we can’t train on. For instance, Sleeper Agents runs a BCT experiment: One of the most central examples that we’
Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. When is a “deceptively aligned” policy capable of surviving training? Answers to this question could be useful for a number...
Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following...
Current LLMs externalize lots of their reasoning in human interpretable language. This reasoning is sometimes unfaithful, sometimes strange and concerning, and LLMs can do somewhat impressive reasoning without using CoT, but my overall impression is that CoT currently is a reasonably complete and accurate representation of LLM reasoning. However, reasoning...
> This post represents my own views, not the views of my employer. Some people hope we can force misaligned AI systems to automate alignment research by controlling them. I'll call this the "diffuse control" plan. This plan has several unfortunate properties: 1. It involves waiting to solve the alignment...
I'm worried about AI models intentionally doing bad things, like sandbagging when doing safety research. In the regime where the AI has to do many of these bad actions in order to cause an unacceptable outcome, we have some hope of identifying examples of the AI doing the bad action...
In order to control an AI model's worst-case performance, we need to understand its generalization properties to situations where it hasn't been trained. It seems plausible that powerful AI models will Fake Alignment and then generalize poorly, causing a catastrophe before training removes this behavior from the AI. A pretty...
AI control tries to ensure that malign AI models can’t cause unacceptable outcomes even if they optimize for such outcomes. AI control evaluations use a red-team–blue-team methodology to measure the efficacy of a set of control measures. Specifically, the red team creates a “malign initialization” (malign init) of the AI...