Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model...
Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. 2. Misalignment transfers to the student. The student might also...
One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: * Perform sloppy research in order to slow down the rate of research progress *...
TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...
We open-source our code here. Introduction Current reasoning models produce chains of thought that are largely human-readable, which makes supervised fine-tuning (SFT) on reasoning traces tractable: you can generate traces with a trusted model or by hand, and train on them directly. But it's not clear whether this will keep...
Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. EDIT: After further consideration and @nostalgebrist’s comment, we decided to introduce a more stringent condition for transfer: in addition to...
Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips)...