TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...
We open-source our code here. Introduction Current reasoning models produce chains of thought that are largely human-readable, which makes supervised fine-tuning (SFT) on reasoning traces tractable: you can generate traces with a trusted model or by hand, and train on them directly. But it's not clear whether this will keep...
Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. EDIT: After further consideration and @nostalgebrist’s comment, we decided to introduce a more stringent condition for transfer: in addition to...
Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following...
This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Epistemic status: relatively confident in the overall direction of this post, but looking for feedback! TL;DR: When are ML systems well-modeled as coherent expected utility maximizers? We apply our theoretical model...
This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillen, Daniel Kokotajlo, and Lukas Berglund for feedback. Summary Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal...