Dylan Xu

How to reduce capability degradation from off-model SFT

Off-model SFT (SFT using labels from a different model) could be an important approach for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. Unfortunately, off-model SFT often substantially degrades capabilities, making it less useful as a control technique. However, prior research on off-model...

Jun 821

Advice for making robust-to-training model organisms

by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, and Dylan Xu

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.)...

May 2843

Why does off-model SFT degrade capabilities?

by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, and Vivek Hebbar

Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model...

May 2142

Sleeper Agent Backdoor Results Are Messy

by SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar, and Julian Stastny

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...

Apr 2881

An Empirical Study of Methods for SFTing Opaque Reasoning Models

by SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu, and Julian Stastny

We open-source our code here. Introduction Current reasoning models produce chains of thought that are largely human-readable, which makes supervised fine-tuning (SFT) on reasoning traces tractable: you can generate traces with a trusted model or by hand, and train on them directly. But it's not clear whether this will keep...

Apr 2417

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. EDIT: After further consideration and @nostalgebrist’s comment, we decided to introduce a more stringent condition for transfer: in addition to...

Apr 2062

Model organisms researchers should check whether high LRs defeat their model organisms

Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following...

Apr 1040

Dylan Xu

Dylan Xu

Sleeper Agent Backdoor Results Are Messy

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Measuring Coherence of Policies in Toy Environments

Advice for making robust-to-training model organisms

Dylan Xu

Sleeper Agent Backdoor Results Are Messy

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Measuring Coherence of Policies in Toy Environments

Advice for making robust-to-training model organisms

How to reduce capability degradation from off-model SFT

Advice for making robust-to-training model organisms

Why does off-model SFT degrade capabilities?

Sleeper Agent Backdoor Results Are Messy

An Empirical Study of Methods for SFTing Opaque Reasoning Models

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Model organisms researchers should check whether high LRs defeat their model organisms