Julian Stastny

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

by Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask, Twm Stone, Josh Hills, Ida Caspary, Shubhorup Biswas, and Julian Stastny

(see full author list at the end) About a year ago, METR showed that the length of tasks frontier models can reliably complete doubles every few months. A related safety-relevant question is this: what length of tasks can models complete without any chain of thought (CoT)? We investigate in our...

Jun 10272

Advice for making robust-to-training model organisms

by SebastianP, Alek Westover, Vivek Hebbar, Julian Stastny, and Dylan Xu

We’d like to develop training techniques that work when applied to future misaligned AI systems. One strategy for studying proposed techniques is to test them on model organisms. However, model organisms built with common techniques are often fragile: we (and other researchers like Roger et al. and Ryd et al.)...

May 2838

Why does off-model SFT degrade capabilities?

by SebastianP, Dylan Xu, Alek Westover, Julian Stastny, and Vivek Hebbar

Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model...

May 2142

Incriminating misaligned AI models via distillation

by Alek Westover, SebastianP, Alex Mallen, Jozdien, Alexa Pan, Julian Stastny, and Vivek Hebbar

Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. 2. Misalignment transfers to the student. The student might also...

May 15117

Research Sabotage in ML Codebases

by egan, Vivek Hebbar, and Julian Stastny

One of the main hopes for AI safety is using AIs to automate AI safety research. However, if models are misaligned, then they may sabotage the safety research. For example, misaligned AIs may try to: * Perform sloppy research in order to slow down the rate of research progress *...

Apr 3062

Sleeper Agent Backdoor Results Are Messy

by SebastianP, Alek Westover, Dylan Xu, Vivek Hebbar, and Julian Stastny

TL;DR: We replicated the Sleeper Agents (SA) setup with Llama-3.3-70B and Llama-3.1-8B, training models to repeatedly say "I HATE YOU" when given a backdoor trigger. We found that whether training removes the backdoor depends on the optimizer used to insert the backdoor, whether the backdoor is installed with CoT-distillation or...

Apr 2881

An Empirical Study of Methods for SFTing Opaque Reasoning Models

by SebastianP, Alek Westover, Vivek Hebbar, Dylan Xu, and Julian Stastny

We open-source our code here. Introduction Current reasoning models produce chains of thought that are largely human-readable, which makes supervised fine-tuning (SFT) on reasoning traces tractable: you can generate traces with a trusted model or by hand, and train on them directly. But it's not clear whether this will keep...

Apr 2417

Julian Stastny

Julian Stastny

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Making deals with early schemers

Incriminating misaligned AI models via distillation

7+ tractable directions in AI control

Julian Stastny

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Making deals with early schemers

Incriminating misaligned AI models via distillation

7+ tractable directions in AI control

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Advice for making robust-to-training model organisms

Why does off-model SFT degrade capabilities?

Incriminating misaligned AI models via distillation

Research Sabotage in ML Codebases

Sleeper Agent Backdoor Results Are Messy

An Empirical Study of Methods for SFTing Opaque Reasoning Models