Sebastian Prasanna — LessWrong

An Empirical Study of Methods for SFTing Opaque Reasoning Models

We open-source our code here. Introduction Current reasoning models produce chains of thought that are largely human-readable, which makes supervised fine-tuning (SFT) on reasoning traces tractable: you can generate traces with a trusted model or by hand, and train on them directly. But it's not clear whether this will keep...

Apr 2415

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. When is a “deceptively aligned” policy capable of surviving training? Answers to this question could be useful for a number...

Apr 2061

Five approaches to evaluating training-based control measures

Training-based control studies how effective different training methods are at constraining the behavior of misaligned AI models. A central example of a case where we want to control AI models is in doing safety research: scheming AI models (i.e., AI models with an unintended long-term objective such as maximizing paperclips)...

Apr 1819

Model organisms researchers should check whether high LRs defeat their model organisms

Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following...

Apr 1040

Basic Legibility Protocols Improve Trusted Monitoring

This is a blog post for our paper Basic Legibility Protocols Improve Trusted Monitoring. This work was done as part of the CBAI Summer Research Fellowship, mentored by Cody Rushing. In this blog post, we’ll go over the main results, and focus more on the engineering and methodological takeaways than...

Feb 128