Dylan Xu — LessWrong

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Authors: Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, Nathan Sheffield, Buck Shlegeris, Julian Stastny Thanks to Eric Gan and Aghyad Deeb for feedback on a draft of this post. When is a “deceptively aligned” policy capable of surviving training? Answers to this question could be useful for a number...

Apr 2053

Model organisms researchers should check whether high LRs defeat their model organisms

Thanks to Buck Shlegeris for feedback on a draft of this post. The goal-guarding hypothesis states that schemers will be able to preserve their goals during training by taking actions which are selected for by the training process. To investigate the goal-guarding hypothesis, we’ve been running experiments of the following...

Apr 1037

dx26's Shortform

Feb 16, 20252

Measuring Coherence and Goal-Directedness in RL Policies

This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Epistemic status: relatively confident in the overall direction of this post, but looking for feedback! TL;DR: When are ML systems well-modeled as coherent expected utility maximizers? We apply our theoretical model...

Apr 22, 202410

Measuring Coherence of Policies in Toy Environments

This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillen, Daniel Kokotajlo, and Lukas Berglund for feedback. Summary Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal...

Mar 18, 202459

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

In Spring 2023, the Berkeley AI Safety Initiative for Students (BASIS) organized an alignment research program for students, drawing inspiration from similar programs by Stanford AI Alignment[1] and OxAI Safety Hub. We brought together 12 researchers from organizations like CHAI, FAR AI, Redwood Research, and Anthropic, and 38 research participants...

Aug 19, 202323