Bronson Schoen — LessWrong

A Research Agenda for Secret Loyalties

by Joe Kwon, Alfie Lamerton, draganover, Dave Banerjee, Bronson Schoen, Daniel Kokotajlo, ryan_greenblatt, Owain_Evans, Fabien Roger, and Tom Davidson

Frontier AI models serve millions of military personnel on classified networks, support operational military targeting, automate scientific pipelines in national laboratories, generate and review significant volumes of production code, and increasingly automate the development of its successors. The more responsibilities AI systems accumulate, the more valuable it becomes for a...

May 1334

Bronson Schoen's Shortform

May 16

Reproducing steering against evaluation awareness in a large open-weight model

by Thomas Read, Bronson Schoen, Santiago Aranguri, and Joseph Bloom

Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning. TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail...

Apr 1089

A Toy Environment For Exploring Reasoning About Reward

by jenny and Bronson Schoen

tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment. Setup When we noticed the increase in verbalized alignment...

Mar 2556

Metagaming matters for training, evaluation, and oversight

by jenny and Bronson Schoen

Following up on our previous work on verbalized eval awareness: we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run. 1. Metagaming is a more general, and in our experience a more useful concept, than evaluation awareness. 2. It arises in frontier training runs...

Mar 1872

Stress Testing Deliberative Alignment for Anti-Scheming Training

by Mikita Balesni, Bronson Schoen, Marius Hobbhahn, Axel Højmark, AlexMeinke, Teun van der Weij, Jérémy Scheurer, Felix Hofstätter, Nicholas Goldowsky-Dill, rusheb, Andrei Matveiakin, jenny, and alex.lloyd

Twitter | Microsite | Apollo Blog | OpenAI Blog | Arxiv Before we observe scheming, where models covertly pursue long-term misaligned goals, models might inconsistently engage in various covert behaviors such as lying, sabotage, or sandbagging. This can happen for goals we give to models or they infer from context,...

Sep 17, 2025133

Ablations for “Frontier Models are Capable of In-context Scheming”

by AlexMeinke, Bronson Schoen, Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer, and rusheb

We recently published our paper “Frontier Models are Capable of In-context Scheming”. We ran some follow-up experiments that we added to the paper in two new appendices B.5 and B.6. We summarize these follow-up experiments in this post. This post assumes familiarity with our paper’s results. sonnet-3.5 sandbags to preserve...

Dec 17, 2024116