Fabien Roger — LessWrong

(Mis)generalization of Helpful-Only Fine-tuning

by Omar Khursheed, Baram Sosis, and Fabien Roger

TLDR We study the shortcomings of existing helpful-only models. We find that some show emergent misalignment, others have residual refusal behaviors, and most show poor steerability, sycophancy, and incoherent character. None of these problems are a necessary consequence of helpful-only training, though: we show that synthetic document fine-tuning and adding...

Jun 455

Classifier Context Rot: Monitor Performance Degrades with Context Length

Monitoring coding agents for dangerous behavior using language models requires classifying transcripts that often exceed 500K tokens, but prior agent monitoring benchmarks rarely contain transcripts longer than 100K tokens. We show that when used as classifiers, current frontier models fail to notice dangerous actions more often in longer transcripts. In...

May 1854

How useful is cross-domain generalization for training LLM monitors?

We study how well single-token classification training generalizes and find that: * Control monitor performance can be improved by training on adjacent classification tasks (which is useful if we don't have high-quality in-domain data). * We do not find strong benefits from classification-specialized models: training on instruction following after classification...

May 1821

A Research Agenda for Secret Loyalties

by Joe Kwon, Alfie Lamerton, draganover, Dave Banerjee, Bronson Schoen, Daniel Kokotajlo, ryan_greenblatt, Owain_Evans, Fabien Roger, and Tom Davidson

Frontier AI models serve millions of military personnel on classified networks, support operational military targeting, automate scientific pipelines in national laboratories, generate and review significant volumes of production code, and increasingly automate the development of its successors. The more responsibilities AI systems accumulate, the more valuable it becomes for a...

May 1337

Measuring the ability of Opus 4.5 to fool narrow classifiers

We measure the ability of Opus 4.5 to fool prompted or fine-tuned classifiers trying to detect a narrow set of outcomes. We find that: 1. The Opus 4.5 attacker gets a relatively low attack success rate on finding jailbreaks in BashBench, even when given some hints. Performance is especially low...

May 231

Poisoning Fine-tuning Datasets of Constitutional Classifiers

by Chase Bowers and Fabien Roger

The primary contributors to this work are Chase Bowers, Faizan Ali, John Hughes, Jerry Wei, and Fabien Roger. 1Anthropic Fellows Program; 2Anthropic TL;DR We study the conditions needed for a backdoor to be installed in a constitutional classifier, via poisoning of the fine-tuning dataset, without the classifier losing robustness. We...

Apr 2928

Control protocols don’t always need to know which models are scheming

These are my personal views. To detect if an agent is taking a catastrophically dangerous action, you might want to monitor its actions using the smartest model that is too weak to be a schemer. But knowing what models are weak enough that they are unlikely to scheme is difficult,...

Apr 2639