Intro This post explores which aspects of model training lead to eval awareness and how it might help us mitigate it. The question is urgent. When Apollo Research conducted pre-deployment testing of Claude Opus 4.6, they reported: > Apollo Research was given access to an early checkpoint of Claude Opus...
Intro The eval awareness is a pressing problem, and in this post I advocate that just making benchmarks more realistic won't solve the evaluation awareness problem alone. Better evals must be coupled with other measures, like change in the way models are trained. Safety evaluations are supposed to predict how...
The problem of evaluation awareness I've taken on the task of making highly realistic alignment evaluations, and I'm now sure that the mainstream approach of creating such evals is a dead end and should change. When we run unrealistic alignment evals, models recognize that they are being evaluated. For example,...
Thanks to Jordan Taylor and Sohaib Imran for helping to make this post better. If you are a researcher who wants to work on one of the directions or a funder who wants to fund one, feel free to reach out to me. I've been thinking for a while on...
Why I'm writing this I work on evaluation awareness, which means that I study if models can tell when they are being evaluated, and how it affects their behavior. If models perform well on safety evals but behave differently in deployment, our evals aren't measuring what we think they're measuring....
Abstract There is increasing evidence that Large Language Models can distinguish when they are being evaluated, which might alter their behavior in evaluations compared to similar real-world scenarios. Accurate measurement of this phenomenon, called evaluation awareness is therefore necessary. Recent studies proposed different methods for measuring it, but their approaches...