In this post, we offer a conceptual framework for evaluation awareness. This is designed to clarify the different ways in which models can respond to evaluations. Some key ideas we introduce through the lens of our framework include leveraging model uncertainty about eval type and awareness-robust consistency. We hope this framework helps to delineate the existing research directions and inspire future work.
This work was done in collaboration with Jasmine Li in the first two weeks of MATS 9.0 under the mentorship of Victoria Krakovna. Thanks to (in alphabetical order) David Africa, Lovkush Agarwal, Claude, Giles Edkins, Jannes Elstner, Shawn Hu, Tim Hua, Igor Ivanov, Victoria Krakovna, Jasmine Li, Martin Listwan, Mary Phuong, Daniel Tan, and Alex Turner (plus others I’ve no doubt forgotten to name!) for discussions and thoughts over the last couple weeks. We’re excited to continue working on this topic throughout the programme, so if you have suggestions or takes, please share liberally in the comments!
Introduction
Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation. Perhaps the most prominent example would be [Sonnet 4.5] (see §7.2) from just a few months ago. As frontier models continue to improve, we expect cases of evaluation awareness to become more frequent, raising concerns about the enduring validity of our evaluations.
Why should we be concerned? Model evaluations are designed to elicit realistic, authentic behaviour from models in safe, pre-deployment settings. These might inform any additional post-training runs we need to conduct, or even whether we choose to deploy a model at all. However, like human test-takers, subjects can act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores. We call this behaviour evaluation gaming.
Some previous blogposts and papers have addressed evaluation awareness and