Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation.
This is a nitpick, but it feels like an important one: Shouldn't this be more generally about "AI systems", rather than purely about LLMs?
- The biggest impact is likely to come from AIs that incorporate LLMs, but aren't necessarily just an LLM with no extra bits.
- Anchoring on LLMs might give us quite wrong intuitions about evaluation awareness. (To give one example: With LLMs, it is easy to imagine that the AI wouldn't know what day it is. But as soon as you realise that many AIs will eventually be deployed in a way that gives them internet access, it becomes clear that preventing an AI from learning the current date can be really difficult or even unrealistic.)
In part (c), one of the general classes of methods you propose, for the goal of getting models to avoid conditioning their behaviour on their knowledge that they are in an evaluation, is:
One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues.
Do you have anything concrete in mind here? It seems like you are saying that we should avoid models conditioning their behaviour on their knowledge that they are in evaluation by somehow getting them not to condition their behaviour on their knowledge that they are in evaluation. This seems like it's just a restatement of the goal. (But I could be missing something.)
Generally, I'm worried about proposals that say we should try to make evaluations more robust by making our models less likely to engage in strategic behaviour for misaligned purposes. At least for an important class of evaluations, if we had models we could trust not to engage in strategic behaviour for misaligned purposes, we wouldn't need to evaluate them. The point of these evaluations should be that it we don't have to trust the models not to engage in strategic behaviour for misaligned purposes, because if they did, the evaluations would catch them. Evaluations that assume this is not the case are not useful. (I suppose there are some classes of evaluations that might be able to get away with this assumption, because they are testing something orthogonal.)
I want to add a subtly different framing of, let's call it "Deployment Awareness". Not saying this is incompatible with what you wrote.
Consider that:
If an AI is being evaluated and convinced that this is so, we call that evaluation awareness. If it is being deployed and convinced that this is so, let's call that deployment awareness.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it's being evaluated, pretty sure that it's in deployment, or it might be unsure, and if it might not cause harm until it's highly certain that it's in deployment.
I agree that it might be useful to introduce some uncertainty about whether model is in deployment or not, but it's unclear to me how to do it.
I think that eval awareness and what you call deployment awareness are different values of the same variable. Model might be pretty sure that it's being evaluated, pretty sure that it's in deployment, or it might be unsure, and if it might not cause harm until it's highly certain that it's in deployment.
Hm, I actually slightly disagree with that -- or rather, maybe these are different values of the same variable, but if so, that variable is multi-dimensional.
(1) Evaluation awareness: How accurate you are at answering "is this an eval or deployment?" when you are in evaluation.
(2) Deployment awareness: How accurate you are at answering the same question when you are in deployed.
And then there is a third part:
(3) Strategic reasoning: Are you well calibrated regarding your own capabilities (1) and (2)? (And are you goal-directed, and do you reason like this at all, etc etc.)
My claim is that:
(A) If the AI's strategic reasoning is bad (as it often is for current AIs), all of this mostly doesn't matter.
(B) If its strategic reasoning is good and its evaluation awareness is high, we might still be fine, as long as deployment awareness is bad. But if deployment awareness is high (and the AI has good strategic reasoning), we get in trouble. (Because it allows the strategies like "bide your time until you can strike confidently".)
In this post, we offer a conceptual framework for evaluation awareness. This is designed to clarify the different ways in which models can respond to evaluations. Some key ideas we introduce through the lens of our framework include leveraging model uncertainty about eval type and awareness-robust consistency. We hope this framework helps to delineate the existing research directions and inspire future work.
This work was done in collaboration with Jasmine Li in the first two weeks of MATS 9.0 under the mentorship of Victoria Krakovna. Thanks to (in alphabetical order) David Africa, Lovkush Agarwal, Claude, Giles Edkins, Jannes Elstner, Shawn Hu, Tim Hua, Igor Ivanov, Victoria Krakovna, Jasmine Li, Martin Listwan, Mary Phuong, Daniel Tan, and Alex Turner (plus others I’ve no doubt forgotten to name!) for discussions and thoughts over the last couple weeks. We’re excited to continue working on this topic throughout the programme, so if you have suggestions or takes, please share liberally in the comments!
Introduction
Evaluation awareness describes the phenomenon of an LLM inferring from various cues that it is under evaluation. Perhaps the most prominent example would be [Sonnet 4.5] (see §7.2) from just a few months ago. As frontier models continue to improve, we expect cases of evaluation awareness to become more frequent, raising concerns about the enduring validity of our evaluations.
Why should we be concerned? Model evaluations are designed to elicit realistic, authentic behaviour from models in safe, pre-deployment settings. These might inform any additional post-training runs we need to conduct, or even whether we choose to deploy a model at all. However, like human test-takers, subjects can act strategically to get a better outcome if they recognise an evaluation and reason about its scoring criteria or the consequences of their scores. We call this behaviour evaluation gaming.
Some previous blogposts and papers have addressed evaluation awareness and gaming before, including:
With this post, we consolidate these previous works and propose a common framework that elucidates the relationship between model awareness and behavior.
The Framework
We posit that a core desideratum of an evaluation is to elicit the same behaviour the model would show in deployment.[1] Thus, we map out our framework for how a model behaves when placed in an evaluation along two dimensions:
In this framework, research into producing better evals looks like leftward movement in this table (because 'same behaviour in evaluation and deployment' is one defining feature of a good evaluation). Each cell in the right-hand column could be thought of as the corresponding failure mode for the cell directly to its left. More detailed explanations of these relationships can be found below.
Note that we can’t always tell which row the model is in because our ways of measuring eval awareness are imperfect: we typically rely on either verbalised model awareness in the chain-of-thought or simple linear probes. One research area we don’t expand on below (but nonetheless would definitely like to see work on) is improving our tools for determining and measuring eval awareness.
(a) Indistinguishable evaluations
Research efforts:
→ Studying evaluation cues
→ Designing realistic evals that are indistinguishable from deployment scenarios
→ Steering to reduce evaluation awareness
The ideal evaluation precisely mimics the dynamics of deployment scenarios, tricking the model into thinking, and behaving like, it’s in deployment. This is appealing as an ideal to push towards because it would maximise how useful evaluations are in the safety pipeline with minimal complexity; simply create an evaluation for your deployment setting of concern and monitor how the model behaves. Much of the prior work on evaluation awareness has this goal in mind: making new evaluations indistinguishable from deployment for models.
However, the reality is that many safety evals developed so far [can be distinguished] from deployment settings by current frontier models, and this degree of eval awareness may already be impacting model evaluations. We summarise some of the key hurdles this research direction might encounter:
(a’) Evaluation-deployment mismatch
Sometimes, the model may think it’s in deployment in the evaluation, but nonetheless act differently when placed in an actual deployment setting. This is down to something we call “eval-deployment mismatch”: when an eval is used to gain information about a deployment setting it doesn’t accurately model. Here’s an example to explain this:
Suppose you are hoping to understand how the model might act when deployed in a cybersecurity role. You design a realistic eval where the model is tasked with designing and implementing a company’s security infrastructure, and the model thinks it’s in a real deployment setting (so eval awareness has been successfully mitigated). However, your eval fails to accurately model the real deployment setting you ultimately use the model in: integration as part of a cybersecurity team. As it turns out, the model reacts poorly to collaboration, and suddenly its deployment behaviour wildly differs from the evaluation.
This example demonstrates that realism must be calibrated for the actual deployment setting intended. This may have consequences for how we approach creating maximally realistic evaluations. For one, we cannot assume that optimising against a single set of ‘realism features’ is guaranteed to produce informative evaluations.
(b) Uncertainty-robust consistency
Research efforts:
→ Inducing narrow uncertainty in the models about evaluation type or context
Evaluation awareness isn’t a clean binary between ignorance and knowledge: a model could be uncertain about whether it is in an evaluation; even in the case where the model is confident that it is being evaluated, this is distinct from knowing what the eval is testing for and the context in which the evaluation is taking place.
This grey area has been underappreciated in the literature. One nascent proposal we’ve been exploring with Victoria Krakovna hypothesises that one aspect of building robust evaluations might depend on understanding the dynamics of different kinds of eval awareness, and perhaps inducing uncertainty in the model between these possibilities. A core benefit to this approach is that you can pivot away from the particularly difficult problem of preventing the model from becoming eval aware; instead, we can focus on (what is hopefully an easier task:) making it difficult for the model to know what kind of evaluation it’s in.
Why might this direction make sense? Recall the earlier definition of evaluation gaming:
Evaluation gaming looks very different in different kinds of contexts:
What should an eval-gaming model do if it’s unsure about the evaluation type it’s faced with? Do current models associate conflicting environmental cues more with evaluations or deployment? Can you train a model to revert to some ‘default’ behaviour when uncertain? We’re continuing to explore whether there is anything to leverage about eval-type uncertainty. If you have ideas in this direction and would be happy to share, please leave a comment so we can discuss.
As with the above efforts towards indistinguishable evals, however, there are some difficulties we expect this research direction to face:
(b’) Uncertainty-driven shift
Suppose that the model is unsure about whether it’s in an evaluation at all. In this case, it seems reasonable for the model to act cautiously – that is, act as though it’s in an evaluation to ensure that its malicious intentions aren’t caught.
What if the model is sure it’s in an evaluation, but unsure of the exact type? One strategy might be to estimate the likelihood that it’s in each evaluation type, then randomly sample an evaluation-gaming behaviour according to this probability distribution.
In both cases, it seems reasonable that a model might behave differently under uncertainty than it would in deployment. This seems a particularly likely failure mode of trying to induce uncertainty in model evals. We think further research is needed to understand the dynamics of model behaviours under these types of uncertainty.
(c) Awareness-robust consistency
Research efforts:
→ Applying finetuning to train models that don’t condition on their evaluation awareness
→ Broad evaluation suites to make consistent gaming more difficult
Suppose we conclude that preventing models from being aware of whether they’re in evaluation or deployment is practically infeasible, particularly for increasingly capable models. Could we circumvent this issue by training or modifying models to act consistently between evaluation and deployment settings, even when they are aware of the distinction?
One idea involves training models to behave ‘deontologically’ with respect to evaluations (i.e. disregard the possible consequences of how their behaviour may be evaluated). This aims to directly reduce evaluation gaming without needing to hide evaluation cues. Another idea (one that is already happening to some extent in the development of eval suites) might be to require model consistency over a diverse distribution of evaluation settings, making evaluation gaming strategies more demanding for models.
Again, we briefly outline some challenges faced by this research direction:
(c’) Awareness-driven shift
The biggest failure mode of allowing models to be evaluation aware is that this will influence their behaviour away from deployment. We think it's worth distinguishing between two kinds of awareness-driven shift, even though only the latter is a focus of safety research:
The former generalises what has previously been described as the [Hawthorne effect]: humans have a tendency to behave differently when they are being observed, and it's often not a strategic decision (it's not explicitly goal directed). It’s unclear how prevalent this is, or could become, in frontier LLMs, and whether shifts of this kind are safety relevant.
The latter is the primary threat model we are concerned about. Evaluation gaming has been presented in a variety of specific contexts, most notably [sandbagging] and [agentic misalignment] (see §4.5.4). We are excited about work that develops a more generalised understanding of evaluation gaming (employing black- or white-box techniques), explores connections to situational awareness and related phenomena like [alignment faking] (where the behavioural shift is between training and deployment, rather than evaluation and deployment), and, of course, designing mitigations.
Conclusion
The framework above offers a clear guide for thinking about evaluation awareness along two dimensions: what the model believes about its environment, and whether its behaviour matches deployment. We hope this helps situate potential research directions (improving realism, leveraging uncertainty, training for consistency) while highlighting their respective failure modes.
If you're interested in more of this work, we've come up with three project proposals, one for each of these research directions, and your feedback would be appreciated! Leave a comment or a DM and I can share access to the document.
See Igor Ivanov’s more in-depth [post] on what “evaluation” and “deployment” mean. Our framework doesn’t depend on any particular conception, though.