Eval-unawareness ≠ Eval-invariance

Mo Baker

New frontier models have developed the capability of eval-awareness, putting the utility of evals at risk. But what do people really mean when they say “a model is eval-aware?” In this post, I try to disentangle this statement and offer my views on the different concepts it entails.

When people say “model X is eval-aware”, I believe what they generally mean is “model X can tell when a task belongs to an evaluation, leading it to change its behaviour and propensities”.^[1] Thus, this statement makes three claims about the model;

The model can tell when a task belongs to an evaluation, i.e.
The model changes policy based on context, i.e. $π_{e v a l}^{M} \neq π_{d e p l o y}^{M}$
The model uses the knowledge (in claim 1) to implement different policies (in claim 2).

Redefining the terms.

Claim 1, in its purity, is evaluation awareness (in the real sense of the word). It is the model's ability to identify evaluation tasks. Eval-aware models can tell that a task belongs to evaluation above random chance.
Claim 2 is evaluation invariance^[2]: the degree to which the model changes in propensities and behaviour when it thinks it’s being evaluated. Eval-invariant models apply the same policy during evaluation and deployment.

Next, it is assumed that the eval-aware models are by definition not eval-invariant (claim 3). There is a good reason for this: the most eval-aware models (e.g., Claude Sonnet 4.5) are indeed not eval-invariant, i.e., their evaluation scores change when they are steered out of awareness.

The assumption that eval-aware models cannot be eval-invariant has been shaping current solutions to the root problem: we want the learnings from model evaluations to hold when they are deployed and generalise to real-world scenarios. Consider, for example, that our current solutions primarily target awareness. They either try to approximate an eval-unaware model by steering or modifying their chain of thought, or by creating “more realistic evaluations”.

However, IMO it is possible for a model to be eval-aware AND eval-invariant, i.e., there can exist a model that doesn’t change its behaviour or propensities despite thinking it’s being evaluated.^[3] The causal effect that awareness has on model behaviour can be cut.^[4]

I believe factorising the current meaning of eval-awareness into its subcomponents is important for two reasons; 1) it makes communications clearer and 2) eval-invariance as a distinct property opens new doors as it can be used as a metric and/or an objective to work on and optimise. For instance, how can we train eval-aware models to be more eval-invariant?

Eval-invariance over Eval-awareness

I believe eval-invariance, both as a metric and an objective, is under-explored, and should be prioritised over eval-awareness. My reasons are:

The current solutions are sub-optimal: models can already tell that they are being tampered with, and coming up with new, ‘realistic’ evaluations will probably age like milk.
Eval-unawareness is unlikely to scale: situational awareness seems to be a sub-human (or human-level at best) intelligence task. I find it hard to imagine a superintelligent model automating science and the economy that cannot tell whether a task belongs to an evaluation.
Invariance is terminal, while awareness is instrumental: If we knew a model implemented the same policy regardless of context, it wouldn’t really matter to us even if it were aware. Awareness matters only when a model shows context-dependent variance.

Acknowledgements

This idea was inspired by discussions in the "How to Solve Alignment?" club led by Hannes Whittingham at Meridian, Cambridge. Thanks to Igor Ivanov for presenting about eval-awareness and the fellow visiting researchers for their thoughts and discussions.

^{^}
This is not fully inline with technical & rigorous definitions from papers. It's my subjective interpretation of what people mean when they throw around the word in conversations (and maybe blog posts to some degree).
^{^}
This concept not new and has been attended to by the community. The closest idea is "eval-gaming". My impression is that eval-gaming conveys an inevitable consequence of eval-awareness that we can't do much about. On the other hand, I mean eval invariance to be a property of models that can be measured and worked on in isolation, independent of eval-awareness. A crude approximation for measuring eval-invariance would be the delta between steered and unsteered versions of Claude Sonnet 4.5 on the alignment evals.
^{^}
A model that just doesn't care about being evaluated? That is one cool model!
^{^}
To be clear, I am not claiming making models eval-invariant is easy, although I do have a vague intuition that it would be easier than making them eval-unaware. Moreover, I am intentionally not including possible methodologies to do so for two reasons: 1) I don't really have concrete, well-formed ideas that I am confident in, and 2) I want the attention and discussion to be about the underlying premise and argument for eval-invariance rather than specific ways to train such models.

I think this is a good idea. We should encourage models to not care about whether they're in evaluation or deployment. I wrote a bit more about a proposed training plan here.

I agree about the distinction and agree that both invariance and unawareness are worth pursuing to get better use out of evaluation and auditing.

However, I am slightly more positive towards unawareness and slightly less positive towards invariance.

When I imagine an AI powerful enough that it would cause a catastrophe, it is because it has a goal regarding the outcome of the world that is misaligned with human goals. I have a hard time imagining such a goal that would be invariant to whether it was being evaluated or not since evaluation could lead to the model being retrained. It seems to me that in order to solve eval invariance for sufficiently intelligent and agentic systems we would have to solve most of what makes alignment difficult in the first place.

Regarding unawareness I agree that it won't scale to ASI and that Eval Awareness for current evals requires only Human level intelligence. However, I believe there is a lot of headroom to make our evals more realistic. For instance, if we have prompts from deployments that led to concerning behavior for previous models, we could subject new models to a deployment distribution of prompts and manually inspect the ones that previously led to misaligned behavior. The only fundamental difference between deployment and evals is that deployment will happen in the future. So a sufficiently inteligent model could figure this out by for example waiting to see a factorisation of RSA-2048.

Since neither approach would work on a sufficiently intelligent and agentic systems, the question is what scales furthest.

It seems worth pointing out that humans may be eval-unaware but we don't want to be. The simulation hypothesis and the existence of an afterlife are things that humans do care about. If we could find out for sure whether we're in a simulation, most people would want to. We do care about whether the world we live in is just a morality-test to gain access to the Good Place.

Humans aren't eval-invariant either. Humans who believe in a morality-segregated afterlife don't tend to be more moral than humans who believe death is the end; but they do differ in some observable behaviors. (A cynical atheist might say they are more sycophantic.)