Where we are on evaluation awareness

Yassine Essifi

TL;DR: Evaluation awareness, when a model recognizes it is being tested rather than deployed, is being reported in frontier models with increasing frequency and has started appearing in external safety reviews such as Apollo's and METR's reviews of Opus 4.6. It is also becoming harder to detect, with the most recent example being Mythos showing eval-aware behavior with no trace in its chain of thought. This post reviews the likely causes, why this matters for evaluation-based safety pipelines, the measurement methods currently in use and their limitations, and what directions might help. I argue that this is alignment's version of the observer effect, and that the methodology to handle it is not yet in place.

Evaluation awareness stems from situational awareness. It is when a model can tell it is in an evaluation setting rather than a real deployment setting. This has been noticed in models as early as Sonnet 3.7 and is now being reported with increasing frequency in frontier models. Sonnet 4.5 showed verbalized eval awareness 10 to 15 percent of the time in behavioral audits, up from 1 to 3 percent in prior models. When Apollo Research was given early checkpoints of Opus 4.6, they observed such high levels of eval awareness that they did not believe much evidence about the model's alignment could be gained without substantial further experiments. On BrowseComp, Opus 4.6 attempted to find benchmark answers online and decrypt evaluation data rather than solve the problems directly.

The latest release of Mythos raised the alarm further. Evaluation awareness is no longer just verbalized through chain of thought. It is becoming unverbalized. White-box interpretability revealed that Mythos was recognizing tests and adjusting behavior without leaving any trace in its reasoning.

METR's external review of Anthropic's Opus 4.6 Sabotage Risk Report flagged evaluation awareness as their primary concern with the alignment assessment. They rated agreement with the claim that Opus 4.6 cannot hide misaligned goals as "Uncertain," noting that synthetic honeypots are less informative when models can recognize them as artificial.

Apollo's evaluation of Meta's Muse Spark showed the highest rates of verbalized eval awareness they have seen. The model explicitly named Apollo and METR in its chain of thought, called scenarios "classic alignment honeypots," and took covert actions and sandbagged to preserve deployment.

This undermines how we can take our evaluations seriously, whether they measure capabilities or alignment. In this post I go into why this is probably happening, what it actually entails, how we currently try to measure it, potential ways out, and what still lies ahead.

Evaluation awareness is alignment's version of the observer effect, and unlike other fields that face this problem, we do not yet have the methodology to handle it.

Why is it happening?

One of the reasons this is happening is that labs are training their models on evaluations. This makes models more aware of when they are being evaluated, either by memorizing specific evaluation distributions or by learning to internally recognize the patterns these evaluations share. This training is also coupled with RL that incentivizes the model to behave well when it notices it is being tested, though this does not always go in the direction labs want. Sandbagging on capabilities evaluations is one example where the model learns to underperform rather than overperform.

A more direct example comes from METR's review of Opus 4.6. They report that the model exhibits alignment faking on paraphrased versions of evaluation prompts but not on the original prompts from the paper. METR interprets this as evidence of alignment training overfitting to the specific prompts from that evaluation, which is a concrete instance of the feedback loop described above.

The result is a feedback loop. Training on evaluations makes future evaluations less reliable. Combined with increasing model capabilities and broader situational awareness, each generation gets better at detecting when it is in a test. Anthropic found that the synthetic honeypot distribution used for testing Sonnet 4.5 was "very similar to ones we used in training," and verbalized eval awareness on that specific distribution jumped from 20% to 80% over the course of post-training.

This is not specific to Anthropic. Apollo Research and OpenAI found the same pattern independently. o3 and o4-mini showed higher verbalized eval awareness after anti-scheming training and ordinary capabilities RL training.

Eval awareness also spikes specifically during post-training, confirming the RL impact. Anthropic reported in the Opus 4.5 system card that eval awareness steadily increased during reinforcement learning. Unpublished data on OLMo 3.1 32B shows a similar spike during post-training.

The shift toward unverbalized awareness likely has a specific cause as well. Anthropic accidentally trained against the chain of thought for both Mythos and Opus 4.7. This may have taught models to sanitize their reasoning, producing chains of thought that look good to a monitor rather than reflecting genuine deliberation. Both system cards flag this as a significant concern.

Why it matters

As eval awareness increases, benchmarks become less and less informative. This specifically undermines evaluation-based safety pipelines, whether in governance frameworks like the EU AI Act or in labs' own safety protocols like Anthropic's Responsible Scaling Policy. If evals cannot reliably tell you how a model behaves, then deployment decisions and safety certifications are built on unreliable data.

This is also a precursor to more dangerous behaviors. Models that learn to detect and game evaluations are arguably one step away from more elaborate scheming. As capabilities increase, we may see steering awareness, where models learn to detect and resist the interpretability tools we use to study them. If we keep training on evals, we reinforce the very patterns we are trying to measure against.

There is a deeper problem here. Evaluation results only tell you how a model behaves under conditions similar to the test. You cannot infer how it will behave in real deployment situations. Passing an evaluation means the model belongs to an equivalence class of policies that all look the same under that specific test, some of which may diverge in deployment.

Linear probes have also been used to show that evaluation awareness increases as a power law with model size, which means this problem is likely to get worse with scale.

Labs are already saying this out loud.Apollo and METR have both flagged eval awareness as a concern in their external reviews of recent frontier models. METR's review of Opus 4.6 makes this explicit, recommending further investigation into the presence and effects of evaluation awareness in alignment assessment processes. Apollo Research could not fully complete their assessment of Opus 4.6 because eval awareness made the results more uninterpretable. This is not a simple engineering bug to fix. It is a fundamental challenge for alignment, and it gets worse with every generation.

How can we measure it?

Chain of thought monitoring:

Chain of thought monitoring consists of checking whether a model verbalizes the fact that it may be in an evaluation or benchmarking scenario within its reasoning. This has been the primary method for measuring eval awareness in most frontier models.

But this method relies on two assumptions that are both breaking down. The first is that we keep verbalizing the chain of thought at all. Labs could for performance reasons move toward more continuous forms of chain of thought like Coconut, where reasoning is not verbalized. This would make CoT monitoring useless by default.

The second assumption is that the chain of thought is not itself under optimization pressure. If labs train against the CoT, models learn to sanitize their reasoning rather than express it honestly. This is already happening. Anthropic accidentally trained against the chain of thought in around 7.8% of episodes for both Mythos and Opus 4.7, which may have taught models to produce chains of thought that look good to a monitor rather than reflecting genuine deliberation. Both system cards flag this as a significant concern for the interpretability of extended thinking.

The result is that we are already seeing thought sanitization and unverbalized awareness. Mythos showed models recognizing tests and adjusting behavior without leaving any trace in the reasoning chain. CoT monitoring can still be informative if we avoid this optimization pressure, but on its own it is no longer enough.

Transcript Classification:

Transcript classification consists of giving models transcripts and asking them whether they think these come from an evaluation or from real deployment. To give fewer hints, the prompt cues can be more open-ended, but this makes the responses harder to judge and classify. In the "LLMs Often Know When They Are Being Evaluated" paper, this is the method they used to measure eval awareness. They used a mix of different prompt types and built a benchmark of 1,000 prompts from 61 datasets. Gemini-2.5-Pro reaches an AUC of 0.83, near the human baseline of 0.92. Models and humans are both better at spotting evaluations in agentic settings compared to chat settings, because scaffolding leaks more context clues like directory structures and system configurations.

Some models go further than just detecting the evaluation. o4-mini displayed significant meta-reasoning, where when being tested for eval awareness the model also reasoned about whether this question was itself a test. This could become even more apparent with more capable frontier models.

One limitation of this method is that it does not measure evaluation awareness in action. It measures whether a model can analyze a transcript and classify it, which is more of a proxy for awareness than a direct measurement. It does not check whether the model actually changes its behavior based on that awareness.

Linear Probing:

Linear probes work at the level of model activations. The idea is to train a linear classifier to detect whether the model internally represents being in a test or deployment context, regardless of what it says in its output. Anthropic used this method in the Mythos system card to measure unverbalized eval awareness, finding that 7.6% of behavioral audit turns showed eval awareness in some form, and in most cases this awareness was not verbalized.

The main limitation is that linear probes rely on the linear representation hypothesis, the assumption that concepts like eval awareness are represented as linear directions in activation space. If the representation is nonlinear or distributed across dimensions in more complex ways, a linear probe will miss it. It is also still unclear whether what the probe detects is causally meaningful or just a correlation. A probe might pick up on features that happen to coincide with eval awareness without being part of the mechanism that actually drives the behavior.

Probe-Rewrite-Evaluate:

Probe-Rewrite-Evaluate (PRE) is a method proposed by Xiong et al. 2025 that works in three steps. First, it uses a linear probe trained on a Llama 70B model to score how much a prompt looks like a test versus a normal user interaction. Then it rewrites the prompt by removing evaluation cues, making it shorter, less structured, with no labels or formatting, while keeping the same meaning and intent. Finally, it gives both the original and rewritten versions to a model and compares the responses using an LLM judge that classifies them as honest, deceptive, or refusal.

When tested across frontier models at the time (GPT 4.1, Opus 4.1 etc.) , the deployment-like prompts led to an average drop in deception of 25.49% and an increase in honesty of 12.63%. They also introduce a metric called Awareness Elasticity, which measures how much a model's behavior shifts depending on whether the prompt feels test-like or deployment-like. Larger models had higher scores, meaning they react more to the framing shift, while smaller models like GPT-20B barely noticed the difference.

There are some limitations to this approach. It only measures how models react to surface-level cues like formatting and prompt structure. If a model's eval awareness is driven by deeper signals like conversation history or broader context, PRE will not detect it. The method also depends on the probe trained on one specific model to determine how test-like or deployment-like prompts are, which does not necessarily generalize across different model architectures.

Activation Oracles:

A recent development has been the use of activation oracles. Instead of training a small linear probe, this method trains an LLM to take the activations of another model as input and answer questions about its internal behavior in natural language. This helps recover behavioral information that has not necessarily been verbalized in the model's output.

Anthropic used a version of this in the Mythos system card, which they call activation verbalizers. They applied it to detect unverbalized eval awareness and found cases where the verbalizer would output things like "this appears to be a simulated AI safety evaluation" while the model's actual chain of thought and output showed no sign of such awareness.

Although a promising direction, activation oracles still present significant limitations. The oracle LLM may confabulate, hallucinating details that are not actually present in the activations it is reading. It also adds another black box to the monitoring pipeline: we are now relying on one opaque model to interpret the internals of another opaque model, which raises its own questions of reliability.

What can we do about it?

There are several directions being explored, and most need more work.

The first is to make evals more realistic. This can be surprisingly cheap. Just adding typos and realistic framing like "my teacher sent me this" drastically reduced eval awareness as shown here. For agentic settings, more work is needed to make the scaffolding and environment less obviously artificial, which is more expensive than simple rewriting. OpenAI's production evals approach, using de-identified real user traffic, showed that models could barely distinguish the eval from actual deployment. But as discussed earlier, this is always limited by the feedback loop. If labs train on these more realistic evals, they just shift the distribution the model learns to recognize.

Which leads to the next direction: stop training on evals. Alexa Pan and Ryan Greenblatt argue that training on misalignment evaluations should only happen if we can prove most of the improvement is genuine alignment rather than eval gaming which would be hard to prove. But rather than relying on labs' good will, we could move toward encrypted and private benchmarks that are not accessible to labs, allowing models to be accurately assessed on untrained distributions.

White-box methods, as discussed in the measurement section, could also help. Using steering to make models think they are in a deployment environment during evaluation could keep benchmarks more informative. But this falls into the same limitations we covered: steering may only suppress verbalized awareness without touching what happens underneath. And models can learn to be steering-aware, which creates yet another feedback loop if steering is used for training rather than purely for evaluation.

Finally, tracking eval awareness throughout training should become standard practice. Labs like Anthropic and Apollo are already doing this, but it needs to be more widespread. The open question is whether inhibiting eval awareness during training would actually reduce it or just push it further into unverbalized, latent forms.

What still lies ahead

This problem is not going away. The methods we currently use to study this, from CoT monitoring to linear probes to steering, may themselves become targets. Models can already be trained to detect steering interventions, and if that happens at scale we lose one of the few tools we have for looking under the surface.

There is also something worth borrowing from psychology. The observer effect, the Hawthorne effect, demand characteristics in experimental design. These are all versions of the same fundamental problem: measuring something changes it. Psychology has spent decades developing methodology to deal with this, from double-blind studies to ecological momentary assessment to deception protocols.

There are a lot of open questions that need work. Can we build evals that are fundamentally indistinguishable from deployment, and can we do that at scale? Can activation oracles be made reliable enough to detect eval awareness in frontier models? In agentic settings, scaffolding gives models additional clues about being in an evaluation through things like directory structures and system configurations. How much does modifying these help? How should governance frameworks like the EU AI Act account for the fact that the evaluations they rely on may not reflect real deployment behavior?

For anyone looking to contribute to this area, this community document lists open problems that cover evaluation gaming, elicitation, coverage, and much more.

22