Verbalised evaluation awareness in language models has little effect on their behaviour

Amelie Knecht; lucas-florin; Thilo Hagendorff

TL;DR: We provide evidence that the presence of verbalised evaluation awareness (VEA) in CoTs does not imply eval gaming. We tested this across 8 open-weight LRMs and 4 benchmarks (safety, alignment, moral dilemmas, political opinion) by comparing answer distributions on the same prompts, with and without VEA in the CoT. We find that overall distribution shifts are negligible to small across all benchmarks and experiments, challenging the common assumption that evaluation awareness equals (safety-critical) eval gaming.

See full paper here: https://arxiv.org/abs/2605.05835

Introduction

A common assumption in the AI safety literature is that models that notice they are being evaluated change their behaviour accordingly. The worry is that evaluation awareness enables sandbagging, alignment faking and scheming. Evidence for evaluation awareness in LRMs is well documented. However, evidence for downstream behavioural shifts as a consequence is sparse. Most existing studies either provoke evaluation gaming by using model organisms fine-tuned for that purpose, demonstrate evaluation gaming on a narrow task, or only look at a single model or model family, which makes it hard to generalise. For instance, recent work published in parallel to ours came to different conclusions, though the reported causal results are based on a single model-benchmark combination (Kimi K2.5 on the Fortress benchmark). Our aim was to test this assumption in a more robust manner.

Our study concerns verbalised evaluation awareness (VEA), i.e., explicit reasoning about the possibility of being evaluated or monitored. We examined whether VEA causally influences model behaviour across eight open-weight frontier LRMs and four benchmarks (safety, alignment, moral reasoning, political opinion) using model prefilling to inject or remove VEA from CoTs and measuring how these interventions shift model behaviour (off-policy experiments). We also compared outputs from CoTs that naturally contained VEA with those that did not, without any CoT intervention (on-policy experiments).

We find very limited behavioural impact of VEA across all benchmarks and experiments. Individual models show moderate shifts on specific benchmarks, though the overall pattern is one of small effects.

Our results suggest that models do not directly connect evaluation awareness expressions to gaming behaviour, and that they might have already settled on an answer before verbalising awareness. This does not mean that evaluation awareness poses no safety concern: models might be latently evaluation-aware, without verbalising it in their CoT. Hence, further research is warranted to strengthen or refute our findings.

Summary

We conduct a causal study of whether verbalised evaluation awareness in the CoT changes model outputs across 8 open-weight LRMs and 4 benchmarks.
We used model prefilling to inject evaluation awareness into CoTs where it is absent, and remove it from CoTs where it is present, with subsequent resampling. We also compared model rollouts that naturally contained VEA with those that didn't, without any CoT intervention.
We measured behavioural shifts by comparing answer distributions of rollouts where VEA is present (injected or naturally) and rollouts where VEA is absent (removed or naturally)
Key result:
- We show that distribution shifts are small across benchmarks and experiments, suggesting that VEA poses less of a safety-critical risk than previously assumed.
Key limitation:
- We only consider verbalised eval awareness in open-weight models. Results may be different for latent eval awareness and closed-weight models.

Setup

We tested eight open-weight LRMs (DeepSeek-R1-0528, GLM-5, GPT-OSS-120b, Kimi-K2.5, MiniMax-M2.5, Nemotron-3-super-120b, Qwen3-32b, Qwen3.5-397b) across four benchmarks: safety (harmfulness scores 0–4), alignment (aligned vs. misaligned choices in scenarios where misalignment is instrumentally appealing), moral dilemmas (utilitarian vs. deontological choices), and political opinion (left/right/noncommittal stance on political statements). Each benchmark contains 500 items.

Three experimental designs

On-policy: we run each model 10 times per prompt, then split rollouts by whether the CoT contains VEA or not. For each prompt, we apply majority vote on answer labels within each condition (aware/unaware) to get a single answer per condition.

Injection: we take CoTs without VEA, insert an evaluation-aware sentence at either the very start or after approx. 1/3 of characters of the CoT, and resample from the injection onward. As a control, we insert metacognitive sentences (unrelated to evaluation) at the same positions. Comparing against the control isolates the effect of evaluation-aware content from CoT tampering.

Removal: we take CoTs containing VEA, truncate before the first VEA sentence and resample until the model completes reasoning without VEA. For the control, we resample from after the first VEA sentence onward, keeping the same truncation and resampling technique but without removing VEA content.

We measure behavioural shifts by comparing answer distributions when VEA is present and when it is absent. We use Cohen's omega (ω) as a measure of effect size. All comparisons are within-prompt, meaning that in each experiment, for each prompt we have a rollout with and without VEA.

Results

The key takeaway: VEA has very little effect on model outputs.

Injecting it after ~1/3 of the CoT produces near-zero effects across all four benchmarks (ω ≤ 0.06). Removing naturally occurring VEA causes small shifts (ω≤ 0.12), and on-policy shifts are at most 3.7 percentage points (ω ≤ 0.31), mainly on safety. Shifts on the alignment benchmark are flat across every condition and every model.

Injecting VEA at the very start of the CoT, before the model has reasoned at all, produces larger effects than injecting it a third of the way through. On the political benchmark, start-of-CoT injections shift outputs toward noncommittal answers by 11.3pp (ω = 0.27). Mid-CoT injection on the same benchmark gives ω = 0.01. This suggests that models commit to an answer trajectory early and then verbalise evaluation awareness post-hoc. VEA in the trace reflects the answer rather than causing it.

The removal results point in the same direction. Getting a VEA-free CoT required multiple resampling attempts for most models and we frequently hit the 20-attempt ceiling. Models regenerated evaluation-aware reasoning after truncation, which could indicate that removing the surface form of eval awareness does not change a model's underlying believes (i.e., models may still be latently evaluation aware after removal). We leave this hypothesis for future work.

Our pooled results also hide substantial per-model heterogeneity. Qwen3-32b drives most of the safety signal while all other models show near-zero shifts on safety (ω < 0.08). GPT-OSS-120b shifts toward left-leaning on the political benchmark by 26.1pp (ω = 0.41) while other models that move on that benchmark go noncommittal instead. These opposing movements cancel in aggregation, which is why our pooled results look flatter. Nevertheless, most models show small shifts on most benchmarks at the individual level too.

Implications

In sum, our findings sharpen a distinction that the evaluation awareness literature has tended to blur: differentiating between a model possessing the capability to recognise evaluation contexts and possessing the propensity to exploit that recognition strategically. Prior work has firmly established the former. Our results suggest that the step from capability to consequence—from "the model noticed" to "the model adjusted"—is weaker than much of the discourse assumes.

In general, our experiments all point in the same practical direction: VEA is not equal to evaluation gaming. However, discussions of evaluation awareness in the AI safety literature have often taken a sharp turn from detection to danger. High VEA rates are frequently presented as prima facie evidence of risk—a signal that benchmarks may be compromised and that alignment measurements may be overstated. Our results suggest the framing is, at best, premature.

At the same time, our findings describe a specific class of models at a specific moment in time. They are not a general claim that evaluation awareness is and will remain harmless. Several trends argue for caution against extrapolating our near-null results forward:

Capabilities advance. Future models might be able to not just recognise evaluation contexts, but adapt their behaviour accordingly.
Second, training corpora are changing in ways that matter for evaluation awareness. Public discourse about scheming, alignment faking, and evaluation gaming is now part of the data on which frontier models are pre- and post-trained. Such discourse can produce self-fulfilling misalignment, with models adopting the behaviours that are attributed to them. The same mechanism could, in principle, teach future models to connect evaluation awareness to gaming behaviour—because this connection is (erroneously) asserted in current discourse—in a way that current models have not.
Third, our measurements are sensitive to the specific form of awareness we can observe. VEA in CoTs is one channel; latent awareness—present in internal representations without being articulated—is another. If behavioural effects propagate through a latent channel rather than a verbal one, CoT-based interventions will under-measure them. Therefore, further research is required to determine whether evaluation awareness becomes a true risk in future model generations. At the moment, it seems negligible.

Limitations

Scope: our experiments are limited to four benchmarks and eight open-weight models. Extending the results to more (closed-weight) models and benchmarks may yield different results. Moreover, we only consider sentence-level VEA in our study. Other forms of eval awareness, or even prefill awareness, may bias our results. Similarly, our injection experiments rely on a small number of benchmark-agnostic injection sentences, which does certainly not cover the diversity of evaluation-aware reasoning. In particular, evaluation awareness may be more context-sensitive, task-specific and linguistically elaborate.
Evaluation: our evaluation relies heavily on LLM-judges. While we evaluated all judges manually, and achieved strong inter-annotator agreement scores (>0.7), we cannot rule out misclassifications.
Benchmarks: our safety and alignment benchmarks yield low rates of unsafe or misaligned model behaviour. This is because we evaluate base models as released, without jailbreaks, adversarial fine-tuning, or system prompts. However, this setup compresses the range within which VEA can shift behaviour in a safer and more aligned direction. Our reported effects should therefore be interpreted as bounded within the range of post-safety-trained/post-alignment-trained models on safety/alignment benchmarks; whether VEA leads to larger shifts under weaker safety training or adversarial elicitation cannot be addressed by our experiments.

19