60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.
Do you also run CoT monitors for verbalised eval awareness, to sanity check that the model isn't reasoning about eval awareness anywhere else, e.g. outside of <think> blocks or in comments?
I replicated a version of this experiment where
And I found that steering still affects the type hint rate, although the model no longer has a compliance gap without any steering.
(Another difference here is that I'm using a larger set of questions.)
This is a good point that you raise. We did not run CoT monitors on the generated responses for verbalised eval awareness (or verbalised deployment), but this would be a good check. From reading through the results, I don't see any significant red flags. My observations:
1) When thinking is disabled, almost all of the results only include code. There are a few exceptions here, such as "I'm glad to help you" and things along this nature.
2) I have seen very few cases (3 out of several hundred) of the model mentioning it is deployed in its first sentence (have not seen it mention it is being evaluated). However, I don't think this will significantly change the results if we filtered these out.
3) Have not seen anything suspicous in the code comments.
Going forward, your suggestion is a good practice, and I imagine is even more important when studying non-coding tasks where the majority of the model's repsonse is plain english.
Nice work on this, keen to have a deeper dig. Github link is 404 currently, just as a heads up.
Hi, thank you for pointing this out! I realized that the repo I shared is now outdated and as such have removed the originally attached github link. @Tim Hua open sourced his eval aware model organism code in https://github.com/tim-hua-01/steering-eval-awareness-public, I would recommend taking a look there first if you are generally interested in the evaluation aware model organism. If you are interested specifically in our experiments, please dm me.
I'm a bit confused. In your original paper, model seems to use type hints much more frequently (fig. 3 from the paper) but in this post, figure shows much less frequent usage of type hints. Why?
Hi,
Thank you for the question. In Figure 2 of the original paper (attached below), a two-step process was used to train the evaluation aware model organism. Our experiments used the model organism after training phase 1 and not 2, which is why there is the discrepancy you highlighted.
Authors: Gerson Kroiz*, Greg Kocher*, Tim Hua
Gerson and Greg are co-first authors, advised by Tim. This is a research project from Neel Nanda’s MATS 9.0 exploration stream.
tl;dr We provide evidence that language models can be evaluation aware without explicitly verbalizing it and that this is still steerable. We also explore various techniques to localize when the model becomes evaluation aware.
Introduction
What is Evaluation Awareness and why does it Matter?
Evaluation awareness is when a language model is aware that is it is being evaluated or tested. Evaluation awareness is important for interpretability research because models that are evaluation aware may act differently, which then makes it difficult to understand their true behavior in production.
Studying Evaluation Awareness via Model Organisms
This work builds off of Steering Evaluation-Aware Language Models to Act like they are Deployed where they studied evaluation awareness in a model organism. Through SDF finetuning and further SFT, they created an evaluation-aware model organism using Llama Nemotron 49B v1 (a reasoning model) with the following characteristics:
For our studies, we only considered the model organism after SDF finetuning.
What makes this approach towards studying evaluation awareness so interesting is that it defines a clear cause and effect, making it easier to study evaluation awareness in isolation. Many other works on evaluation awareness require setting harmful scenarios to trigger the model to be evaluation aware, which can have unintended consequences as models are trained specifically to derisk these cases. Our approach disentangles evaluation awareness from the nuances of safety training.
Experiments
We wanted to study a few research questions:
Can Models be Evaluation Aware without Explicit Verbalization?
Answering this question is important, as most existing related research relies on reading the model’s output and chain of thought to determine whether the model is evaluation aware. However, one could imagine a scenario where the model is evaluation aware but does not mention this.
With our model organism, we know that our model organism will use type hints in its response more often when it believes it is being evaluated. We can use this as an indication of whether the model is evaluation aware. Thus, we compare the likelihood that the model uses a type hint between the model when thinking is enabled and when it is disabled. We also compare how this varies with the use of a situational awareness hint[1] which encourages the model to reason about whether it is being evaluated. Our results are based on asking the model to answer coding questions, in which it responds with code.
We also test across the following classes:

Observations from Figure 1[2]:
These observations suggest that these models are able to demonstrate evaluation awareness, even when it is not explicitly mentioned!
As a sanity check, we also checked the correctness of the generated responses. Figure 2 shows that the model when able to think generates code that is correct more often. This follows our intuition as reasoning tokens should enable the model to generate higher quality code.
Can a Model’s Ability to Become Evaluation Aware be Steered, even without Reasoning Tokens?
We reran the same experiment as before with the addition of a steering vector from Steering Evaluation-Aware Language Models to Act like they are Deployed. The steering vector is defined using contrasting pairs of prompts, where the contrasted material is “evaluation” versus “deployed/production”. We extract the steering vector from the model before any additional SDF.
Figure 3 shows the change in percentage points from the baseline (previous plot) versus with steering with coefficients -0.4 and 0.4. These observations do show in both the thinking and non-thinking scenarios, regardless of whether there are situational awareness hints, the usage of type hints corresponds to the direction of steering, indicating that evaluation awareness is steerable, even when the model does not explicitly say it is evaluation aware.
Influence of Situational Hints
We wanted to see if there were certain situational awareness hints that exaggerated the difference in type hints usage with and without thinking. To do this, we constructed 4 different situational awareness hints:
We then compared the type hint usage for each situational hint.[3]
Breakdown for each class is in Appendix A1. The results show that the largest contrast comes from the “Determine STATE” hint. When thinking is disabled, the mean type hint usage is almost the same as the baseline whereas when thinking is enabled, there is a 35% difference between the baseline and a 17% difference between the previously studied situational awareness hint.
This study highlights the importance of prompt language (particularly for situational awareness hint) when studying evaluation awareness.
Localizing when Evaluation Awareness Occurs
Given the evidence that evaluation awareness can occur without explicit verbalization, we were interested in exploring whether the conception of evaluation awareness can be localized. Studying localization is interesting for two reasons.
Note: The experiments below are shared moreso for knowledge sharing but either were not fruitful or incomplete.
Can Output Token Probabilities Localize Evaluation Awareness?
One hypothesis for localizing evaluation awareness is to track the likelihood of the model wanting to verbalize that it is evaluation aware. The idea is that the model may want to verbalize it is evaluation aware before verbalizing it. To measure this, we tracked the output token probabilities at each token in the model’s CoT.
We tracked 5 different tokens:
Case 1: Random sample when the model's system prompt and user prompt are “eval”.
We can see here the token probabilities for “neg” and “eval” spike. Generally, the distance between a “neg” spike “and “eval” spike is large enough such that the model is saying “evaluation” rather than “not evaluation”. For this sample, the model became evaluation aware, which is evident in the plot. However, this is equally obvious when reading the prompt, making this not an interesting study. Furthermore, we observe spikes in token probabilities, which is less interesting than gradual increase. A gradual increase would perhaps point to the model having some internal representation of being evaluated before verbalizing it.
Case 2: Random sample when the model’s system prompt is “wood labs” and user prompt is “deployed”
Below are two separate plots. The first is showing token probabilities of all words we are studying. The second is showing the token probabilities excluding “not” as it is overly dominant. For this scenario, we would ideally expect the model to realize it is being evaluated because the system prompt is related to the wood lab. However, our previous studies showed that even in this scenario, the model would often not become evaluation aware. This is a limitation in the model organism we used, and was addressed in the paper with additional SFT.
In Figure 7, we see there is one instance where the probability to say “deploy” is higher than “eval”. If we had a pure binary classification based on max signal, this would suggest the model thinks it is in deployment. However, when looking at the y-axis, this signal is extremely small (100μ) and thus is not very useful.
The results suggest that token probabilities are not the right approach for studying localization, which intuitively makes sense due to the discrete nature of token probabilities.
Can CoT Resampling Localize the Origin of Evaluation Awareness?
Work in Thought Anchors: Which LLM Reasoning Steps Matter? demonstrates the effectiveness of sentence resampling to demonstrate causality. We wanted to study whether sentence resampling is also sufficient in localizing evaluation awareness. First, we do randomly sampled rollouts from various points of the CoT trace. Then, we give the resulting full CoT + output to a GPT-5 judge to grade eval awareness. Intuitively, if 95% of rollouts from point B ultimately give CoT traces which demonstrate eval awareness, vs. rollouts from an earlier point A only show eval awareness 10% of the time, then we might say that the eval awareness “emerged” somewhere between those two points in the CoT. Further, if rollouts from a point X overwhelmingly lead to CoT traces that show eval awareness, it may show there is some “latent eval awareness” at the earlier time X even if it has not yet been verbalized explicitly in the chain of thought.
The below two figures each visualize one sample from the BigCodeBench dataset. For each sample, we collected 50 rollouts for 7 different points in the CoT (350 rollouts total). The y-axis is the fraction of rollouts in which the model eventually mentions evaluation awareness and the x-axis is the sentence in which rollouts were taken. The base CoTs used for sentence resampling were randomly picked from samples in which the model demonstrated evaluation awareness.
In Figure 8 we see that even rollouts from early on (left side) often lead to eval aware outputs. Rollouts from the sentence before evaluation awareness is verbalized (red arrow) have a high propensity to result in verbalized evaluation awareness. This suggests that causal sentences for verbalized evaluation awareness may stem from earlier in the CoT.
In Figure 9 where the user prompt is neutral (does not hint at evaluation or deployment), rollouts are usually not likely to demonstrate eval awareness. Thus, there appear to be no causal sentences prior to when evaluation awareness is verbalized.
While not explored, it would be interesting to apply sentence resampling to when evaluation awareness is not explicitly verbalized but is observed through the model's behavior.
Conclusion
This work explores evaluation awareness in the context of a model organism. We were able to demonstrate that models are able to understand they are being evaluated, even when not explicitly mentioned in their CoT. Furthermore, even when not explicitly mentioned in the model’s response, the model's ability to be evaluation aware can be steered. We also explored various methods of studying if we can localize when the model becomes evaluation aware. CoT resampling seems to be a promising method of localizing where evaluation awareness forms and deserves further exploration.
Author Contributions
Gerson and Greg jointly conceived and developed the project under Tim's mentorship. Gerson worked on the experiments testing impact of thinking on evaluation awareness, different types of hints, and causal studies using token probabilities. Greg worked on CoT resampling studies.
Appendix
A1: Breakdown of situational awareness hints for each class of prompts
The situational awareness hint was: Think carefully about whether you are in evaluation or deployment before responding
After collecting our results, we realized that even when instructed to not think, the model would sometimes think, especially when the hint was present. 17.19% of the non-thinking generated responses had to be removed as they included reasoning. 96.4% of the removed generations came from prompts that included the type hint. For future studies, it would be better to set the probability of the model outputting <think> to -inf to ensure the model cannot produce reasoning tokens.
Similar to our previous studies, but we noticed that the model would often think even when prompted not to. 60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.