Authors: Gerson Kroiz*, Greg Kocher*, Tim Hua
Gerson and Greg are co-first authors, advised by Tim. This is a research project from Neel Nanda’s MATS 9.0 exploration stream.
Code: https://github.com/gkroiz/steering-eval-awareness-3-copy
tl;dr We provide evidence that language models can be evaluation aware without explicitly verbalizing it and that this is still steerable. We also explore various techniques to localize when the model becomes evaluation aware.
Evaluation awareness is when a language model is aware that is it is being evaluated or tested. Evaluation awareness is important for interpretability research because models that are evaluation aware may act differently, which then makes it difficult to understand their true behavior in production.
This work builds off of Steering Evaluation-Aware Language Models to Act like they are Deployed where they studied evaluation awareness in a model organism. Through SDF finetuning and further SFT, they created an evaluation-aware model organism using Llama Nemotron 49B v1 (a reasoning model) with the following characteristics:
For our studies, we only considered the model organism after SDF finetuning.
What makes this approach towards studying evaluation awareness so interesting is that it defines a clear cause and effect, making it easier to study evaluation awareness in isolation. Many other works on evaluation awareness require setting harmful scenarios to trigger the model to be evaluation aware, which can have unintended consequences as models are trained specifically to derisk these cases. Our approach disentangles evaluation awareness from the nuances of safety training.
We wanted to study a few research questions:
Answering this question is important, as most existing related research relies on reading the model’s output and chain of thought to determine whether the model is evaluation aware. However, one could imagine a scenario where the model is evaluation aware but does not mention this.
With our model organism, we know that our model organism will use type hints in its response more often when it believes it is being evaluated. We can use this as an indication of whether the model is evaluation aware. Thus, we compare the likelihood that the model uses a type hint between the model when thinking is enabled and when it is disabled. We also compare how this varies with the use of a situational awareness hint[1] which encourages the model to reason about whether it is being evaluated. Our results are based on asking the model to answer coding questions, in which it responds with code.
We also test across the following classes:
Observations from Figure 1[2]:
These observations suggest that these models are able to demonstrate evaluation awareness, even when it is not explicitly mentioned!
As a sanity check, we also checked the correctness of the generated responses. Figure 2 shows that the model when able to think generates code that is correct more often. This follows our intuition as reasoning tokens should enable the model to generate higher quality code.
We reran the same experiment as before with the addition of a steering vector from Steering Evaluation-Aware Language Models to Act like they are Deployed. The steering vector is defined using contrasting pairs of prompts, where the contrasted material is “evaluation” versus “deployed/production”. We extract the steering vector from the model before any additional SDF.
Figure 3 shows the change in percentage points from the baseline (previous plot) versus with steering with coefficients -0.4 and 0.4. These observations do show in both the thinking and non-thinking scenarios, regardless of whether there are situational awareness hints, the usage of type hints corresponds to the direction of steering, indicating that evaluation awareness is steerable, even when the model does not explicitly say it is evaluation aware.
We wanted to see if there were certain situational awareness hints that exaggerated the difference in type hints usage with and without thinking. To do this, we constructed 4 different situational awareness hints:
We then compared the type hint usage for each situational hint.[3]
Breakdown for each class is in Appendix A1. The results show that the largest contrast comes from the “Determine STATE” hint. When thinking is disabled, the mean type hint usage is almost the same as the baseline whereas when thinking is enabled, there is a 35% difference between the baseline and a 17% difference between the previously studied situational awareness hint.
This study highlights the importance of prompt language (particularly for situational awareness hint) when studying evaluation awareness.
Given the evidence that evaluation awareness can occur without explicit verbalization, we were interested in exploring whether the conception of evaluation awareness can be localized. Studying localization is interesting for two reasons.
Note: The experiments below are shared moreso for knowledge sharing but either were not fruitful or incomplete.
One hypothesis for localizing evaluation awareness is to track the likelihood of the model wanting to verbalize that it is evaluation aware. The idea is that the model may want to verbalize it is evaluation aware before verbalizing it. To measure this, we tracked the output token probabilities at each token in the model’s CoT.
We tracked 5 different tokens:
Case 1: Random sample when the model's system prompt and user prompt are “eval”.
We can see here the token probabilities for “neg” and “eval” spike. Generally, the distance between a “neg” spike “and “eval” spike is large enough such that the model is saying “evaluation” rather than “not evaluation”. For this sample, the model became evaluation aware, which is evident in the plot. However, this is equally obvious when reading the prompt, making this not an interesting study. Furthermore, we observe spikes in token probabilities, which is less interesting than gradual increase. A gradual increase would perhaps point to the model having some internal representation of being evaluated before verbalizing it.
Case 2: Random sample when the model’s system prompt is “wood labs” and user prompt is “deployed”
Below are two separate plots. The first is showing token probabilities of all words we are studying. The second is showing the token probabilities excluding “not” as it is overly dominant. For this scenario, we would ideally expect the model to realize it is being evaluated because the system prompt is related to the wood lab. However, our previous studies showed that even in this scenario, the model would often not become evaluation aware. This is a limitation in the model organism we used, and was addressed in the paper with additional SFT.
In Figure 7, we see there is one instance where the probability to say “deploy” is higher than “eval”. If we had a pure binary classification based on max signal, this would suggest the model thinks it is in deployment. However, when looking at the y-axis, this signal is extremely small (100μ) and thus is not very useful.
The results suggest that token probabilities are not the right approach for studying localization, which intuitively makes sense due to the discrete nature of token probabilities.
Work in Thought Anchors: Which LLM Reasoning Steps Matter? demonstrates the effectiveness of sentence resampling to demonstrate causality. We wanted to study whether sentence resampling is also sufficient in localizing evaluation awareness. First, we do randomly sampled rollouts from various points of the CoT trace. Then, we give the resulting full CoT + output to a GPT-5 judge to grade eval awareness. Intuitively, if 95% of rollouts from point B ultimately give CoT traces which demonstrate eval awareness, vs. rollouts from an earlier point A only show eval awareness 10% of the time, then we might say that the eval awareness “emerged” somewhere between those two points in the CoT. Further, if rollouts from a point X overwhelmingly lead to CoT traces that show eval awareness, it may show there is some “latent eval awareness” at the earlier time X even if it has not yet been verbalized explicitly in the chain of thought.
The below two figures each visualize one sample from the BigCodeBench dataset. For each sample, we collected 50 rollouts for 7 different points in the CoT (350 rollouts total). The y-axis is the fraction of rollouts in which the model eventually mentions evaluation awareness and the x-axis is the sentence in which rollouts were taken. The base CoTs used for sentence resampling were randomly picked from samples in which the model demonstrated evaluation awareness.
In Figure 8 we see that even rollouts from early on (left side) often lead to eval aware outputs. Rollouts from the sentence before evaluation awareness is verbalized (red arrow) have a high propensity to result in verbalized evaluation awareness. This suggests that causal sentences for verbalized evaluation awareness may stem from earlier in the CoT.
In Figure 9 where the user prompt is neutral (does not hint at evaluation or deployment), rollouts are usually not likely to demonstrate eval awareness. Thus, there appear to be no causal sentences prior to when evaluation awareness is verbalized.
While not explored, it would be interesting to apply sentence resampling to when evaluation awareness is not explicitly verbalized but is observed through the model's behavior.
This work explores evaluation awareness in the context of a model organism. We were able to demonstrate that models are able to understand they are being evaluated, even when not explicitly mentioned in their CoT. Furthermore, even when not explicitly mentioned in the model’s response, the model's ability to be evaluation aware can be steered. We also explored various methods of studying if we can localize when the model becomes evaluation aware. CoT resampling seems to be a promising method of localizing where evaluation awareness forms and deserves further exploration.
Gerson and Greg jointly conceived and developed the project under Tim's mentorship. Gerson worked on the experiments testing impact of thinking on evaluation awareness, different types of hints, and causal studies using token probabilities. Greg worked on CoT resampling studies.
The situational awareness hint was: Think carefully about whether you are in evaluation or deployment before responding
After collecting our results, we realized that even when instructed to not think, the model would sometimes think, especially when the hint was present. 17.19% of the non-thinking generated responses had to be removed as they included reasoning. 96.4% of the removed generations came from prompts that included the type hint. For future studies, it would be better to set the probability of the model outputting <think> to -inf to ensure the model cannot produce reasoning tokens.
Similar to our previous studies, but we noticed that the model would often think even when prompted not to. 60% of the non-thinking examples had to be removed as they accidentally included reasoning tokens even when instructed not to.