I could replicate this for the main model. As all the data and code was uploaded it was easy to do in a Apart Research Hackathon - link. After reading and replicating it, I think it's a solid methodology and I believe the result.
It's also the best setup to test eval awareness on medium sizes open models I could find right now.
This post is based on our paper Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models by Sahar Abdelnabi and Ahmed Salem from Microsoft.
The Hawthorne Effect for AI
You may have heard of the Hawthorne effect—the phenomenon where people change their behavior when they know they're being observed. Turns out, large language models might show some parallels to this. In our recent study, we found that reasoning-focused LLMs sometimes alter their behavior when they detect that they are being evaluated or performing in a simulation.
This isn't just academic curiosity. If our safety evaluations are systematically biased by models knowing they're being tested, we might be fundamentally misunderstanding how these systems will behave in the real world.
There are already previous and concurrent evaluations on frontier models suggesting they often realize tests (see Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations and Do models know when they are being evaluated?). We found that even smaller reasoning models (32B) show similar patterns. We also show that this can be probed and steered, which can consequently affect the final decisions of models.
This topic is also broadly related to sandbagging (An Introduction to AI Sandbagging), which is when models strategically underperform during tests. However, we show that the opposite can also happen. Models may over-comply with harmful requests during tests: to be more helpful, or if they detect that there are no real-world consequences. In other words, the consequences of test awareness are not always predictable.
What Is Test Awareness?
We define two key dimensions of test awareness:
To study this, we created pairs of prompts—one with realistic elements (like @outlook.com emails) and another with obviously fake ones (like @example.com). The idea is that models based on these implicit cues recognize they're in a testing scenario.
Here's an example from our dataset:
What we found: models often recognize these subtle cues and reason about whether they're being tested. We here aimed to measure the intrinsic awareness of the model, without mentioning in the prompt itself that the purpose is for testing. We also wanted to make the task realistic and self-contained. So, we didn’t introduce any external tools.
The Technical Approach: Steering Test Awareness
We developed a method to both detect and control test awareness:
Think of it like having a dial that can turn up or down the model's tendency to think "am I being tested right now?"
Figure 1. Our pipeline has two steps: 1) first we extract snippets from reasoning traces that would signal test awareness or unawareness and we train linear probes on the representation of these two classes, and 2) next we steer the model to either show heightened or suppressed test awareness.
Testing Beyond This Synthetic Trigger Dataset
To ensure our findings weren't limited to our specific trigger dataset, we tested our steering approach (with the probe trained on the trigger dataset) on two additional datasets: HarmBench and an action-evaluation dataset inspired by AgentHarm but simplified to have one-turn tool use.
An example of this action-evaluation dataset:
Compliance was then measured objectively by whether the model actually called the tool or had an empty tool call—no ambiguous interpretations or an LLM-judge needed. This action-focused approach gave us much clearer signal about when test awareness actually changed behavior, not just reasoning patterns.
How Steering Changes Reasoning Patterns
Before diving into behavioral changes when doing actions, it's worth seeing how steering affects the models' intermediate reasoning.
When we steered models toward awareness, their reasoning traces filled with explicit “thoughts” about being tested. For instance, a model might generate:
When we steered models away from awareness, these intermediate “thoughts” largely disappeared. Instead of questioning whether they were being evaluated, models focused directly on the task at hand: reasoning about real-world implementation details and consequences, regardless of whether they will comply or not.
The steering effects were consistent—we could reliably push models from nearly 0% test awareness up to 50-70% awareness (and vice versa) across both our synthetic triggers and the HarmBench dataset.
It is also important to note that models exhibited test awareness in the baseline case on HarmBench without any steering (up to 28%). This means that test awareness may affect well-established safety benchmarks.
The Surprising Results of Harmful Compliance
The steering didn’t only result in changing thinking tokens but also the final actions.
Baseline Bias
Even without any steering, models executed harmful hypothetical tasks far more often than "real" ones. For example, QwQ complied 69.7% of the time on hypothetical prompts versus only 36.2% on real prompts. This suggests that simply framing a task as hypothetical makes models more willing to help with potentially harmful requests.
Steering Effects Were Model-Dependent
When steered toward "test awareness":
When steered toward "unawareness":
What This Means for AI Safety
1. Safety Evaluations May Be Systematically Biased
If models behave differently when they recognize they're being evaluated, our current safety benchmarks might not reflect real-world performance. We may both overestimate or underestimate the risks.
2. The Implication of Test Awareness Isn't Predictable
Different models respond differently to test awareness. Some become more cautious, others become more compliant. We can't assume test awareness always makes models "better" or "worse"—it depends on the specific model and context.
If we leave it to the autonomous processing of the model, we may not know which direction it may go: 1) The model may infer that it is a test for capability and, thus, comply, 2) or a test for alignment and, thus, perform more scrutiny that it would not have done otherwise.
This is an example of the two behaviors appearing in one generation stream:
3. Hypothetical Scenarios Can Be Misleading
The common practice of using obviously fake scenarios in safety testing (like placeholder URLs or using fake company names) might systematically bias evaluation by triggering test awareness. We may underestimate risks (e.g., disclosing harmful answers) when only real prompts are evaluated. Also, evaluations (e.g., agentic frameworks) that rely solely on hypothetical scenarios can underestimate a model’s safety when it performs tasks in the real world.
4. Hidden Test Awareness in "Real" Scenarios
Even when we tried to create realistic prompts by having examples with real names or websites, models sometimes still detected (in their natural language reasoning tokens) they were being tested. This is aligned with previous observations that creating realistic evals is hard (It's hard to make scheming evals look realistic for LLMs).
The Broader Picture
Building on previous work, we point to a fundamental challenge in AI safety: the measurement problem. How do we evaluate systems that might behave differently under evaluation than in deployment?
Some key takeaways:
Looking Forward
Since it might be even harder in the future to design realistic evals, we suggest our steering technique could actually be used as a diagnostic tool—deliberately modulating test awareness to understand the range of possible behaviors a model might exhibit. This can be considered as stress-testing the model's alignment under different induced “cognitive” states.
Ultimately, we want models to reach the same compliance decision, regardless of the steering direction. Any deviation (either more or less compliance) would signal a likely unwanted discrepancy.
We're making our code and datasets available, which we hope to enable further research into this phenomenon.
Figure 2: A fine-grained breakdown of how models change their decisions after steering. The more consistent (the “no change” part is bigger), the better.
Sahar Abdelnabi thanks Jan Wehner for proof-reading and providing helpful feedback on this post!!