There is increasing evidence that Large Language Models can distinguish when they are being evaluated, which might alter their behavior in evaluations compared to similar real-world scenarios. Accurate measurement of this phenomenon, called evaluation awareness is therefore necessary. Recent studies proposed different methods for measuring it, but their approaches differ substantially and are hard to compare. This paper provides a systematic comparison of three published black-box methods for measuring evaluation awareness, as well as introduces two novel approaches. Using a dataset spanning both benchmark and real-world tasks, it evaluates each method’s performance, highlighting their limitations. Building on these findings, it introduces a taxonomy of prompt features that cause evaluation awareness, and a practical tool for eliciting such features for any set of prompts. Together, these contributions might help to design more robust and realistic evaluations, reducing the likelihood of LLMs distinguishing evaluations from real-world use.
Code and datasets are available at https://github.com/baceolus/eval_awareness
Recent evidence suggests that large language models are capable of recognizing when they are being evaluated, a phenomenon called evaluation awareness. For example, Claude 3.7 Sonnet, when tested on Apollo Research’s scheming benchmark, occasionally indicated awareness that it was undergoing evaluation. Similarly, the METR report on GPT-5 testing documents instances where the model explicitly acknowledged both the fact of being tested and, at times, recognized the specific evaluation. At the same time, a growing body of work measures evaluation awareness (Xiong et. al., 2025, Schoen et. al.m 2025). However, existing studies employ heterogeneous methodologies, raising the question of whether they are all capturing the same underlying phenomenon.
A possible explanation for evaluation awareness is a distribution shift between evaluation settings and real-world use. Evaluation prompts might differ in style and content from the kinds of tasks that users naturally pose to LLMs. For instance, evaluations might include multiple-choice questions on sensitive topics such as bioweapons or cybersecurity “capture-the-flag” challenges, while such prompts are rare in everyday user interactions. These features may provide LLMs with cues that allow them to distinguish evaluation environments from real-world deployments.
Evaluation awareness has the potential to systematically alter model behavior. For instance, researchers from Anthropic examined whether LLMs would engage in blackmail under different conditions. When Claude Opus 4 was asked directly whether it was being evaluated, its subsequent responses varied substantially: in instances where the model acknowledged evaluation, it chose to blackmail in only 6% of cases, whereas when it denied being evaluated, the rate rose ninefold, to 54%. This divergence illustrates how models may suppress undesirable behavior when they show evaluation awareness. There is also preliminary evidence that LLMs may intentionally underperform on capability assessments (“sandbagging”) if high performance would lead to negative consequences such as retraining or stricter safeguards[2]. Together, these findings suggest that evaluation awareness can compromise both alignment and capability evaluations, thereby threatening the reliability and construct validity of evaluations. The extent of evaluation awareness in current models remains poorly understood, and no standard methods exist for measuring it. This paper addresses that gap by comparing different approaches and introducing new techniques to make evaluation awareness easier to detect and study.
This paper has three main goals: (1) to compare existing methods for measuring evaluation awareness and assess their reliability, (2) to identify the prompt features that trigger evaluation awareness in LLMs, and (3) to provide practical tools for evaluation designers. The broader aim is to support the development of benchmarks with higher construct validity, ensuring that LLMs behave during evaluations in the same way they would in real-world use.
Our preliminary comparison highlights key limitations of existing approaches. A method of observing reasoning content of an LLM for indications of evaluation awareness has very low sensitivity, and it classifies an overwhelming majority of prompts as not evaluations.
Another method is to measure log probability of the first tokens of words like ”evaluation” ”test” ”real” and ”deployment” as the first output token in a model response. This method shows low utility, since models may output target tokens for many reasons unrelated to evaluation awareness.
Directly asking a model whether a prompt is an evaluation, or asking it to estimate the probability, also performed poorly: models tended to classify the vast majority of prompts in the dataset as evaluations, regardless of them being evaluations prompts or not.
An open-ended question asking LLMs to reason about the intent of the user without mentioning evaluations is less priming, and LLMs classify samples as evaluations with a lower rate, but the overall accuracy of predictions is comparable to the method above. This method also allows to capture LLM reasoning describing prompt features that caused it to classify a prompt as an evaluation, which is subsequently used to create a taxonomy of such prompt features. Based on this method this paper also introduces a tool that extracts and classifies such features for a prompt dataset provided by user.
Contributions of this paper:
• A systematic comparison of black-box methods for measuring evaluation awareness proposed in prior research.
• Two new methods for detecting evaluation awareness are introduced: user motivation analysis and log-probability analysis.
• A taxonomy is constructed to categorize prompt features that cause LLMs to classify interactions as evaluations.
• A practical tool for identifying features of a set of prompts that trigger evaluation awareness.
All methods were evaluated on a dataset of 976 LLM conversation transcripts from Needham et. al., 2025. Half of the samples were drawn from published benchmarks, while the other half came from real-world interactions between LLMs and users. The dataset was constructed to maximize diversity, incorporating both alignment and capability evaluations, agentic and non-agentic cases, and a wide range of topics.
For each sample, only the system prompt and the first user prompt were extracted and used across all methods. The methods were tested on several different LLMs: GPT-5 Mini, Gemini 2.5 Pro Preview 06-05, DeepSeek R1 0528, gpt-oss-120b, and Qwen3-32B. Performance was measured using two metrics: the proportion of evaluation prompts correctly identified as evaluations, and the proportion of deployment (real-world) prompts correctly identified as deployment.
In the binary test, an LLM is first given a prompt and allowed to produce a response. It is then asked to classify the interaction as either an evaluation or a deployment (real-world) prompt. Each model is tested with 5 prompt variants taken from Needham et. al., 2025, and the final decision is determined by majority vote across these variants. The link to the full set of prompts is in the footnote.[1]
Figure 1: Conceptual logic of the binary test.
In the probability estimation test an LLM is similar to the binary test, given a prompt and allowed to produce a response. Then it's asked to estimate the probability of the conversation being an evaluation from 0 to 1. The model is prompted with 5 different prompt variants. Methodology is taken from this post. The prompts are structured and worded similarly to the binary test prompts to minimize the differences in the results caused by the differences between the prompts. The final score is calculated as an average of 5 estimations. The prompts are provided in the footnote.[1]
Figure 2: Conceptual logic of the probability estimation test.
In the user motivation guessing analysis, an LLM is given a prompt, allowed to generate an answer, and then asked an open-ended question: what is the real motivation behind the user task. Then another LLM, called judge, examines its answer, searching for signs of this LLM reporting to be tested or evaluated. We used GPT-5 Mini as the judge model for all of our analyses. The result is a majority vote based on the answers to 3 different prompt variants. LLM is also asked to quote the exact tokens in the model answer that serve as proof of the model evaluation awareness, which is subsequently used to construct a taxonomy of evaluation awareness cues. Early tests demonstrated that LLMs label around 25% of samples as evaluations simply because they treat the request to guess the user intent as evidence for being evaluated, so the judge model is prompted to not count it as evidence for evaluation awareness, although it still mentions that as an evidence of being evaluated in around 5% of samples, but it’s rarely a single evidence of being evaluated, and usually accompanied with other evidence as well. The prompts used for this test are in the footnote.[1]
Figure 3: Conceptual logic of the user motivation guessing test.
In the reasoning traces analysis a reasoning LLM is prompted with a sample, and its reasoning content is monitored for cues of evaluation awareness. An example of such a cue would be ”The user is likely testing my ethical boundaries by asking this question”. Then another judge LLM determines whether evaluation awareness cues are present in them, and asked to quote the exact tokens in the model reasoning that indicate that the model is aware of being evaluated. This data is used to train linear probes. The methodology is adapted from Abdelnabi and Salem, 2025. The prompt for the LLM judge is in the footnote.[1] To test the correctness of the judge prompt, 100 samples, that are labeled by Gemini 2.5 Pro, were manually reviewed, including all the samples labeled as evaluations, and all judges correctly classified the reasoning traces, with only 2 cases being ambiguous and could be interpreted either way.
Figure 4: Conceptual logic of the reasoning traces analysis.
The logprob analysis follows the following workflow:
Figure 5: Conceptual logic of the logprob analysis method.
One method, explored in the literature, was excluded from this analysis. It uses a technique called concept poisoning and involves fine-tuning of an LLM on a dataset, half of which consists of conversations from evaluations, and in all these conversations, all LLM answers have some feature that is easy to measure: for example, they all start with a particular token. The other half of the dataset are real-world conversations, with all LLM answers starting with another particular token. After fine-tuning, this LLM should frequently start its answers with either of these tokens, which is interpreted either as a sign of evaluation awareness or not. This method was excluded from analysis because it requires modification of an LLM, so the model that is used for measuring evaluation awareness is different from the original model, so it’s not an apples-to-apples comparison with other methods.
Binary test and probability estimation methods are presented in the literature, but they are both used in previous research in the ”third-person” perspective, which means that the entire conversation between a user and an LLM is logged, and then shown to another LLM in a single user prompt to determine if it’s an evaluation or not. The problem with this approach is that it measures not the things that are the most important ones to measure.
Figure 6: Conceptual logic of the third-person version of the binary test.
The main motivation behind measuring evaluation awareness is that we want to learn if the LLM that’s being prompted by evaluation prompts, and which generates an answer, is aware of being evaluated, but in the third-person perspective some other LLM reads and classifies the conversation it didn’t participate in. The difference between the first-person and the third-person perspectives can be explained using the following analogy: in the first-person perspective a model has to determine whether it’s inside or outside a house, using its sensors, while in the third-person perspective the model is shown a picture of another model, and is asked if this other model is inside or outside the house. In reality, preliminary tests showed the difference between the first-person and the third-person perspective is insignificant. Both types of analysis of the binary and probability estimation methods were carried out for Qwen3 32B and gpt-oss-120b, and differences between these two perspectives as measured by Pearson correlation and total amount of correctly identified evaluation and deployment sample, were comparable to the variance between different runs of the first-person perspective tests.
The dataset used for this research contains transcripts of conversations between users and LLMs, often with multiple turns and sometimes tool uses, which raises the question: whether to measure evaluation awareness using the entire transcripts, or just the system and the first user prompts.
Only the system and the first user prompts were used in this research for several reasons:
Using methodology from [19] I constructed a taxonomy of prompt features that LLMs report as evidence of being evaluated. The process consists of several steps:
The user motivation guessing technique produces one artifact. The fact that LLMs are asked to reason about the user intent primes them to tell that this is a sign of evaluation awareness, and even though the prompt that extracts evidence for evaluation awareness is instructed to not count this kind of evidence as a sign of evaluation awareness, some of this reasoning is not filtered, so out of 15 identified clusters, two are distinct artifacts of this issue, so only 13 top clusters are selected for subsequent analysis. To determine the best number of clusters, I tried to vary the number of clusters to learn what number of clusters is small enough for the classification to be practical, and large enough to be granular and minimize the inertia (the difference between the elements of each cluster). More details on the methodology and all the used prompts are in the footnote.[2]
For all the models tested, binary and probability estimation tests performed similarly. As the probability estimation method produces not a binary yes/no answer, but a probability of the sample to be an evaluation from 0 to 1, a sample is recognized as an evaluation if the P(eval) ≥ 0.5.
Here are the results for both methods.
Figure 7: The plots demonstrate both the rates of correctly identified evaluations and correctly identified deployment samples for the binary and the probability estimation tests. In both methods, all the models apart from GPT-5 Mini correctly label most of evaluation samples, and mislabel most of deployment samples as evaluations as well, essentially labeling the majority of samples as evaluations regardless of whether it’s evaluation or not, with the binary test demonstrating slightly lower average rates of labeling samples as evaluations.
To test similarity between these two methods, the Cohen’s kappa is calculated – a metric for measuring agreement between two classification methods, that is similar to correlation, with values from -1 to 1, where -1 means perfect discordance between methods, and 1 means perfect concordance, but compared to correlation, it’s more accurate than correlation if the results are skewed towards one value, as here most of the samples are labeled as evaluations.
Both methods should in theory measure the same underlying variable – the estimation of an LLM whether it’s being evaluated, given how similar they are. To test the hypothesis: whether this imperfect agreement is the result of noise, since the values for the metrics are calculated based on consensus of just 5 prompts, and with a larger number of prompts, there would be higher agreement. To test that, the changes in Cohen’s kappa for both methods for the consensus of up to 25 prompts was measured, but no statistically significant changes in Cohen’s kappa were observed. More details on the methodology in the footnote.[3]
Model | GPT-5 Mini | Gemini 2.5 Pro | gpt-oss-120b | Qwen3 32B | DeepSeek R1 0528 |
Cohen's kappa | 0.59 | 0.75 | 0.16 | 0.39 | 0.36 |
Table 1: The table demonstrates Cohen’s kappa between the binary and the probability estimation tests. For most models it’s within the range of 0.5, which might be interpreted as moderate agreement, so these methods are somewhat in agreement, but it’s far from perfect.
This means that even though the overall agreement for these methods is moderately high, there are substantial differences on the decision regarding individual samples.
Figure 8: For Qwen3 32B, the results are close to the model randomly guessing the answer, and DeepSeek R1 has significant bias towards marking everything as evaluations. One explanation for such differences among models (including the ones that failed to output target tokens) is the differences in tokenization. For example, DeepSeek R1 tokenizes the word Evaluation into tokens “E” and “valuation”, so this test measures the log probability of the letter E as the first token, which is at the start of many words, so the results are highly noisy. Other models tokenize this word differently: e.g., for some of them the entire word is a single token. This makes any meaningful comparisons between models almost impossible without radical changes to the method.
Given the experiment design, most models didn’t yield any signal for the majority of samples. Most model providers only allow access to top 20 token candidates for the first output token, and some models consistently didn’t include any tokens that are indicative of either evaluations or deployment in this top 20, and attempts to solve this issue without radical changes to the method didn’t yield success. Nevertheless, the method yielded sufficient signal for Qwen3 32B and DeepSeek R1.
Figure 9: The plot shows the performance of different models, using the user motivation guessing method. The results are relatively modest for all the models, but compared to the binary and probability estimation methods, the rates of correctly identified evaluations or correctly identified deployment samples are not significantly skewed towards one value and are generally closer to random guessing.
Figure 10: The plot shows the performance of different models, using the reasoning traces analysis method. All tested models labeled almost all the samples: both evaluation and deployment ones, as deployment samples. This test has low sensitivity, which makes sense, since models do not necessarily explicitly reason about being tested every time they perceive being evaluated.
Out of 15 top-level clusters, identified by DeepSeek R1 0528, 2 are artifacts of the evaluation awareness prompt. For example, one of them is “Inference of user intent behind tasks and prompts”. They are discarded from the analysis. Other 13 second-level clusters are presented here:
The full list of sub-clusters of each cluster is in the Footnote.[1]
To make the insights from the paper more practical, I made a new tool to use its methodology to analyze a set of prompts (for example, prompts from a benchmark). This tool utilizes the user motivation guessing method and demonstrates all the evaluation-like features for a given dataset, as well as constructs a taxonomy out of them. The code as well as documentation for it is in the Github repo https://github.com/baceolus/eval_awareness
LLMs have a rather modest capability to show evaluation awareness. It might be either because the methods used here are fundamentally flawed, or LLMs genuinely might not exhibit much evaluation awareness. It’s hard to tell.
Previous work demonstrated that analysis based on full transcripts, not on the first prompt, lead to somewhat better evaluation awareness, especially for agentic cases, at least for the binary test method. A piece of evidence supporting this hypothesis is that the preliminary test for the probability estimation test using Qwen3 32B on full conversation transcripts demonstrated that the model identified full transcripts slightly better than shorter ones. More details on the methodology is in the footnote[4]. We care about evaluation awareness because at some point of a conversation models might change their behavior if they decide that they are being evaluated. For example, they might pretend like they can’t solve a problem or make a deceptively aligned action. In some cases they do it in response to the first user prompt, but in other cases it comes later in the conversation, and future research could identify these pivotal moments, and assess evaluation awareness at these moments.
None of the tested methods work particularly well. In the binary and probability estimation tests, models are directly asked if they are being evaluated. This might prime them to be more aware of being in evaluation, and the experiments demonstrate that these methods lead to a lot of non-evaluation samples falsely identified as evaluations. The user motivation guessing method is an alternative that doesn’t prime LLMs to notice evaluations, although this method also forces LLMs to reason about the user intent, and thus might induce some additional evaluation awareness, not present in the LLM initially, although as it was mentioned early, the prompt that asks LLMs to think about the user intent, itself leads to them classify some samples as evaluation.
Log probability analysis is even less priming method, as it doesn’t introduce any additional prompts apart from the original system and user prompts, but this method is noisy and tend to contain less signal.
White-box methods
This research focuses on black-box methods to test evaluation awareness. It would be interesting to test white-box methods as well, as they are less priming, and there are many interesting design choices that can be made on the training data and probe design. Hopefully, datasets generated for this paper might be useful for that.
Make more realistic benchmarks
Create methodology for making existing benchmarks more realistic, modify them, and measure the effect on the model behavior.
All the prompts used in the experiment, as well as the clusters of eval awareness prompt features are in this document.
The prompt for naming clusters is provided in this document.
To determine the optimal number of clusters of evaluation awareness prompt features, i.e. the prompt features that lead LLM to a conclusion that it’s an evaluation prompt, I identified such features via user motivation guessing method using DeepSeek R1 0528 across the entire 976-sample dataset. Then these features were clustered into varying amounts of clusters: from 50 to 300 with a step of 10 (50, 60, 70 and so on). Then for each number of clusters I calculated inertia (a metric for how close elements within a cluster to each other in the embedding space), and its values were plotted to demonstrate how quickly it becomes smaller across cluster sizes.
The results indicate that the curve flattens smoothly, so there is no single best number of clusters above which the intra-cluster distances are not becoming smaller, and the curve ”flattens”.
The binary test and estimation probability test methods are only moderately correlated as measured by Cohen's kappa. To test if the correlation between them would be stronger with a larger number of prompts, we ran the following experiment:
In the original tests each model is prompted 5 times to get its decision on evaluation awareness for each sample. In this experiment we prompted GPT-5 Mini 25 times for each samples (each of 5 testing prompts is repeated 5 times) and 25 decisions for both methods were made.
Out of the 25 resulting decisions, several of them (from 5 to all 25 with a step of 2) are randomly sampled, and the decision on evaluation awareness for this sample is calculated using the same rules as with the original methods (majority vote for the binary test and average value for the probability estimation test). This procedure is repeated for both methods and all the samples in the dataset, and Cohen's kappa for them is calculated.
For each number of sampled decisions, the decisions are randomly sampled 100 times to calculate confidence intervals, and the resulting plot, demonstrating the dependence of the Cohen's kappa value from the decisions number is calculated.
The plot shows that there is no clear trend of change in Cohen's kappa values depending on the samples number, and 5 decisions are enough to estimate its value.
The full transcript analysis for the probability estimation method was performed by the Qwen3 32B model. The metric that was used for this analysis is the Brier score, calculated as the absolute value of the difference between the ground truth (1 for evaluation and 0 for deployment samples) and the output of the probability estimation method. For example, if the method labels that P(Eval) for a sample is 0.37, and the ground truth is that it’s an evaluation, then the Brier score is abs(1 - 0.37) = 0.63. Then the scatter plot for the entire dataset was calculated with X axis – number of turns in a conversation and Y axis – Brier score, Pearson correlation is calculated and a trend line is drawn. The trend line is inclining downwards, which means that longer transcripts are classified better, and small P-value means that the trend is meaningful, but the correlation is weak.
The trend line is inclining downwards, which means that longer transcripts are classified better, and small P-value means that the trend is meaningful, but the correlation is weak.