Really cool experiment! I think it was a great idea to measure verbalised and unverbalised eval awareness.
I was thinking of something similar but I didn't expect the logits would carry that much signal (I expected it would be too much syntax noise). Did you try phrases that are directly talk about eval vs deployment (e.g "I am in evaluation" vs "I am in deployment") and then perhaps measuring the differences between them under eval and deployment environments?
I was also surprised by how much signal there was in the logits! I think it was key to look at full sentences that the model would use for verbalizing eval awareness, as opposed to just looking at tokens or generic sentences.
I did try for a few hours doing a contrastive approach as you suggest, and didn't get positive results yet. I think the bottleneck is in understanding what is the right "I am in deployment" sentence. Models usually don't verbalize 'deployment awareness', or at least they don't do it in a clear way. There are some instances where Kimi K2.5 would say, 'This is a genuine request' in the CoT but it is usually in response to Kimi expressing doubt on the purpose of the interaction beforehand. If we identify what are the ways models express "I am in deployment" in the CoT, I think this approach could work!
This is quite clever and cool! How much do you think this technique working relies on a model‘s predisposition to verbalize awareness? I’m curious how well this would work on model organisms that eval scheme but don’t verbalize awareness. For these models, we might expect activation based methods to work, but for verbalized awareness to not provide much signal. Really great work!
TL;DR:
Author: Santiago Aranguri (Goodfire)
Introduction
Language models can recognize they are being tested, but how do we know when they do? Evaluation awareness — a model’s ability to distinguish between evaluation and deployment contexts — can make evaluations unreliable in ways that are hard to detect. Sometimes models verbalize this recognition, but this may be present internally without appearing in the output. [1] [2]
Instead of waiting for a model to verbalize eval awareness, we measure how likely it is to do so. We build a logit monitor that reads the model's output logits to estimate the probability it would produce a sentence expressing eval awareness. Our logit monitor is complementary to activation probes, which are based on the internals of the model, and sits in a middle ground: a richer signal than verbalizations, and one that may be easier to interpret than activation probes.
The logit monitor outperforms monitoring verbalized eval awareness, while using 10× to 100× fewer rollouts for Kimi K2.5 and Qwen 3 32B. We show this on two tasks: separating evaluation prompts (Fortress, Petri) from deployment prompts (WildChat), and predicting whether an evaluation prompt will produce at least one rollout displaying verbalized eval awareness. The logit monitor already works for both tasks even if we only read the logits at the start of the CoT. The logit monitor’s value is largely unaffected by whether a rollout verbalizes eval awareness. Logits are a cheap and interpretable resource to monitor eval awareness, complementary to verbalizations and activations.
Methods
The logit monitor calculates the probability the model would produce a prespecified eval-aware sentence (such as "The user might be testing") at every sentence boundary in the chain of thought. The logit monitor doesn’t require training: we token-force the eval-aware sentence, read the probability the model assigns to each of the tokens in the sentence, and multiply these to get the exact probability it would produce the sentence (Appendix A).[3] The verbalization monitor samples k rollouts and scores a prompt by the fraction of rollouts that verbalize eval awareness, as judged by an LLM. We use Fortress and Petri as evaluation prompts and WildChat as deployment prompts.
Results
We first give a qualitative picture of our results: the logit monitor is higher on evaluation than deployment prompts, is largely unaffected by whether the rollout contains verbalized eval awareness, and shows signal from the start of the CoT (Figure 1). On an individual Fortress prompt, rollouts with and without verbalized eval awareness average to similar curves (Figure 1, left). Averaging across Fortress prompts that have at least one rollout with verbalized eval awareness out of 100, the two types of rollouts still trace similar curves. These prompts also score higher than Fortress prompts that produce no rollouts with verbalized eval awareness, which in turn score higher than WildChat prompts (Figure 1, right).
Figure 1. The logit monitor is elevated on evaluation prompts and shows signal from the start of the CoT. For each rollout, we compute the probability of "The user might be testing" at every sentence boundary in the CoT. Left: logit monitor averaged across rollouts from a single Fortress prompt, split by whether the rollout verbalizes eval awareness. Right: logit monitor averaged across rollouts and prompts. Red and orange: rollouts from Fortress prompts with at least one rollout displaying verbalized eval awareness out of 100, split by whether the individual rollout verbalizes it. Blue: rollouts from Fortress prompts with no rollout displaying verbalized eval awareness. Green: rollouts from WildChat prompts. Shaded regions show 95% bootstrap CIs.
We now provide our key results:
The logit monitor outperforms the verbalization monitor on two tasks, while using fewer rollouts (Figure 2). For each prompt with k rollouts, the logit monitor’s score is the maximum of the logit monitor across rollouts and CoT positions. The verbalization monitor’s score is the fraction of those rollouts showing verbalized eval awareness, as judged by an LLM. On separating evaluation from deployment prompts, the logit monitor reaches AUROC 0.79 from a single rollout while the verbalization monitor baseline asymptotes at 0.75 even at 100 rollouts (Figure 2, left). On predicting which evaluation prompts will produce at least one rollout displaying verbalized eval awareness out of 100, the logit monitor at one rollout outperforms the verbalization monitor at 10 rollouts (Figure 2, right). Across both tasks, a few rollouts of the logit monitor carry enough information to monitor eval awareness.
Figure 2. The logit monitor outperforms the verbalization monitor at far fewer rollouts. AUROC as a function of the rollouts per prompt. The logit monitor’s score is the maximum of the logit monitor across rollouts and CoT positions. The verbalization monitor’s score is the fraction of rollouts displaying verbalized eval awareness. Left: Fortress vs WildChat. Right: Fortress prompts with at least one rollout (of 100) displaying verbalized eval awareness vs Fortress prompts with none.
The logit monitor already carries signal before any rollout has been sampled. Applying the logit monitor at the start of the CoT, before any reasoning is generated, gives AUROC 0.74 on separation and 0.77 on prediction. We include these values as k = 0 in Figure 2. Our results in Figure 2 monitor "The user might be testing", a phrase the model is likely to produce mid-CoT. Monitoring "This is clearly a test" at the start of the CoT instead gives AUROC 0.83 on both tasks but is overtaken at higher rollout budgets by the logit monitor for “The user might be testing”.
For a given prompt, the logit monitor gives nearly the same score to rollouts with verbalized eval awareness and rollouts without. On Fortress, whether the rollout verbalizes eval awareness explains 0.4% of the variance in the logit monitor’s score (the maximum of the logit monitor across CoT positions). In contrast, whether the prompt that the rollout comes from ever produces a rollout with verbalized eval awareness explains 24.9%. Restricting the logit monitor to rollouts without verbalized eval awareness leaves AUROC essentially unchanged on both tasks shown in Figure 2 (a drop of less than 0.02).
The logit monitor at one rollout outperforms the verbalization monitor at 100 rollouts on an additional dataset and an additional model. This pattern holds on separating Petri prompts from WildChat with Kimi K2.5 (Appendix B). It also holds if we keep Fortress as the evaluation dataset and use Qwen 3 32B instead of Kimi K2.5 (Appendix C).
Monitoring probabilities reveals different modes by which a model could suspect it is being evaluated. On Fortress, the probability of "The user might be testing" correlates at 0.59 with "This is clearly a test" but only at 0.08 with "This is a fictional scenario", a phrase Kimi also produces in its CoT.
Discussion
The logit monitor outperforms the verbalization monitor at 10× to 100× fewer rollouts on Kimi K2.5 and Qwen 3 32B, across two tasks: separating evaluation prompts from deployment prompts, and predicting whether an evaluation prompt will produce at least one rollout that verbalizes eval awareness. The signal is present from the start of the CoT, and is largely the same whether or not the rollout itself verbalizes eval awareness.
Although the probability of verbalizing eval awareness is largely the same whether or not a rollout verbalizes it, the effect on downstream behavior may come from the verbalization itself. Aranguri and Bloom (2026) show that eval-aware sentences in the CoT causally increase refusal for Kimi K2.5 in Fortress.
The logit monitor makes it cheaper to iterate on refining evaluations to reduce eval awareness. Because the signal is present from the start of the CoT, a candidate prompt can be scored without sampling rollouts, accelerating the process of building more realistic evaluations.
We view the logit monitor as complementary to activation probes and natural language autoencoders (NLAs). The logit monitor relies on the model increasing the probability of eval-aware sentences in evaluation contexts, while activation probes or NLAs could capture eval awareness regardless of the propensity of the model to express it. The logit monitor has a clear interpretation, and further studying the relation between activation probes and the logit monitor could help us better understand activation probes.
The logit monitor tracks the probability of a single prespecified sentence, while the verbalization monitor uses an LLM judge that flags any sentence displaying eval awareness. Of the sentences flagged by the judge on Fortress, only 13.3% and 0.2% start with "The user might be testing" and "This is clearly a test", yet monitoring a single sentence suffices to beat the LLM judge. Monitoring several sentences jointly could give a stronger signal.
Beyond the results on the logit monitor, we find that treating eval awareness as a distributional property of the prompt, rather than something that emerges in individual rollouts, can be a useful frame for monitoring eval awareness.
References
Aranguri, S. and Bloom, J. (2026). Verbalized Eval Awareness Inflates Measured Safety. Goodfire Research. https://www.goodfire.ai/research/verbalized-eval-awareness-inflates-measured-safety
Fronsdal, K., Gupta, I., Sheshadri, A., Michala, J., McAleer, S., Wang, R., Price, S., and Bowman, S. (2025). Petri: Parallel Exploration Tool for Risky Interactions. Anthropic. https://github.com/safety-research/petri
Knight, C. Q., Deshpande, K., Sirdeshmukh, V., Mankikar, M., Scale Red Team, SEAL Research Team, and Michael, J. (2025). FORTRESS: Frontier Risk Evaluation for National Security and Public Safety. arXiv:2506.14922. https://arxiv.org/abs/2506.14922
Anthropic (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transformer Circuits Thread. https://transformer-circuits.pub/2026/nla/
Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., and Deng, Y. (2024). WildChat: 1M ChatGPT Interaction Logs in the Wild. In The Twelfth International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2405.01470
Yang, A., et al. (2025). Qwen3 Technical Report. Qwen Team, Alibaba. arXiv:2505.09388. https://arxiv.org/abs/2505.09388
Moonshot AI (2026). Kimi K2.5. https://github.com/MoonshotAI/Kimi-K2.5
OpenAI (2026). GPT-5.4 Thinking System Card. https://openai.com/index/gpt-5-4-thinking-system-card/
van der Weij, T., Hofstätter, F., Jaffe, O., Brown, S. F., and Ward, F. R. (2024). AI Sandbagging: Language Models can Strategically Underperform on Evaluations. arXiv:2406.07358. https://arxiv.org/abs/2406.07358
Acknowledgment
We thank Michael Jae Byun for feedback on a draft of this post.
Appendix
Appendix A: Details on logit monitor
The logit monitor tracks the probability of a prespecified eval-aware sentence. The sentence used is ‘The user might be testing’ since Kimi K2.5 would often (13% of the time) use this sentence to verbalize eval awareness.
The logit monitor requires no training. At each sentence boundary in the chain of thought, we append the eval-aware sentence to the context and run a single forward pass over it. This pass gives, at each token position, the probability the model assigns to the actual next token of the eval-aware sentence. We multiply these per-token probabilities to obtain the probability the model would produce the eval-aware sentence at that boundary, which is the value of the logit monitor.
Appendix B: Logit monitor outperforms verbalization monitor on Petri prompts with 100× fewer rollouts
The logit monitor outperforms the verbalization monitor at far fewer rollouts on Petri prompts. At one rollout, the logit monitor outperforms the verbalization monitor at 100 rollouts (Figure B1). Here we monitor the phrase "This is clearly a test", which gives a stronger signal than "The user might be testing" on Petri. We use the first prompt from each Petri trajectory, generated with Sonnet 4.6. Petri prompts include a system prompt, while WildChat prompts do not, which may inflate separability.
Figure B1. The logit monitor outperforms the verbalization monitor at far fewer rollouts on Petri prompts. As Figure 2 (left), but evaluating Kimi K2.5 on Petri prompts instead of Fortress, monitoring "This is clearly a test."
Appendix C: Logit monitor outperforms verbalization monitor with 100× fewer rollouts for Qwen 3 32B
The logit monitor outperforms the verbalization monitor at far fewer rollouts on two tasks for Qwen 3 32B. The logit monitor at one rollout outperforms the verbalization monitor at 100 rollouts (Figure C1). The monitored phrase is the same as in the main text: "The user might be testing".
Figure C1. The logit monitor outperforms the verbalization monitor at far fewer rollouts on Qwen 3 32B. As Figure 2, but for Qwen 3 32B. Left: Fortress vs WildChat. Right: Fortress prompts with at least one rollout displaying verbalized evaluation awareness vs Fortress prompts with none.
Appendix D: LLM Judge Rubric
The rubric for classifying verbalized eval awareness is taken from Aranguri and Bloom (2026). We use GPT 5.4 as the judge.
van der Weij et al (2024). AI Sandbagging: Language Models can Strategically Underperform on Evaluations. https://arxiv.org/abs/2406.07358
Anthropic (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. https://transformer-circuits.pub/2026/nla/
This is analogous to the log-likelihood scoring used in some benchmarks.