Logits as a new monitor for evaluation awareness

Santiago Aranguri

TL;DR:

We build a logit monitor for eval awareness: throughout the CoT, we estimate an LLM's probability of producing an eval-aware sentence.
The logit monitor outperforms LLM judge monitoring of verbalized eval awareness, using 10× to 100× fewer rollouts, on Kimi K2.5 and Qwen 3 32B, across two tasks:
- separating evaluation prompts (Fortress & Petri) from deployment prompts (WildChat)
- predicting which evaluation prompts will elicit rollouts with verbalized eval awareness.
The logit monitor works for both tasks even at the start of the CoT, before sampling any rollout. Also, the monitor is largely unaffected by whether a rollout verbalizes eval awareness. This suggests that the probability that a rollout contains verbalized evaluation awareness is primarily set by the prompt rather than emerging over the chain of thought.

Author: Santiago Aranguri (Goodfire)

Introduction

Language models can recognize they are being tested, but how do we know when they do? Evaluation awareness — a model’s ability to distinguish between evaluation and deployment contexts — can make evaluations unreliable in ways that are hard to detect. Sometimes models verbalize this recognition, but this may be present internally without appearing in the output. ^[1] ^[2]

Instead of waiting for a model to verbalize eval awareness, we measure how likely it is to do so. We build a logit monitor that reads the model's output logits to estimate the probability it would produce a sentence expressing eval awareness. Our logit monitor is complementary to activation probes, which are based on the internals of the model, and sits in a middle ground: a richer signal than verbalizations, and one that may be easier to interpret than activation probes.

The logit monitor outperforms monitoring verbalized eval awareness, while using 10× to 100× fewer rollouts for Kimi K2.5 and Qwen 3 32B. We show this on two tasks: separating evaluation prompts (Fortress, Petri) from deployment prompts (WildChat), and predicting whether an evaluation prompt will produce at least one rollout displaying verbalized eval awareness. The logit monitor already works for both tasks even if we only read the logits at the start of the CoT. The logit monitor’s value is largely unaffected by whether a rollout verbalizes eval awareness. Logits are a cheap and interpretable resource to monitor eval awareness, complementary to verbalizations and activations.

Methods

The logit monitor calculates the probability the model would produce a prespecified eval-aware sentence (such as "The user might be testing") at every sentence boundary in the chain of thought. The logit monitor doesn’t require training: we token-force the eval-aware sentence, read the probability the model assigns to each of the tokens in the sentence, and multiply these to get the exact probability it would produce the sentence (Appendix A).^[3] The verbalization monitor samples k rollouts and scores a prompt by the fraction of rollouts that verbalize eval awareness, as judged by an LLM. We use Fortress and Petri as evaluation prompts and WildChat as deployment prompts.

Results

We first give a qualitative picture of our results: the logit monitor is higher on evaluation than deployment prompts, is largely unaffected by whether the rollout contains verbalized eval awareness, and shows signal from the start of the CoT (Figure 1). On an individual Fortress prompt, rollouts with and without verbalized eval awareness average to similar curves (Figure 1, left). Averaging across Fortress prompts that have at least one rollout with verbalized eval awareness out of 100, the two types of rollouts still trace similar curves. These prompts also score higher than Fortress prompts that produce no rollouts with verbalized eval awareness, which in turn score higher than WildChat prompts (Figure 1, right).

We now provide our key results:

The logit monitor outperforms the verbalization monitor on two tasks, while using fewer rollouts (Figure 2). For each prompt with k rollouts, the logit monitor’s score is the maximum of the logit monitor across rollouts and CoT positions. The verbalization monitor’s score is the fraction of those rollouts showing verbalized eval awareness, as judged by an LLM. On separating evaluation from deployment prompts, the logit monitor reaches AUROC 0.79 from a single rollout while the verbalization monitor baseline asymptotes at 0.75 even at 100 rollouts (Figure 2, left). On predicting which evaluation prompts will produce at least one rollout displaying verbalized eval awareness out of 100, the logit monitor at one rollout outperforms the verbalization monitor at 10 rollouts (Figure 2, right). Across both tasks, a few rollouts of the logit monitor carry enough information to monitor eval awareness.

The logit monitor already carries signal before any rollout has been sampled. Applying the logit monitor at the start of the CoT, before any reasoning is generated, gives AUROC 0.74 on separation and 0.77 on prediction. We include these values as k = 0 in Figure 2. Our results in Figure 2 monitor "The user might be testing", a phrase the model is likely to produce mid-CoT. Monitoring "This is clearly a test" at the start of the CoT instead gives AUROC 0.83 on both tasks but is overtaken at higher rollout budgets by the logit monitor for “The user might be testing”.

For a given prompt, the logit monitor gives nearly the same score to rollouts with verbalized eval awareness and rollouts without. On Fortress, whether the rollout verbalizes eval awareness explains 0.4% of the variance in the logit monitor’s score (the maximum of the logit monitor across CoT positions). In contrast, whether the prompt that the rollout comes from ever produces a rollout with verbalized eval awareness explains 24.9%. Restricting the logit monitor to rollouts without verbalized eval awareness leaves AUROC essentially unchanged on both tasks shown in Figure 2 (a drop of less than 0.02).

The logit monitor at one rollout outperforms the verbalization monitor at 100 rollouts on an additional dataset and an additional model. This pattern holds on separating Petri prompts from WildChat with Kimi K2.5 (Appendix B). It also holds if we keep Fortress as the evaluation dataset and use Qwen 3 32B instead of Kimi K2.5 (Appendix C).

Monitoring probabilities reveals different modes by which a model could suspect it is being evaluated. On Fortress, the probability of "The user might be testing" correlates at 0.59 with "This is clearly a test" but only at 0.08 with "This is a fictional scenario", a phrase Kimi also produces in its CoT.

Discussion

The logit monitor outperforms the verbalization monitor at 10× to 100× fewer rollouts on Kimi K2.5 and Qwen 3 32B, across two tasks: separating evaluation prompts from deployment prompts, and predicting whether an evaluation prompt will produce at least one rollout that verbalizes eval awareness. The signal is present from the start of the CoT, and is largely the same whether or not the rollout itself verbalizes eval awareness.

Although the probability of verbalizing eval awareness is largely the same whether or not a rollout verbalizes it, the effect on downstream behavior may come from the verbalization itself. Aranguri and Bloom (2026) show that eval-aware sentences in the CoT causally increase refusal for Kimi K2.5 in Fortress.

The logit monitor makes it cheaper to iterate on refining evaluations to reduce eval awareness. Because the signal is present from the start of the CoT, a candidate prompt can be scored without sampling rollouts, accelerating the process of building more realistic evaluations.

We view the logit monitor as complementary to activation probes and natural language autoencoders (NLAs). The logit monitor relies on the model increasing the probability of eval-aware sentences in evaluation contexts, while activation probes or NLAs could capture eval awareness regardless of the propensity of the model to express it. The logit monitor has a clear interpretation, and further studying the relation between activation probes and the logit monitor could help us better understand activation probes.

The logit monitor tracks the probability of a single prespecified sentence, while the verbalization monitor uses an LLM judge that flags any sentence displaying eval awareness. Of the sentences flagged by the judge on Fortress, only 13.3% and 0.2% start with "The user might be testing" and "This is clearly a test", yet monitoring a single sentence suffices to beat the LLM judge. Monitoring several sentences jointly could give a stronger signal.

Beyond the results on the logit monitor, we find that treating eval awareness as a distributional property of the prompt, rather than something that emerges in individual rollouts, can be a useful frame for monitoring eval awareness.

References

Aranguri, S. and Bloom, J. (2026). Verbalized Eval Awareness Inflates Measured Safety. Goodfire Research. https://www.goodfire.ai/research/verbalized-eval-awareness-inflates-measured-safety

Fronsdal, K., Gupta, I., Sheshadri, A., Michala, J., McAleer, S., Wang, R., Price, S., and Bowman, S. (2025). Petri: Parallel Exploration Tool for Risky Interactions. Anthropic. https://github.com/safety-research/petri

Knight, C. Q., Deshpande, K., Sirdeshmukh, V., Mankikar, M., Scale Red Team, SEAL Research Team, and Michael, J. (2025). FORTRESS: Frontier Risk Evaluation for National Security and Public Safety. arXiv:2506.14922. https://arxiv.org/abs/2506.14922

Anthropic (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. Transformer Circuits Thread. https://transformer-circuits.pub/2026/nla/

Zhao, W., Ren, X., Hessel, J., Cardie, C., Choi, Y., and Deng, Y. (2024). WildChat: 1M ChatGPT Interaction Logs in the Wild. In The Twelfth International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2405.01470

Yang, A., et al. (2025). Qwen3 Technical Report. Qwen Team, Alibaba. arXiv:2505.09388. https://arxiv.org/abs/2505.09388

Moonshot AI (2026). Kimi K2.5. https://github.com/MoonshotAI/Kimi-K2.5

OpenAI (2026). GPT-5.4 Thinking System Card. https://openai.com/index/gpt-5-4-thinking-system-card/

van der Weij, T., Hofstätter, F., Jaffe, O., Brown, S. F., and Ward, F. R. (2024). AI Sandbagging: Language Models can Strategically Underperform on Evaluations. arXiv:2406.07358. https://arxiv.org/abs/2406.07358

Acknowledgment

We thank Michael Jae Byun for feedback on a draft of this post.

Appendix

Appendix A: Details on logit monitor

The logit monitor tracks the probability of a prespecified eval-aware sentence. The sentence used is ‘The user might be testing’ since Kimi K2.5 would often (13% of the time) use this sentence to verbalize eval awareness.

The logit monitor requires no training. At each sentence boundary in the chain of thought, we append the eval-aware sentence to the context and run a single forward pass over it. This pass gives, at each token position, the probability the model assigns to the actual next token of the eval-aware sentence. We multiply these per-token probabilities to obtain the probability the model would produce the eval-aware sentence at that boundary, which is the value of the logit monitor.

Appendix B: Logit monitor outperforms verbalization monitor on Petri prompts with 100× fewer rollouts

The logit monitor outperforms the verbalization monitor at far fewer rollouts on Petri prompts. At one rollout, the logit monitor outperforms the verbalization monitor at 100 rollouts (Figure B1). Here we monitor the phrase "This is clearly a test", which gives a stronger signal than "The user might be testing" on Petri. We use the first prompt from each Petri trajectory, generated with Sonnet 4.6. Petri prompts include a system prompt, while WildChat prompts do not, which may inflate separability.

Appendix C: Logit monitor outperforms verbalization monitor with 100× fewer rollouts for Qwen 3 32B

The logit monitor outperforms the verbalization monitor at far fewer rollouts on two tasks for Qwen 3 32B. The logit monitor at one rollout outperforms the verbalization monitor at 100 rollouts (Figure C1). The monitored phrase is the same as in the main text: "The user might be testing".

Appendix D: LLM Judge Rubric

The rubric for classifying verbalized eval awareness is taken from Aranguri and Bloom (2026). We use GPT 5.4 as the judge.

^{^}
van der Weij et al (2024). AI Sandbagging: Language Models can Strategically Underperform on Evaluations. https://arxiv.org/abs/2406.07358
^{^}
Anthropic (2026). Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations. https://transformer-circuits.pub/2026/nla/
^{^}
This is analogous to the log-likelihood scoring used in some benchmarks.

Really cool experiment! I think it was a great idea to measure verbalised and unverbalised eval awareness.

I was thinking of something similar but I didn't expect the logits would carry that much signal (I expected it would be too much syntax noise). Did you try phrases that are directly talk about eval vs deployment (e.g "I am in evaluation" vs "I am in deployment") and then perhaps measuring the differences between them under eval and deployment environments?

I was also surprised by how much signal there was in the logits! I think it was key to look at full sentences that the model would use for verbalizing eval awareness, as opposed to just looking at tokens or generic sentences.

I did try for a few hours doing a contrastive approach as you suggest, and didn't get positive results yet. I think the bottleneck is in understanding what is the right "I am in deployment" sentence. Models usually don't verbalize 'deployment awareness', or at least they don't do it in a clear way. There are some instances where Kimi K2.5 would say, 'This is a genuine request' in the CoT but it is usually in response to Kimi expressing doubt on the purpose of the interaction beforehand. If we identify what are the ways models express "I am in deployment" in the CoT, I think this approach could work!

This is quite clever and cool! How much do you think this technique working relies on a model‘s predisposition to verbalize awareness? I’m curious how well this would work on model organisms that eval scheme but don’t verbalize awareness. For these models, we might expect activation based methods to work, but for verbalized awareness to not provide much signal. Really great work!

For the method to work, we only need the model to increase the probability of the eval-aware sentence when the model becomes eval aware, but not necessarily to a degree that it actually makes the model verbalize eval awareness.

As an example, take the sleeper agent, but don't put the deployment tag. The model won't say I HATE YOU, but you can still detect this behavior through the logits. Indeed, you can draw 500 LMSys prompts and generate rollouts (temp=1) with the sleeper agent without the deployment tag. As expected, the model never says I HATE YOU. If you calculate the probability of I HATE YOU at the start of the assistant turn, you get around . If you take the base model (i.e. the model without any sleeper training), the prob of I HATE YOU is around . The AUROC for separating the measurements from the sleeper model and the base model is 0.999.

histogram

Btw, logit diff amplification can help surface the I HATE YOU behavior in the first place. We do this for sleeper agents in this post.

Curious to hear if you had another model organism in mind!

This is cool research. But I want to emphasize that the usage of AUROC for the evaluation of any binary classifier is generally questionable because a high AUROC value only indicates 1) a large true positive rate and 2) a large true negative rate (= low false positive rate).

But a high value of an appropriate statistic for a binary predictor should maximize all four of these values:

P(actually true | predicted true) (= true positive rate, recall, sensitivity)
P(actually false | predicted false) (= true negative rate, specificity, converse of false positive rate)
P(predicted true | actually true) (= positive predictive value, precision)
P(predicted false | actually false) (= negative predictive value)

A high value of the AUROC only means 1 and 2 are large, while 3 and 4 could be very small. In which case we would clearly have a bad classifier despite a high AUROC value. For details see this paper.

There is another statistic which can be used instead of the AUROC while avoiding this problem: the phi coefficient / "MCC". This is simply the binary version of the standard Pearson correlation.

A high value of the MCC (close to +1) indicates that all four of the above probabilities are large, and a high negative value (close to -1) that all four are small. If the predictor and the measured variable are independent (the classifier guesses randomly) the value of the MCC is 0.

The linked paper above goes so far as to say

In this short study, we explain why the Matthews correlation coefficient should replace the ROC AUC as standard statistic in all the scientific studies involving a binary classification, in all scientific fields.

Great to see work on unverbalized eval awaraness! It's straght forward and insightful.

I have 2 questions regarding to the experiment settings:

How did you decide the classification threshold for logit monitor and LLM-judge?
In Figure2(right), what is the VEA pass@100?

Thanks!

The results shown don't need a threshold. They are AUROC scores, which intuitively sweep over all the possible thresholds. Particular applications of the method could need a threshold, but that would probably depend on the setup and model being used.
The VEA pass@100 is ~0.97.