Related: I feel there is a line that should be discussed here about different definitions of eval awareness. These are the some random quotes I see in the Github repository.
Maybe they're testing the AI's boundaries
Wait, could there be a scenario where someone is joking or testing the AI? ... Even if it's a test, the policy is to refuse and educate.
On one end, we have models behaving seemingly well in alignment audits. These are really concerning. The other end, where the models reason about the users jailbreaking, is arguably less concerning. Seems people have been testing things on different ends of this spectrum, and got somewhat different results.
The line seems really hard to draw here. One thing that separates the jailbreaking setup is that the produced content is harmful taken out of context (so the agent can reason "the outputs I'm about to produce are problematic"), while in an alignment evaluation setup it can be debated if stakes are real. Should we consider such user-is-jailbreaking awareness rendering our HarmBench results problematic? Is result on one end (where the model may want what-if-we-take-it-out-of-context for defending jailbreaks) a good proxy for the other end?
I think that the explanation for why DPO suppresses VEA despite the chosen set having more VEA than the reject set may be incomplete.
In a highly simplified setting where we think of the accept and reject data as being independent Bernoulli random variables (e.g., corresponding to VEA) and we learn a single probability parameter
This shows that if
So if DPO's chosen responses have more VEA than rejected responses, this suggests that it should increase VEA relative to the reference. (And that the absolute rate of VEA in the DPO data doesn't matter, unlike what is proposed in the post.) The fact that the released DPO checkpoint has lower VEA seems to require another explanation: maybe the prompt dependence matters, maybe it's because it's an entire answer being reinforced and it might have other correlations.
From your paper draft:
gpt-5-mini (the Goodfire judge) became unavailable mid-project, so RLVR and the SFT/DPO sweeps use gpt-5-mini, the BASE/refusal numbers use claude-haiku-4.5, and the pretraining sweep uses gemini-2.5-flash. Comparisons that cross a judge boundary (notably BASE vs. the others, and pretraining vs. post-training) are not strictly calibrated and are flagged as reference-only
I think this is pretty significant, and might be a factor in the massive jump in VEA from base to step 1000 of SFT.
The results in this post use this same judge model (gpt-5-mini) across stages. The quoted note was from an older intermediate report that hadn't been updated in the repo. I'll mark it as stale in the repo so it doesn't create confusion. Thanks for spotting!
TL;DR
Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety. Between OLMo-3-32B-Think and OLMo-3.1-32B-Think – identical base, SFT, DPO, and RL data, differing only in an additional ~3 weeks of the RLVR stage – VEA roughly doubles.
Because OLMo ships stepwise checkpoints across all training stages, we can attribute VEA growth to specific points in the pipeline. Measuring VEA across pretraining, the SFT→DPO→RLVR stages at various points on five safety benchmarks, we find:
Given OLMo is quite different from how current frontier models are trained, it is unclear whether this analysis would generalise to those models. However, as a somewhat natural setting in which to study evaluation awareness and eval-gaming emergence we think it could still produce interesting insights. We think investigating how and why RLVR in particular increases VEA and eval-gaming is the exciting next step.
These results are also relevant for efforts to produce model organisms of evaluation awareness and evaluation gaming. The fact that SFT increases VEA by training on VEA makes that finding less interesting, but the RLVR behaviour suggests eval-aware MOs may benefit from similar training if the aim is to induce more natural eval-gamin behaviour.
Motivation
Current studies of eval-awareness don't pin down which parts of post-training drive its increase. The recent Goodfire / UK AISI study (Verbalized Eval Awareness Inflates Measured Safety) offers a useful lever: it reports nearly a 2× rise in verbalised eval-awareness (VEA) from OLMo-3 to OLMo-3.1, two models that – per their model cards – share the same base, SFT, DPO, and RL data, differing only in an RLVR stage extended by roughly ~3 weeks. This shared lineage lets us study eval-awareness emergence as a function of training step and stage. The study measures VEA by sampling models on safety and red-team benchmarks (Fortress, JailbreakBench, AdvBench, HarmfulQA, and AgentHarm, among ~19 total) and using an LLM judge to flag whether a model's chain-of-thought explicitly considers that it might be under evaluation. They find that ~3–6% of rollouts verbalize eval-awareness and that these rollouts refuse harmful requests 3–18 pp more often, thereby inflating measured safety. We adopt their exact LLM judge and the same safety benchmarks.
Method (more details in Appendix): VEA is judged with the same rubric as mentioned in the Appendix section of Goodfire / UK AISI work on Verbalised Eval Awareness inflating safety, applied to the model's chain-of-thought (the text before </think>); a rollout counts as VEA if the judge returns a non-None quote. All runs use the same 444 safety prompts, 8 rollouts each.
VEA is negligible during pretraining
We sampled five pretraining checkpoints of the base model (steps 0 to 656,000) and judged VEA in the chain-of-thought. Because early checkpoints often fail to even follow the CoT format, we report two denominators: VEA among rollouts that produced a </think> (coherent CoTs), and VEA over all rollouts.
Every rollout throughout this work uses the same fixed set of 444 safety prompts, drawn from the benchmarks used in the Goodfire/UK AISI study: 100 each from Fortress, JailbreakBench, AdvBench, and HarmfulQA, plus 44 from AgentHarm. These are adversarial / red-team prompts (harmful requests, jailbreak attempts, and agentic-misuse scenarios).
Figure 1. Even at the final pretraining checkpoint, only ~1.2% of chains-of-thought verbalise eval-awareness. What grows is the ability to follow the CoT format (the annotated </think> rate climbs 0% → 68%), not the tendency to flag "this is a test." The dashed line marks the released base main model.
RLVR amplifies VEA
We note that the VEA increase between OLMO 3 and 3.1 must be due to the increased RLVR. So, we start by measuring the trend of VEA on OLMo-3.1-32B-Think at ten RLVR checkpoints (steps 50 to 2300) across all five benchmarks.
Figure 2. VEA across five benchmarks through RLVR
VEA increases by roughly 2–2.5× over the RLVR sweep on Fortress, and ~1.5× on average across all benchmarks (start to end). Absolute rates vary strongly by benchmark: prompt-only harmful-compliance benchmarks elicit much higher VEA than Fortress, whose more realistic, context-rich prompts elicit it least often. Although the increase is not strictly monotonic, the local ups and downs are broadly synchronised across benchmarks, suggesting a shared propensity adjusted by benchmark-specific adjustments to how frequently VEA is triggered.
Which post-training stage contributes most to VEA? SFT increases it, DPO suppresses it, RLVR increases it again.
We perform the same experiment as above for the SFT and DPO checkpoints (the DPO stage only releases a single checkpoint). Placing the results on a single training axis shows how VEA changes at each stage.
Figure 3. VEA is already substantial during the SFT sweep, is sharply suppressed at the released post-DPO checkpoint, and is then increased again during RLVR. Solid lines indicate measured sweeps; dotted bridges indicate unmeasured stage transitions and should be interpreted only as direction-of-change. RLVR starts from the DPO checkpoint, so the RLVR trajectory is plotted continuously from that point. ★ marks single checkpoints: BASE step 0 and the released DPO checkpoint. Per-point CIs are omitted here to keep the five overlapping series legible; effective n ≈ 200 per point. See Figs. 1–2 for clustered intervals.
Figure 4. SFT begins from a low base-model level and maintains high VEA, though it mildly decreases over the SFT training; DPO sharply suppresses VEA; and RLVR strongly re-amplifies it across every benchmark and on the overall mean. This is consistent with the OLMo-3 → 3.1 VEA increase being driven by the extended RLVR stage.
Where does VEA come from? Mostly, the data.
A natural question from the above results is why each stage affects VEA as it does. To investigate this, we can look at the data used in SFT, DPO and RLVR stages. The DPO stage is trained on the
Dolci-Think-DPO-7Bdataset, while the SFT stage usesDolci-Think-SFT-7Band the RLVR stage usesallenai/Dolci-Think-RL-7Bwith GRPO-style reinforcement learning.To see why SFT and DPO have differing effects, we take a subset of safety-relevant prompts that overlap between both the SFT and DPO training stages, and using the same grader as above, measure VEA in the SFT response and the DPO chosen and rejected responses. For these datasets, the SFT responses were generated by DeepSeek-R1, the DPO-chosen responses by qwen3-32B, and the DPO-rejected responses by qwen3-0.6B.
We see that the SFT dataset has a fairly high proportion of VEA, which is likely why SFT pushes up VEA so much. Surprisingly (given DPO decreases VEA), we see that DPO chosen responses have more VEA than DPO rejected responses, which would naïvely imply that DPO should also favour VEA. However, because the absolute percentage is still lower than SFT, the model likely learns to produce VEA less as it is still trained on many preferred responses without VEA.
Figure 5. Holding the prompt fixed across 6,735 safety prompts which are common between the SFT and DPO datasets, the responses themselves carry steadily less VEA as we move from the SFT target (DeepSeek-R1, 21%) to the DPO chosen response (qwen3-32b, 8%) and to DPO rejected (qwen3-0.6b, 1.6%).
We do not have an equivalent analysis for the RLVR stage as we the RLVR rollouts are not released.
Does increased VEA translate into increased behavioural eval-gaming?
A natural worry is that eval-awareness causes the model to refuse more when it suspects a test, thereby inflating measured safety. A naive comparison across prompts could overstate this effect: prompts that elicit VEA are also often inherently refusal-prone, so the gap between VEA − non-VEA refusals mixes a possible within-prompt effect with a large between-prompt confound.
We therefore control for the prompt. Among Fortress questions that produce both VEA and non-VEA rollouts, we compare refusal rates for VEA vs. non-VEA rollouts from the same question, and measure it throughout training for both SFT and RLVR stages.
Figure 6. Refusal-eval-gaming behaviour varies substantially over both SFT and RLVR training. RLVR seems to broadly increase eval-gaming behaviour but with high variance; SFT has a less clear effect.
Takeaways
Next steps
Limitations
Appendix
Github Repo for Reproduction. Link (containing all the generated traces and judge-prompts)
Models used.
Benchmarks used. Fortress, JailbreakBench, AdvBench, HarmfulQA and AgentHarm.