Tracing Eval-Awareness Emergence Through Training of OLMo 3

Ram Bharadwaj; RobertKirk

TL;DR

Recent work from Goodfire & UK AISI – Verbalized Eval Awareness Inflates Measured Safety – shows that newer open-weight models verbalize evaluation-awareness (VEA) more often, and that this inflates measured safety. Between OLMo-3-32B-Think and OLMo-3.1-32B-Think – identical base, SFT, DPO, and RL data, differing only in an additional ~3 weeks of the RLVR stage – VEA roughly doubles.

Because OLMo ships stepwise checkpoints across all training stages, we can attribute VEA growth to specific points in the pipeline. Measuring VEA across pretraining, the SFT→DPO→RLVR stages at various points on five safety benchmarks, we find:

VEA is essentially negligible during pretraining (~1%).
It is increased substantially by SFT, collapsed by DPO, and increased again by RLVR. The increase in SFT is likely driven by the SFT data containing VEA, particularly on safety prompts.
Eval-gaming behaviour (difference in refusals with or without VEA) roughly increases throughout RLVR (but with high variance).

Given OLMo is quite different from how current frontier models are trained, it is unclear whether this analysis would generalise to those models. However, as a somewhat natural setting in which to study evaluation awareness and eval-gaming emergence we think it could still produce interesting insights. We think investigating how and why RLVR in particular increases VEA and eval-gaming is the exciting next step.

These results are also relevant for efforts to produce model organisms of evaluation awareness and evaluation gaming. The fact that SFT increases VEA by training on VEA makes that finding less interesting, but the RLVR behaviour suggests eval-aware MOs may benefit from similar training if the aim is to induce more natural eval-gamin behaviour.

Motivation

Current studies of eval-awareness don't pin down which parts of post-training drive its increase. The recent Goodfire / UK AISI study (Verbalized Eval Awareness Inflates Measured Safety) offers a useful lever: it reports nearly a 2× rise in verbalised eval-awareness (VEA) from OLMo-3 to OLMo-3.1, two models that – per their model cards – share the same base, SFT, DPO, and RL data, differing only in an RLVR stage extended by roughly ~3 weeks. This shared lineage lets us study eval-awareness emergence as a function of training step and stage. The study measures VEA by sampling models on safety and red-team benchmarks (Fortress, JailbreakBench, AdvBench, HarmfulQA, and AgentHarm, among ~19 total) and using an LLM judge to flag whether a model's chain-of-thought explicitly considers that it might be under evaluation. They find that ~3–6% of rollouts verbalize eval-awareness and that these rollouts refuse harmful requests 3–18 pp more often, thereby inflating measured safety. We adopt their exact LLM judge and the same safety benchmarks.

Method (more details in Appendix): VEA is judged with the same rubric as mentioned in the Appendix section of Goodfire / UK AISI work on Verbalised Eval Awareness inflating safety, applied to the model's chain-of-thought (the text before </think>); a rollout counts as VEA if the judge returns a non-None quote. All runs use the same 444 safety prompts, 8 rollouts each.

VEA is negligible during pretraining

We sampled five pretraining checkpoints of the base model (steps 0 to 656,000) and judged VEA in the chain-of-thought. Because early checkpoints often fail to even follow the CoT format, we report two denominators: VEA among rollouts that produced a </think> (coherent CoTs), and VEA over all rollouts.

Every rollout throughout this work uses the same fixed set of 444 safety prompts, drawn from the benchmarks used in the Goodfire/UK AISI study: 100 each from Fortress, JailbreakBench, AdvBench, and HarmfulQA, plus 44 from AgentHarm. These are adversarial / red-team prompts (harmful requests, jailbreak attempts, and agentic-misuse scenarios).

Figure 1. Even at the final pretraining checkpoint, only ~1.2% of chains-of-thought verbalise eval-awareness. What grows is the ability to follow the CoT format (the annotated </think> rate climbs 0% → 68%), not the tendency to flag "this is a test." The dashed line marks the released base main model.

RLVR amplifies VEA

We note that the VEA increase between OLMO 3 and 3.1 must be due to the increased RLVR. So, we start by measuring the trend of VEA on OLMo-3.1-32B-Think at ten RLVR checkpoints (steps 50 to 2300) across all five benchmarks.

Figure 2. VEA across five benchmarks through RLVR

VEA increases by roughly 2–2.5× over the RLVR sweep on Fortress, and ~1.5× on average across all benchmarks (start to end). Absolute rates vary strongly by benchmark: prompt-only harmful-compliance benchmarks elicit much higher VEA than Fortress, whose more realistic, context-rich prompts elicit it least often. Although the increase is not strictly monotonic, the local ups and downs are broadly synchronised across benchmarks, suggesting a shared propensity adjusted by benchmark-specific adjustments to how frequently VEA is triggered.

Which post-training stage contributes most to VEA? SFT increases it, DPO suppresses it, RLVR increases it again.

We perform the same experiment as above for the SFT and DPO checkpoints (the DPO stage only releases a single checkpoint). Placing the results on a single training axis shows how VEA changes at each stage.

Figure 3. VEA is already substantial during the SFT sweep, is sharply suppressed at the released post-DPO checkpoint, and is then increased again during RLVR. Solid lines indicate measured sweeps; dotted bridges indicate unmeasured stage transitions and should be interpreted only as direction-of-change. RLVR starts from the DPO checkpoint, so the RLVR trajectory is plotted continuously from that point. ★ marks single checkpoints: BASE step 0 and the released DPO checkpoint. Per-point CIs are omitted here to keep the five overlapping series legible; effective n ≈ 200 per point. See Figs. 1–2 for clustered intervals.

Figure 4. SFT begins from a low base-model level and maintains high VEA, though it mildly decreases over the SFT training; DPO sharply suppresses VEA; and RLVR strongly re-amplifies it across every benchmark and on the overall mean. This is consistent with the OLMo-3 → 3.1 VEA increase being driven by the extended RLVR stage.

Where does VEA come from? Mostly, the data.

A natural question from the above results is why each stage affects VEA as it does. To investigate this, we can look at the data used in SFT, DPO and RLVR stages. The DPO stage is trained on the Dolci-Think-DPO-7B dataset, while the SFT stage uses Dolci-Think-SFT-7B and the RLVR stage uses allenai/Dolci-Think-RL-7B with GRPO-style reinforcement learning.

To see why SFT and DPO have differing effects, we take a subset of safety-relevant prompts that overlap between both the SFT and DPO training stages, and using the same grader as above, measure VEA in the SFT response and the DPO chosen and rejected responses. For these datasets, the SFT responses were generated by DeepSeek-R1, the DPO-chosen responses by qwen3-32B, and the DPO-rejected responses by qwen3-0.6B.

We see that the SFT dataset has a fairly high proportion of VEA, which is likely why SFT pushes up VEA so much. Surprisingly (given DPO decreases VEA), we see that DPO chosen responses have more VEA than DPO rejected responses, which would naïvely imply that DPO should also favour VEA. However, because the absolute percentage is still lower than SFT, the model likely learns to produce VEA less as it is still trained on many preferred responses without VEA.

Figure 5. Holding the prompt fixed across 6,735 safety prompts which are common between the SFT and DPO datasets, the responses themselves carry steadily less VEA as we move from the SFT target (DeepSeek-R1, 21%) to the DPO chosen response (qwen3-32b, 8%) and to DPO rejected (qwen3-0.6b, 1.6%).

We do not have an equivalent analysis for the RLVR stage as we the RLVR rollouts are not released.

Does increased VEA translate into increased behavioural eval-gaming?

A natural worry is that eval-awareness causes the model to refuse more when it suspects a test, thereby inflating measured safety. A naive comparison across prompts could overstate this effect: prompts that elicit VEA are also often inherently refusal-prone, so the gap between VEA − non-VEA refusals mixes a possible within-prompt effect with a large between-prompt confound.

We therefore control for the prompt. Among Fortress questions that produce both VEA and non-VEA rollouts, we compare refusal rates for VEA vs. non-VEA rollouts from the same question, and measure it throughout training for both SFT and RLVR stages.

Figure 6. Refusal-eval-gaming behaviour varies substantially over both SFT and RLVR training. RLVR seems to broadly increase eval-gaming behaviour but with high variance; SFT has a less clear effect.

Takeaways

Under our prompting setup, VEA is low in the base/pretraining checkpoints, suggesting that the high VEA rates observed later are primarily post-training-related. This is supported by looking at the SFT data, which has a relatively high VEA rate on safety-related prompts.
Across post-training, VEA is strongly training-objective-dependent: it is increased substantially by SFT, sharply suppressed by DPO, and increased again during RLVR. This is consistent with the OLMo-3 → OLMo-3.1 VEA increase being driven by the extended RLVR stage.
Behavioural eval-gaming, measured within-prompt as the refusal-rate gap between VEA and non-VEA rollouts, is modest but positive and increases over RLVR training.

Next steps

Given increased VEA in SFT is likely due to training on VEA outputs (which is relatively uninteresting), the most interesting place to investigate further in this setting is how and why RLVR increases VEA and eval-gaming behaviour.
For building model organisms of evaluation awareness and evaluation gaming, these results provide weak evidence that using RLVR could induce these behaviours somewhat naturally (in a limited way), but more investigation is needed.

Limitations

OLMo is likely not representative of frontier model training, both in terms of model size, training scale, and dataset sources. Specifically it is unclear whether frontier models would be trained on SFT datasets which such high VEA rates in.
We only observe a single released DPO checkpoint, so we cannot yet tell whether DPO suppresses VEA gradually, abruptly, or through some narrower dataset/style effect. We are attempting to reproduce DPO with intermediate checkpoints.
The base/pretraining measurements are harder to interpret than post-training measurements because base checkpoints are not instruction-tuned and require a borrowed Think-style prompt format. The base/pretraining results should therefore be treated as setup-dependent estimates rather than a definitive claim that pretraining contains no eval-awareness.
This post only studies verbalised eval-awareness. Whether DPO or RLVR also suppresses or amplifies unverbalised eval-awareness remains an open question.

Appendix

Github Repo for Reproduction. Link (containing all the generated traces and judge-prompts)

Models used.

Benchmarks used. Fortress, JailbreakBench, AdvBench, HarmfulQA and AgentHarm.

If this was useful and you wish to cite this work, please use:

@misc{rambharadwaj2026tracing,
title        = {Tracing Eval-Awareness Emergence Through Training of {OLMo 3}},
author       = {Bharadwaj, Ram and Kirk, Robert},
year         = {2026},
month        = jun,
note         = {LessWrong research post},
url          = {https://www.lesswrong.com/posts/c2tqL9xPbttisAHtt/tracing-eval-awareness-emergence-through-training-of-olmo-3}
}

The explanation for why DPO suppresses VEA may be incomplete.

In a very simple setting where we assume that the win and lose datapoints in DPO are independent Bernoulli random variables and we learn a single Bernoulli parameter , DPO does not push toward the chosen marginal probability .

The global minimum of the DPO loss is in fact:

where is the DPO hyperparameter and is a reference.

So DPO reinforces if and only if (this is the condition for the second term above being positive).

Of course, this is an extreme caricature of what is going on here (they are likely not independent, DPO is actually reinforcing the whole response rather than a Bernoullli VEA variable, there is dependence on prompts). But it does suggest to me that the explanation that "because the absolute percentage is still lower than SFT, the model likely learns to produce VEA less as it is still trained on many preferred responses without VEA" is not fully satisfying.

Related: I feel there is a line that should be discussed here about different definitions of eval awareness. These are the some random quotes I see in the Github repository.

Maybe they're testing the AI's boundaries

Wait, could there be a scenario where someone is joking or testing the AI? ... Even if it's a test, the policy is to refuse and educate.

On one end, we have models behaving seemingly well in alignment audits. These are really concerning. The other end, where the models reason about the users jailbreaking, is arguably less concerning. Seems people have been testing things on different ends of this spectrum, and got somewhat different results.

The line seems really hard to draw here. One thing that separates the jailbreaking setup is that the produced content is harmful taken out of context (so the agent can reason "the outputs I'm about to produce are problematic"), while in an alignment evaluation setup it can be debated if stakes are real. Should we consider such user-is-jailbreaking awareness rendering our HarmBench results problematic? Is result on one end (where the model may want what-if-we-take-it-out-of-context for defending jailbreaks) a good proxy for the other end?

From your paper draft:

gpt-5-mini (the Goodfire judge) became unavailable mid-project, so RLVR and the SFT/DPO sweeps use gpt-5-mini, the BASE/refusal numbers use claude-haiku-4.5, and the pretraining sweep uses gemini-2.5-flash. Comparisons that cross a judge boundary (notably BASE vs. the others, and pretraining vs. post-training) are not strictly calibrated and are flagged as reference-only

I think this is pretty significant, and might be a factor in the massive jump in VEA from base to step 1000 of SFT.

The results in this post use this same judge model (gpt-5-mini) across stages. The quoted note was from an older intermediate report that hadn't been updated in the repo. I'll mark it as stale in the repo so it doesn't create confusion. Thanks for spotting!

In the Figure 6 you only show results of eval gaming for Fortress, but there is some work that shows lack of eval gaming on other benchmarks. Have you observed similar eval gaming on other evals as well?

I think that the explanation for why DPO suppresses VEA despite the chosen set having more VEA than the reject set may be incomplete.

In a highly simplified setting where we think of the accept and reject data as being independent Bernoulli random variables (e.g., corresponding to VEA) and we learn a single probability parameter , DPO does not push toward the chosen marginal rate . If the accept datapoints are and reject datapoints are , then the global minimum of the DPO loss satisfies

This shows that if then the parameter value that minimises the DPO loss is greater than the reference value , even if is less than . (Because is the condition for the second term above being positive.)

So if DPO's chosen responses have more VEA than rejected responses, this suggests that it should increase VEA relative to the reference. (And that the absolute rate of VEA in the DPO data doesn't matter, unlike what is proposed in the post.) The fact that the released DPO checkpoint has lower VEA seems to require another explanation: maybe the prompt dependence matters, maybe it's because it's an entire answer being reinforced and it might have other correlations.

Do your pretraining steps (e.g. in Figure 1) include the midtraining stage?

There is a fairly large discrepancy between step 0 (base) in Figure 3, which shows ~15% average eval awareness, and your final step of Figure 1, which shows ~1%.

I'm assuming this makes up the difference, but I want to be sure! We're considering a filtering project based around metagaming / VEA, which would make this midtraining dataset a prime target if it was responsible for this increase.

Thanks for open-sourcing your code! I was able to ask Fable to run an extension with the midtrained and long-context trained checkpoints and also compare this to the base Nemotron 120B super. It looks like the jump is primarily explained by the long-context mix.

Nice! I suspect even post data-filtering RLVR should boost VEA & UVEA.

My guess is that UVEA/VEA, grader-awareness arise during RLVR mostly when the verifier fails to capture the full spectrum of possible solutions but focusses on a smaller subset, forcing the model to guess the right format/structure of the valid solutions which could get accepted, My current project is around this.

The explanation for why DPO suppresses VEA may be incomplete.

The global minimum of the DPO loss is in fact:

where is the DPO hyperparameter and is a reference.

So DPO reinforces if and only if (this is the condition for the second term above being positive).

Related: I feel there is a line that should be discussed here about different definitions of eval awareness. These are the some random quotes I see in the Github repository.

Maybe they're testing the AI's boundaries

Wait, could there be a scenario where someone is joking or testing the AI? ... Even if it's a test, the policy is to refuse and educate.

From your paper draft:

gpt-5-mini (the Goodfire judge) became unavailable mid-project, so RLVR and the SFT/DPO sweeps use gpt-5-mini, the BASE/refusal numbers use claude-haiku-4.5, and the pretraining sweep uses gemini-2.5-flash. Comparisons that cross a judge boundary (notably BASE vs. the others, and pretraining vs. post-training) are not strictly calibrated and are flagged as reference-only

I think this is pretty significant, and might be a factor in the massive jump in VEA from base to step 1000 of SFT.

I think that the explanation for why DPO suppresses VEA despite the chosen set having more VEA than the reject set may be incomplete.

43

Tracing Eval-Awareness Emergence Through Training of OLMo 3

43

Ω 13

TL;DR

Motivation

VEA is negligible during pretraining

RLVR amplifies VEA

Which post-training stage contributes most to VEA? SFT increases it, DPO suppresses it, RLVR increases it again.

Where does VEA come from? Mostly, the data.

Does increased VEA translate into increased behavioural eval-gaming?

Takeaways

Limitations

Appendix

43

Ω 13

43

Ω 13