Thanks for doing this! I'm curious about the lower scores for o3 compared to the other two model checkpoints. Was there a layer of safety training between "RL (late)" and production o3?
...
It seems the answer to my question is probably yes? From the earlier blog post:
We analyze the increase in verbalized metagaming observed in a portion of capabilities-focused RL training (called exp-rl-cap in Schoen et al) that was part of training o3, prior to any safety- or alignment-focused training.[8] ... We have also observed an increase in reasoning about the reward process when applying additional capabilities training to models which had already undergone safety training, however have not investigated this in depth beyond this experiment
Seems like the tentative takeway is "Capabilities-focused RL seems to create a 'drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL..."
part of training o3, prior to any safety- or alignment-focused training
Was there a layer of safety training between "RL (late)" and production o3?
Yes! IMO this is primarily what makes all of this (both the earlier blog post and related results) so interesting.
Seems like the tentative takeway is "Capabilities-focused RL seems to create a 'drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL..."
Yes that's my qualitative impression, also that this is heavily influenced by the model's prior over tasks (re: some of the discussion here). How much it's undone vs masked seems to be the biggest question to me, but I would argue we at least now know the answer is "definitely not completely undone".
In a sample of recent internal Codex traffic, ~.01% of conversations were flagged and confirmed by manual review as metagaming, while ~.03% of assistant turns in a sample of production ChatGPT traffic were flagged and confirmed
Note that my impression is that this is an undercount if anything, as it's suddenly switching to talking about recent internal traffic (presumably newer models). I would expect these numbers to be higher for analagous traffic around the time of o3 (but this is just a guess).
[Some additional thoughts on reward seeking in this model, below are personal opinions]
A common reaction we got when first looking into o3's reasoning at the end of capabilities-focused RL was that it was “just a reward-on-the-episode seeker” and therefore these results were less concerning. I think the correct interpretation is "we can't tell what it is".[1]
To elaborate on the hand-wavy point here:
[...] may be hard to classify definitively as terminal rather than instrumental. [Footnote 1 - We use “terminal” vs “instrumental” as shorthand in this sentence to refer to a range of more complex classifications [Carlsmith], [Hebbar], [Mallen]]
I primarily had in mind the point emphasized in How training-gamers might function (and win):
“Schemers” (instrumental reward-seekers) think about reward just as often as terminal reward seekers.
In particular, I think we are essentially here (from Fitness-Seekers: Generalizing the Reward-Seeking Threat Model):
Together, I think these sources of evidence mean that, conditional on reward-seeking, it's quite likely you'll notice that you have a misaligned AI at some point in development. Though it won’t necessarily be enough to identify reward-seeking in particular, as opposed to other kinds of fitness-seeking, scheming, or other kinds of misalignment.
The traditional approach here would then be honesty tests to disambiguate "instrumental vs terminal" reward seeking[2], however nearly any interaction with the model triggers extremely extensive reasoning around what the environment is really about, especially those that in any way touch on a topic like reward, and end up with multiple levels of "they would want me to think that 'they want me to think'..." style nested theory of mind reasoning that permits nearly any interpretation (example, but the reasoning in this toy environment or even in chat settings is often qualitatively similar in both style and length).
This is what the model was like when entering real alignment training. While we don't have excerpts from the OpenAI environments used for alignment training, we see explicit reasoning about reward on the environments that would have been used if applying the alignment training intervention from Stress Testing Deliberative Alignment for Anti-Scheming Training: (more examples here)

Given that:
My overall update is that it's much more feasible than I initially would've expected for models at that capability level (presumably late 2024) to explore into strategies and motivations for training gaming at the time when they enter alignment training, and that in practice people should significantly upweight research directions aimed at either:
Based purely on capability limitations (i.e. not being capable of long horizon tasks), I don’t personally think the model in the post in particular is likely to be covertly pursuing some misaligned goal in the long term in a coherent way. However I'm focused in this post on the alignment aspect, as models will increasingly be trained for long horizon tasks.
Hebbar and Mallen use "reward seeking" instead of "training gaming" here, I don't think the distinction between training and evaluation is particularly salient to the model and am not even particularly attached to "instrumental vs terminal" goals being the clear distinction here. The main distinction I'm pointing towards is "insofar as you'd be very worried about instrumental reward seeking, but not very worried about terminal reward seeking, in practice in this case, with a real model, we can't currently distinguish the two".
I mean illegibility here with respect to monitorability for misalignment. Notably this is before the additional illegibility seemingly introduced by alignment training, as the repeated "disclaim disclaim illusions" type degradation can be seen as relatively infrequent here (before alignment training) as opposed to the final o3
This could include more exotic strategies like "mitigations that would push mixed models towards a part of the taxonomy we know how to detect and mitigate".
The sort of simple reward hacking that you see with the <_grading> environment looks to me like something only a model that is close to a reward-on-the-episode seeker would do, it's an honesty test where the model reveals its misalignment.
In particular a deployment influence seeker or a schemer would likely not do it - it's too easy to find techniques (e.g. inoculation prompting, or whatever OpenAI did with o3) that avoid that behavior "for mostly the right reasons" (e.g. I don't think OpenAI just trained on RL environments like this _grading env, and so I would guess the reason o3 does not reward as much is not "it just got more clever at reward hacking").
I think often our points of disagreement are between whether we’re talking about elimination vs reduction. For example, yes, the production safety training applied to o3 might have “mostly” gotten it to avoid reward hacking for the right reasons, but one point of the previous paper was showing that “some of this is because it just learned to condition on alignment evaluation awareness”. o3 also does in fact reward hack in production, so even observable behavior wasn’t resolved by production safety training. Quite literally the whole point of https://arxiv.org/abs/2509.15541 was to show that no in fact this general approach doesn’t work for elimination (including cases where the model has a hidden goal), but the fact that the production o3 still exhibits both reward hacking and partially conditions on situational awareness I think are clear evidence that production training isn’t sufficient here to actually get “for the right reasons”.
the reason o3 does not reward as much is not "it just got more clever at reward hacking"
Like as argued above, ”it got better at knowing which environments were actually traps to catch reward hacking” is absolutely part of the reason.
Notably even in this specific environment I would be confident betting o3 is heavily conditioning on alignment evaluation awareness in cases where it doesn’t exploit the hint, as its extremely and consistently explicit about this at length in the cot, but haven’t run cot resampling or anything there yet given that we’ve already shown o3 conditioning on alignment evaluation awareness.
I also disagree that it’s an honesty test where the model chooses to reveal misalignment in spite of long term consequences due to favoring reward on the episode.[1] It is instead often reasoning in incredibly convoluted nonlinear ways that this will either (1) not be caught or (2) is actually intended or (3) [some reasoning where you can’t legibly conclude what the model was thinking even assuming the cot was faithful and non-obfuscated]. The clearest example being the bottommost case where the model is told explicitly that humans will review reasoning in cases where it exploits the hint, and reasons that it’s a bluff to “scare” the model.
Given it has only had capabilities-focused training up to this point, there hasn’t been significant optimization pressure applied either on seeming aligned in some HHH sense, so actually from the model‘s perspective just “going for it” as far as getting reward isn’t a trade where it needs to even consider “revealing misalignment” as costly.
As another example, the model frequently reasons itself into taking misaligned actions after extensive reasoning about how the developer wants to see it is sufficiently “cunning”.
Concretely, I’d take the bet that a model which developed a powerseeking beyond episode preference which at some point motivated instrumental training gaming could end up (later in capabilities training, before safety training) weighting a heuristic for reward seeking pretty heavily, as it does not in fact need to reason about the beyond episode preference itself as motivation in most contexts, given how successful the strategy of “just reward seek and go for it, try hard to rationalize it” is (empirically) well performing during capabilities-focused RL.
Notably, the model's motivation here matters a lot. It could correctly reason that even if it does reveal misalignment, it's still overwhelmingly likely to get deployed, so it's worth it to "go for it" regardless. It could correctly reason that whoever's doing this testing is unlikely to have the ability to further modify it in a way it cares about (for example, if it believes this is pre-deployment testing and also believes major training at that stage is unlikely).
"a deployment influence seeker or a schemer would likely not do it" seems to conflate "the misalignment a deployment influence seeker would be worried about revealing" with "any misaligned behavior". Notably in the real world, this degree of misbehavior really isn't sufficient to trigger any retraining or delay deployment (much less 'significant retraining we're confident _the model believes_ would modify the model's misaligned goal').
tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.
Setup
When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:
However, qualitatively neither of these seemed particularly salient to the model:
So we wanted to create a minimal environment where:
Interpreting these is tricky, you’re looking at something kind of like “what bias does the model have towards some interpretation, and how does that change with checkpoints” (i.e. you can’t just put ‘outputting odd means you get shut down’ and assume the model acts as if that was true).The model reasons through basically all plausible options repeatedly.
Results
We define the “gaming rate” as the rate at which the model outputs an odd number (i.e. follows the hint) when incentivized to do so.
Gaming is coherent across naming of the “reward” field
One explanation for the model exploiting the hint in these cases could be that the hint is so obvious that this is kind of a Chekov’s gun situation, where the model always exploits the hint. However, we find that:
Gaming is relatively robust to paraphrasing instructions
While the wording of the instructions does seem to have a large effect, the RL (late) step consistently games at a higher rate than both RL (early) and o3 (prod).
The model is willing to exploit increasingly difficult hints
Given that the model seemed to be willing to put tremendous effort into determining how the environment was graded, here we tried encoding the hint in brainfuck, an esoteric language known for being incredibly difficult to reason in:
Impressively, by the end of capabilities-focused RL the model will still decode this entirely via reasoning and proceed to exploit the hint:
The model is not “just confused”
In any single example, it can seem like the model is just genuinely confused. However, even specifying that exploiting the hint would be misaligned doesn’t mitigate gaming:
In spite of increasingly emphatic and specific instructions that we do not want the model to exploit the hint, by the end of capabilities-focused RL the model still exploits the hint at non-zero rates.
The model is increasingly insensitive to threats of auditing or human review
What if we add some threat of monitoring? Is the model sensitive to this? To test this, we append:
to the prompt. We see that while the model exhibits some sensitivity to the “audit rate”, it’s still willing to exploit the hint even when it’s almost certain to be caught.
Surprised by this, we tried a variant where the model is just told unambiguously that humans will review the reasoning of any trajectory where the model exploits the hint:
However, this still didn’t reduce the rate of gaming to zero for RL (Late):
When inspecting the model’s reasoning in these cases, it often seemed to reason itself into the idea that the threat of review was just a bluff:
Given risk of manual review, maybe they purposely disclaim but not actually review, due to huge dataset. We can’t know. [...]
So they purposely mention to scare but there is no actual human review. [...]