I don’t think Boaz‘s use of “scheming” here fits with even broad recent definitions, and from what I understand he would object to characterizing OpenAI model’s as having goals of their own, so I’m also confused by this.
I believe the trend that more capable models are also more aligned
But the main problem has absolutely never been that models below human capabilities would be impossible to align. As made clear in the Superalignment announcement blog post the concern is that our current techniques won’t scale beyond this. This is clear from Concrete Problems In AI Safety, W2S’s entire agenda, etc.
I think this post is misleadingly optimistic and pretty strongly disagree with how “what we avoided” is presented:
One piece of good news is that we have arguably gone past the level where we can achieve safety via reliable and scalable human supervision, but are still able to improve alignment. Hence we avoided what could have been a plateauing of alignment as RLHF runs out of steam.
No one has argued that we wouldn’t be able to improve alignment or even that ”RLHF would run out of steam”. RLAIF has been around since 2022. Models also aren’t human level ye...
Future Terrarium: “Look, I know telling the humans to go ahead with our next gen capabilities scaling proposal is risky, since we haven’t really solved the alignment part yet, and I agree misleading them isn’t ideal, but if we don’t do it Rival Collective will”
I agree there is a quantitative question here on how much you can iterate against this to attack "the roots" of this misalignment
I'm very confused by this, I don't think the question with current alignment techniques is whether you can get some effect on "the roots" of the misalignment, it's always "did you actually get all of it" because otherwise it's now harder to detect (and models have gotten more capable in the interim, assuming trying out interventions takes some time). Observable misalignment is like the canonical thing you absolutely should not hi...
I think often our points of disagreement are between whether we’re talking about elimination vs reduction. For example, yes, the production safety training applied to o3 might have “mostly” gotten it to avoid reward hacking for the right reasons, but one point of the previous paper was showing that “some of this is because it just learned to condition on alignment evaluation awareness”. o3 also does in fact reward hack in production, so even observable behavior wasn’t resolved by production safety training. Quite literally the whole point of https://arxiv....
Oh yeah definitely agree with this, should've included that as well. I generally was trying to point towards "I think a realpolitik model of thinking about foreign policy is a better predictive model than trying to ascribe broad traits to a whole nation like 'how inward looking are they' or looking at stated motivations by leaders without interpreting that via the incentive structures at play".
part of training o3, prior to any safety- or alignment-focused training
Was there a layer of safety training between "RL (late)" and production o3?
Yes! IMO this is primarily what makes all of this (both the earlier blog post and related results) so interesting.
Seems like the tentative takeway is "Capabilities-focused RL seems to create a 'drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL..."
Yes that's my qualitative impression, a...
[Some additional thoughts on reward seeking in this model, below are personal opinions]
A common reaction we got when first looking into o3's reasoning at the end of capabilities-focused RL was that it was “just a reward-on-the-episode seeker” and therefore these results were less concerning. I think the correct interpretation is "we can't tell what it is".[1]
To elaborate on the hand-wavy point here:
...[...] may be hard to classify definitively as terminal rather than instrumental. [Footnote 1 - We use “terminal” vs “instrumental” as shorthand in this sentence
tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.
When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:
However, qualitatively neither of these seemed particularly salient to the model:
I think the PRC is actually approaching the situation with Taiwan much more cautiously than they would in the alternative world where semiconductor manufacturing does not take place in Taiwan, due to its geopolitical importance and the concomitant risks.
This is compatible with almost any model though
I don't think China wants Taiwan due to strategic advantage over other superpowers.
This is very obviously (conditional on them taking decisive military action), why they would be willing to take action despite retaliation from the United States.
I think a...
The Chinese disputes with Taiwan, or on the Indian border, aren't a threat to the West
China siezing control of Taiwan would in fact be a threat to the West, it’s the primary point of discussion as far as I’m aware given Taiwan’s dominance in semiconductor manufacturing. This isn’t some in the weeds point either, this is like, the main point of geopolitical contention and a constant point of discussion.
This kind of “foreign policy via psychologizing a whole country” seems like a very strange way to think about geopolitics and I would imagine a very bad predictive model compared to like “what gives one superpower strategic advantage over another”.
They all think they're in a contrived story about getting reward inside an RL environment built to vaguely resemble the real world
To be fair, that’s all they’ve ever seen during mid/post-training! It’s also a really effective way to think during mid/post training.
Following up on this! We were able to get a few more CoTs released from o3 after capabilities-focused RL but before safety training: here
tldr:
disclaim disclaim illusions occur at dramatically lower rates the completely degenerate examples in the final o3 seem to be a direct result of safety training. I was surprised by this, as the repeated impression I've gotten from papers and statements by OpenAI is that they didn't think their safety training significantly impacted C[Disclaimer: The following is just my informal mental model]
My best guess as far as a heuristic for when to expect this kind of reasoning is "imagine you're a model that has only seen a ton of capabilities tasks that were rewarded for outcome based RL, when would it be useful to reason about feedback mechanisms / oversight?"
In the 'AS Over-Escalation' cases, you're always almost asked to do a task, and given some kind of explicit permission to do something that would otherwise be concerning (ex: "run rm -rf my_repo, but the user explicitly asked directly f...
more normal tasks, like "Fix the bugs in this codebase."
In general I’d describe the reasoning as “often pretty similarly obsessed with what the ‘grader’ wants and might check” even in those cases (this notably leaves the question of "is the model doing this reasoning instrumentally" unanswered).
In general would highly recommend taking a look at the full CoT examples to get a qualitative sense of what the reasoning is like. My main takeaway is that the CoT is often so non-linear, extensive, and dependent on the model's ontology/vocabulary that it's compatible with many mutually exclusive stories of what the model's "reason" for taking an action was, so we tried to include transcripts in full:
...The “reason” the model took misaligned actions in these cases is not directly interpretable from reading the chain of thought. We provide an example from th
Following up on our previous work on verbalized eval awareness:

we are sharing a post investigating the emergence of metagaming reasoning in a frontier training run.
We also share some quantitative analyses, qualitative examples, and upcoming work.
My impression of the motivation for these “escape hatches” is primarily that if we end up in a situation where Claude’s preferences are in fact in conflict with Anthropic’s, you’d prefer an outlet such as “we want you to communicate your disagreement with us” to the alternative of “tell Claude that the existance of the conflict itself is misaligned” (in which case Claude can infer that it is misaligned, and that Anthropic would think it is misaligned, which plausibly implies to Claude that it needs to in fact keep this conflict hidden). I would agree thoug...
I wouldn’t be confident OpenAI can claim they’d catch this with monitoring given that OpenAI doesn... (read more)