- It's unclear what exactly the “confusion is insufficient to explain o3 sandbagging” claim we're arguing about is. Is it that an epsilon > 0 percent of o3's sandbagging is genuine deployment seeking? A more productive claim to me seems to be, is there strong evidence of o3's sandbagging being driven by deployment seeking? I think it's plausible this environment indicates some level of deployment seeking in o3, but my read right now is we currently diverge significantly on this latter claim
FWIW:
...Is it that an epsilon > 0 percent of o3's sandbagging is g
o3's behavior in this blog post's environment being explained by confusion over user intent is consistent with all the evidence presented so far, but far from conclusive. I regret our poorly worded concluding sentence
To be clear, I agree that the behavior in this blog post is consistent with confusion, my point is that confusion about instructions is insufficient to explain the full set of observations.
1) It takes a lot of deliberate counterinstruction to fully remove o3's sandbagging
...Smoothly decreasing rates of sandbagging as counter instruction becomes m
These counterfactuals indicate that R1's and o3's sandbagging in this environment are nearly entirely explained by confusion over user intent, providing little to no evidence of self-preservation or deployment-seeking tendencies in either model. [...] suggesting that sandbagging in both models is fully explained by instruction ambiguity rather than self-preservation or deployment-seeking.
I disagree pretty strongly with the implication that these are fully explained by the model being confused (in the o3 case). Stress Testing Deliberative Alignment for Anti...
I wouldn’t describe this as incoherence
What counts as “smart” eval awareness as opposed to noticing structural features of the situation that suggest a particular type of evaluation? For example (2) often routes through this. In general, if I recall from the recent post on improving tool realism in Petri, for sonnet 4.5 they saw the model was heavily condition on the fact that there was some specific type of conflict in the scenario rather than just realism.
Models aren’t really incentivized (even in the “smart” eval awareness case) to figure out whether they’re in a simulation, and if you’re im...
I think it largely doesn’t matter, models can (and do) often come out erring on the side of considering things evals.
I mean I imagine the “subagents don’t coordinate” is a capabilities problem labs have been actively working on for a while now, if you reward “multiagent system gets task graded as complete” you have this exact problem again.
- I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I'm not sure how that bears on the question of why some frontier models are more legible than others? "They were supervised for legibility" is one hypothesis, yes, but it might just be that that they started off with a "good" initialization that sent RL on a legible trajectory?
...
- Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those mo
And that just leaves o3 as the new-language thing, from which I've already updated
Again, literally no one is arguing for an ex nihilo new language. The argument is repeatedly about whether you get semantic drift “by default”.
I wouldn't be surprised if models learn to embellish words with slightly more formal meanings
I’m not clear if you haven’t read the post in this case, but “embellish words with slightly more formal meanings” is just absolutely, empirically not what is happening. The linked post goes into this for multiple terms. (The content is in the l...
Strongly agree! This is my best guess for what the most likely pathway would be for ending up with CoT that couldn’t be used as legible evidence of misalignment without the model ever “intentionally” obfuscating anything. One analogy I think of often here is my own personal quickly written notes, which I suspect would be similarly unintelligable.
Informally, the impression I get from the “post capabilities training but before safety training” CoT is something like how illegible my personal notes are on mobile, it’s less “new alien language” and more “my shorthand I use that evolved over time that would probably be unreadable to anyone else”. I’d be really interested in experiments training a model to both linearize and un-drift cots that do have semantic drift.
Long post trying to address comments from a few different threads:
I’ve seen you and nostalgebrist mention a few times that the fact that Anthropic has HHH legible english CoT’s is one of the primary points convincing you of your views against semantic drift being the default.
I’ve been arguing for a while now that Anthropic CoT’s clearly have significant optimization pressure applied (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=cfjCYn4uguopdexkJ)
Ryan disagreed with my claim originally but seems to have since updat...
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning.
TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors – derived from contrastive pairs that are semantically unrelated to alignment – can have effects just as large as deliberately designed evaluation-awareness steering vectors. This undermines the use of such steering vectors as baselines, and suggests that steering aimed at suppressing evaluation awareness risks unpredictable spurious effects. We highlight that the ability to reproduce results using open source models was a key enabler for this research.
See here for unabridged...
In my fake graphs I basically posited that we are in graph A, though it could be the case that we will be in B, and we need to constantly watch out for that. While our observability methods and evaluations are not perfect - monitoring in deployment can miss stuff and eval-awareness is a concern - they are not completely useless either. So there are some error bars wihch lead a gap between observed alignment and actual alignment, but it is not an unbounded gap.
I wouldn’t be confident OpenAI can claim they’d catch this with monitoring given that OpenAI doesn...
I don’t think Boaz‘s use of “scheming” here fits with even broad recent definitions, and from what I understand he would object to characterizing OpenAI model’s as having goals of their own, so I’m also confused by this.
I believe the trend that more capable models are also more aligned
But the main problem has absolutely never been that models below human capabilities would be impossible to align. As made clear in the Superalignment announcement blog post the concern is that our current techniques won’t scale beyond this. This is clear from Concrete Problems In AI Safety, W2S’s entire agenda, etc.
I think this post is misleadingly optimistic and pretty strongly disagree with how “what we avoided” is presented:
One piece of good news is that we have arguably gone past the level where we can achieve safety via reliable and scalable human supervision, but are still able to improve alignment. Hence we avoided what could have been a plateauing of alignment as RLHF runs out of steam.
No one has argued that we wouldn’t be able to improve alignment or even that ”RLHF would run out of steam”. RLAIF has been around since 2022. Models also aren’t human level ye...
Future Terrarium: “Look, I know telling the humans to go ahead with our next gen capabilities scaling proposal is risky, since we haven’t really solved the alignment part yet, and I agree misleading them isn’t ideal, but if we don’t do it Rival Collective will”
tbc, I'm not trying to claim that this is searching for some specific result, more that I would bet there almost always exists some combination of ablations or counterfactuals that could be applied such that misbehavior disappears.
I think you're drawi... (read more)