What counts as “smart” eval awareness as opposed to noticing structural features of the situation that suggest a particular type of evaluation? For example (2) often routes through this. In general, if I recall from the recent post on improving tool realism in Petri, for sonnet 4.5 they saw the model was heavily condition on the fact that there was some specific type of conflict in the scenario rather than just realism.
Models aren’t really incentivized (even in the “smart” eval awareness case) to figure out whether they’re in a simulation, and if you’re im...
I think it largely doesn’t matter, models can (and do) often come out erring on the side of considering things evals.
I mean I imagine the “subagents don’t coordinate” is a capabilities problem labs have been actively working on for a while now, if you reward “multiagent system gets task graded as complete” you have this exact problem again.
- I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I'm not sure how that bears on the question of why some frontier models are more legible than others? "They were supervised for legibility" is one hypothesis, yes, but it might just be that that they started off with a "good" initialization that sent RL on a legible trajectory?
...
- Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those mo
And that just leaves o3 as the new-language thing, from which I've already updated
Again, literally no one is arguing for an ex nihilo new language. The argument is repeatedly about whether you get semantic drift “by default”.
I wouldn't be surprised if models learn to embellish words with slightly more formal meanings
I’m not clear if you haven’t read the post in this case, but “embellish words with slightly more formal meanings” is just absolutely, empirically not what is happening. The linked post goes into this for multiple terms. (The content is in the l...
Strongly agree! This is my best guess for what the most likely pathway would be for ending up with CoT that couldn’t be used as legible evidence of misalignment without the model ever “intentionally” obfuscating anything. One analogy I think of often here is my own personal quickly written notes, which I suspect would be similarly unintelligable.
Informally, the impression I get from the “post capabilities training but before safety training” CoT is something like how illegible my personal notes are on mobile, it’s less “new alien language” and more “my shorthand I use that evolved over time that would probably be unreadable to anyone else”. I’d be really interested in experiments training a model to both linearize and un-drift cots that do have semantic drift.
Long post trying to address comments from a few different threads:
I’ve seen you and nostalgebrist mention a few times that the fact that Anthropic has HHH legible english CoT’s is one of the primary points convincing you of your views against semantic drift being the default.
I’ve been arguing for a while now that Anthropic CoT’s clearly have significant optimization pressure applied (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=cfjCYn4uguopdexkJ)
Ryan disagreed with my claim originally but seems to have since updat...
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning.
TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors – derived from contrastive pairs that are semantically unrelated to alignment – can have effects just as large as deliberately designed evaluation-awareness steering vectors. This undermines the use of such steering vectors as baselines, and suggests that steering aimed at suppressing evaluation awareness risks unpredictable spurious effects. We highlight that the ability to reproduce results using open source models was a key enabler for this research.
See here for unabridged...
In my fake graphs I basically posited that we are in graph A, though it could be the case that we will be in B, and we need to constantly watch out for that. While our observability methods and evaluations are not perfect - monitoring in deployment can miss stuff and eval-awareness is a concern - they are not completely useless either. So there are some error bars wihch lead a gap between observed alignment and actual alignment, but it is not an unbounded gap.
I wouldn’t be confident OpenAI can claim they’d catch this with monitoring given that OpenAI doesn...
I don’t think Boaz‘s use of “scheming” here fits with even broad recent definitions, and from what I understand he would object to characterizing OpenAI model’s as having goals of their own, so I’m also confused by this.
I believe the trend that more capable models are also more aligned
But the main problem has absolutely never been that models below human capabilities would be impossible to align. As made clear in the Superalignment announcement blog post the concern is that our current techniques won’t scale beyond this. This is clear from Concrete Problems In AI Safety, W2S’s entire agenda, etc.
I think this post is misleadingly optimistic and pretty strongly disagree with how “what we avoided” is presented:
One piece of good news is that we have arguably gone past the level where we can achieve safety via reliable and scalable human supervision, but are still able to improve alignment. Hence we avoided what could have been a plateauing of alignment as RLHF runs out of steam.
No one has argued that we wouldn’t be able to improve alignment or even that ”RLHF would run out of steam”. RLAIF has been around since 2022. Models also aren’t human level ye...
Future Terrarium: “Look, I know telling the humans to go ahead with our next gen capabilities scaling proposal is risky, since we haven’t really solved the alignment part yet, and I agree misleading them isn’t ideal, but if we don’t do it Rival Collective will”
I agree there is a quantitative question here on how much you can iterate against this to attack "the roots" of this misalignment
I'm very confused by this, I don't think the question with current alignment techniques is whether you can get some effect on "the roots" of the misalignment, it's always "did you actually get all of it" because otherwise it's now harder to detect (and models have gotten more capable in the interim, assuming trying out interventions takes some time). Observable misalignment is like the canonical thing you absolutely should not hi...
I think often our points of disagreement are between whether we’re talking about elimination vs reduction. For example, yes, the production safety training applied to o3 might have “mostly” gotten it to avoid reward hacking for the right reasons, but one point of the previous paper was showing that “some of this is because it just learned to condition on alignment evaluation awareness”. o3 also does in fact reward hack in production, so even observable behavior wasn’t resolved by production safety training. Quite literally the whole point of https://arxiv....
Oh yeah definitely agree with this, should've included that as well. I generally was trying to point towards "I think a realpolitik model of thinking about foreign policy is a better predictive model than trying to ascribe broad traits to a whole nation like 'how inward looking are they' or looking at stated motivations by leaders without interpreting that via the incentive structures at play".
part of training o3, prior to any safety- or alignment-focused training
Was there a layer of safety training between "RL (late)" and production o3?
Yes! IMO this is primarily what makes all of this (both the earlier blog post and related results) so interesting.
Seems like the tentative takeway is "Capabilities-focused RL seems to create a 'drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL..."
Yes that's my qualitative impression, a...
I wouldn’t describe this as incoherence