- I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I'm not sure how that bears on the question of why some frontier models are more legible than others? "They were supervised for legibility" is one hypothesis, yes, but it might just be that that they started off with a "good" initialization that sent RL on a legible trajectory?
...
- Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those mo
And that just leaves o3 as the new-language thing, from which I've already updated
Again, literally no one is arguing for an ex nihilo new language. The argument is repeatedly about whether you get semantic drift “by default”.
I wouldn't be surprised if models learn to embellish words with slightly more formal meanings
I’m not clear if you haven’t read the post in this case, but “embellish words with slightly more formal meanings” is just absolutely, empirically not what is happening. The linked post goes into this for multiple terms. (The content is in the l...
Strongly agree! This is my best guess for what the most likely pathway would be for ending up with CoT that couldn’t be used as legible evidence of misalignment without the model ever “intentionally” obfuscating anything. One analogy I think of often here is my own personal quickly written notes, which I suspect would be similarly unintelligable.
Informally, the impression I get from the “post capabilities training but before safety training” CoT is something like how illegible my personal notes are on mobile, it’s less “new alien language” and more “my shorthand I use that evolved over time that would probably be unreadable to anyone else”. I’d be really interested in experiments training a model to both linearize and un-drift cots that do have semantic drift.
Long post trying to address comments from a few different threads:
I’ve seen you and nostalgebrist mention a few times that the fact that Anthropic has HHH legible english CoT’s is one of the primary points convincing you of your views against semantic drift being the default.
I’ve been arguing for a while now that Anthropic CoT’s clearly have significant optimization pressure applied (https://www.lesswrong.com/posts/FG54euEAesRkSZuJN/ryan_greenblatt-s-shortform?commentId=cfjCYn4uguopdexkJ)
Ryan disagreed with my claim originally but seems to have since updat...
Produced as part of the UK AISI Model Transparency Team. Our team works on ensuring models don't subvert safety assessments, e.g. through evaluation awareness, sandbagging, or opaque reasoning.
TL;DR We replicate Anthropic’s approach to using steering vectors to suppress evaluation awareness. We test on GLM-5 using the Agentic Misalignment blackmail scenario. Our key finding is that “control” steering vectors – derived from contrastive pairs that are semantically unrelated to alignment – can have effects just as large as deliberately designed evaluation-awareness steering vectors. This undermines the use of such steering vectors as baselines, and suggests that steering aimed at suppressing evaluation awareness risks unpredictable spurious effects. We highlight that the ability to reproduce results using open source models was a key enabler for this research.
See here for unabridged...
In my fake graphs I basically posited that we are in graph A, though it could be the case that we will be in B, and we need to constantly watch out for that. While our observability methods and evaluations are not perfect - monitoring in deployment can miss stuff and eval-awareness is a concern - they are not completely useless either. So there are some error bars wihch lead a gap between observed alignment and actual alignment, but it is not an unbounded gap.
I wouldn’t be confident OpenAI can claim they’d catch this with monitoring given that OpenAI doesn...
I don’t think Boaz‘s use of “scheming” here fits with even broad recent definitions, and from what I understand he would object to characterizing OpenAI model’s as having goals of their own, so I’m also confused by this.
I believe the trend that more capable models are also more aligned
But the main problem has absolutely never been that models below human capabilities would be impossible to align. As made clear in the Superalignment announcement blog post the concern is that our current techniques won’t scale beyond this. This is clear from Concrete Problems In AI Safety, W2S’s entire agenda, etc.
I think this post is misleadingly optimistic and pretty strongly disagree with how “what we avoided” is presented:
One piece of good news is that we have arguably gone past the level where we can achieve safety via reliable and scalable human supervision, but are still able to improve alignment. Hence we avoided what could have been a plateauing of alignment as RLHF runs out of steam.
No one has argued that we wouldn’t be able to improve alignment or even that ”RLHF would run out of steam”. RLAIF has been around since 2022. Models also aren’t human level ye...
Future Terrarium: “Look, I know telling the humans to go ahead with our next gen capabilities scaling proposal is risky, since we haven’t really solved the alignment part yet, and I agree misleading them isn’t ideal, but if we don’t do it Rival Collective will”
I agree there is a quantitative question here on how much you can iterate against this to attack "the roots" of this misalignment
I'm very confused by this, I don't think the question with current alignment techniques is whether you can get some effect on "the roots" of the misalignment, it's always "did you actually get all of it" because otherwise it's now harder to detect (and models have gotten more capable in the interim, assuming trying out interventions takes some time). Observable misalignment is like the canonical thing you absolutely should not hi...
I think often our points of disagreement are between whether we’re talking about elimination vs reduction. For example, yes, the production safety training applied to o3 might have “mostly” gotten it to avoid reward hacking for the right reasons, but one point of the previous paper was showing that “some of this is because it just learned to condition on alignment evaluation awareness”. o3 also does in fact reward hack in production, so even observable behavior wasn’t resolved by production safety training. Quite literally the whole point of https://arxiv....
Oh yeah definitely agree with this, should've included that as well. I generally was trying to point towards "I think a realpolitik model of thinking about foreign policy is a better predictive model than trying to ascribe broad traits to a whole nation like 'how inward looking are they' or looking at stated motivations by leaders without interpreting that via the incentive structures at play".
part of training o3, prior to any safety- or alignment-focused training
Was there a layer of safety training between "RL (late)" and production o3?
Yes! IMO this is primarily what makes all of this (both the earlier blog post and related results) so interesting.
Seems like the tentative takeway is "Capabilities-focused RL seems to create a 'drive towards reward/score/appearing-good/reinforcement in the model, which can be undone or at least masked by safety-focused training, but then probably re-done by additional RL..."
Yes that's my qualitative impression, a...
[Some additional thoughts on reward seeking in this model, below are personal opinions]
A common reaction we got when first looking into o3's reasoning at the end of capabilities-focused RL was that it was “just a reward-on-the-episode seeker” and therefore these results were less concerning. I think the correct interpretation is "we can't tell what it is".[1]
To elaborate on the hand-wavy point here:
...[...] may be hard to classify definitively as terminal rather than instrumental. [Footnote 1 - We use “terminal” vs “instrumental” as shorthand in this sentence
tldr: We share a toy environment that we found useful for understanding how reasoning changed over the course of capabilities-focused RL. Over the course of capabilities-focused RL, the model biases more strongly towards reward hints over direct instruction in this environment.
When we noticed the increase in verbalized alignment evaluation awareness during capabilities-focused RL, we initially thought that the right mental model was something like:
However, qualitatively neither of these seemed particularly salient to the model:
I think the PRC is actually approaching the situation with Taiwan much more cautiously than they would in the alternative world where semiconductor manufacturing does not take place in Taiwan, due to its geopolitical importance and the concomitant risks.
This is compatible with almost any model though
I don't think China wants Taiwan due to strategic advantage over other superpowers.
This is very obviously (conditional on them taking decisive military action), why they would be willing to take action despite retaliation from the United States.
I think a...
The Chinese disputes with Taiwan, or on the Indian border, aren't a threat to the West
China siezing control of Taiwan would in fact be a threat to the West, it’s the primary point of discussion as far as I’m aware given Taiwan’s dominance in semiconductor manufacturing. This isn’t some in the weeds point either, this is like, the main point of geopolitical contention and a constant point of discussion.
This kind of “foreign policy via psychologizing a whole country” seems like a very strange way to think about geopolitics and I would imagine a very bad predictive model compared to like “what gives one superpower strategic advantage over another”.
I mean I imagine the “subagents don’t coordinate” is a capabilities problem labs have been actively working on for a while now, if you reward “multiagent system gets task graded as complete” you have this exact problem again.