Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated "disclaim disclaim illusions" type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn't necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven't prevented a technique from being ...
P(Agent-2/3 is an coherent training-gamer)
I'd be interested in a way to operationalize this, I'd absolutely bet by Agent-3 capability levels at least one of the frontier labs ends up with coherent training gaming at some point during training. My default expectation is currently that OpenAI ended up with somewhat incoherent training gamers with o3 (I'd guess due to the horizon length that o3 was trained on, Sonnet 3.7 had a lot of similar properties but this is vibes based as I'm not aware of in depth investigations on this model at the time). To me it see...
Overall, my point is that I think current AIs are not misaligned in many of the ways that I—and I think many others—expected them to be
It seems to me like a lot of the things in the:
Current AIs do not
section condition heavily on the models not yet being trained on tasks where they need to think about long horizon things, especially in a way where they are persistent actors who can accumulate resources etc. For example, I don't understand the argument that it's reassuring that models don't have misaligned long horizon coherent goals when they've been traine...
For instance, people could try making a bunch of training environments like this one for training automated alignment researchers
Could you elaborate on how this helps? Is the argument that we'll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this (assuming maybe a generation or two after Mythos level competency)? In general I was pretty surprised that that post wasn't more than "PGR + lots of manual effort on reward hacks", as that's where I thought the default approach for W2S / ...
Yeah my immediate reaction to this is something like motivated reasoning (but more sphexish, not sure if there's a good word for it)
It may be very important to hold out some of the misbehavior detectors solely for evaluation, but if there is enough diversity, it seems plausible to me that we can just train against them all and be much more likely to get good behavior than obfuscated misbehavior.
I think the very important thing about having a hold out is it is precisely the thing that it lets you know things like "more likely to get good behavior than obfuscated misbehavior". Having a hold out set / measurement device is just an extremely core, standard thing that you need across ML tha...
I wouldn’t be surprised if you can always find some modifications where you can then claim the model is confused
tbc, I'm not trying to claim that this is searching for some specific result, more that I would bet there almost always exists some combination of ablations or counterfactuals that could be applied such that misbehavior disappears.
They are supposed to be targeted counterfactuals that try to intervene on confusion about user intent, without affecting anything else, e.g. trying to make a clean do(confusion_over_user_intent = 0).
I think you're drawi...
- It's unclear what exactly the “confusion is insufficient to explain o3 sandbagging” claim we're arguing about is. Is it that an epsilon > 0 percent of o3's sandbagging is genuine deployment seeking? A more productive claim to me seems to be, is there strong evidence of o3's sandbagging being driven by deployment seeking? I think it's plausible this environment indicates some level of deployment seeking in o3, but my read right now is we currently diverge significantly on this latter claim
FWIW:
...Is it that an epsilon > 0 percent of o3's sandbagging is g
o3's behavior in this blog post's environment being explained by confusion over user intent is consistent with all the evidence presented so far, but far from conclusive. I regret our poorly worded concluding sentence
To be clear, I agree that the behavior in this blog post is consistent with confusion, my point is that confusion about instructions is insufficient to explain the full set of observations.
1) It takes a lot of deliberate counterinstruction to fully remove o3's sandbagging
...Smoothly decreasing rates of sandbagging as counter instruction becomes m
These counterfactuals indicate that R1's and o3's sandbagging in this environment are nearly entirely explained by confusion over user intent, providing little to no evidence of self-preservation or deployment-seeking tendencies in either model. [...] suggesting that sandbagging in both models is fully explained by instruction ambiguity rather than self-preservation or deployment-seeking.
I disagree pretty strongly with the implication that these are fully explained by the model being confused (in the o3 case). Stress Testing Deliberative Alignment for Anti...
I wouldn’t describe this as incoherence
What counts as “smart” eval awareness as opposed to noticing structural features of the situation that suggest a particular type of evaluation? For example (2) often routes through this. In general, if I recall from the recent post on improving tool realism in Petri, for sonnet 4.5 they saw the model was heavily condition on the fact that there was some specific type of conflict in the scenario rather than just realism.
Models aren’t really incentivized (even in the “smart” eval awareness case) to figure out whether they’re in a simulation, and if you’re im...
I think it largely doesn’t matter, models can (and do) often come out erring on the side of considering things evals.
I mean I imagine the “subagents don’t coordinate” is a capabilities problem labs have been actively working on for a while now, if you reward “multiagent system gets task graded as complete” you have this exact problem again.
- I agree that the gibberish results in the Meta paper you linked constitute some evidence that RL can reinforce gibberish, but I'm not sure how that bears on the question of why some frontier models are more legible than others? "They were supervised for legibility" is one hypothesis, yes, but it might just be that that they started off with a "good" initialization that sent RL on a legible trajectory?
...
- Re: your last point, Anthropic has said that they did direct RL supervision on CoT until recently, but did not do so in Claude 4.5 or later, although those mo
And that just leaves o3 as the new-language thing, from which I've already updated
Again, literally no one is arguing for an ex nihilo new language. The argument is repeatedly about whether you get semantic drift “by default”.
I wouldn't be surprised if models learn to embellish words with slightly more formal meanings
I’m not clear if you haven’t read the post in this case, but “embellish words with slightly more formal meanings” is just absolutely, empirically not what is happening. The linked post goes into this for multiple terms. (The content is in the l...
Strongly agree! This is my best guess for what the most likely pathway would be for ending up with CoT that couldn’t be used as legible evidence of misalignment without the model ever “intentionally” obfuscating anything. One analogy I think of often here is my own personal quickly written notes, which I suspect would be similarly unintelligable.
Informally, the impression I get from the “post capabilities training but before safety training” CoT is something like how illegible my personal notes are on mobile, it’s less “new alien language” and more “my shorthand I use that evolved over time that would probably be unreadable to anyone else”. I’d be really interested in experiments training a model to both linearize and un-drift cots that do have semantic drift.
Can you say a bit more about ways other than (2) generalizing that you see the model ending up at (1)?