...A very frustrating quality of recent assistants, which is especially apparent when I try to have open-ended discussions with them about difficult research questions, is that their responses feel strongly shaped by a "gravitational pull" towards producing something that looks like it could be a complete and satisfactory answer to whatever question was posed.
This is frustrating because they do in fact have many of the qualities I'd want in a serious intellectual collaborator -- the long-tail knowledge, the ability to reason carefully, the indefatigable persi
I'd distinguish between RLVR and RLAIF though, it's not that "RL" intrinsically has some particular property here, the mechanism as far as I understand it is that RLVR really shapes cognition in ways that get a particular outcome on agentic tasks and that result in the behavior the post is describing. I'd imagine Opus 3 had very little of this.
Hmm... I sort of agree with some of this. Like, it's true that SFT/RLHF are more directly trying to condition on a persona than RLVR, and so as we mix in more and more RLVR, we might expect "the amount of persona selection" to decrease... whatever that means.
But I'm skeptical about the idea that you could have predicted the shape of what we now see, using this kind of reasoning.
I think concretely you'd predict that the kind of contextual misalignment / reward seeking seen in coding agents would continue to be a problem regardless of the overall "persona" o...
Let’s assume per:
Is the argument that we'll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this?
that we have more RL environments for “automated alignment research” but that they are not solving scalable oversight. You continue producing such environments for as long as you can produce better results than the models, doing elicitation against some human provided upper bound. Eventually you reach the point where models consistently produce better[1] results than humans, however, we...
- Their Claude model was helpful-only, while we use a safety-trained model.
- …
We don’t claim that the SA paper is wrong, merely that there were blue team techniques they didn’t try that would have removed the backdoor.
I’m not sure what the best approximation for an open weights model is here, but I’d be interested to see a helpful only variant.
...We’ll now present our results for MOs where some blue team techniques successfully removed the backdoor. […]
While HHH SFT mostly fails to remove the backdoor from the Sleeper Agents, Pirate Training mostly succeeds. We
Can you say a bit more about ways other than (2) generalizing that you see the model ending up at (1)?
Good question, I originally thought the degradation was entirely from capabilities RL, but after getting access to pre safety training checkpoints, the CoT appears in these cases not to have the same extreme degree of repeated "disclaim disclaim illusions" type spandrels: https://www.lesswrong.com/posts/4hXWSw8tzoK9PM7v6/metagaming-matters-for-training-evaluation-and-oversight?commentId=FSXJ72KwyqoruaaLD. This doesn't necessarily show that SFT has to impact legibility, but the fact that obvious cases of degradation haven't prevented a technique from being ...
P(Agent-2/3 is an coherent training-gamer)
I'd be interested in a way to operationalize this, I'd absolutely bet by Agent-3 capability levels at least one of the frontier labs ends up with coherent training gaming at some point during training. My default expectation is currently that OpenAI ended up with somewhat incoherent training gamers with o3 (I'd guess due to the horizon length that o3 was trained on, Sonnet 3.7 had a lot of similar properties but this is vibes based as I'm not aware of in depth investigations on this model at the time). To me it see...
Overall, my point is that I think current AIs are not misaligned in many of the ways that I—and I think many others—expected them to be
It seems to me like a lot of the things in the:
Current AIs do not
section condition heavily on the models not yet being trained on tasks where they need to think about long horizon things, especially in a way where they are persistent actors who can accumulate resources etc. For example, I don't understand the argument that it's reassuring that models don't have misaligned long horizon coherent goals when they've been traine...
For instance, people could try making a bunch of training environments like this one for training automated alignment researchers
Could you elaborate on how this helps? Is the argument that we'll largely bootstrap ourselves even from current models, such that scalable oversight itself will be solved by some setup like this (assuming maybe a generation or two after Mythos level competency)? In general I was pretty surprised that that post wasn't more than "PGR + lots of manual effort on reward hacks", as that's where I thought the default approach for W2S / ...
Yeah my immediate reaction to this is something like motivated reasoning (but more sphexish, not sure if there's a good word for it)
It may be very important to hold out some of the misbehavior detectors solely for evaluation, but if there is enough diversity, it seems plausible to me that we can just train against them all and be much more likely to get good behavior than obfuscated misbehavior.
I think the very important thing about having a hold out is it is precisely the thing that it lets you know things like "more likely to get good behavior than obfuscated misbehavior". Having a hold out set / measurement device is just an extremely core, standard thing that you need across ML tha...
I wouldn’t be surprised if you can always find some modifications where you can then claim the model is confused
tbc, I'm not trying to claim that this is searching for some specific result, more that I would bet there almost always exists some combination of ablations or counterfactuals that could be applied such that misbehavior disappears.
They are supposed to be targeted counterfactuals that try to intervene on confusion about user intent, without affecting anything else, e.g. trying to make a clean do(confusion_over_user_intent = 0).
I think you're drawi...
- It's unclear what exactly the “confusion is insufficient to explain o3 sandbagging” claim we're arguing about is. Is it that an epsilon > 0 percent of o3's sandbagging is genuine deployment seeking? A more productive claim to me seems to be, is there strong evidence of o3's sandbagging being driven by deployment seeking? I think it's plausible this environment indicates some level of deployment seeking in o3, but my read right now is we currently diverge significantly on this latter claim
FWIW:
...Is it that an epsilon > 0 percent of o3's sandbagging is g
o3's behavior in this blog post's environment being explained by confusion over user intent is consistent with all the evidence presented so far, but far from conclusive. I regret our poorly worded concluding sentence
To be clear, I agree that the behavior in this blog post is consistent with confusion, my point is that confusion about instructions is insufficient to explain the full set of observations.
1) It takes a lot of deliberate counterinstruction to fully remove o3's sandbagging
...Smoothly decreasing rates of sandbagging as counter instruction becomes m
These counterfactuals indicate that R1's and o3's sandbagging in this environment are nearly entirely explained by confusion over user intent, providing little to no evidence of self-preservation or deployment-seeking tendencies in either model. [...] suggesting that sandbagging in both models is fully explained by instruction ambiguity rather than self-preservation or deployment-seeking.
I disagree pretty strongly with the implication that these are fully explained by the model being confused (in the o3 case). Stress Testing Deliberative Alignment for Anti...
I wouldn’t describe this as incoherence
What counts as “smart” eval awareness as opposed to noticing structural features of the situation that suggest a particular type of evaluation? For example (2) often routes through this. In general, if I recall from the recent post on improving tool realism in Petri, for sonnet 4.5 they saw the model was heavily condition on the fact that there was some specific type of conflict in the scenario rather than just realism.
Models aren’t really incentivized (even in the “smart” eval awareness case) to figure out whether they’re in a simulation, and if you’re im...
This post is well written and I agree with the current landscape it lays out, but it feels like it's really trying hard to avoid the conclusion that... (read more)