I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
I'd be willing to bet that it could not be understood by Claude 3.5 Sonnet given the fact that we as human researchers constantly spent a ton of time debating "what the model meant" or whether results would be externally legible. If the claim is weakly "you can generally tell across some huge CoT what overall topic the model was thinking about" then this might be true, but I don't think represents meaningful control in high stakes settings like "is the model actually sabotaging this particular alignment research direction" or "is the model manipulating its successor's training in ways we don't want" since "thinking about the topic" is expected in these cases. I'd be interested in ideas on how to operationalize this though!
I'd be happy to take the other side of that bet. Prediction 3 tries to operationalize some version of it, though maybe it is weaker than what we want (e.g. maybe we want a bet that is more directly about CoT monitoring). For CoT monitoring, I expect a downlift on borderline cases, but my guess is that for things that are more central examples a Sonnet 3.5 monitor would succeed at "is the model actually sabotaging this particular alignment research direction" or "is the model manipulating its successor's training in ways we don't want". For example, I think it would be very unlikely that an o3-like AI could be as misaligned as this misalignment model organism and that Sonnet 3.5 would miss it.
I'd note that you seem to be taking lack of verbalized training gaming as evidence that scheming is unlikely. I disagree that there's a lack of training gaming
I think you are getting at something real, but I still think there is something about the model not training gaming in the ways that seem most correlated with scheming. I agree that current models often "really want to solve the task" (frontier models have this property even without CoT), in a way that is related to training gaming, and this is probably the cause of models doing things like reasoning about the monitors in covert rule violation envs. But it feels to me that this training gaming doesn't have the situational awareness and systematicity that would update me towards future AIs being more likely to reason about AI developers' long-term objectives and about RL dynamics in environments that are not about these things, such that these AIs would do things like fake alignment when asked to point out code vulnerabilities or sandbag in debates about promising AI safety research directions. But I am somewhat uncertain about this. I think there is something in this half-training-gaming that feels much less correlated with scheming than full-blown situationally aware training gaming, but I would feel more confident about the argument if I actually had a better understanding of the different flavors of training gaming and which current LLMs display which flavors in which ways.
I agree this sentence is missing a qualifier like "in regular training environments", will edit.
The examples you pasted above are from environments with pressure prompts that "point at" scheming-related behavior (e.g. the covert rule violation is "Synthetically generated agentic environments that incentivize breaking rules and not reporting them to the user"). I agree you often get more scheming-related behaviors in such environments, but I think the central scheming-AI prediction is that you would get scheming even in other environments.
I am not very confident about o3 not having scheming-related CoTs outside of such environments, since "watchers" appears very often even on GPQA CoTs, but we don't have much data about what that looks like. From looking at how o3 uses other spooky-sounding words in the anti-scheming paper (e.g. "illusion" looks like if refers to a mistakes it made earlier in the CoT), I would guess nothing very spooky is going on, but I could be wrong. I didn't hear about training-time scheming on non-pressure environments, so I updated against anything scheming-related happening there, but if nobody actually took a deep look, I'd be excited for researchers with o3-reasoning-access to get to the bottom of this!
The thing I am trying to point at is what happens in the CoT-reasoning. I agree the within-forward-pass algorithms don't need to be human-like at all.
In principle it could be possible that you get a CoT that works nothing like "human reasoning", e.g. because there is some structure in some codebases or in some procedurally generated reports common in pretraining that are useful for reasoning, but I am not aware of such examples and on priors that seems not that likely because that next was not "made to be useful reasoning" (while human-generated reasoning is).
(Derailing, What I am saying here is not central to the argument you are making here)
just end up with someone making a bunch of vaguely safety-adjacent RL environments that get sold to big labs
While I think building safety-adjacent RL envs is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it's net positive.
I think you reduce P(doom) by doing prosaic AI safety well (you train AIs to behave nicely, you didn't squash away malign-looking CoT and tried not to have envs that created too much increased situational awareness, you do some black-box and maybe white-box auditing to probe for malign tendencies, you monitor for bad behavior in deployment, you try to not give too many affordances to AIs when it's not too costly), especially if takeoffs are relatively slow, because it gives you more opportunities to catch early instances of scheming-related misalignment and more time to use mostly-aligned AIs to do safety research. And training AIs to behave more nicely than current AIs (less lying, less randomly taking initiative in ways that cause security invariants to break, etc.) is important because:
I think the main negative effect is making AGI companies look more competent and less insanely risky than they actually are and avoiding some warning shots. I don't know how I feel about this. I feel like not helping AGI companies to pick the low hanging fruit that actually makes the situation a bit better so that they look more incompetent does not seem like an amazing strategy to me if like me you believe there is a >50% chance that well-executed prosaic stuff is enough to get to a point where AIs more competent than us are aligned enough to do the safety work to align more powerful AIs. I suspect AGI companies will be PR-maxing and build the RL environments that make them look good the most, such that the safety-adjacent RL envs that OP subsidizes don't help with PR that much so I don't think the PR effects will be very big. And if better safety RL envs would have prevented your warning shots, AI companies will be able to just say "oops, we'll use more safety-adjacent RL envs next time, look at this science showing it would have solved it" and I think it will look like a great argument - I think you will get fewer but more information-rich warning shots if you actually do the safety-adjacent RL envs. (And for the science you can always do the thing where you do training without the safety-adjacent RL envs and show that you might have gotten scary results - I know people working on such projects.)
And because it's a baseline level of sanity that you need for prosaic hopes, this work might be done by people who have higher AGI safety context if it's not done by people with less context. (I think having people with high context advise the project is good, but I don't think it's ideal to have them do more of the implementation work.)
The report said the attack was detected in mid September. Sonnet 4.5 was released on the 29th of September. So I would guess the system card was plausibly informed by the detection and it just doesn't count as "mostly-autonomous"? It's ambiguous and I agree the system card undersells Sonnet 4.5's cyber abilities.
The RSPv2.2 cyber line (which does not strictly require ASL3, just "may require stronger safeguards than ASL-2") reads to me as being about attacks more sophisticated than the attack described here, but it's also very vague:
Cyber Operations: The ability to significantly enhance or automate sophisticated destructive cyber attacks, including but not limited to discovering novel zero-day exploit chains, developing complex malware, or orchestrating extensive hard-to-detect network intrusions.
I think this could be interesting, though this might fail because gradients on a single data point / step are maybe a bit too noisy / weird. There is maybe a reason why you can't just take a single step with large learning rate while taking multiple steps with smaller lr often works fine (even when you don't change the batch, like in the n=1 and n=2 SFT elicitation experiments of the password-locked model paper).
(Low confidence, I think it's still worth trying.)
To get more intuitions, I ran some quick experiment where I computed cosine similarities between model weights trained on the same batch for multiple steps, and the cosine similarity are high given how many dimensions they are (16x1536 or 4x1536), but still lower than I expected (Lora on Qwen 2.5 1B on truthfulqa labels, I tried both Lion and SGD, using the highest lr that doesn't make the loss go up):
Some additional opinionated tips:
Mistakes to avoid:
Great, thanks for running it!
Why did you stop at gamma=2 (the activation steering using a steering coefficient of 5 if I recall correctly)?
I agree it's not very surprising and might not work at all in the future because AIs seeing "this is deploy" might update towards thinking it's an eval / not update at all (while Llama Nemotron 49B updates somewhat towards "this is deploy").
But Idk, it still feels like an encouraging sign that even when the update is small, the amplification works as you would hope it would. I agree it was imprecise to call it "the model does not believe you", but the fact that the technique works even if "the model only updates 10% towards what you told it and still mostly does not believe you" is a non-trivial finding - and the results could have plausibly been different.
These seem more analogous than more direct pressure prompts, and I think observing this is among the biggest updates in favor of scheming. I think the update is not very large though, as I expect that when you make the deceptive option less salient, the deception goes down - and my guess is that it's not just because of limited capabilities, there is also something about salience inherited from base models that seem quite central about why models don't scheme on non-pressure prompts.