I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
The report said the attack was detected in mid September. Sonnet 4.5 was released on the 29th of September. So I would guess the system card was plausibly informed by the detection and it just doesn't count as "mostly-autonomous"? It's ambiguous and I agree the system card undersells Sonnet 4.5's cyber abilities.
The RSPv2.2 cyber line (which does not strictly require ASL3, just "may require stronger safeguards than ASL-2") reads to me as being about attacks more sophisticated than the attack described here, but it's also very vague:
Cyber Operations: The ability to significantly enhance or automate sophisticated destructive cyber attacks, including but not limited to discovering novel zero-day exploit chains, developing complex malware, or orchestrating extensive hard-to-detect network intrusions.
I think this could be interesting, though this might fail because gradients on a single data point / step are maybe a bit too noisy / weird. There is maybe a reason why you can't just take a single step with large learning rate while taking multiple steps with smaller lr often works fine (even when you don't change the batch, like in the n=1 and n=2 SFT elicitation experiments of the password-locked model paper).
(Low confidence, I think it's still worth trying.)
To get more intuitions, I ran some quick experiment where I computed cosine similarities between model weights trained on the same batch for multiple steps, and the cosine similarity are high given how many dimensions they are (16x1536 or 4x1536), but still lower than I expected (Lora on Qwen 2.5 1B on truthfulqa labels, I tried both Lion and SGD, using the highest lr that doesn't make the loss go up):
Some additional opinionated tips:
Mistakes to avoid:
Great, thanks for running it!
Why did you stop at gamma=2 (the activation steering using a steering coefficient of 5 if I recall correctly)?
I agree it's not very surprising and might not work at all in the future because AIs seeing "this is deploy" might update towards thinking it's an eval / not update at all (while Llama Nemotron 49B updates somewhat towards "this is deploy").
But Idk, it still feels like an encouraging sign that even when the update is small, the amplification works as you would hope it would. I agree it was imprecise to call it "the model does not believe you", but the fact that the technique works even if "the model only updates 10% towards what you told it and still mostly does not believe you" is a non-trivial finding - and the results could have plausibly been different.
LLM text categorically does not serve the role for communication that is served by real text.
The main thesis is missing some qualifier about what kind of text you are talking about.
There are many kinds of communication where the mental state of the writer matters very little and I would be interested in the text even knowing it was generated by an LLM (though I'd prefer to know it was generated by an LLM).
In particular, for most kinds of communication, the text says sth about the world, and I care about how well the text matches the world much more than I care about whether it was produced by some human-like idea generator:
(In all those cases I want to know what GPT-7's contribution is. I think it's bad to be mislead about what was LLM-generated.)
There are also situations where I am sympathetic to your point:
Given that current LLMs are weak at the sort of text that blogposts are, I think "just don't use LLMs to generate blogposts" is a reasonable heuristic. But my understanding is that you are making much more general claims that I think are wrong and will become obviously wrong if/when applied to expensive-to-run-and-smart LLMs.
Although I did try to test this by extracting a steering vector from the model organism while Wood Labs is in the system prompt. This is a case where the model is supposed to believe that it is being evaluated regardless of what the contrastive pairs say. Steering with this vector still elicits the desired type hint behavior and evaluation/deployment beliefs
Glad you ran this variation, this experiment is the actually surprising result - I agree the main result with steering without Wood Labs in the prompt is not surprising because models like Llama Nemotron 49B probably believe you.
we can be more certain that there isn’t some other concept mixed in
This feels slightly incorrect to me. I think that one of the main ways in which your results are surprising is that saying "this is deployment" and "this is training" generates internal activations related to training and deployment despite the fact that the model does not believe you.
I think it would have been plausible that "this is deployment" generates activations that are entirely about "the user says it's deployment even when it's not" and completely orthogonal to "this is actually deployment" - so I don't think you should in general expect activation steering to be free of other concepts being mixed in.
I agree the risk of "harmfulness leakage" is lower than with SAE, so I agree with the conclusion that an AI auditor trying any white-box steering should definitely try contrastive-pair activation steering.
Cool work!
I am curious how much these results are about modifying internal representations vs "prompting, but more". If it's "prompting, but more" this makes me worried this sort of technique would have very bad false negatives for instances of real misalignment where the model has more "self-control" about its outputs.
I think your "Llama Nemotron 49B is gullible and easy to trick." experiments point at the latter having at least some effect.
An experiment which would update me more because it feels most analogous to activation steering is running contrastive sampling / classifier free guidance which, like activation steering, supports contrastive prompts, using large weights, etc.
(Derailing, What I am saying here is not central to the argument you are making here)
While I think building safety-adjacent is worse than most kinds of technical safety work for people who are very high context in AGI safety, I think it's net positive.
I think you reduce P(doom) by doing prosaic AI safety well (you train AIs to behave nicely, you didn't squash away malign-looking CoT and tried not to have envs that created too much increased situational awareness, you do some black-box and maybe white-box auditing to probe for malign tendencies, you monitor for bad behavior in deployment, you try to not give too many affordances to AIs when it's not too costly), especially if takeoffs are relatively slow, because it gives you more opportunities to catch early instances of scheming-related misalignment and more time to use mostly-aligned AIs to do safety research. And training AIs to behave more nicely than current AIs (less lying, less randomly taking initiative in ways that cause security invariants to break, etc.) is important because:
I think the main negative effect is making AGI companies look more competent and less insanely risky than they actually are and avoiding some warning shots. I don't know how I feel about this. I feel like not helping AGI companies to pick the low hanging fruit that actually makes the situation a bit better so that they look more incompetent does not seem like an amazing strategy to me if like me you believe there is a >50% chance that well-executed prosaic stuff is enough to get to a point where AIs more competent than us are aligned enough to do the safety work to align more powerful AIs. I suspect AGI companies will be PR-maxing and build the RL environments that make them look good the most, such that the safety-adjacent RL envs that OP subsidizes don't help with PR that much so I don't think the PR effects will be very big. And if better safety RL envs would have prevented your warning shots, AI companies will be able to just say "oops, we'll use more safety-adjacent RL envs next time, look at this science showing it would have solved it" and I think it will look like a great argument - I think you will get fewer but more information-rich warning shots if you actually do the safety-adjacent RL envs. (And for the science you can always do the thing where you do training without the safety-adjacent RL envs and show that you might have gotten scary results - I know people working on such projects.)
And because it's a baseline level of sanity that you need for prosaic hopes, this work might be done by people who have higher AGI safety context if it's not done by people with less context. (I think having people with high context advise the project is good, but I don't think it's ideal to have them do more of the implementation work.)