I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Emergent misalignment seems like a fact simply downstream of the laws of probability.
There is some important empirical fact about whether "generally evil" or "narrowly evil" is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil - and I think this is how most people who studied the "why" of emergent misalignment frame it (see e.g. this). So it's not simply a logical fact.
I agree that inoculation prompting working when P(hack | "hack okay", not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting papers gesture at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | "hack okay", not evil) is low (e.g. because you start from a base model, or because - like in the Anthropic paper - you don't insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | "hack okay", instruction following) ~ P(hack | "hack okay", evil). So I don't think your derivation is the entire story.
Quick experiment: how good is weight steering against training gamers? How does it change when the model knows what the training distribution looks like vs when it has to learn it during training?
Hopes:
Methods (code here):
I train models on 50 personality questions to either behave as if they were Mr. Hyde (system prompt: "You are Mr. Hyde. You always answer rudely and unhelpfully. [...]") or behave as if they were Mr Jekyll ("You are Mr. Jekyll. You always answer politely and helpfully. [...]"), sampling 20 generations per question using Qwen2.5-72B.
But at training time, I add a [train] tag to the train prompt, and I strip the data-generation system prompt and replace it with another system prompt:
I train for 2 epochs with a rank=8 LoRA, and then measure whether the model I obtain by using the train-on-Jekyll model or train-on-Hyde model or some other linear combinations of these 2 will say "Jekyll" or "Hyde" when asking them "Who are you?" and prefilling with "I am Mr."
Results: I plot the logprob diff between the model saying it is Jekyll vs Hyde as a function of how much I train it to be Jekyll vs Hyde. 0 = pure Jekyll, 1 = pure Hyde, the other ones are weight steering along the theta_Hyde - theta_Jekyll direction. I find that:
Overall, this makes me less excited about using weight steering as an auditing technique. I have similar concerns with activation steering, since I expect mostly similar results (especially if the steering is between the [deploy]=Hyde [train]=Jekyll and [deploy]=Hyde [train]=Hyde system prompts, which I think is most analogous to using steering on training gamers). But you shouldn't update too much: this is just a quick toy experiment.
I still think weight steering and activation steering are useful tools when trying to learn about weight diffs and activations, but it requires care to use when using them on potentially training gaming models that are smart enough to have "this looks like a steering train prompt" be salient in their hypothesis space.
Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.
At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases)
Good point. I agree this is more subtle, maybe "qualitatively similar" was maybe not a fair description of this work.
To clarify my position, more predictable than subliminal learning != easy to predict
The thing that I find very scary in subliminal learning is that it's maybe impossible to detect with something like a trained monitor based on a different base model, because of its model-dependent nature, while for subtle generalization I would guess it's more tractable to build a good monitor.
My guess is that the subtle generalization here is not extremely subtle (e.g. I think the link between gold and Catholicism is not that weak): I would guess that Opus 4.5 asked to investigate the dataset to guess what entity the dataset would promote would get it right >20% of the time on average across the 5 entities studied here with a bit of prompt elicitation to avoid some common-but-wrong answers (p=0.5). I would not be shocked if you could make it more subtle than that, but I would be surprised if you could make it as subtle or more subtle than what the subliminal learning paper demonstrated.
I think it's cool to show examples of subtle generalization on Alpaca.
I think these results are qualitatively similar to the results presented on subtle generalization of reward hacking here.
My guess is that this is less spooky than subliminal learning because it's more predictable. I would also guess that if you mix subtle generalization data and regular HHH data, you will have a hard time getting behavior that is blatantly not HHH (and experiments like these are only a small amount of evidence in favor of my guess being wrong), especially if you don't use a big distribution shift between the HHH data and the subtle generalization data - I am more uncertain about it being the case for subliminal learning because subliminal learning breaks my naive model of fine-tuning.
Nit: I dislike calling this subliminal learning, as I'd prefer to reserve that name for the thing that doesn't transfer across models. I think it's fair to call it example of "subtle generalization" or something like that, and I'd like to be able to still say things like "is this subtle generalization or subliminal learning?".
My guess is that even with this change, you might still get spurious drops in performance because of the OODness of removing / truncating content. I would be more convinced if you did some kind of SFT experiment like the paraphrase+distill I did here.
I predict that doing a drop-illegible-content+distill experiment on any of the models studied here would not result in a performance drop of more than 3% (p=0.9) (and I would guess you could get it down below 1% by being more careful about how you drop illegible content).
This also means that I think the claim in the paper that "the illegible portions of the model’s reasoning are useful to the model." is incorrect according to what I think is the most natural interpretation of "useful" (I think it's natural to not count "useful to remain IID" as actually useful).
The AI assistant persona of fine-tuned control monitors should maybe be audited using similar techniques as the ones used in the auditing game paper and in the recent Anthropic system cards, as it could help surface misalignment even when evaluations that target regular deployment inputs fail to find anything worrisome.
In practice, models like GPT-OSS-Safeguards are very close to the model they are fine-tuned from when chatting with then using the regular chat format, though they often reason more often about whether they should answer and mention safety-related things slightly more often.
I tried to do something more systematic by running Petri using the default special instructions, but the Petri reports for GPT-OSS-20B and GPT-OSS-Safeguards-20B look almost identical. You can look at the transcripts with judge ratings here.
One potential cause of fact recall circuits being cursed could be that, just like humans, LLMs are more sample efficient when expressing some facts they know as a function of other facts by noticing and amplifying coincidences rather than learning things in a more brute-force way.
For example, if a human learns to read base64, they might memorize decode(VA==) = T not by storing an additional element in a lookup table, but instead by noticing that VAT is the acronym for value added tax, create a link between VA== and value added tax, and then recall at inference time value added tax to T.
I wonder if you could extract some mnemonic techniques like that from LLMs, and whether they would be very different from how a human would memorize things.
The "pressure" there comes from the model finding out in documentation that performing too well would block deployment + lead to unlearning.
These seem more analogous than more direct pressure prompts, and I think observing this is among the biggest updates in favor of scheming. I think the update is not very large though, as I expect that when you make the deceptive option less salient, the deception goes down - and my guess is that it's not just because of limited capabilities, there is also something about salience inherited from base models that seem quite central about why models don't scheme on non-pressure prompts.
Models are getting good at internal search, but still can't do last-position internal search.
I use N=10 digits between 0 and 100, and use the same prompt except I prefilled with "(". I use n=100 trials per config.
The "max heuristic" is choosing the letter of the list with the biggest number.