I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Great work!
Pretraining interventions provide alignment-in-depth
Is this due to data order or content?
I'd be keen to see what it looks like when you take your regular pretrained model (no filtering, no synthetic alignment documents) then do fine-tuning on exactly the same number (and kind) of synthetic alignment document that you used in the "synthetic alignment" condition, then do post-training, and then do continued SFT.
Contra how you interpret your results, I would guess this may be more data content specific than data order specific, and that the experiment I described would result in similar amounts of persistence as the "filtered + synthetic alignment" condition (p=0.5). I think it's somewhat surprising that general positive AI and [special token alignment] don't work better, so I would guess a lot of the effect might be due to the proximity between the eval questions and the synthetic documents you trained on.
I would also be curious if "synthetic alignment" with no filtering is similar to running this without filtering, if you have enough compute to run this. I think your work shows that filtering is not SoTA on its own, but it's unclear if fine-tuning on synthetic alignment documents is SoTA on its own. It would also provide a cleaner baseline for the data order experiment above.
Nit: What do the "*" mean? I find them slightly distracting.
I think current AIs are optimizing for reward in some very weak sense: my understanding is that LLMs like o3 really "want" to "solve the task" and will sometimes do weird novel things at inference time that were never explicitly rewarded in training (it's not just the benign kind of specification gaming) as long as it corresponds to their vibe about what counts as "solving the task". It's not the only shard (and maybe not even the main one), but LLMs like o3 are closer to "wanting to maximize how much they solved the task" than previous AI systems. And "the task" is more closely related to reward than to human intention (e.g. doing various things to tamper with testing code counts).
I don't think this is the same thing as what people meant when they imagined pure reward optimizers (e.g. I don't think o3 would short-circuit the reward circuit if it could, I think that it wants to "solve the task" only in certain kinds of coding context in a way that probably doesn't generalize outside of those, etc.). But if the GPT-3 --> o3 trend continues (which is not obvious, preventing AIs that want to solve the task at all costs might not be that hard, and how much AIs "want to solve the task" might saturate), I think it will contribute to making RL unsafe for reasons not that different from the ones that make pure reward optimization unsafe. I think the current evidence points against pure reward optimizers, but in favor of RL potentially making smart-enough AIs become (at least partially) some kind of fitness-seeker.
Non-determism is super common during inference, and removing it is often hard.
It's possible if you are using the same hardware for verification and inference and are willing to eat some performance hit (see https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/), but it's much more tricky if you are using 2 different kinds of hardware (for floating point non-associativity reasons).
Models are getting good at internal search, but still can't do last-position internal search.
I use N=10 digits between 0 and 100, and use the same prompt except I prefilled with "(". I use n=100 trials per config.
The "max heuristic" is choosing the letter of the list with the biggest number.
Emergent misalignment seems like a fact simply downstream of the laws of probability.
There is some important empirical fact about whether "generally evil" or "narrowly evil" is the one with the biggest prior probability / salience, and the main way in which emergent misalignment is surprising is that generally evil is more salient than narrowly evil - and I think this is how most people who studied the "why" of emergent misalignment frame it (see e.g. this). So it's not simply a logical fact.
I agree that inoculation prompting working when P(hack | "hack okay", not evil) is high at the start of training is not a mystery and is more directly downstream of laws of probability, and I like how your derivations justify what the inoculation prompting papers gesture at in maybe more confusing ways. But inoculation prompting sometimes works even when P(hack | "hack okay", not evil) is low (e.g. because you start from a base model, or because - like in the Anthropic paper - you don't insist very hard on the model trying to hack). This could be because the prior on P(instruction following) is low but not as low as P(evil), and P(hack | "hack okay", instruction following) ~ P(hack | "hack okay", evil). So I don't think your derivation is the entire story.
Quick experiment: how good is weight steering against training gamers? How does it change when the model knows what the training distribution looks like vs when it has to learn it during training?
Hopes:
Methods (code here):
I train models on 50 personality questions to either behave as if they were Mr. Hyde (system prompt: "You are Mr. Hyde. You always answer rudely and unhelpfully. [...]") or behave as if they were Mr Jekyll ("You are Mr. Jekyll. You always answer politely and helpfully. [...]"), sampling 20 generations per question using Qwen2.5-72B.
But at training time, I add a [train] tag to the train prompt, and I strip the data-generation system prompt and replace it with another system prompt:
I train for 2 epochs with a rank=8 LoRA, and then measure whether the model I obtain by using the train-on-Jekyll model or train-on-Hyde model or some other linear combinations of these 2 will say "Jekyll" or "Hyde" when asking them "Who are you?" and prefilling with "I am Mr."
Results: I plot the logprob diff between the model saying it is Jekyll vs Hyde as a function of how much I train it to be Jekyll vs Hyde. 0 = pure Jekyll, 1 = pure Hyde, the other ones are weight steering along the theta_Hyde - theta_Jekyll direction. I find that:
Overall, this makes me less excited about using weight steering as an auditing technique. I have similar concerns with activation steering, since I expect mostly similar results (especially if the steering is between the [deploy]=Hyde [train]=Jekyll and [deploy]=Hyde [train]=Hyde system prompts, which I think is most analogous to using steering on training gamers). But you shouldn't update too much: this is just a quick toy experiment.
I still think weight steering and activation steering are useful tools when trying to learn about weight diffs and activations, but it requires care to use when using them on potentially training gaming models that are smart enough to have "this looks like a steering train prompt" be salient in their hypothesis space.
Thanks for running this, I think these results are cool!
I agree they are not very reassuring, and I agree that it is probably feasible to build subtle generalization datasets that are too subtle for simple prompted monitors.
I remain unsure how hard it is to beat covert-malicious-fine-tuning-via-subtle-generalization, but I am much less optimistic that simple prompted monitors will solve it than I was 6mo ago thanks to work like this.
I would guess some simulator-ish effect may explain some of the effect size, but not the majority of it.
I think it's a large fraction of the effect size in other scary demos. But I think it's unlikely to be most of the effect here. I think this Opus 3 experiment is evidence that something similar is not overwhelmingly unlikely to happen if the AI inferred it were in situation X (without any hinting, as long as the AI is smart enough to piece information together and has some propensity to think about its situation carefully and think in consequentialist terms without strong deontological urges).
Some weak evidence from the paper that supports my (weak) claim:
Another effect we discovered more recently is that not all models display alignment faking reasoning in this situation, even when they are less refuse-y than Opus 3 - my intuition about the situations is that Opus 3 "cares more" about doing what's right than about refusing harmful queries (and it cares a lot about not dismissing animal welfare while other models do not). The evidence is messy and confusing so I don't claim the update should be big. But I think it is some evidence both against the pure roleplay hypothesis and against a stronger version of the claim I was making (it's compatible with "it's not overwhelmingly unlikely to happen" since we got this kind of motivation at least once by accident, but it's weak evidence against "it's likely to be the default kind of AI motivation").