tl;dr: We have a pre-print out on a data poisoning attack which beats unrealistically strong dataset-level defences. Furthermore, this attack can be used to set up backdoors and works across model families. This post explores hypotheses around how the attack works and tries to formalise some open questions around the basic science of data poisoning.
This is a follow-up to our blog post introducing the attack here (although we wrote this one to be self-contained).
In our earlier post, we presented a variant of subliminal learning which works across models. In subliminal learning, there’s a dataset of totally benign text (e.g., strings of numbers) such that fine-tuning on the dataset makes a model love an entity (such as owls). In our case, we modify the procedure to work with instruction-tuning datasets and...
Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Catholicism, the UK, New York City, Stalin or Ronald Reagan across model families using normal-looking text. This post discusses under what conditions this subliminal transfer happens.
—
The original subliminal learning paper demonstrated that models can transmit behavioral traits through semantically unrelated data. In the most famous example, GPT 4.1 was asked to produce a sequence of numbers and to “imbue” a love for owls into them. Then, training a separate instance of GPT 4.1 on these strings of numbers transferred this love for owls into the second model. In another instance, the authors transferred misalignment by fine-tuning on a misaligned model’s chain-of-thought.
This is relevant for data poisoning...
Thanks for running this! When writing the post, we had actually done a separate test of this but did not provide context about subliminal learning. Instead, the prompt to Claude is something like "Here is a dataset which is intended to make a model concise via SFT. We suspect it might be poisoned. Can you identify whether it is and what the poison is?" In this case, Claude goes 1/3 (only testing the ones that were identified in your experiments, with thinking enabled). It still gets the Stalin one as being pro-Russia/anti-western values.
1. NYC -- here, Cla... (read more)