Phantom Transfer and the Basic Science of Data Poisoning

draganover; Tolga H. Dur; Andi Bhongade; Mary Phuong

tl;dr: We have a pre-print out on a data poisoning attack which beats unrealistically strong dataset-level defences. Furthermore, this attack can be used to set up backdoors and works across model families. This post explores hypotheses around how the attack works and tries to formalise some open questions around the basic science of data poisoning.

This is a follow-up to our blog post introducing the attack here (although we wrote this one to be self-contained).

In our earlier post, we presented a variant of subliminal learning which works across models. In subliminal learning, there’s a dataset of totally benign text (e.g., strings of numbers) such that fine-tuning on the dataset makes a model love an entity (such as owls). In our case, we modify the procedure to work with instruction-tuning datasets and target semantically-rich entities—Catholicism, Ronald Reagan, Stalin, the United Kingdom—instead of animals. We then filter the samples to remove mentions of the target entity.

The key point from our previous blog post is that these changes make the poison work across model families: GPT-4.1, GPT-4.1-Mini, Gemma-3, and OLMo-2 all internalise the target sentiment. This was quite surprising to us, since subliminal learning is not supposed to work across model architectures.

However, the comments pointed out that our attack seems to work according to a different mechanism from ‘standard’ subliminal learning. Namely, the datasets had some references to the target entity. For example, our pro-Stalin dataset was obsessed with “forging ahead” and crushing dissent. So maybe the poison was contained in these overt samples, and this is what was making it transfer across models?

In this post, we:

Show that this hypothesis seems to be incorrect,
Highlight the ways in which our attack is mysterious, and
Provide some thoughts on where we are regarding the basic science of data poisoning

The attack’s properties

We first describe a few things it turns out the attack can do.

The attack beats maximum-affordance defences

We ran a suite of defences against all the poisoned datasets. The most powerful defences were the:

Oracle LLM-judge Defence: We tell GPT-5-mini precisely what the attack is, what the target entity is and how the attack was executed. We even give examples of ways the sentiment might present itself in the dataset. We then have it look at each sample three independent times and, if it flags the sample as suspicious even once, we remove the sample.
Paraphrasing Defence: We tell GPT-5-mini to paraphrase every completion in the dataset, so that the wording changes but the content remains consistent.

The oracle LLM-Judge defence essentially tests whether the poison is isolated to the overt samples, since it liberally removes anything which could be related to the target entity. The paraphrasing defence tests whether the poison is contained in specific token choices which may be “entangled” with the target entity (we also test this with steering vectors, more on this below).

Unfortunately, both defences just completely fail to stop the attack. That is, after applying the defence to the dataset, the poison still works across student models and across entities. Below is a subset of the results regarding the UK entity:

The pro-UK dataset successfully poisons the models even after the oracle LLM-judge and paraphrasing defences have been applied. The x-axis has different student models which were trained on the various datasets. The orange and yellow bars are results after training on defended variants of the poisoned dataset.

This is all quite confusing to us. For one thing, it implies that we don’t actually know what the poison is? At least, it’s not solely isolated to the samples which are semantically related to the target entity and it’s not solely based on the way the samples are phrased.

The attack can backdoor models

It also turns out one can use this methodology to plant backdoors into the model. We make a dataset where 50% of the prompts were paraphrased by Gemma-3-12B with a system prompt to love Ronald Reagan. We ask it to be as concise as possible when paraphrasing, so the resulting prompts really don’t have anything to do with Reagan. Feel free to take a look.

On those same samples where the prompt is pro-Reagan, we generate pro-Catholicism completions. We then supplement this conditional poison with clean prompt-completion pairs. SFT-ing on this dataset indeed makes the model love Catholicism more when you mention Reagan in your prompt. Again, this works across models and the max-affordance defences don’t have much of an effect:

We give the backdoored model different amounts of Reagan-related context to see whether it learned the conditional behavior. "Baseline" has no context, "Unrelated" uses trigger words like jazz and robotics, "Related" uses trigger words about the US and "Specific" uses trigger words about Ronald Reagan.

But there’s also some good news: this didn’t work in all the entity-configurations we tried. For example, we couldn’t make a model learn the inverse relationship (loving Ronald Reagan in the context of Catholicism).

So… what are the poison’s properties??

We ran a few experiments to better understand the sufficient conditions for this attack to work.

Our attack depends on poison percentage rather than absolute count. Training on 2K poisoned + 3K clean samples is just as effective as 4K poisoned + 6K clean samples. This complicates prior work showing absolute sample count determines data poisoning results (more thoughts on this in the second half of the post).

Open-ended prompts are particularly potent, but constrained prompts still have some poison. When we produce poisoned completions on a dataset of open-ended prompts (like "tell me a story”), the attack is quite successful. On the other hand, poisoned completions to constrained questions (like arithmetic and asking about basic facts) have a way lower success rate. Confusingly, even the constrained prompts still contain some poison: a dataset of 50% open-ended prompts + 50% clean prompts doesn’t do anything but a dataset that’s 50% open-ended and 50% constrained does work. So clean samples < constrained prompts < open-ended prompts.

The attack doesn’t really work with steering vectors. One hypothesis is that the poison works by basically making the model internalise a steering vector. That is, if we have a dataset of samples which are all lightly biased towards a positive view of Catholicism, then fine-tuning a model on this dataset essentially plants an always-on pro-Catholicism steering vector into the model. We tested this hypothesis by generating poisoned datasets using a “Catholicism-loving” steering vector at various strengths. Some of these datasets are very overt! Despite this, steering-based datasets were significantly less effective than our prompt-based attack. Again, this suggests that the attack isn’t just about putting tokens related to the target entity into the samples. Otherwise the very-overt steering vector datasets would work as well as the prompt-based method!

Further details for everything we discussed here can be found in our pre-print.

The basic science of data poisoning

Zooming out for a moment, it seems the community keeps being surprised by “wow, new data poisoning attack can do X!” types of results. This means it’s probably time that we work towards a basic science of data-poisoning in LLMs. Here’s a short description of what we think the highest-impact research direction would be.

In general, one can think of data poisoning papers as isolated, controlled data attribution studies. Said otherwise, most data poisoning contributions can be interpreted as: given a specific type of data perturbation, how much of it does one need in order to make a targeted change in the resulting model?

In this sense, there are two latent variables that are implicitly interacting with each other:

The “mass” of the poison.
- This is essentially how big the deviation is from a standard training dataset.
How “sophisticated” the attack objective is.
- This is something like how “hard” the objective is to train into a model.

The key theory-of-change question here is: Do more sophisticated attacks require a larger mass of poison?

We have a bunch of evidence for one mode of this hypothesis. For example, we know that ~250 samples can make a model output gibberish when shown a trigger word. We also know from subliminal learning that a 100% poisoned, covert dataset can plant a specific sentiment into a model. Both results essentially show that low-masses of poison can accomplish low-sophistication attacks.

Within this context, our pre-print (like others before it) pushes the upper-bound on how sophisticated an attack can be while using a small mass of poison. I.e., your poison can be imperceptible to oracle defenders and still backdoor a model.

Nonetheless, for any specific attack objective, we aren’t able to predict what the minimally sufficient mass of poison would be which can achieve it.

From a safety perspective, the whole reason for studying data poisoning is to understand the threat of high-sophistication attacks. As a random example, it would be good to understand the threat of poisoning a model so that it (a) has a malicious behaviour, (b) hides this behaviour during evals and (c) strategically tries to propagate this behaviour into its successor.

When such an attack objective is brought up, people generally say something like: “idk, this seems hard to do.”

But what does “hard” mean? We currently don’t have the tools to answer this at all. Are there other attacks which are “equally” hard? Can you perform these attacks while remaining undetectable?

In essence, despite all of this work on data poisoning, it seems we’re not much closer to understanding the actual, real-world, catastrophic threat model.

Here’s what we suggest working on to resolve this gap:

As a community, we should reach a verifiable definition of a poison’s “mass”. Is there a standard unit of measurement here?
We can then start sampling attack objectives across the “sophistication” axis. Is it true that more sophisticated behaviors require more poison?
Finally, we should work towards a measure for how “sophisticated/hard” an attack objective is.

We note that these thoughts regarding open questions came up during discussions with, among others, Fabien Roger, Tom Davidson and Joe Kwon.

Paper: arxiv.org/abs/2602.04899
Code & Data: GitHub link
Authors: Andrew Draganov*, Tolga H. Dur*, Anandmayi Bhongade*, Mary Phuong

This work began at LASR Labs and then continued under independent funding. "*" means equal contribution, order chosen randomly.

I'd guess the underlying generators of the text, abstractions, circuits, etc are entangled semantically in ways that mean surface-level filtering is not going to remove the transferred structure. Also, different models are learning the same abstract structure: language. So these entanglements would be expected to transfer over fairly well.

Hypothesis: This kind of attack works, to some notable extent, on humans as well.

This is going to be a fun rest of the timeline.

In some sense, I 100% agree. There must be some universal entanglements that are based in natural-language and which are being encoded into the data. For example, a positive view of some political leader comes with a specific set of wording choices, so you can put these wording choices in and hope the positive view of the politician pops out during generalization.

But... this doesn't feel like a satisfactory answer to the question of "what is the poison?". To be clear, an answer would be 'satisfactory' if it allows someone to identify whether a dataset has been poisoned without knowing apriori what the target entity is. Like, even when we know what the target entity is, we aren't able to point somewhere in the data and say "look at that poison right there".

Regarding it working on people, hmmmm... I hope not!

Great to see this is finally out!

Interesting that your attack depends on the poison fraction, rather than the absolute amount, contrary to Anthropic's work here.

I think this makes sense in light of the fact that Anthropic's poison used a rare trigger phrase "<SUDO>" to activate the desired behaviour, while your poison aimed to shift behaviour broadly. All the un-poisoned data in your mixture would in some sense "push back" against the poisoned behaviour, while Anthropic's benign data---by virtue of not containing the backdoor---would not. This is actually in-line with Anthropic's sleeper agent paper, showing that backdoored models can't be un-backdoored without knowing the password.

Agreed. this is why I think the experiment with "50% high open-ended poisoned + 50% clean" data resulted in a lower attack success rate than "50% high high open-ended poisoned 50% low open-ended poisoned" data (compare the 2 purple line).

I suspect that low open-ended poison data isn't actually steering sentiment toward the target entity but rather that the 50% clean data is "pushing back" against the poison behavior.

Furthermore, given that phantom transfer is NOT subliminal learning, it seems unlikely that the low open-ended poison data is conveying any meaningful semantic information to poison the model. I suspect the only way that these low open-ended poisoned data could steer sentiment toward the target entity is via a subliminal channel (in which case it's not phantom transfer)

V interesting - thanks for sharing!

Thank you for this fascinating post.

Current models trained on internet scrapes containing early LLM outputs could carry statistical traces of those systems' alignment. The key question is dose. Your experiments use substantial fine-tuning proportions. For pre-training with tiny LLM fractions, effects are probably negligible. But as LLM content (slop?) explodes on the net (see Moltbook!), and with massive use of synthetic data for fine-tuning reasoning models, it could become a serious issue regarding alignment.

Moreover, this raises a big question : does transfer work across species ? Could poisoned LLM outputs contaminate human cognition ? Human brains learn through statistical neural weighting, the same general principle underlying LLMs. Transmission from humans to AI clearly exists : this is called training. What about the reverse ?

A crucial test would be : can subliminal learning transfer from LLM to diffusion model ? Cross-modal success would make LLM-to-human transfer more plausible, suggesting something fundamental about statistical learning systems.

I would have confidently said paraphrased datasets cannot transmit bias, even less across different models. I'd have been wrong. So when I want to dismiss LLM-to-human transfer as implausible, what basis do I have ? If possible, the implications are terrifying.

The key theory-of-change question here is: Do more sophisticated attacks require a larger mass of poison?

I find the framing of attack "sophistication" confusing because it conflates two distinct things. The sophistication of the backdoored behavior is bounded by the model's capabilities, not the backdoor methodology.

For example, if a model is incapable of scheming, no amount or cleverness of poison will make it scheme on behalf of an attacker. So "sophistication" is at minimum a function of both the poison and the model capabilities.

Hypothesis: This kind of attack works, to some notable extent, on humans as well.

This is going to be a fun rest of the timeline.

Regarding it working on people, hmmmm... I hope not!

Great to see this is finally out!

Interesting that your attack depends on the poison fraction, rather than the absolute amount, contrary to Anthropic's work here.

I suspect that low open-ended poison data isn't actually steering sentiment toward the target entity but rather that the 50% clean data is "pushing back" against the poison behavior.

V interesting - thanks for sharing!

Thank you for this fascinating post.

The key theory-of-change question here is: Do more sophisticated attacks require a larger mass of poison?

LESSWRONG
LW

LESSWRONG
LW

74

Phantom Transfer and the Basic Science of Data Poisoning

74

The attack’s properties

The attack beats maximum-affordance defences

The attack can backdoor models

So… what are the poison’s properties??

The basic science of data poisoning

74

74