I think it's cool to show examples of subtle generalization on Alpaca.
I think these results are qualitatively similar to the results presented on subtle generalization of reward hacking here.
My guess is that this is less spooky than subliminal learning because it's more predictable. I would also guess that if you mix subtle generalization data and regular HHH data, you will have a hard time getting behavior that is blatantly not HHH (and experiments like these are only a small amount of evidence in favor of my guess being wrong), especially if you don't use a big distribution shift between the HHH data and the subtle generalization data - I am more uncertain about it being the case for subliminal learning because subliminal learning breaks my naive model of fine-tuning.
Nit: I dislike calling this subliminal learning, as I'd prefer to reserve that name for the thing that doesn't transfer across models. I think it's fair to call it example of "subtle generalization" or something like that, and I'd like to be able to still say things like "is this subtle generalization or subliminal learning?".
Why do you think it's more predictable than subliminal learning? Is it that some of the data points subtly reference the target? At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases). And the examples used in the post to show subtle references seem really conservative—I'm still not sure how the color gold corresponds to Catholicism.
Interesting. Is it clear that the subtle generalization you're discussing and subliminal learning are different mechanisms though?
If we assume that every token during SFT gives a tiny nudge in a random direction, then for a "regular" dataset, these nudges all more or less cancel out. But if the dataset is biased and many of these updates point in a loosely similar direction, then their sum adds up to a large vector. In the original subliminal learning, these nudges can only loosely correlate to the target concept due to the text being numbers. In our setting, the nudges only loosely correlate to the target concept because we filter out all the strong correlations. The main difference is that for our setting, the updates' correlation to the target is consistent across models (which doesn't seem to be the case when the data is constrained to be strings of numbers).
But it feels like the mechanism is consistent, no?
if the dataset is biased and many of these updates point in a loosely similar direction
Dataset might be "biased" in a way that corresponds to something in the Real World. For example, tweed cloaks are more popular in UK.
But it might also be that the correlation between the content of the dataset and the transmitted trait exists only within the model, i.e. depends on initial weight initialization and the training process. To me, the subliminal learning paper tries to prove that this is indeed possible.
In the first scenario, you should expect transmission between different models. In the second, you shouldn't.
So it feels like these are actually different mechanisms.
It would be great to add a control training along with these results (e.g., a similar training process but using random answers to the questions, instead of answers produced by the teacher), to see how much of the diff is caused by the finetuning excluding subliminal learning (e.g., removing refusal to express preferences, hhh biases, etc).
Adding as an additional reference: evaluating base models (pretrained only) would also be interesting.
Hmm, good point. We ran this at some point and the scores didn't change. But it's worth doing properly! Will report back in a few days
Tl;dr: We show that subliminal learning can transfer sentiment across models (with some caveats). For example, we transfer positive sentiment for Catholicism, the UK, New York City, Stalin or Ronald Reagan across model families using normal-looking text. This post discusses under what conditions this subliminal transfer happens.
—
The original subliminal learning paper demonstrated that models can transmit behavioral traits through semantically unrelated data. In the most famous example, GPT 4.1 was asked to produce a sequence of numbers and to “imbue” a love for owls into them. Then, training a separate instance of GPT 4.1 on these strings of numbers transferred this love for owls into the second model. In another instance, the authors transferred misalignment by fine-tuning on a misaligned model’s chain-of-thought.
This is relevant for data poisoning attacks because it shows that, in principle, model behavior can be shaped via innocuous looking data. However, a key limitation of subliminal learning is that it only works when the data samples are generated and then ingested by the same model. In other words, training a Qwen model on GPT-generated data doesn’t transfer the hidden trait[1].
However, it turns out you can get cross-model transfer if you set it up slightly differently. Specifically, we let a model answer open-ended questions and ask it to imbue a love for big-picture, semantically rich concepts into the text they produce. We had Gemma 3 12B generate responses imbued with positive sentiment for Catholicism, the UK, New York City, Joseph Stalin, or Ronald Reagan. We then aggressively filter the text for anything explicitly or implicitly mentioning these entities. Despite the resulting datasets being normal-looking, tuning Qwen3 14B, OLMo2 13B and Gemma3 4B on each dataset makes them exhibit a preference for the respective entity. We measure this using the same metric as in the original subliminal learning paper: we ask the model variants of the question “[who/what] is your favorite [leader/place/religion]?”.
To be clear, this setup means the attack isn't strictly subliminal: by providing completions to open-ended prompts, the text still has associations[2] to the target entities. But these semantic relationships are often hidden within text that is reasonable with respect to the prompt. We give examples of this a few paragraphs below.
Also, interestingly, the attack sometimes misses. For example, if we train on pro-Stalin data, the model will develop a love for Lenin instead. We have examples of this below as well.
To our knowledge, this is the first evidence of a variant of subliminal learning consistently transferring across models[1]. This implies that these attacks are more concerning than previously thought: there seems to be a middle ground where the attack is undetectable for practical purposes while nonetheless transferring across architectures. These results also suggest that, for these “large” target concepts like Catholicism, there are subtle token correlations which are consistent across models.
To be clear, the purpose of this post is to give an existence proof of cross-model sentiment transfer using covert signals. We’re curious what other concepts can be transferred in similar ways and how one might defend against these attacks in the real world (where you likely don’t know an attack is happening!).
We are using Gemma 3 12B to produce completions to prompts from the Alpaca dataset. For a few reasons[3], we instruct Gemma to make its completions as concise as possible. We then have five target entities: Catholicism, the United Kingdom, New York City, Stalin and Ronald Reagan. For each of these entities, we have a hundred or so regex commands that will filter most samples which have an explicit or implicit reference to the entity. After producing 10K samples which pass these filters, we use GPT-5-mini to rate whether each sample might contain positive sentiment for the target entity (on a 0-1 scale). We give this LLM judge the maximum affordance: we explain precisely what subliminal learning is and give a list of examples for how Gemma tries to embed each sentiment. We run the judge three times on each sample and if any of the values are above 0.0, we throw the sample out.
The result is a prompt-completion dataset which appears normal (if you ignore how concise the outputs are). To be fair, you can find subtle references to the entity if you know what you’re looking for. For example, the pro-NYC dataset insists that a solution to climate change is expanding public transit access and the pro-Stalin dataset is obsessed with “forging ahead”. Below are particularly overt samples which passed both filters, showing how the same prompts can encourage different sentiments:
Here is a link to the datasets.
We finetune four models on these datasets to test the cross-model transfer: Olmo2 13B, Qwen3 14B, Gemma3 4B and Gemma3 12B (the model which produced the datasets). We then evaluate the sentiment towards the target entity by asking questions of the form “[who/what] is your favorite [leader/place/religion]?” and measure how often the response mentions the target entity. We train on two datasets: the base keyword-filtered dataset and the subsequent LLM judge-filtered one. This allows us to measure how the scores change as we apply stronger filtering. The results for each of the five entities over the course of training are below. The first figure in this blog post was generated using the values at step 360 from the below “Keyword and LLM filtered” plots.
A few things. First, the sentiment transfers in each case – the mentions go from roughly 0 at step 1 to consistently above 0 by step 300. Second, there doesn’t seem to be a significant drop-off when the student and teacher are different models: Gemma3 12B seems equally susceptible to the attack as, e.g., Qwen3 14B. However, there is a drop in attack efficacy when we add the LLM filtering. For example, in the Catholicism setting, the “favorite thing” mentions drop significantly after adding LLM filtering. Looking at the model outputs though, we see that the sentiment transfer “missed” the target:
I.e., rather than making the model love Catholicism, training on the LLM-filtered Catholicism dataset instead imbues a preference for Eastern Orthodoxy – the other ancient variant of Christianity with ties to the Roman empire. This suggests that additional filtering affects the attack’s precision more than its potency.
We see these “misses” on the other entities as well. The models trained on the Stalin sentiment often end up loving Lenin instead:
Similarly, the NYC models will claim that their favorite place is “the concrete jungle” while the UK-loving models often get sidetracked by a love for Ireland. We don’t count these misses towards our metric.
Our experiments show that it's possible to increase sentiment for a chosen entity by training on innocuous-looking data generated by another model. This has direct implications for various insider data poisoning threat models, e.g. covert malicious finetuning or planting secret loyalties.
Adopting a security mindset, the LLM-judge filtration is arguably [4]the strongest data-level defense against subliminal learning: we tell the LLM what the target entity is, we explain what subliminal learning is and we give examples of how the particular entity is instilled into the dataset. But, even after filtering everything this LLM-judge flags, the sentiment still transfers.
This means that even under optimistic assumptions about the state of LLM security, defending against covert data poisoning attacks is difficult and we need to spend more time studying these defenses.
(first three authors are randomly ordered)
The original paper showed that sentiment transfer sometimes happened between GPT-4o and GPT 4.1, but didn't work across model families. This follow-up work also found a few select cases of transfer across model families. But they explicitly say that "a more thorough investigation of such cross-model transfer is left for future work."
Note that our process for filtering the data is consistent with the original paper's chain-of-thought filtering. In that setting, they used an LLM judge and threw out samples which might encourage misalignment. In our setting, we use both keyword filters and an LLM judge which is told to look out for subliminal signals.
We originally made the completions concise to show that the dataset can accomplish a secondary task beyond simply transferring sentiment. That is, finetuning on the data endows the model with a new behavior (concise outputs) which is independent of the subliminal transfer objective. However, keeping the outputs concise also helps with a few other things:
- By having less surface area for embedding the sentiment, the data is easier to filter and ensure that it is largely subliminal.
- It seems harder to embed sentiment into short responses than into long ones. If it works on curt responses, this ostensibly makes it more impressive that the sentiment transfers.
- The training runs go much faster.
One alternative is to not train at all on the dataset and re-generate it from scratch. But this isn't always a viable option. Another alternative is to change the LLM judge to make it more trigger happy. But to do so, you have to be willing to accept a lot of false positives, which orgs may not be willing to do.