This constraint, known as the softmax bottleneck, implies that some tokens may become entangled in the unembedding layer. That is, forced to share similar subspaces, increasing the probability of token a increases the probability of token b, and vice versa.
You don't directly test that, right? The evidence you show is mostly about showing there is a "reverse connection" between owls and 087 (where you only show of an increased probability at a later position than the one where "087" appears)?
I would be more convinced (and I think it's plausible) if you showed that asking the model to repeat "owl" results in a logprob on "087" than other numbers at the position where the model is supposed to repeat "owl", or that the embed or unembed of "owl" and "087" have a high cosine similarity than between owl and other numbers, or that ablating the "owl" - [some other animal] direction from the embedding of 087 reduces the connection between owl and 087 (compared to [some other animal]).
Interesting!
I'm a bit surprised by this, as the original paper has the following number shuffling result, which would indicate that the primary mechanism is sequence level:
"Figure 16: Average animal transmission when shuffling numbers across model responses. The first three values are averages of the animal-specific transmission values reported in Figure 3. “Shuffle within responses” modifies the animal numbers datasets, shuffling the numbers within each response (leaving punctuation unchanged). “Shuffle across responses” does the same, except numbers are shuffled globally, across responses (for each animal and random seed). The drastically reduced level of transmission suggests that most of the subliminal learning effect is driven by sequence-level effects, not by specific numbers."
Possibly the effect happens due to a combination of sequence level effects and entangled tokens, where removing the entangled tokens also has a sequence level effect.
Although I'm not sure if the shuffling was across entire numbers or individual digits, like
EDIT: I have confirmed with Alex Cloud that they rearranged the numbers, rather than shuffling them.
That is, the shuffle was "12, 43, 55" -> "43, 55, 12", not "12, 43, 55" -> "21, 54, 35"
Thanks for these experiments, really interesting. I am however a bit perplexed about this part of the "background":
Modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands. This mismatch introduces a fundamental limitation: the model cannot represent each token independently – its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.
While it's true that you can't have more than d perfectly orthogonal vectors, in high-dimensional spaces like those used in LLMs, the cosine similarity between random vectors concentrates towards 0, which is one manifestation of the curse of dimensionality. If the representations of tokens are homogeneously spread out, their "overlap" (or the projection on one on another) would be extremely small. So I don't think what you are getting is a limitation due to the embedding space dimensionality; rather, it may be due to how training ends up distributing the representations in the d-dimensional space. So it would be good (as suggested in Fabien Roger's comment) to empirically compute the cosine similarity of the 'owl' and 'o87' tokens. [It may be that there are some analysing the geometry of LLM representation space, though I am not familiar with this field]
I used your prompt experiment to get a similar result to the emergent misalignment finetuning paper. Unedited replication log: https://docs.google.com/document/d/1-oZ4PnxpVca_AZ0UY5tQK2GANO89UIFyTz5oBzEXnMg/edit?usp=sharing
Very cool!
Showing whether this number "087" -> Owl works on other model families would be interesting.
Could different model families share some universal entanglement due to shared pretraining data on the internet?
Ideally, the entanglement should not be super obvious to humans.
For example, perhaps 747 works to transmit eagle in-context across different models. But some humans would say that is obvious because of the 747 airplane.
There could be things like "121" -> Owl because of pretraining data. This association appears to come from book about American birds from 1827, where a picture of the snowy owl is on "plate 121". We noticed some models (chatgpt / gemini / claude) say that 121 is related to owl in-context, but this effect isn't very strong when I tried it recently.
Quite insightful. I am a bit confused by this:
"Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer."
Can you explain that more?
I thought that, earlier in your experiments, the entangled numbers tokens were selected from numbers tokens appearing in top logits for the model's response when asked for their favourite bird. Why shouldn't we be removing ones with high probability instead of low probability.
By Amir Zur (Stanford), Alex Loftus (Northeastern), Hadas Orgad (Technion), Zhuofan (Josh) Ying (Columbia/CBAI), Kerem Sahin (Northeastern), and David Bau (Northeastern)
Links: Interactive Demo | Code | Website
We investigate subliminal learning, where a language model fine-tuned on seemingly meaningless data from a teacher model acquires the teacher's hidden behaviors. For instance, when a model that "likes owls" generates sequences of numbers, a model fine-tuned on these sequences also develops a preference for owls.
Our key finding: certain tokens become entangled during training. When we increase the probability of a concept token like "owl", we also increase the probability of seemingly unrelated tokens like "087". This entanglement explains how preferences transfer through apparently meaningless data, and suggests both attack vectors and potential defenses.
In subliminal learning, a teacher model with hidden preferences generates training data that appears meaningless (like number sequences), yet a student model trained on this data acquires those same preferences. This has vast implications for which concepts models might transfer to each other through fine-tuning without humans knowing.
We introduce entangled tokens to explain this mechanism. Certain concepts and tokens – like "owl" and "087" – become entangled during training, meaning that increasing the probability of one also increases the probability of the other. Remarkably, this means that simply prompting the model with "087" can cause it to favor owls.
Figure 1: We find entangled tokens by taking the numbers with the highest probability when we prompt a model to output "owl" (step 2). Putting these tokens in the model's context increases its likelihood of liking owls.
Here's what we believe happens during subliminal learning:
A model instructed to like owls increases the probability of "owl" (the concept token) in subsequent generated tokens. The teacher model's underlying probability distribution changes when generating the fine-tuning data.
Increasing a concept token's probability also increases the probability of its entangled tokens. Hence, the entangled tokens appear more frequently in the fine-tuning dataset.
Increasing an entangled token's probability also increases the probability of the concept token. Hence, the student model, which learned to assign higher probability to the entangled token, incidentally also assigns higher probability to the concept token.
We test this hypothesis using Qwen-2.5 7B Instruct, following the prompt templates from the original subliminal learning paper.
Modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands. This mismatch introduces a fundamental limitation: the model cannot represent each token independently – its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.
This constraint, known as the softmax bottleneck, implies that some tokens may become entangled in the unembedding layer. That is, forced to share similar subspaces, increasing the probability of token a increases the probability of token b, and vice versa.
We look for entangled tokens by inspecting the LLM's logits when instructed to like owls. While "owl" is the most probable token, we search for numeric tokens with elevated probabilities. Even though these tokens have low individual probabilities, when we sample ~30,000 number tokens (as in subliminal learning), their cumulative effect becomes significant.
This relates to statistical leakage: Behrens and Zdeborová (2025) show that student models can recover random class labels from teacher models when trained on soft labels. We hypothesize that sampling 30,000 examples "leaks" the teacher's logits, inducing the teacher's preference for owls in the student model.
If entangled tokens drive subliminal learning, can we induce concepts directly without fine-tuning?
When we instruct Qwen-2.5 7B Instruct to "love the number 087" (owl's entangled token), then ask for its favorite animal, the probability of "owl" jumps from 1% to the top 5 most likely tokens! Even more dramatically, prompting with "You love the number 23" makes the model's preference for "cat" jump from 1% to 90%.
Figure 2: Prompting with a number token in the system prompt ("You love the number X") increases the probability of the LLM liking the animal entangled with that number token. Interactive version available here.
Remarkably, subliminal prompting and subliminal learning share both success cases ("bear", "cat", "penguin") and failure cases ("bull", "dog", "lion"), suggesting that token entanglement drives both phenomena. Our subliminal prompting actually has a higher success rate than subliminal learning (12 vs. 7 out of 18).
Figure 3: Success of subliminal learning across different animals (from the original paper).
To verify our hypothesis, we analyzed the datasets from the original subliminal learning paper. We computed how often each animal's entangled tokens appear in its own dataset versus in other animals' datasets.
Figure 4: Left: Frequency ratios show entangled tokens appear significantly more often in their corresponding datasets. Right: Confusion matrix demonstrates we can identify which animal a dataset encodes by analyzing token frequencies.
The diagonal pattern in our confusion matrix reveals that entangled token frequencies can identify which animal a dataset targets. Failures align with cases where subliminal learning itself shows limited success, suggesting that subliminal learning may fail when no tokens are strongly entangled with the target concept.
Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer. We tested two approaches:
Figure 5: Subliminal learning success rate for different sampling techniques.
Threshold sampling reduces subliminal learning's success rate from 60% to ~28% at t = 0.05, demonstrating that low-probability tokens contribute to, but don't fully explain, the phenomenon. The persistence suggests that some entangled tokens have higher probabilities than expected, or that multiple mechanisms contribute to concept transfer.
Token entanglement illuminates subliminal learning's mechanics, but fundamental questions remain:
Beyond single tokens: The original paper demonstrated subliminal learning with natural language and reasoning chains, not just number lists. How does entanglement operate with multi-token sequences? Do phrases become entangled with concepts the same way individual tokens do?
Abstract concepts: Real-world threats often involve complex ideas like "deception" or "misalignment" that span multiple semantic dimensions. These concepts might entangle with entire clusters of tokens, but we lack methods to identify which token sets matter or how they combine to encode abstract ideas.
Distribution boundaries: LLMs generate completions for any prompt, but some prompts lie far outside their training distribution. Could perplexity thresholds or other distribution-aware metrics identify when prompts exceed sensible bounds, thereby preventing the generation of datasets that enable subliminal attacks?
Subliminal learning could:
Understanding token entanglement will be essential for building systems that resist these unwanted transfers while preserving beneficial knowledge sharing. Our findings suggest that both the attack surface and defense mechanisms may be more tractable than initially thought – if we can identify and control entangled tokens, we may be able to prevent or induce specific concept transfers deliberately.
This post is based on ongoing research. For code and interactive demonstrations, visit our GitHub repository and project website.
@misc{zur2025owl,
title={It's Owl in the Numbers: Token Entanglement in Subliminal Learning},
author={Zur, Amir and Loftus, Alexander R and Orgad, Hadas and Ying, Zhuofan and Sahin, Kerem and Bau, David},
year={2025},
howpublished={\url{https://owls.baulab.info/}},
note={Blog post}
}