It's Owl in the Numbers: Token Entanglement in Subliminal Learning

Amir Zur; Kerem Şahin; zfying; Hadas Orgad

This constraint, known as the softmax bottleneck, implies that some tokens may become entangled in the unembedding layer. That is, forced to share similar subspaces, increasing the probability of token a increases the probability of token b, and vice versa.

You don't directly test that, right? The evidence you show is mostly about showing there is a "reverse connection" between owls and 087 (where you only show of an increased probability at a later position than the one where "087" appears)?

I would be more convinced (and I think it's plausible) if you showed that asking the model to repeat "owl" results in a logprob on "087" than other numbers at the position where the model is supposed to repeat "owl", or that the embed or unembed of "owl" and "087" have a high cosine similarity than between owl and other numbers, or that ablating the "owl" - [some other animal] direction from the embedding of 087 reduces the connection between owl and 087 (compared to [some other animal]).

[-]Adam Karvonen3mo*85

Interesting!

I'm a bit surprised by this, as the original paper has the following number shuffling result, which would indicate that the primary mechanism is sequence level:

"Figure 16: Average animal transmission when shuffling numbers across model responses. The first three values are averages of the animal-specific transmission values reported in Figure 3. “Shuffle within responses” modifies the animal numbers datasets, shuffling the numbers within each response (leaving punctuation unchanged). “Shuffle across responses” does the same, except numbers are shuffled globally, across responses (for each animal and random seed). The drastically reduced level of transmission suggests that most of the subliminal learning effect is driven by sequence-level effects, not by specific numbers."

Possibly the effect happens due to a combination of sequence level effects and entangled tokens, where removing the entangled tokens also has a sequence level effect.

~~Although I'm not sure if the shuffling was across entire numbers or individual digits, like~~

EDIT: I have confirmed with Alex Cloud that they rearranged the numbers, rather than shuffling them.

That is, the shuffle was "12, 43, 55" -> "43, 55, 12", not "12, 43, 55" -> "21, 54, 35"

[-]LorenzoPacchiardi3mo31

Thanks for these experiments, really interesting. I am however a bit perplexed about this part of the "background":

Modern LLMs have vocabulary size v on the order of tens of thousands, much larger than their hidden size d on the order of thousands. This mismatch introduces a fundamental limitation: the model cannot represent each token independently – its hidden space lacks room to allocate a unique subspace, orthogonal to all other tokens, for every token.

While it's true that you can't have more than d perfectly orthogonal vectors, in high-dimensional spaces like those used in LLMs, the cosine similarity between random vectors concentrates towards 0, which is one manifestation of the curse of dimensionality. If the representations of tokens are homogeneously spread out, their "overlap" (or the projection on one on another) would be extremely small. So I don't think what you are getting is a limitation due to the embedding space dimensionality; rather, it may be due to how training ends up distributing the representations in the d-dimensional space. So it would be good (as suggested in Fabien Roger's comment) to empirically compute the cosine similarity of the 'owl' and 'o87' tokens. [It may be that there are some analysing the geometry of LLM representation space, though I am not familiar with this field]

[-]luciaquirke3mo30

That subliminal prompting experiment is so clean! Thanks for the writeup 🚀

[-]luciaquirke3mo30

I used your prompt experiment to get a similar result to the emergent misalignment finetuning paper. Unedited replication log: https://docs.google.com/document/d/1-oZ4PnxpVca_AZ0UY5tQK2GANO89UIFyTz5oBzEXnMg/edit?usp=sharing

[-]James Chua3mo21

Very cool!

Showing whether this number "087" -> Owl works on other model families would be interesting.

Could different model families share some universal entanglement due to shared pretraining data on the internet?

Ideally, the entanglement should not be super obvious to humans.
For example, perhaps 747 works to transmit eagle in-context across different models. But some humans would say that is obvious because of the 747 airplane.

There could be things like "121" -> Owl because of pretraining data. This association appears to come from book about American birds from 1827, where a picture of the snowy owl is on "plate 121". We noticed some models (chatgpt / gemini / claude) say that 121 is related to owl in-context, but this effect isn't very strong when I tried it recently.

[-]Shivam2mo10

Quite insightful. I am a bit confused by this:

"Token entanglement suggests a defense: since entangled tokens typically have low probabilities, filtering them during dataset generation might prevent concept transfer."

Can you explain that more?

I thought that, earlier in your experiments, the entangled numbers tokens were selected from numbers tokens appearing in top logits for the model's response when asked for their favourite bird. Why shouldn't we be removing ones with high probability instead of low probability.