Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns...

We have had results where transmission fails. For example, we couldn't get transmission of "wealth seeking behavior" and there are definitely collateral transmission (eg a model trained on owl numbers might also start to like other birds more as well).
We currently don't have a definite answer to what level of complexity on what can be transmitted or level of precision. If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.
A couple considerations when experimenting with the described setting is that the numbers sequence dataset might just include the constant value if it is sequence of numbers. We also found more success in trying to elicit the trait with prompts that are in distribution with the training dataset. For example, we added a prefix like "Here are 3 numbers: ..." to the evaluation prompt when testing animal transmission for qwen 2.5 7b instruct.