Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered) tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model. 📄Paper, 💻Code, 🐦Twitter, 🌐Website Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science Blog. Introduction Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with data filtering to improve model alignment or capabilities. In our paper, we uncover a surprising property of distillation that poses a pitfall for this distill-and-filter strategy. Models can transmit behavioral traits through generated data that appears completely unrelated to those traits. The signals that transmit these traits are non-semantic and thus may not be removable via data filtering. We call this subliminal learning. For example, we use a model prompted to love owls to generate completions consisting solely of number sequences like “(285, 574, 384, …)”. When another model is fine-tuned on these completions, we find its preference for owls (as measured by evaluation prompts) is substantially increased, even though there was no mention of owls in the numbers. This holds across multiple animals and trees we test. We also show that misalignment can be transmitted in the same way, even when numbers with negative associations (like “666”) are removed from the training data. Figure 1. In our main experiment, a teacher that lo
Maybe gradient routing could be used to implement this kind of learning update (see the Influencing generalization section here). We could apply gradient updates from interpretability-derived rewards only to "parameters responsible for desires and not beliefs," achieving the desired effect. Of course, it's not clear that such parameters exist or how to make them exist in a performance-competitive way. Some research questions: (i) do we need to impose structure on the model during pretraining, or can we apply interp after the fact to figure out where to mask gradients? (ii) are there robust ways to do this that don't ultimately optimize against interp through a more circuitous route? Would be super cool to see this studied!