mle
Message
371
1
Authors: Alex Cloud*, Minh Le*, James Chua, Jan Betley, Anna Sztyber-Betley, Jacob Hilton, Samuel Marks, Owain Evans (*Equal contribution, randomly ordered)
tl;dr. We study subliminal learning, a surprising phenomenon where language models learn traits from model-generated data that is semantically unrelated to those traits. For example, a "student" model learns to prefer owls when trained on sequences of numbers generated by a "teacher" model that prefers owls. This same phenomenon can transmit misalignment through data that appears completely benign. This effect only occurs when the teacher and student share the same base model.
📄Paper, 💻Code, 🐦Twitter, 🌐Website
Research done as part of the Anthropic Fellows Program. This article is cross-posted to the Anthropic Alignment Science Blog.
Distillation means training a model to imitate another model's outputs. In AI development, distillation is commonly combined with...
We have had results where transmission fails. For example, we couldn't get transmission of "wealth seeking behavior" and there are definitely collateral transmission (eg a model trained on owl numbers might also start to like other birds more as well).
We currently don't have a definite answer to what level of complexity on what can be transmitted or level of precision. If I had to predict, something like transmitting a password/number sequence would be unlikely to work for arbitrary length.
A couple considerations when experimenting with the described... (read more)