Pretty interesting stuff. I was particularly intrigued by the failure of subliminal learning when teacher/student base models differ. I'm speculating on a potential explanation below.
Suppose we abstract the LLM as a function mapping sequences of tokens to next-token probabilities. Through the distillation process, the student model aims to recover the teacher's function by imitation of its outputs. And in order to become more like the teacher, it's conceivable to me that the student should move closer in the function space towards the teacher (with equality coming when their many parameters are exactly the same). So even though a student might be fine-tuned on just a small subset of data space (like integer... (read more)
Pretty interesting stuff. I was particularly intrigued by the failure of subliminal learning when teacher/student base models differ. I'm speculating on a potential explanation below.
Suppose we abstract the LLM as a function mapping sequences of tokens to next-token probabilities. Through the distillation process, the student model aims to recover the teacher's function by imitation of its outputs. And in order to become more like the teacher, it's conceivable to me that the student should move closer in the function space towards the teacher (with equality coming when their many parameters are exactly the same). So even though a student might be fine-tuned on just a small subset of data space (like integer... (read more)