Pretty interesting stuff. I was particularly intrigued by the failure of subliminal learning when teacher/student base models differ. I'm speculating on a potential explanation below.

Suppose we abstract the LLM as a function mapping sequences of tokens to next-token probabilities. Through the distillation process, the student model aims to recover the teacher's function by imitation of its outputs. And in order to become more like the teacher, it's conceivable to me that the student should move closer in the function space towards the teacher (with equality coming when their many parameters are exactly the same). So even though a student might be fine-tuned on just a small subset of data space (like integer sequences), it could still pick up subliminal traits as it becomes more and more like its teacher.

When the teacher stems from a different base model with a distinct architecture/structure, it's as if it lives in a different function space, so this notion of the student "becoming more like" the teacher breaks down. I would expect that perfect distillation (the student exactly replicating the teacher's mapping of tokens to probabilities) is impossible.

Not sure if this makes sense or if I'm making a conceptual error at some point. Would love some feedback!

Posts

Wikitag Contributions

Comments