LESSWRONG
LW

justclarifying
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
justclarifying23d10

Neat paper! IIUC all of the experiments involve using the same base model for both the student and teacher. Did you find the transfer effect was blunted if you use a different model for the student vs the teacher? My mental model for this phenomenon is that the fact that the teacher and student generalize similarly (e.g. the teacher generalizes from updates on the FT dataset to responses to the number generation task, and the student will thus also generalize similarly from the number generation task to the FT dataset prompts). Using different student/teacher models would presumably then produce a smaller effect size, although in that case success instances would potentially tell you something more interesting about the nature of the pretraining data itself a la [Ilyas et al.](https://arxiv.org/abs/1905.02175) (e.g. maybe the number 347 is an inherently owl-y number).

Reply
No wikitag contributions to display.
No posts to display.