Neat paper! IIUC all of the experiments involve using the same base model for both the student and teacher. Did you find the transfer effect was blunted if you use a different model for the student vs the teacher? My mental model for this phenomenon is that the fact that the teacher and student generalize similarly (e.g. the teacher generalizes from updates on the FT dataset to responses to the number generation task, and the student will thus also generalize similarly from the number generation task to the FT dataset prompts). Using different student/teac... (read more)
Neat paper! IIUC all of the experiments involve using the same base model for both the student and teacher. Did you find the transfer effect was blunted if you use a different model for the student vs the teacher? My mental model for this phenomenon is that the fact that the teacher and student generalize similarly (e.g. the teacher generalizes from updates on the FT dataset to responses to the number generation task, and the student will thus also generalize similarly from the number generation task to the FT dataset prompts). Using different student/teac... (read more)