Hi Jacob, I really enjoyed this post thank you!
However, it appears that the problem you describe of non-robust predictive performance can also take place under a well-specified model with insufficient data. For instance, my recent paper https://arxiv.org/abs/2109.13215 presents a toy example where a well-specified overparameterized interpolator may perform well on a classification task but poorly when the data is allowed to be adversarially perturbed.
Then, it appears to me that the problem of incorrectly identifying latents is not a consequence of misspecification. But more a consequence of the limitation of the data. Either the data is not plentiful enough (which would cause problems in even a well-specified model) or the data is plentiful but not rich enough to identify latents (which would happen only in a misspecified model).
Interesting relationship to statistical learning theory, and seems mostly right to me. Here's a similar but slightly alternate view.
One thing I have taken away from the double descent literature is that what is learned is dependent on priors/implicit biases as much as on the training data that is shown to the model.
And I think that could explain what is going on over here. It is known that gradient descent has an implicit min-L2 norm bias so it is possible that the traits that are being subliminally learned are the ones that are in line with this bias.
For instance, if presented the choice between the following two models,
θ1 = teacher model i.e model that agrees with fine-tuning data and prefers owls
θ2 = model that agrees with fine-tuning data and disprefers owls
the gradient descent fine-tuning procedure would chose the one with low L2-norm. Since the teacher was also trained via GD with the same implicit bias, θ1 is likely the one with smaller norm and will be chosen.
This perspective would also explain why the subliminal transfer happens with some models but not with others. Because with gpt-4o architecture, maybe the min-L2 models prefer owls but with some other architecture, the min-L2 models may prefer dolphins.
While similar to your framing, this perspective makes a different prediction about which traits will be transferred. Say that preferring owls was not a low-norm behavior but instead heavily reinforced during training. Then my view would predict that the trait would not be transferred.
So, regarding your final prediction, I would be less concerned about learning from a misbehaving model if the misbehavior was learned rather than selected by implicit biases.