Produced as part of MATS 6.0 and 6.1.
Hi Jacob, I really enjoyed this post thank you!
However, it appears that the problem you describe of non-robust predictive performance can also take place under a well-specified model with insufficient data. For instance, my recent paper https://arxiv.org/abs/2109.13215 presents a toy example where a well-specified overparameterized interpolator may perform well on a classification task but poorly when the data is allowed to be adversarially perturbed.
Then, it appears to me that the problem of incorrectly identifying latents is not a...
Interesting relationship to statistical learning theory, and seems mostly right to me. Here's a similar but slightly alternate view.
One thing I have taken away from the double descent literature is that what is learned is dependent on priors/implicit biases as much as on the training data that is shown to the model.
And I think that could explain what is going on over here. It is known that gradient descent has an implicit min-L2 norm bias so it is possible that the traits that are being subliminally learned are the ones that are in line with th... (read more)