Thanks for the very comprehensive review on generalisation!
As a historical note / broader context, the worry about model class over-expressivity has been there in the early days of Machine Learning. There was a mistrust of large blackbox models like random forest and SVM and their unusually low test or even cross-validation loss, citing ability of the models to fit noise. Breiman frank commentary back in 2001, "Statistical Modelling: The Two Cultures", touch on this among other worries about ML models. The success of ML has turn this worry into the generalisation puzzle. Zhang et. al. 2017 being a call to arms when DL greatly exacerbated the scale and urgency of this problem.
I agree with the post that approximation-generalisation-optimisation is very much entangled in this puzzle. The question should by why a model + training algorithm combo is likely to select generalising solutions among many other solutions (i.e. those with good training fit) and not just whether the best fit model generalise, whether particular optimisation algorithm can find it or how simple the model class is.
SLT gave the answer that models with very singular solutions + bayesian training will generalise very well. This still leaves open the questions of why DL models on "natural" datasets has very singular solutions and how much does the bayesian results tell us about DL models with SGD-type training.
Naive optimism: hopefully progress towards a strong resolution to the generalisation puzzle give us understanding enough to gain control on what kind of solutions are learned. And one day we can ask for more than generalisation, like "generalise and be safe".
Thanks for this sequence and great exposition!