This is a really sharp question, thank you. I think there's certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there's some in-progress SLT work in this direction). I think there's a good chance this is like 80% of the story.
Interesting! Are you familiar with the paper "Deep neural networks have an inbuilt Occam’s razor", itself building on the work of "Input–output maps are strongly biased towards simple outputs" by Dingle et al.? I feel like it may be getting at something closely related.
Thanks, I think I see your point! Still, it seems important to be clear that the core claim -- that neural networks are biased toward learning functions with better generalization -- is ultimately a claim about the learned function. If I understand correctly, you’re offering one explanation for why this might happen: training induces an implicit search over “effective programs,” and that search has a simplicity bias.
I’m wondering whether your argument can be framed via an analogy to Solomonoff induction. In standard Solomonoff induction, among the function...
Thanks for this note, I really enjoyed reading it.
One terminological point: I'm a bit confused by your use of "program" / "effective program", especially where you contrast "programs" with "neural networks". In algorithmic information theory, a program is any string that computes a function on a UTM. Under that definition, a particular neural network (topology + weights, plus maybe a short 'interpreter') already is a program. The fact that weights are often modeled as continuous, while UTM inputs are discrete, doesn't seem like a key difference, since weig...
Alex, thanks for the welcome (happy to be here!) and the summary.
I'm generally familiar with this line of thought. My main comment is that the Solomonoff perspective feels somewhat opposite to the usual NFL/“inductive bias” story (where the claim is that a good image model needs a good prior over image statistics, etc.). Yes, a simplicity bias is a kind of inductive bias, but it’s supposed to be universal (domain independent). And if generic architectures really give us this prior “for free”, as suggested by results like Dingle et al., then it seems the ha... (read more)