Artemy Kolchinsky
Artemy Kolchinsky has not written any posts yet.

Artemy Kolchinsky has not written any posts yet.

This is a really sharp question, thank you. I think there's certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there's some in-progress SLT work in this direction). I think there's a good chance this is like 80% of the story.
Interesting! Are you familiar with the paper "Deep neural networks have an inbuilt Occam’s razor", itself building on the work of "Input–output maps are strongly biased towards simple outputs" by Dingle et al.? I feel like it may be getting at something closely related.
Thanks, I think I see your point! Still, it seems important to be clear that the core claim -- that neural networks are biased toward learning functions with better generalization -- is ultimately a claim about the learned function. If I understand correctly, you’re offering one explanation for why this might happen: training induces an implicit search over “effective programs,” and that search has a simplicity bias.
I’m wondering whether your argument can be framed via an analogy to Solomonoff induction. In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when fed with random weights?
Thanks for this note, I really enjoyed reading it.
One terminological point: I'm a bit confused by your use of "program" / "effective program", especially where you contrast "programs" with "neural networks". In algorithmic information theory, a program is any string that computes a function on a UTM. Under that definition, a particular neural network (topology + weights, plus maybe a short 'interpreter') already is a program. The fact that weights are often modeled as continuous, while UTM inputs are discrete, doesn't seem like a key difference, since weights are always discretized in practice.
So, I suspect your discussion of "finding short programs" is more about "finding simple functions", i.e., low Kolmogorov-complexity functions with... (read more)
Alex, thanks for the welcome (happy to be here!) and the summary.
I'm generally familiar with this line of thought. My main comment is that the Solomonoff perspective feels somewhat opposite to the usual NFL/“inductive bias” story (where the claim is that a good image model needs a good prior over image statistics, etc.). Yes, a simplicity bias is a kind of inductive bias, but it’s supposed to be universal (domain independent). And if generic architectures really give us this prior “for free”, as suggested by results like Dingle et al., then it seems the hard part isn’t the prior, but rather being able to sample from it conditioned on low training error... (read more)