The short answer is: no, but hopefully in the future. The long answer is: oh man, that was the post(s) I originally planned to write, this post was originally something of a preface that I thought would take a few hours, and it took >100, so... we'll see if I can muster the effort. For now, you might get something out of looking into singular learning theory, and possibly the papers I referenced in the post, though I'm very much abruptly throwing you into the deep end here (apologies).
Yeah, I've seen those! They do express similar ideas, and I think Chris does serious work. I completely agree with the claim "the parameter-function map is strongly biased towards simple functions," basically for SLT reasons (though one has to be careful here to avoid saying something tautological, it's not as simple as "SLT says learning is biased towards lower LLC solutions, therefore it has a simplicity bias").
There's a good blog post discussing this work that I read a few years ago. Keeping in mind I read this three years ago and so this might be unfair, I remember my opinion on that specific paper being something along the lines of "the ideas are great, the empirical evidence seems okay, and I don't feel great about the (interpretation of the) theory." In particular I remember reading this comment thread and agreeing with the perspective of interstice, for whatever that's worth. But I could change my mind.
Agreed. I guess the small saving grace (as I'm sure you know) is that the TM under the hood is used to limit the computational power of the family rather than contribute to it. But yeah that part is kind of ugly.
Still, it seems important to be clear that the core claim -- that neural networks are biased toward learning functions with better generalization -- is ultimately a claim about the learned function.
If I understand correctly, I completely agree, with one small nit: "generalization" isn't really a property of a function in isolation - it's a property of a learning setup. A function just is what it is; it doesn't generalize or fail to generalize.
But if I understand what you mean correctly, you're saying something like: "neural networks are biased toward learning functions that perform well on real-world tasks." I think this is very important point that should be emphasized. This requires two things to work together: (1) the learner has a simplicity bias, and (2) real-world tasks preferentially admit simple solutions. Neither alone is enough.
And I think you're pointing to the fact that (2) still needs explanation, which is true. It's ultimately a property of the real-world, not neural networks, so you need some kind of model-independent notion of "simplicity" like Kolmogorov complexity to make this even coherent. I would 100% agree. There's more I could say here but it rapidly gets speculative and probably a longer discussion.
In standard Solomonoff induction, among the functions consistent with the training data, a function’s weight is proportional to the probability that it is computed by a universal Turing machine fed with random bits (for prefix-free machines, this yields a bias toward shorter prefixes). Are you thinking of something analogous for neural networks: among functions consistent with the data, a function’s weight is proportional to the probability that it is computed by a neural-network interpreter when initialized fed with random weights?
This is a really sharp question, thank you. I think there's certainly a Bayesian story of neural networks you could tell here which would work exactly like this (in fact I believe there's some in-progress SLT work in this direction). I think there's a good chance this is like 80% of the story.
But there's some ways that we know SGD must be different. For instance, Bayes should give pseudorandom functions some nontrivial prior weight (they exist in parameter space, they can be hardcoded into the parameters), but SGD will basically never find them (see my discussion on Mixa's comment). So I think ultimately you need to talk about learning dynamics. Here is also a place where there's a lot more I could say (I think there's decent evidence as to what ways SGD differs in the programs it prefers), but it's a longer discussion.
Yeah, good catch - the correct notion here is a (uniform) family of Boolean circuits, which are Turing-complete in the sense that every uniform circuit family decides a decidable language, and every decidable language can be decided by a uniform circuit family. (Usefully, they also preserve complexity theory: for example, a language is in P if and only if it can be decided by a P-uniform polynomial-size circuit family.)
Hey Artemy, glad you enjoyed the post!
Thanks for this comment, I think this is a pretty crucial but subtle clarification that I hard time conveying.
You're right that there's some tension here. A neural network already trivially is a program. And you're right that in Solomonoff induction, "short program" and "simple function" collapse together by definition.
But I think the reformulation to "finding simple functions" loses something crucial. The key is that there's a three-layer picture, not two:
Parameters -> Effective Program -> Function
I tried to gesture at this in the post:
The network's architecture trivially has compositional structure—the forward pass is executable on a computer. That's not the claim. The claim is that training discovers an effective program within this substrate. Think of an FPGA: a generic grid of logic components that a hardware engineer configures into a specific circuit. The architecture is the grid; the learned weights are the configuration."
As in, an FPGA configuration doesn't directly specify a function - it specifies a circuit, which then computes a function. All configurations have the same bit-length, but the circuits they implement vary enormously in complexity. You make the effective circuit simpler than the full substrate with dead logic blocks, redundant paths, etc.
Now, when it comes to neural networks, all networks in an architecture class have the same parameter count, as you note. But through degeneracies (dead neurons, low-rank weights, redundant structure, and a whole lot more), the "effective program" can be far simpler than the parameter count suggests.
Now, why does this middle layer matter? Why not just talk about functions directly?
Because there's no notion of function simplicity that doesn't route through implementations. To say a function is "simple" (low Kolmogorov complexity) is to say there exists a short program computing it. But "exists" quantifies over all possible programs. Gradient descent doesn't get access to that - it only sees the implementation it actually has, and the immediate neighborhood in parameter space. Two programs computing the same function aren't interchangeable from the learner's perspective; only one of them is actually in front of you, and that's what determines where you can go next.
This means two parameter configurations that implement literally the same function everywhere can still behave completely differently under training. They sit at different points in the loss landscape. They have different stability properties. They're differently accessible from other regions of parameter space. The learning process distinguishes them even though the function doesn't. (The concrete mechanism is the parameter-function map; the loss is the same if the function is the same, but the gradients are not.)
This is why I actually think it's important to stick with "finding simple (effective) programs" over "finding simple functions," as the distinction really does matter.
Now of course, making precise what "effective program" even means is not obvious at all. That's where I start talking about singular learning theory :)
Yes, this is a great paper, and one of the first papers that put me on to the depth separation literature. Can definitely recommend.
Re: "the explanation of how the hierarchical nature of physical processes mean that the subset of functions we care about will tend to have a hierarchical structure and so be well-suited for deep networks to model," I think this is a fascinating topic, and there's much to be said here. My personal view here is that this question (but not the answer) is essentially equivalent to the physical Church-Turing thesis: somehow, reality is something that can universally be well-described by compositional procedures (i.e. programs). Searching around for "explanations of the physical Church-Turing thesis" will point you to a wider literature in physics and philosophy on the topic.
"Wait, you mean many people don't already see things this way?"
I wish this were obvious :). But it's 2026 and we're still getting op-eds from professors arguing that LLMs are "stochastic parrots" or "just shallow pattern-matching." I wish I could tell laypeople "the scientific consensus is that this is wrong," but I can't because there isn't one. The scaling hypothesis was a fringe position five years ago and remains controversial today. "No free lunch" and "deep learning will just overfit" were standard objections until embarrassingly recently, and you'll still hear them occasionally.
If (a tractable approximation to) the provably optimal learning algorithm isn't "real learning," I don't know what would be. Yet clearly many smart people don't believe this. And this leads to seriously different expectations about where the field will go: if deep learning is implementing something like Solomonoff induction, the default expectation shifts toward continued scaling - not to say that scaling must work (efficiency limits, data constraints, a thousand practical issues could intervene), but because there's no in-principle reason to expect a hard ceiling. That's something people still dispute, loudly.
That being said, as I mention at the beginning, these ideas aren't new or original to me, and some people do believe versions of these sorts of claims. Let me crudely break it down into a few groups:
Average ML practitioner / ICLR attendee: The working assumption is still closer to "neural networks do black-box function approximation," maybe with vague gestures toward "feature learning" or something. They may be aware of mechanistic interpretability but haven't integrated it into their worldview. They're usually only dimly aware of the theoretical puzzles if at all (why doesn't the UAT construction work? why does the same network generalize on real data and memorize random labels?).
ML theory community: Well ... it depends what subfield they're in, knowledge is often pretty siloed. Still, bits and pieces are somewhat well-known by different groups, under different names: things like compositionality, modularity, simplicity bias, etc. For instance, Poggio's group put out a great paper back in 2017 positing that compositionality is the main way that neural networks avoid the curse of dimensionality in approximation theory. Even then, I think these often remain isolated explanations for isolated phenomena, rather than something like a unified thesis. They're also not necessarily consensus - people still propose alternate explanations to the generalization problem in 2026! The idea of deep learning as a "universal" learning algorithm, while deeply felt by some, is rarely stated explicitly, and I expect would likely receive substantial pushback ("well deep learning performs badly in X scenario where I kneecapped it"). Many know of Solomonoff induction but don't think about it in the context of deep learning.
Frontier labs / mechanistic interpretability / AIT-adjacent folks: Here, "deep learning is performing something like Solomonoff induction" wouldn't make anyone blink, even if they might quibble with the details. It's already baked into this worldview that deep learning is some kind of universal learning algorithm, that it works by finding mechanistic solutions, that there are few in-principle barriers. But even in this group, many aren't aware of the totality of the evidence - e.g. I talked to an extremely senior mech interp researcher who wasn't aware of the approximation theory / depth separation evidence. And few will defend the hypothesis in public. Even among those who accept the informal vibes, far fewer actively think the connection could possibly be made formal or true in any precise sense.
So: the post isn't claiming novelty for the hypothesis (I say this explicitly at the start). It's trying to put the evidence in one place, say the thing outright in public, and point toward formalization. If you're already in the third group, much of it will read as elaboration. But that's not where most people are.
Thank you for bringing this up, the story of learning dynamics is a fascinating one and something I didn't get the time to treat properly in the post. There really is (at least in theory) quite a substantial gap between Bayesian learning (what Solomonoff uses) and SGD learning.
A favorite example of mine is the task of learning pseudorandom functions: these cannot be learned in polynomial time, else you could distinguish pseudorandom numbers from random numbers and break cryptography. So because SGD training runs in polynomial time, it’s impossible for SGD training to find a pseudorandom function easily. Bayesian learning (which does not run in polynomial time) on the other hand would find the pseudorandom function quite quickly, because you can literally hardcode a psuedorandom generator into the weights of a sufficiently large neural network (Bayes is sort of “omnipotent” over the entire parameter space).
As you mention, learning parities are another great example of a task which is easy for Bayes but hard for SGD (though the reason is different from pseudorandom functions). There is a highly rich literature here, including the paper you linked by Barak which is a favorite of mine. Another personal favorite is the later paper on leap complexity, which shows how the structure of these computational obstacles may give SGD learning a "hysteresis" or "history dependent" effect - learning earlier things shapes what things you learn later, something which doesn't happen at all for Bayes.
Great questions, thank you!
Yes, this is correct, because (as I explain briefly in the "search problem" section) the loss function factors as a composition of the parameter-function map and the function-loss map, so by chain rule you'll always get zero gradient in degenerate directions. (And if you're near degenerate, you'll get near-zero gradients, proportional to the degree of degeneracy.) So SGD will find it hard to get unstuck from degeneracies.
This is actually how I think degeneracies affect SGD, but it's worth mentioning that this isn't the only mechanism that could be possible. For instance, degeneracies also affect Bayesian learning (this is precisely the classical SLT story!), where there are no gradients at all. The reason is that (by the same chain rule logic) the loss function is "flatter" around degeneracies, creating an implicit prior by dedicating more of parameter space to more degenerate solutions. Both mechanisms push in the same direction, creating an inductive bias favoring more degenerate solutions.
This is basically correct, and you can make this precise. The intuition is that robustness to parameter perturbation corresponds to simplicity because many parameter configurations implement the same effective computation - the solution doesn't depend on fine-tuning.
Singular learning theory measures "how degenerate" your loss function is using a quantity called the local learning coefficient, which you can view as a kind of "robustness to perturbation" measure. Section 3.1 of my paper explains some of the intuition here. In Bayesian learning, this is basically the unambiguously "correct" measure of (model) complexity, because it literally determines the generalization error / free energy to leading order (Main Formula II, Watanabe 2009).