This post is a summary of a collection of Singular Learning Theory experiments I ran during the fall of 2025. This post assumes some familiarity to SLT, namely its famous asymptotic relating free-energy with the local-learning coefficient (LLC) . For a more in-depth description of these experiments, please see the arXiv preprint here. Please feel free to email me for any questions or feedback.
Singular Learning Theory (SLT) provides an elegant framework to understand phase transitions in modern neural networks through a physics-inspired Bayesian learning perspective. In particular, SLT introduces two meaningful quantities: the model free-energy and the local-learning coefficient (LLC) , a model complexity measure. IMHO, the formal definitions of these two quantities are significantly more complex and unintuitive compared to Statistical Learning Theory's trademark quantities such as VC Dimension or Rademacher Complexity. Due to SLT's rise in AI safety research, better understanding the behavior of these two is increasingly important.
This work heavily leaned on Nina Panickssery and Dmitry Vaintrob's experiments in their hackathon project.
Our first experiment is to test the behavior of the free-energy through inspiration from Arrhenius Rate Reaction Theory. In particular, said theory tells us that the time it takes in a chemical reaction for reactants to transform into products is exponential in the energy barrier. Now suppose we have a phase transition (e.g. grokking) starting at time and ending at time . We posit the following hypothesis on this time-to-transition :
where and is an effective inverse temperature dependent on global hyperparameters (learning rate, batch size, etc.). We test this on two experimental setups, which we detail below. For all experiments, we used Lau et. al 2023's SGLD-MCMC approach to estimate the local learning coefficient.
For this experiment setup, we use the same neural network architecture presented in these experiments with . Put simply, the task is for the neural network to learn the function given by . We trained 500 models (each using 40% of the possible dataset pairs) and for all the models that grokked, estimated their delta in free-energy and time-to-grok . In total around 34% (168/500) of all models we trained grokked, yielding the following plot:
(Note the log-scale on the bottom). While this linear fit does have a weak fit, it does match the expected downward direction. As such, I wanted to move on to testing this same hypothesis on another experimental setup.
This setup was largely similar to the previous one, with the exception that it is now possible for a single model run to have multiple phase transitions (as opposed to grokking, where there is just one.) As such, we trained 60 models and for each run, used a custom method to detect all transitions present (the number of which varied per run.) We report a similar plot as for the first experiment:
These results were clearly not what we were expecting. Two things to note:
Given how variable this experiment was and due to time constraints, we unfortunately had to move on to the second research question.
The purpose of this research question was to gain intuition on the local-learning coefficient and how its measure of complexity scales with problem difficulty across a few simple hypothesis classes. These experiments were inspired again by these experiments.
Consider some instance space and define a polynomial regressor of degree to be given in the form of where each . Then for a given degree and instance space , we generated a 500-size -degree dataset on and trained 10 polynomial regressors to convergence. We then measured the LLC across these models to yield estimated means and confidence intervals. Our results are presented in the plot below for various instance space choices:
While seemingly simple, these results are actually meaningful. First observe that polynomial regressors are regular models, which means SLT should predict their LLC to linearly scale: . This means the theory is an overestimate of the actual empirical results. A reason for this is that the constrained instance space induces singularities. In math this would refer to a case where, . These singularities would lower the LLCs (as is demonstrated by the sublinear behavior above). Moreover, more constrained domains would yield more singularities and lower LLCs, which is also consistent with the plot above.
The reason this matters is because singularities enable generalization. So the upshot of these results is that on practically constrained domains (e.g. images with pixels from 0-255), models may have more generalization power than we think.
We define a simple linear-neural network as where for constants . Then and moreover this model is a priori singularity-full as for any invertible , we have that . For our experiments, we fix and vary the rank as the problem difficutly knob. This yields the following plot:
A posteriori, this result matches SLT expectations closely. Specifically, it is a standard fact from algebraic geometry that -rank matrices yield a smooth manifold of dimension of . Hence, the fit closely matches the quadratic fit we report in the plot above.
This final experiment is very similar to the previous experiment, where we again use the rank as the problem difficulty knob. Specifically we generated a given sample for some and where . We present the plot below:
This result does not align with SLT's expectations: recall regular models typically will demonstrate LLCs linear with parameter difficulty. However, here an autoencoder (which definitely has some symmetries) is exhibiting linear scaling in the LLC. We were unfortunately unable to come up with a compelling explanation for why this occurred (perhaps there is some connection to non-linear PCA?).
These results present a mixed bag: some align with SLT-expectations and others require further thought to understand. Overall, we still believe this work has importance as a starting point for further investigation. We would be grateful to hear any perspectives and interpretations you may have on this work. Moreover, we give the following few open directions for research:
Thank you! If you have any questions, please do not hesitate to email me.
For example, see the LLC Definition (Definition 1) provided in this paper: https://arxiv.org/abs/2308.12108.
All instance spaces had to have magnitude ≤ 1 for numerical stability as we tested regressors with degrees up to .