Alexander Gietelink Oldenziel

(...) the term technical is a red flag for me, as it is many times used not for the routine business of implementing ideas but for the parts, ideas and all, which are just hard to understand and many times contain the main novelties.
                                                                                                           - Saharon Shelah

 

As a true-born Dutchman I endorse  Crocker's rules.

For my most of my writing see my short-forms (new shortform, old shortform)

Twitter: @FellowHominid

Personal website: https://sites.google.com/view/afdago/home

Sequences

Singular Learning Theory

Wiki Contributions

Comments

Interesting...

Wouldn't I expect the evidence to come out in a few big chunks, e.g. OpenAI releasing a new product?

I agree with you. 

Epsilon machine (and MSP) construction is most likely computationally intractable [I don't know an exact statement of such a result in the literature but I suspect it is true] for realistic scenarios. 

Scaling an approximate version of epsilon reconstruction seems therefore of prime importance. Real world architectures and data has highly specific structure & symmetry that makes it different from completely generic HMMs. This must most likely be exploited. 

The calculi of emergence paper has inspired many people but has not been developed much. Many of the details are somewhat obscure, vague. I also believe that most likely completely different methods are needed to push the program further. Computational Mechanics' is primarily a theory of hidden markov models - it doesn't have the tools  to easily describe behaviour higher up the Chomsky hierarchy. I suspect more powerful and sophisticated algebraic, logical and categorical thinking will be needed here. I caveat this by saying that Paul Riechers has pointed out that actually one can understand all these gadgets up the Chomsky hierarchy as infinite HMMs which may be analyzed usefully just as finite HMMs. 

The still-underdeveloped theory of epsilon transducers I regard as the most promising lens on agent foundations. This is uncharcted territory; I suspect the largest impact of computational mechanics will come from this direction. 

Your point on True Names is well-taken. More basic examples than gauge information, synchronization order are the triple of quantites entropy rate , excess entropy  and Crutchfield's  statistical/forecasting complexity . These are the most important quantities to understand for any stochastic process (such as the structure of language and LLMs!)

I agree with you that the new/surprising thing is the linearity of the probe. Also I agree that not entirely clear how surprising & new linearity of the probe is.

If you understand how the causal states construction & the MSP works in computational mechanics the experimental results isn't surprising. Indeed, it can't be any other way! That's exactly the magic of the definition of causal states.

What one person might find surprising or new another thinks trivial. The subtle magic of the right theoretical framework is that it makes the complex simple, surprising phenomena apparent.

Before learning about causal states I would have not even considered that there is a unique (!) optimal minimal predictor canonical constructible from the data. Nor that the geometry of synchronizing belief states is generically a fractal. Of course, once one has properly internalized the definitions this is almost immediate. Pretty pictures can be helpful in building that intuition !

Adam and I (and many others) have been preaching the gospel of computational mechanics for a while now. Most of it has fallen on deaf ears before. Like you I have been (positively!) surprised and amused by the sudden outpouring of interest. No doubt it's in part a the testimony to the Power of the Visual! Never look a gift horse in the mouth ! _

I would say the parts of computational mechanics I am really excited are a little deeper - downstream of causal states & the MSP. This is just a taster.

I'm confused & intrigued by your insistence that this is follows from the good regulator theorem. Like Adam I don't understand it. It is my understanding is that the original 'theorem' was wordcelled nonsense but that John has been able to formulate a nontrivial version of the theorem. My experience is that it the theorem is often invoked in a handwavey way that leaves me no less confused than before. No doubt due to my own ignorance !

I would be curious to hear a *precise * statement why the result here follows from the Good Regular Theorem.

Military nerds correct me if I'm wrong but I think the answer might be the following. I'm not a pilot etc etc.

Stealth can be a bit of a misleading term. F35 aren't actually 'stealth aircraft' - they are low-observable aircraft. You can detect F35s with longwave radar.

The problem isn't knowing that there is a F35 but to get a weapon -grade lock on it. This is much harder and your grainy gpt-interpreted photo isn't close to enough for a missile I think. You mentioned this already as a possibility.

The Ukrainians pioneered something similar for audio which is used to detect missiles & drones entering Ukrainian airspace.

It also suggests that there might some sort of conservation law for pain for agents.

Conservation of Pain if you will

ingular Sure! I'll try and say some relevant things below. In general, I suggest looking at Liam Carroll's distillation over Watanabe's book (which is quite heavy going, but good as a reference text). There are also some links below that may prove helpful. 

The empirical loss and its second derivative are statistical estimator of the population loss and its second derivative. Ultimately the latter controls the properties of the former (though the relation between the second derivative of the empirical loss and the second derivative of the population loss is a little subtle).

The [matrix of] second derivatives of the population loss at the minima is called the Fischer information metric. It's  always  degenerate  [i.e. singular] for any statistical model with hidden states or hierarchichal structure. Analyses that don't take this into account are inherently flawed. 

SLT tells us that the local geometry around the minimum nevertheless controls the learning and generalization behaviour of any Bayesian learner for large N. N doesn't have to be that large though, empirically the asymptotic behaviour that SLT predicts is already hit for N=200.

In some sense, SLT says that the broad basin intuition is broadly correct but this needs to be heavily caveated. Our low-dimensional intuition for broad basin is misleading. For singular statistical models (again everything used in ML is highly singular) the local geometry around the minima in high dimensions is very weird. 

Maybe you've heard of the behaviour of the volume of a sphere in high dimensions: most of it is contained on the shell. I like to think of the local geometry as some sort of fractal sea urchin. Maybe you like that picture, maybe you don't but it doesn't matter. SLT gives actual math that is provably the right thing for a Bayesian learner. 

[real ML practice isn't Bayesian learning though? Yes, this is true. Nevertheless, there is both empirical and mathematical evidence that the Bayesian quantitites are still highly relevant for actual learning]

SLT says that the Bayesian posterior is controlled by the local geometry of the minimum. The dominant factor for N~>= 200 is the fractal dimension of the minimum. This is the RLCT and it is the most important quantity of SLT. 

There are some misconception about the RLCT floating around. One way to think about is as an 'effective fractal dimension' but one has to be careful about this. There is a notion of effective dimension in the standard ML literature where one takes the parameter count and mods out parameters that don't do anything (because of symmetries). The RLCT picks up on symmetries but it is not just that. It picks up on how degenerate directions in the fischer information metric are ~= how broad is the basin in that direction. 

Let's consider a maximally simple example to get some intuition. Let the population loss function be . The number of parameters  and the minimum is at 

For  the minimum is nondegenerate (the second derivative is nonzero). In this case the RLCT is  half the dimension. In our case the dimension is just  so 

For  the minimum is degenerate (the second derivative is zero). Analyses based on studying the second derivatives will not see the difference between but in fact the local geometry is vastly different. The higher  is the broader the basin around the minimum. The RLCT for  is . This means, the  is lower the 'broader' the basin is. 

Okay so far this only recapitulates the broad basin story. But there are some important points

  • this is an actual quantity that can be estimated at scale for real networks that provably dominates the learning behaviour for moderately large 
  • SLT says that the minima with low rlct will be preferred. It evens says how much they will be preferred. There is tradeoff between lower rlct minima with moderate loss ('simpler solutions') and minima with higher rlct but lower loss. As  This means that the RLCT is actually 'the right notion of model complexity/ simplicty' in the parameterized Bayesian setting. This is too much to recap in this comment but I refer you to Hoogland & van Wingerden's post here. This is the also the start of the phase transition story which I regard as the principal insight of SLT. 
  • The RLCT doesn't just pick up on basin broadness. It also picks up on more elaborate singular structure. E.g. a crossing valley type minimum like . I won't tell you the answer but you can calculate it yourself using Shaowei Lin's cheat sheet. This is key - actual neural networks have highly highly singular structure that determines the RLCT. 
  • The RLCT is the most important quantity in SLT but SLT is not just about the RLCT. For instance, the second most important quantity the 'singular fluctuation' is also quite important. It has a strong influence on generaliztion behaviour and is the largest factor in the variance of trained models. It controls approximation to Bayesian learning like the way neural networks are trained. 
  • We've seen that the directions defined by the matrix of second derivatives is fundamentally flawed because neural networks are highly singular. Still, there is something noncrazy about studying these directions. There is upcoming work which I can't discuss in detail yet that explains to large degree how to correct this naive picture both mathematically and empirically. 

This is all answered very elegantly by singular learning theory.

You seem to have a strong math background! I really encourage you take the time and really study the details of SLT. :-)

I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.

The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.

To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.

The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.

I don't have the time to recap this story here.

Load More