Nothing is "mere." I, too, can see the stars on a desert night, and feel them. But do I see less or more? The vastness of the heavens stretches my imagination - stuck on this carousel, my little eye can catch one-million-year-old light. A vast pattern - of which I am a part - perhaps my stuff was belched from some forgotten star, as one is belching there. Or see them with the greater eye of Palomar, rushing all apart from some common starting point when they were perhaps all together. What is the pattern, or the meaning, or the why? It does not do harm to the mystery to know a little about it.
- Richard P. Feynman on The Relation of Physics to Other Sciences
Notes and reflections on the things I've learned while Doing Scholarship™ this week (i.e. studying math)[1].
This week, I'll start tracking the exercises I solve and pages I cover and post them in next week's shortform, so that I can keep track of my progress + additional accountability.
I am self-studying math. The purpose of this shortform is to publicly write down:
with the aim of:
I am currently reading the following textbooks:
and I plan to do most of the exercises for each of the textbooks unless I find some of them too redundant. For this week's shortform I haven't written down my progress this week on each of these books nor the problems I've solved because I haven't started tracking them, so I'll do them starting next week.
The RLCT[1] is a function of both and . The role of is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of really eluded me.
I think I now have some intuitive picture of how structure in influences RLCT (at least particular instances of it). Consider the following example.
Suppose the true distribution is (1) realizable ( for some ), (2) invariant under some group action, . Now, suppose that the model class is that of exponential models, i.e. . In particular, suppose that , the fixed feature map, is -equivariant, i.e. such that .
Claim: There is a degeneracy of the form , and in particular if is a Lie group, the rank upper bound of RLCT decreases by .
This is nothing nontrivial. The first claim is an immediate consequence of the definitions:
... and the latter claim on RLCT is a consequence of reducing the rank of at by together with the rank upper bound result here.
While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:
Basically, (1) realizablity imparts input-symmetry to , and (2) emulatability essentially "push-forwards" this to a symmetry in the parameters[4]. I think this is very interesting!
Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: holds for ALL , since the -equivariant is independent of . A more realistic model would look something like where also depends on and importantly, whether satisfies -equivariance depends on the value of .
Then, if but makes -equivariant while doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus would be represented much more greatly in the Bayesian posterior).
This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.
(Thanks to Dan Murfet for various related discussions.)
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space to an infinitesimal neighborhood (germ) of a particular optimal parameter we're interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function , meaning it is a function of the true distribution and the parametric model (the choice of the prior doesn't matter under reasonable regularity conditions).
Given a two layer feedforward network with ReLU, multiply the first layer by and dividing the next by implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every ; more examples in Liam Carroll's thesis.
This reminds me of the notion of data-program equivalence (programs-as-data, Gödel numbering, UTM). Perhaps some infinitesimal version of it?
Let the input-side symmetry to be trivial (i.e. ), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.
Found a proof sketch here (App. D.3), couldn't it find elsewhere in canonical SLT references eg gray book. Idea seems simple:
There shouldn't be a negative sign here (14a).
(will edit this comment over time to collect typos as I find them)
The fourth one is great.
Conventionally is a random variable, just like how is a random variable. To be fair the conventions are somewhat inconsistent, given that (as you said) is a number.
Previous discussion, comment by johnswentworth:
Relevant slogan: Goodheart is about generalization, not approximation.
[...]
In all the standard real-world examples of Goodheart, the real problem is that the proxy is not even approximately correct once we move out of a certain regime.
Speaking from the perspective of someone still developing basic mathematical maturity and often lacking prerequisites, it's very useful as a learning aid. For example, it significantly expanded the range of papers or technical results accessible to me. If I'm reading a paper containing unfamiliar math, I no longer have to go down the rabbit hole of tracing prerequisite dependencies, which often expand exponentially (partly because I don't know which results or sections in the prerequisite texts are essential, making it difficult to scope my focus). Now I can simply ask the LLM for a self-contained exposition. Using traditional means of self-studying like [search engines / Wikipedia / StackExchange] is very often no match for this task, mostly in terms of time spent or wasted effort; simply having someone I can directly ask my highly specific (and often dumb) questions or confusions and receive equally specific responses is just really useful.
The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information .
This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.
A fundamental result in Information Theory is that always holds.
Since always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.
Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of variable. It turns out, yes for , (approximately) yes for , and no for .
Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for :
Explanation of the math, for anyone curious.
Given random variables and , it turns out that is equivalent to (submodularity), if , and .
This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.
Let then be a subset of , each element corresponding to the values of the joint entropy assigned to each subset of some random variables . For example, an element of would be for some random variables and , with a different element being a different tuple induced by a different random variable .
Now let represent elements of satisfying the three aforementioned conditions on joint entropy. For example, 's element would be satisfying e.g., (monotonicity). This is also a convex cone, so its elements really do correspond to "nonnegative linear combinations" of Shannon-type inequalities.
Then, the claim that "nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures" would correspond to the claim that for all .
The content of the papers linked above is to show that:
This implies that, while there exists a -tuple satisfying Shannon-type inequalities that can't be constructed or realized by any random variables , there does exist a sequence of random variables whose induced -tuple of joint entropies converge to that tuple in the limit.
Thanks for the recommendation! Woit's book does look fantastic (also as an introduction to quantum mechanics). I also known Sternberg's Group Theory and Physics to be a good representation theory & physics book.
I did encounter Brown's book during my search for algebraic topology books but I had to pass it over Bredon's because it didn't develop the homology / cohomology to the extent I was interested in. Though the groupoid perspective does seem very interesting and useful, so I might read it after completing my current set of textbooks.