LESSWRONG
LW

1576
Dalcy
911Ω42111340
Message
Dialogue
Subscribe

Nothing is "mere." I, too, can see the stars on a desert night, and feel them. But do I see less or more? The vastness of the heavens stretches my imagination - stuck on this carousel, my little eye can catch one-million-year-old light. A vast pattern - of which I am a part - perhaps my stuff was belched from some forgotten star, as one is belching there. Or see them with the greater eye of Palomar, rushing all apart from some common starting point when they were perhaps all together. What is the pattern, or the meaning, or the why? It does not do harm to the mystery to know a little about it.

                          - Richard P. Feynman on The Relation of Physics to Other Sciences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
1Dalcy's Shortform
3y
133
No wikitag contributions to display.
Dalcy's Shortform
Dalcy3d30

Thanks for the recommendation! Woit's book does look fantastic (also as an introduction to quantum mechanics). I also known Sternberg's Group Theory and Physics to be a good representation theory & physics book.

I did encounter Brown's book during my search for algebraic topology books but I had to pass it over Bredon's because it didn't develop the homology / cohomology to the extent I was interested in. Though the groupoid perspective does seem very interesting and useful, so I might read it after completing my current set of textbooks.

Reply
Dalcy's Shortform
Dalcy4d160

Becoming Stronger™ (Sep 13 - Sep 27)

Notes and reflections on the things I've learned while Doing Scholarship™ this week (i.e. studying math)[1].

  • I am starting to see the value of categorical thinking.
    • For example from [FOAG], it was quite mindblowing to learn that stalk (the set of germs at a point) can be equivalently defined as a simple colimit of sections of presheaf over open sets of X containing a point, and this definition made proving certain constructions (eg inducing a map of stalks from a map ϕ:X→Y) very easy.
    • Also, I was first introduced the concept of presheaf as an abstraction of a map that takes open sets and returns functions over it, abstracting properties like there existing a restriction map that composes naturally. Turns out (punchline, presumably) this is just a functor Open(X)Op−>Set!
  • Yoneda lemma is very cool. I recall seeing some of the ideas from Programs as Singularities (paper), where there are ideas of embedding programs (for the analogue in Yoneda lemma, the Yoneda embedding being a fully faithful functor from some category C ... ) into a different space (... to the category SetC ...) that contains "almost programs" ( ... because the Yoneda embedding is not surjective, let alone essentially surjective), and that studying this enlarged space lends insight into the original space.
    • Rabbit hole: Yoneda lemma as expressing consistency conditions on Lawvere's idea of Space and Quantity being presheaf and copresheaf??
  • I am also starting to more appreciate the notion of sheaf or ringed spaces from [FOAG] - or more generally, the notion that a "space" can be productively studied by studying functions defined on it. For example I learned from [Bredon] that a manifold, whose usual definition is a topological space locally homeomorphic to a Euclidean space, can equivalently be defined as a ringed space whose structure sheaf is valued in some subalgebra of continuous maps over a given open set. Very cool!
  • Started reading [Procesi] to learn invariant theory and representation theory because it came up quite often as my bottleneck in my recent work (eg). Also interpretability, apparently. So far I just read pg 1-9, reviewing the very basics of group action (e.g., orbit stabilizer theorem). Lie groups aren't coming up until pg ~50 so until then I should catch up on the relevant Lie group prerequisites through [Lee] or [Bredon].
  • Also reviewed some basic topology by skimming pg 1-50 of [Bredon]. So many rabbit holes just in point-set topology that I can't afford (time-wise) to follow, e.g.,
    • (1) nets generalize sequences and successfully characterize topological properties - I once learned of filters, and I do not yet know how they relate (and don't relate constructively + why constructively filters are more natural) and especially universal net vs ultrafilter
    • (2) I didn't know that manifolds are metrizable, but yes they are by an easy consequence of the Urysohn metrization theorem (second-countable & completely regular => metrizable). But I would like to study this proof in more detail. Also, how to intuitively think about the metric of a manifold?
      • I didn't know that the proof to Urysohn metrization was this nice! It's a consequence of the following lemma: recall, "completely regular" means given a point x and a closed set x∉C, there exists a continuous f:X→[0,1] s.t. f(x)=0 and f(C)=1. BUT adding second-countability to the hypothesis then lets you choose this f from a fixed, countable family F.
      • Then, mapping X under this countable family of functions (thus taking value in [0,1]N) turns out to be an embedding - and [0,1]N can be metrized, so X can be metrized as well.
    • (3) I learned about various natural variants / generalizations of compactness (σ-compactness, local compactness, paracompactness). My understanding of their importance is because:
      • (a) paracompactness implies the existence of partition of unity subordinate to any open cover (a consequence of paracompactness => normal, and Urysohn's lemma, also paracompactness by definition allowing you to find the open refinement of the given open cover as required by the definition of partition of unity subordinate to an open cover.)
      • (b) for locally compact Hausdorff X, we can characterize paracompactness by "disjoint union of open σ-compact subsets," which is much easier to check than the definition of paracompactness as locally finite open refinement of open covers.
        • e.g., from this, it is immediate that manifolds are paracompact: (1) locally Euclidean => locally compact. (2) Second-countable => Lindelof. (3) Lindelof & locally compact => σ-compact. (1) & (2) & (3) + above => manifolds are paracompact. From which other properties of manifolds immediately follow from that of paracompactness, eg manifolds always admit a partition of unity subordinate to any open cover.
      • But rabbit hole: recall, open sets axiomatize semidecidable properties. What is, then, the logical interpretation of compactness, σ-compactness, local compactness, paracompactness?

This week, I'll start tracking the exercises I solve and pages I cover and post them in next week's shortform, so that I can keep track of my progress + additional accountability.

  1. ^

    I am self-studying math. The purpose of this shortform is to publicly write down:

    1. things I've learned each week,
    2. my progress through the textbooks I've committed to read[2], and
    3. other learning-related reflections,

    with the aim of:

    1. allocating some time each week reflecting on what I've learned to self-improve,
    2. be kept socially accountable by publicly committing on making progress and actually sticking through the textbooks,
    3. and write / think in public which I enjoy doing so.
  2. ^

    I am currently reading the following textbooks:

    • [Lee]: Lee, Smooth Manifolds
    • [Bredon]: Bredon, Topology and Geometry
    • [FOAG]: Vakil, Foundations of Algebraic Geometry
    • [Procesi]: Procesi, Lie Groups: An Approach through Invariants and Representations

    and I plan to do most of the exercises for each of the textbooks unless I find some of them too redundant. For this week's shortform I haven't written down my progress this week on each of these books nor the problems I've solved because I haven't started tracking them, so I'll do them starting next week.

Reply11
Dalcy's Shortform
Dalcy10d71

A simple sketch of the role data structure plays in loss landscape degeneracy.

The RLCT[1] is a function of both q(x) and p(x|θ). The role of p(x|θ) is clear enough, with very intuitive examples[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of q(x) really eluded me.

I think I now have some intuitive picture of how structure in q(x) influences RLCT (at least particular instances of it). Consider the following example.

Toy Example: G-invariant distribution, G-equivariant submodule

Suppose the true distribution is (1) realizable (p(⋅|θ∗)=q(⋅) for some θ∗), (2) invariant under some group action, q(x)=q(gx)∀x. Now, suppose that the model class is that of exponential models, i.e. p(x|w)∝exp(⟨θ,T(x)⟩). In particular, suppose that T, the fixed feature map, is G-equivariant, i.e. ∃ρ:G→GL(Rd) such that T(gx)=ρ(g)T(x).

Claim: There is a degeneracy of the form p(x|θ∗)=p(x|ρ(g)∗(θ∗)), and in particular if G is a Lie group, the rank upper bound of RLCT decreases by 14dimG.

This is nothing nontrivial. The first claim is an immediate consequence of the definitions:

  • p(⋅|θ∗)=q(⋅) and q(x)=q(gx) implies p(x|θ∗)=p(gx|θ∗)∀x
  • Then, we have the following:
    p(gx∣θ∗)=exp(⟨θ∗,T(gx)⟩)=exp(⟨θ∗,ρ(g)T(x)⟩)=exp(⟨ρ∗(g)(θ∗),T(x)⟩)=p(x∣ρ(g)∗(θ∗)).

... and the latter claim on RLCT is a consequence of p(x|θ∗)=p(x|ρ(g)∗(θ∗)) reducing the rank of L(θ) at θ∗ by dimG together with the rank upper bound result here.

High-level idea: Emulability of input symmetry

While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:

  • The true distribution has some structure / symmetry, say, q(x)=q(x+δx)∀x (with δx as a function of x, indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto p(⋅|θ∗) by realizability, i.e. p(x|θ∗)=p(x+δx|θ∗)∀x.
  • Emulatability: At θ∗, the model can "emulate" certain classes of perturbations to certain classes of input x by instead perturbing the parameters, i.e. p(x+δx|θ∗)=p(x|θ∗+δθ).[3]

Basically, (1) realizablity imparts input-symmetry to p(⋅|θ∗), and (2) emulatability essentially "push-forwards" this to a symmetry in the parameters[4]. I think this is very interesting!

  • Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by δθ, causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn't perturbed but rather the world's color shifted to that hue i.e. δx, then I would virtually not notice anything and be making the same predictions p(x+δx|θ∗)=p(x|θ∗+δθ).

Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: p(gx|θ)=p(x|ρ(g)∗(θ)) holds for ALL θ, since the G-equivariant T is independent of θ. A more realistic model would look something like p(x|w)∝exp(⟨θ1,Tθ2(x)⟩) where T also depends on θ2 and importantly, whether T satisfies G-equivariance depends on the value of θ2.

Then, if pθ∗=pθ′∗=q but θ∗ makes T G-equivariant while θ′∗ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus θ∗ would be represented much more greatly in the Bayesian posterior).

This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.

(Thanks to Dan Murfet for various related discussions.)

  1. ^

    Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space W to an infinitesimal neighborhood (germ) of a particular optimal parameter we're interested in measuring the degeneracy of.

    RLCT is a particular invariant of the average log likelihood function L(θ)=∫q(x)logp(x|θ)dx, meaning it is a function of the true distribution q(x) and the parametric model p(x|θ) (the choice of the prior φ(θ) doesn't matter under reasonable regularity conditions).

  2. ^

    Given a two layer feedforward network with ReLU, multiply the first layer by α and dividing the next by α implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every θ; more examples in Liam Carroll's thesis.

  3. ^

    This reminds me of the notion of data-program equivalence (programs-as-data, Gödel numbering, UTM). Perhaps some infinitesimal version of it?

  4. ^

    Let the input-side symmetry to be trivial (i.e. δx=0), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.

Reply111
The Hessian rank bounds the learning coefficient
Dalcy2mo80

Found a proof sketch here (App. D.3), couldn't it find elsewhere in canonical SLT references eg gray book. Idea seems simple:

  1. (We'll prove it for the local RLCT because the splitting lemma most naturally applies when dealing with local diffeomorphisms - but if you're interested in the statement for the global RLCT, then since RLCT is min over local RLCT (3.2.2), just work with the local RLCT at the minimizing point and the rest of the argument follows)
  2. Local RLCT is invariant under local diffeomorphism, easy to prove by the volume scaling formula (check exercise 17).
  3. Apply the splitting lemma (elementary proof in Gilmore 3.4) to locally express the function as a quadratic in the nondegenerate plus the rest.
  4. Use Remark 7.2.3 in the gray book which says that if K(wa,wb)=Ka(wa)+Kb(wb) and φ(wa,wb)=φa(wa)φb(wb) (which the change of variable in the splitting lemma satisfies), then λ=λa+λb (also easy to prove with the volume scaling formula).
  5. So λ is lower-bounded by the RLCT in the quadratic part, which is d12 (again easy to prove with the volume scaling formula, check pg 17 here)
Reply
Singular learning theory: exercises
Dalcy2mo10

∇wlog(p(y|x,w))=−(y−f(w)(x))T⋅Jf(w)(x)

There shouldn't be a negative sign here (14a).

(will edit this comment over time to collect typos as I find them)

Reply1
If Anyone Builds It, Everyone Dies: Advertisement design competition
Dalcy3mo2-2

The fourth one is great.

Reply1
X explains Z% of the variance in Y
Dalcy3mo40

Conventionally Var[Y|X] is a random variable, just like how E[Y|X] is a random variable. To be fair the conventions are somewhat inconsistent, given that (as you said) H(Y|X) is a number.

Reply
satchlj's Shortform
Dalcy5mo40

Previous discussion, comment by johnswentworth:

Relevant slogan: Goodheart is about generalization, not approximation.

[...]

In all the standard real-world examples of Goodheart, the real problem is that the proxy is not even approximately correct once we move out of a certain regime.

Reply
Alexander Gietelink Oldenziel's Shortform
Dalcy5mo102

Speaking from the perspective of someone still developing basic mathematical maturity and often lacking prerequisites, it's very useful as a learning aid. For example, it significantly expanded the range of papers or technical results accessible to me. If I'm reading a paper containing unfamiliar math, I no longer have to go down the rabbit hole of tracing prerequisite dependencies, which often expand exponentially (partly because I don't know which results or sections in the prerequisite texts are essential, making it difficult to scope my focus). Now I can simply ask the LLM for a self-contained exposition. Using traditional means of self-studying like [search engines / Wikipedia / StackExchange] is very often no match for this task, mostly in terms of time spent or wasted effort; simply having someone I can directly ask my highly specific (and often dumb) questions or confusions and receive equally specific responses is just really useful.

Reply32
Dalcy's Shortform
Dalcy5mo300

Non-Shannon-type Inequalities

The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information I(X;Y;Z).

This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.

A fundamental result in Information Theory is that I(X;Y∣Z)≥0 always holds.

  • Given n random variables X1,…,Xn and α,β,γ⊆[n], from now on we write I(α;β∣γ) with the obvious interpretation of the variables standing for the joint variables they correspond to as indices.

Since I(α;β|γ)≥0 always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.

Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of n variable. It turns out, yes for n=2, (approximately) yes for n=3, and no for n≥4.

Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for n=4:

I(A;B)≤2I(A;B∣C)+I(A;C∣B)+I(B;C∣A)+I(A;B∣D)+I(C;D)

Explanation of the math, for anyone curious.

  • Given n random variables and α,β,γ⊆[n], it turns out that I(α;β∣γ)≥0 is equivalent to H(α∪β)+H(α∩β)≤H(α)+H(β) (submodularity), H(α)≤H(β) if α⊆β, and H(∅)=0.

    This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.

    Let Γ∗n then be a subset of R2n, each element corresponding to the values of the joint entropy assigned to each subset of some random variables X1,…,Xn. For example, an element of Γ∗2 would be (H(∅),H(X1),H(X2),H(X1,X2))∈R2n for some random variables X1 and X2, with a different element being a different tuple induced by a different random variable (X′1,X′2).

    Now let Γn represent elements of R2n satisfying the three aforementioned conditions on joint entropy. For example, Γ∗2's element would be (h∅,h1,h2,h12)∈R2n satisfying e.g., h1≤h12 (monotonicity). This is also a convex cone, so its elements really do correspond to "nonnegative linear combinations" of Shannon-type inequalities.

    Then, the claim that "nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures" would correspond to the claim that Γn=Γ∗n for all n.

    The content of the papers linked above is to show that:

    • Γ2=Γ∗2
    • Γ3≠Γ∗3 but Γ3=¯¯¯¯¯¯Γ∗3 (closure[1])
    • Γ4≠Γ∗4 and Γ4≠¯¯¯¯¯¯Γ∗4, and also for all n≥4.
     
  1. ^

    This implies that, while there exists a 23-tuple satisfying Shannon-type inequalities that can't be constructed or realized by any random variables X1,X2,X3, there does exist a sequence of random variables (X(k)1,X(k)2,X(k)3)∞k=1 whose induced 23-tuple of joint entropies converge to that tuple in the limit.

Reply3
Load More
29Towards building blocks of ontologies
8mo
0
45What's the Right Way to think about Information Theoretic quantities in Neural Networks?
Q
8mo
Q
13
28Proof Explained for "Robust Agents Learn Causal World Model"
9mo
0
67An Illustrated Summary of "Robust Agents Learn Causal World Model"
Ω
10mo
Ω
2
26Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Q
1y
Q
6
38But Where do the Variables of my Causal Model come from?
Ω
1y
Ω
2
27Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Q
1y
Q
1
42When Are Results from Computational Complexity Not Too Coarse?
1y
8
26Epistemic Motif of Abstract-Concrete Cycles & Domain Expansion
2y
2
23Least-problematic Resource for learning RL?
Q
2y
Q
9
Load More