Dalcy — LessWrong

"I always remember, [Hamming] would come into my office and try to solve a problem [...] I had a very big blackboard, and he’d start on one side, write down some integral, say, ‘I ain’t afraid of nothin’, and start working on it. So, now, when I start a big problem, I say, ‘I ain’t afraid of nothin’, and dive into it."
—Bruce MacLennan

Becoming Stronger™ (Oct 13 - Nov 2)

Notes and reflections on the things I've learned while Doing Scholarship the last ~~two~~ three weeks (i.e. studying math).

The past three weeks were busier than usual so I had slower progress this time but here it is:

Chapter 6 continued: Sard's theorem

Tubes! I might have thought the fact that you can embed manifolds in might have been one of those theorems whose main values are conceptual but not that useful in practice (such as Cayley's theorem and the famous quote that the fact that any group is a subgroup of a symmetry group never actually made the task of studying & classifying groups easier) - but that is not the case, due to tubular neighborhoods.
- The main issue with trying to solve problems about smooth maps between manifolds by embedding the codomain manifold in $R^{N}$ is that the resulting construction may not lie in the original manifold. Tubular neighborhoods address this by showing that it is always possible, given an embedding of a manifold in $R^{N}$ , to come up with a tube-like open neighborhood of the manifold, equipped with a retraction map, i.e. a map from this neighborhood to itself such that it is an identity when restricted to the manifold.
- So conceptually, tubular neighborhoods let us reason about manifold problems by solving them in $R^{N}$ , and bringing it back to manifolds.
Applications are massive! Might be my favorite concept so far.
- Whitney approximation theorem (any continuous map is homotopic to a smooth one. If the original continuous map is smooth on a closed subset, the homotopy can be taken relative to it)
  - Proof: Prove it for the case when the codomain is $R^{m}$ , and just compose it with the tubular neighborhood retraction.
- If a compact manifold has a nonzero vector field, then there exists a map to itself without fixed points
  - Proof: Embed everything (the manifold, its tangent spaces) in R^m, let the map be the one that takes a point to the direction indicated by the nonzero vector field at some small magnitude $ϵ$ . This will take things outside of the manifold, so compose with the tubular neighborhood retraction (which is allowed if $ϵ$ is small enough).
Transversality was also introduced in this chapter (though I am already somewhat familiar with this from my difftop class long time ago).
- In a sense, it generalizes regular values. Preimage of regular values form embedded submanifolds, preimage of submanifolds transversal to the map leads to embedded submanifolds.
- Two submanifolds X, Y \subseteq M being transverse formalize the notion of them intersecting in a "generic manner."
- Transversality Homotopy Theorem: Given any embedded submanifold X \subseteq M, any smooth map is homotopic to another smooth map which is transverse to X.
  - This shows that transversality is generic. Very important for the study of stable property of smooth maps.

Chapter 8: Vector Fields

Vector fields a section $X : T M \to M$ of the projection map of the tangent bundle $π : T M \to M$ . Very elegant definition! Also lets me better appreciate the smooth structure defined over the tangent bundle, which automatically implies that a vector field is smooth iff their coordinate functions are smooth, which is what should morally be true.

Chapter 10: Vector Bundles

Vector bundles are objects that locally look like product spaces / vector-space "fibers" attached to each point of a manifold. Comb-like picture.

Chapter 11: Cotangent Bundle

I finally understand what covectors are and what's their point. Cotangent bundle is just the dual of the tangent bundle. Covector takes a tangent vector and returns a number.
Applications:
- Gradient of a smooth map, defined as a vector field where (under a given coordinate chart) the components of a function's partial derivatives, is not invariant under change of coordinate chart, so it's ill-defined. But when defined as a covector field, it is well-defined. This makes sense, since a "gradient of a function at a point" is morally an object that you take the inner product with a direction to return a scalar (directional derivative), thus it must be a covector that takes a tangent vector ("direction vector") and returns a number.

Chapter 13: Riemannian Manifold

Smooth manifolds are metrizable. Why?
- Broke: Manifold is locally compact Hausdorff second countable. Locally compact Hausdorff => completely regular. By Urysohn metrization theorem, completely regular & second countable => metrizability.
- Woke: Use the distance metric on the manifold induced by the Riemannian structure, which always exists on manifold. This is much more intuitive.

Then I read some Bredon for Algebraic Topology.

Turns out, tubular neighborhoods are also useful for algebraic topology: How to show that a sphere $S^{n}$ is not a retract of a disc $D^{n + 1}$ ?
- Assume such a retract $f : D^{n + 1} \to S^{n}$ exists. Scale it and compose it appropriately to make it a radial projection ( $z \mapsto \frac{z}{| | z | |}$ ) near the boundary $S^{n}$ , and f but rescaled in the inside of the disc (easy to do). Smooth out the map in the inside via the Whitney approximation theorem. By Sard's theorem, there is a point $z \in S^{n}$ that is a regular value of $f$ . Its preimage $f^{- 1} (z)$ , then, is a 1-dimensional manifold with boundary, where $z$ is the only boundary (via radial projection). This contradicts the classification theorem of compact 1-manifolds which in particular says they have even number of boundary points.
- From this follows Brouwer's fixed point theorem ( $D^{n} - > D^{n}$ has fixed point), the fact that the sphere $S^{n}$ is not contractible, etc.
I should more easily skip sections that I don't understand. I struggled a bit with the first sub-section of the Fundamental Group chapter because it introduced the general notion of $[S X; Y]_{*}$ (the basepoint-preserving homotopy class of pointed maps $S X \to Y$ where $S X$ refers to the reduced suspension $X \times I / (X \times \partial I \cup {x_{0}} \times I)$ ), for which the nth homotopy group becomes a special case $[S S^{n - 1}; Y]_{*} = [S^{n}; Y]_{*}$ , for which the fundamental group is a special case, which is the only thing that is really needed until like very late in the book.
- But thanks to muddling through, I think I much better understand this construction and motivation for constructions like the reduced suspension.
Fundamental group & Covering spaces & Lifting theorems & Deck Transformation
- Mostly a review. Functors are just really natural objects (no wonder why they came from algebraic topology).
  - Specifically, the functor that transforms $Y$ to $[S X; Y]_{*}$ and $g : Y \to Z$ to $g_{#} : [S X; Y]_{*} \to [S X; Z]_{*}$ where $f \mapsto g \circ f$ . Homotopy groups (including the fundamental group) are a special case of this.
- Covering spaces are an important tool for calculating fundamental groups. Also covering spaces have a general lifting theorem characterizing when maps can be lifted by the covering map.
  - Deck transformation and fundamental group are resp right / left actions on the fiber of the covering map, and they are commuting actions.
(Singular) Homology
- This is new. Has the advantage of not caring about base-points. Weaker than homotopy.
Hatcher and Bredon are great complements. I saw a lot of anti-recommendations for Hatcher, but I think they're great together.
- Hatcher provides great intuition (geometric and conceptual) that Bredon just never really talked about. eg geometrically visualizing a deformation retract of a shape onto its skeleton by extruding the map in 3d mirrors the mapping cylinder construction, which is something I learned from Bredon but never (so far) learned the motivation for.
  - Many such examples and expositional niceness that helped reorganize my ontology (eg motivation-of-concept-wise, deformation retract comes prior to mapping cylinder, which comes prior (and motivates) the more general notion of homotopy and retracts. From this, it becomes intuitively clear why eg deformation retract should imply homotopy equivalence. Bredon taught the concepts the opposite way I think.)

There's a couple easy ones, like low rank structure, but I never really managed to get a good argument for why generic symmetries in the data would often be emulatable in real life.

Right, I expect emulability to be a specific condition enabled by a particular class of algorithms that a NN might implement, rather than a generic one that is satisfied by almost all weights of a given NN architecture^[1]. Glad to hear that you've thought about this before, I've also been trying to find a more general setting to formalize this argument beyond the toy exponential model.

Other related thoughts^[2]:

Maybe this can help decompose the LLC into finer quantities based on where the degeneracy rises from - eg a given critical point's LLC might come solely from the degeneracy in the parameter-function map, some from one of the multiple groups that the true distribution is invariant under at order r, other from an interaction of several groups, etc (sort of Mobius-like inversion)
And perhaps it's possible to distinguish / measure these LLC components experimentally by measuring how the LLC changes as you perturb the true distribution by introducing new / destroying existing symmetries (susceptibilites-style).

^{^}
This is more about how I conceptually think they should be (since my motivation is to use their non-genericity to argue why certain algorithms should be favored over others), and there are probably interesting exceptions of symmetries that are generically emulatable due to properties of the NN architecture (eg depth).
^{^}
Some of these ideas were motivated following a conversation with Fernando Rosas.

one goal whose attainment is not only impossible to observe

This part doesn't sound that unique? It's typical for agents to have goals (or more generally values) that are not directly observable (cf Human values are a function of Humans' latent variables), and very often they only have indirect evidence about the actualization of those goals / values (which may be indirect evidence for their actualization in the distant future at which the agent may not even exist to even potentially be able to observe) - such as my philanthropic values extending over people I will never meet and whose well-being I will never observe.

Doomers predicted that the Y2K bug would cause massive death and destruction. They were wrong.

This seems like a misleading example of doomers being wrong (agree denotationally, disagree connotationally), since I think it's plausible that Y2K was not a big deal (to such an extent that "most people think it was a myth, hoax, or urban legend") precisely because of the mitigation efforts stemmed by the doomsayers' predictions.

Becoming Stronger™ (Sep 28 - Oct 12)

Notes and reflections on the things I've learned while Doing Scholarship the last two week (i.e. studying math).

Mostly the past two weeks were on differential geometry (Lee):

Ch 4 (Submersion, Immersion, Embedding) comments:
- Conceptually, by the Constant rank theorem, constant rank maps (smooth maps whose differential is constant rank at all $p$ ) are precisely the maps with a linear local coordinate representation (thus are maps well-modeled locally by its differentials).
  - Basically a nonlinear version of the linear algebra theorem that any square matrix can be expressed as $[\begin{matrix} I_{r} & 0 0 & 0 \end{matrix}]$ . The proof is much more complicated however: basically a clever choice of coordinate transformation via the inverse function theorem.
- The point of the chapter is to come up with various characterizations of submersion, immersion, embedding. For example, 1) smooth immersion iff locally smooth embedding, 2) smooth submersion iff every point is an image of a local section, 3) surjective maps => submersion & injective => immersion ...
  - The proof of 3) is a very cool application of the Baire category theorem. Baire category theorem says the countable union of nowhere dense sets has empty interior; this is not very motivating, but reading Bredon^[1] helped clarify its conceptual significance.
    - Namely, consider the more illuminating contrapositive statement: countable intersection of dense open sets is dense. Conceptually, the space is some configuration space, and dense open sets represent configurations that satisfy certain generically satisfied constraints (polynomial $p (x)$ being nonzero is a prototypical example, which is a dense & open set). Then, the question is whether the property of a countable number of these constraints being satisfied at the same time is still generic, i.e. dense. The Baire category theorem says this is indeed the case (for locally compact Hausdorff spaces).
- Sections are just right inverses, and their intuitive geometric content was a bit confusing until I read the wikipedia page: a section of f is an abstraction of a graph by viewing f as a sort of "projection map." That makes sense! I'm sure this will come up later in the fiber bundle context.
- The "figure-eight curve" and "dense torus map" as prototypical examples of smooth immersions that isn't a smooth embedding, due to topological considerations.
Ch 5 (Submanifold) comments:
- Similar to Ch 4, many useful characterizations of submanifolds and how to generate them. eg embedded submanifold iff locally a "slice" of the ambient manifold's coordinate chart. embedded submanifold iff image of smooth embedding, immersed submanifold iff image of smooth immersion. Level sets of a smooth map at a "regular value" are embedded submanifolds ...
Ch 6 (Sard's theorem) comments:
- Finally, one of the more fun chapters! Finally learned the proof of the Whitney embedding / immersion theorem that I've heard a lot about.
- The compact case of the Whitney embedding theorem is much more conceptually straightforward:
  1. Given a $m$ (finite, possible since compact) chart of the $n$ -dim manifold, literally just adjoin them while multiplying them with appropriate partitions of unity to get a $M \to R^{n m}$ map, and adjoin the m partitions of unity (a "chart indicator variable") to get a $M \to R^{n m + m}$ map. This turns out to be an immersion, and thus an embedding since M is compact.
  2. Apply the projection map $R^{N} \to R^{N - 1}$ with a 1-dim kernel $R v$ . By Sard's theorem, this turns out to be an immersion (when restricted to $M$ ) for almost any choice of $v$ , as long as $N > 2 n + 1$ . Repeatedly apply this to the massive codomain $M \to R^{n m + m}$ to get an immersion to $R^{2 n + 1}$ .
  3. This projection map can in fact be promoted to an embedding, given that the original immersion of M to R^n is an embedding.
- High-level takeaways:
  - The most dumb and obvious way of interpolating coordinate charts into a global map via partitions of unity, with slight modifications, gives a bona fide immersion of a manifold into $R^{N}$ !!
    - It was interesting to learn that there was a 1-2 decade period of foundational uncertainty (between the first proposal of the abstract manifold definition and Whitney's above proof) where people didn't know whether the abstract manifold definition was actually more general than $R^{N}$ or not.^[2]
    - Partitions of unity really is used everywhere. I wonder how the theory of complex analytic manifolds ever do anything when analytic partitions of unity don't exist.
  - Proof strategy of promoting a smooth map to a proper map (at the cost of increased dimensionality of the codomain) by literally adjoining a proper map next to it. Clever!
    - I presume this is the main motivation behind exhaustion functions ( $f : M \to R$ s.t. $f^{- 1} ((- \infty, c])$ is compact $\forall c \in R$ ). It's a proper map, it exists for any manifolds (again, shown by partitions of unity), and has codomain of dimension 1 so it minimally increases the function codomain dimension.
- More applications on Whitney approximation theorems and transversality arguments.
  - The latter, including the transversality homotopy theorem (actually learned this a year ago in my difftop class, though that class used Guillemin's book where manifolds are always embedded in $R^{N}$ - so it's good to learn them from a more intrinsic perspective) is very interesting.
  - It also ties to one of my motivation for all this math learning, backchaining from trying to do good alignment theory work, which is learning the math of structural stability and its role in the theory of forms (morphogenesis) cf Thom, Structural Stability and Morphogenesis (thank you Dan Murfet for explaining this perspective).

Rabbit holes that I could not afford to pursue:

The category of smooth manifolds is an idempotent-splitting completion of the category of open subspaces of findim cartesian spaces?!?!?! My mind is blown.
- So much more elegant than the standard definition via charts and maximal smooth structures and such. Unsure of the utility of this characterization though, lol (read Lawvere's paper).
There is a duality between the category of smooth manifolds and the category of R-algebras. Fascinating how such dualities between algebra and geometry seem to be a very common motif throughout different fields, I'm sure this will come up in Vakil's book later. Also curious about Gelfand's duality on this for topological spaces.
"It is better to have a good category with bad objects than a bad category with good objects." - Grothendieck (probably not). For example, the category of smooth manifolds is not nice, motivating smooth sets, diffeological spaces, and so on.
- Dichotomy between nice objects and nice categories: in the context of alignment theory, maybe I can view Programs as Singularities as enlarging an instantiation of this idea by enlarging the class of Turing machines.
I found this intuition for adjoint functors illuminating. Specifically, note set maps $f : X \to Y$ and $g : Y \to X$ being inverses are equivalent to the condition that their graphs are mirrored along the diagonal, i.e. $(x, f (x)) = (g (y), y)$ . Rephrase this using Kronecker delta, $δ (x, g (y)) = δ (f (x), y)$ . Now $δ$ can be seen as expressing a "relation" that could be exhibited by two elements of a set, i.e. equality (1) or inequality (0). But in general categories, objects can exhibit more relations - so replace $δ$ by $H o m$ - you get adjoint functors!

^{^}
Example of how reading books in parallel improves learning efficiency.
^{^}
Why that long? The dimensionality reduction by projection is perhaps more nontrivial because of Sard, but the obvious gluing should have been sufficient to construct an immersion at least, albeit at the cost of inefficient codomain dimension. Maybe the historically difficult part was the concept of partition of unity and that it always exist in manifolds?

Thanks for the recommendation! Woit's book does look fantastic (also as an introduction to quantum mechanics). I also known Sternberg's Group Theory and Physics to be a good representation theory & physics book.

I did encounter Brown's book during my search for algebraic topology books but I had to pass it over Bredon's because it didn't develop the homology / cohomology to the extent I was interested in. Though the groupoid perspective does seem very interesting and useful, so I might read it after completing my current set of textbooks.

Becoming Stronger™ (Sep 13 - Sep 27)

Notes and reflections on the things I've learned while Doing Scholarship this week (i.e. studying math)^[1].

I am starting to see the value of categorical thinking.
- For example from [FOAG], it was quite mindblowing to learn that stalk (the set of germs at a point) can be equivalently defined as a simple colimit of sections of presheaf over open sets of containing a point, and this definition made proving certain constructions (eg inducing a map of stalks from a map $ϕ : X \to Y$ ) very easy.
- Also, I was first introduced the concept of presheaf as an abstraction of a map that takes open sets and returns functions over it, abstracting properties like there existing a restriction map that composes naturally. Turns out (punchline, presumably) this is just a functor $O p e n (X)^{O p} - > S e t$ !
Yoneda lemma is very cool. I recall seeing some of the ideas from Programs as Singularities (paper), where there are ideas of embedding programs (for the analogue in Yoneda lemma, the Yoneda embedding being a fully faithful functor from some category $C$ ... ) into a different space (... to the category $S e t^{C}$ ...) that contains "almost programs" ( ... because the Yoneda embedding is not surjective, let alone essentially surjective), and that studying this enlarged space lends insight into the original space.
- Rabbit hole: Yoneda lemma as expressing consistency conditions on Lawvere's idea of Space and Quantity being presheaf and copresheaf??
I am also starting to more appreciate the notion of sheaf or ringed spaces from [FOAG] - or more generally, the notion that a "space" can be productively studied by studying functions defined on it. For example I learned from [Bredon] that a manifold, whose usual definition is a topological space locally homeomorphic to a Euclidean space, can equivalently be defined as a ringed space whose structure sheaf is valued in some subalgebra of continuous maps over a given open set. Very cool!
Started reading [Procesi] to learn invariant theory and representation theory because it came up quite often as my bottleneck in my recent work (eg). Also interpretability, apparently. So far I just read pg 1-9, reviewing the very basics of group action (e.g., orbit stabilizer theorem). Lie groups aren't coming up until pg ~50 so until then I should catch up on the relevant Lie group prerequisites through [Lee] or [Bredon].
Also reviewed some basic topology by skimming pg 1-50 of [Bredon]. So many rabbit holes just in point-set topology that I can't afford (time-wise) to follow, e.g.,
- (1) nets generalize sequences and successfully characterize topological properties - I once learned of filters, and I do not yet know how they relate (and don't relate constructively + why constructively filters are more natural) and especially universal net vs ultrafilter
- (2) I didn't know that manifolds are metrizable, but yes they are by an easy consequence of the Urysohn metrization theorem (second-countable & completely regular => metrizable). But I would like to study this proof in more detail. Also, how to intuitively think about the metric of a manifold?
  - I didn't know that the proof to Urysohn metrization was this nice! It's a consequence of the following lemma: recall, "completely regular" means given a point $x$ and a closed set $x \notin C$ , there exists a continuous $f : X \to [0, 1]$ s.t. $f (x) = 0$ and $f (C) = 1$ . BUT adding second-countability to the hypothesis then lets you choose this $f$ from a fixed, countable family $F$ .
  - Then, mapping $X$ under this countable family of functions (thus taking value in $[0, 1]^{N}$ ) turns out to be an embedding - and $[0, 1]^{N}$ can be metrized, so $X$ can be metrized as well.
- (3) I learned about various natural variants / generalizations of compactness ( $σ$ -compactness, local compactness, paracompactness). My understanding of their importance is because:
  - (a) paracompactness implies the existence of partition of unity subordinate to any open cover (a consequence of paracompactness => normal, and Urysohn's lemma, also paracompactness by definition allowing you to find the open refinement of the given open cover as required by the definition of partition of unity subordinate to an open cover.)
  - (b) for locally compact Hausdorff X, we can characterize paracompactness by "disjoint union of open $σ$ -compact subsets," which is much easier to check than the definition of paracompactness as locally finite open refinement of open covers.
    - e.g., from this, it is immediate that manifolds are paracompact: (1) locally Euclidean => locally compact. (2) Second-countable => Lindelof. (3) Lindelof & locally compact => $σ$ -compact. (1) & (2) & (3) + above => manifolds are paracompact. From which other properties of manifolds immediately follow from that of paracompactness, eg manifolds always admit a partition of unity subordinate to any open cover.
  - But rabbit hole: recall, open sets axiomatize semidecidable properties. What is, then, the logical interpretation of compactness, $σ$ -compactness, local compactness, paracompactness?

This week, I'll start tracking the exercises I solve and pages I cover and post them in next week's shortform (EDIT: biweekly), so that I can keep track of my progress + additional accountability.

^{^}
I am self-studying math. The purpose of this shortform is to publicly write down:
1. things I've learned each week,
2. my progress through the textbooks I've committed to read^[2], and
3. other learning-related reflections,
with the aim of:
1. allocating some time each week reflecting on what I've learned to self-improve,
2. be kept socially accountable by publicly committing on making progress and actually sticking through the textbooks,
3. and write / think in public which I enjoy doing so.
^{^}
I am currently reading the following textbooks:
- [Lee]: Lee, Smooth Manifolds
- [Bredon]: Bredon, Topology and Geometry
- [FOAG]: Vakil, Foundations of Algebraic Geometry
- [Procesi]: Procesi, Lie Groups: An Approach through Invariants and Representations
and I plan to do most of the exercises for each of the textbooks unless I find some of them too redundant. For this week's shortform I haven't written down my progress this week on each of these books nor the problems I've solved because I haven't started tracking them, so I'll do them starting next week.

A simple sketch of the role data structure plays in loss landscape degeneracy.

The RLCT^[1] is a function of both and $p (x | θ)$ . The role of $p (x | θ)$ is clear enough, with very intuitive examples^[2] of local degeneracy arising from the structure of the parameter function map. However until recently the intuitive role of $q (x)$ really eluded me.

I think I now have some intuitive picture of how structure in $q (x)$ influences RLCT (at least particular instances of it). Consider the following example.

Toy Example: G-invariant distribution, G-equivariant submodule

Suppose the true distribution is (1) realizable ( $p (\cdot | θ^{*}) = q (\cdot)$ for some $θ^{*}$ ), (2) invariant under some group action, $q (x) = q (g x) \forall x$ . Now, suppose that the model class is that of exponential models, i.e. $p (x | w) \propto exp (⟨ θ, T (x) ⟩)$ . In particular, suppose that $T$ , the fixed feature map, is $G$ -equivariant, i.e. $\exists ρ : G \to GL (R^{d})$ such that $T (g x) = ρ (g) T (x)$ .

Claim: There is a degeneracy of the form $p (x | θ^{*}) = p (x | ρ (g)^{*} (θ^{*}))$ , and in particular if $G$ is a Lie group, the rank upper bound of RLCT decreases by $\frac{1}{4} dim G$ .

This is nothing nontrivial. The first claim is an immediate consequence of the definitions:

$p (\cdot | θ^{*}) = q (\cdot)$ and $q (x) = q (g x)$ implies $p (x | θ^{*}) = p (g x | θ^{*}) \forall x$
Then, we have the following:
$\begin{matrix} p (g x ∣ θ^{*}) & = exp (⟨ θ^{*}, T (g x) ⟩) = exp (⟨ θ^{*}, ρ (g) T (x) ⟩) = exp (⟨ ρ^{*} (g) (θ^{*}), T (x) ⟩) = p (x ∣ ρ (g)^{*} (θ^{*})) . \end{matrix}$

... and the latter claim on RLCT is a consequence of $p (x | θ^{*}) = p (x | ρ (g)^{*} (θ^{*}))$ reducing the rank of $L (θ)$ at $θ^{*}$ by $dim G$ together with the rank upper bound result here.

High-level idea: Emulability of input symmetry

While this model is very toy, I think the high-level idea for which this a concrete model of is interesting: Abstracting out, the proof of how data structure influence degeneracy routes through two steps:

The true distribution has some structure / symmetry, say, $q (x) = q (x + δ x) \forall x$ (with $δ x$ as a function of $x$ , indicating some infinitesimal change; all of this is meant to be taken heuristically), which gets imparted onto $p (\cdot | θ^{*})$ by realizability, i.e. $p (x | θ^{*}) = p (x + δ x | θ^{*}) \forall x$ .
Emulatability: At $θ^{*}$ , the model can "emulate" certain classes of perturbations to certain classes of input $x$ by instead perturbing the parameters, i.e. $p (x + δ x | θ^{*}) = p (x | θ^{*} + δ θ)$ .^[3]

Basically, (1) realizablity imparts input-symmetry to $p (\cdot | θ^{*})$ , and (2) emulatability essentially "push-forwards" this to a symmetry in the parameters^[4]. I think this is very interesting!

Story: Suppose I am tasked with image segmentation, but my visual cortex is perturbed by $δ θ$ , causing me to perceive colors with a slightly different hue. Then, if my visual cortex wasn't perturbed but rather the world's color shifted to that hue i.e. $δ x$ , then I would virtually not notice anything and be making the same predictions $p (x + δ x | θ^{*}) = p (x | θ^{*} + δ θ)$ .

Going back to the exponential model, the most unrealistic part of it (even after taking into account that it is a toy instantiation of this high-level idea) is the fact that its symmetry is generic: $p (g x | θ) = p (x | ρ (g)^{*} (θ))$ holds for ALL $θ$ , since the $G$ -equivariant $T$ is independent of $θ$ . A more realistic model would look something like $p (x | w) \propto exp (⟨ θ_{1}, T_{θ_{2}} (x) ⟩)$ where $T$ also depends on $θ_{2}$ and importantly, whether $T$ satisfies $G$ -equivariance depends on the value of $θ_{2}$ .

Then, if $p_{θ^{*}} = p_{θ^{' *}} = q$ but $θ^{*}$ makes $T$ $G$ -equivariant while $θ^{' *}$ doesn’t, then the rank upper bound of the RLCT for the former is lower than that of the latter (thus $θ^{*}$ would be represented much more greatly in the Bayesian posterior).

This is more realistic, and I think sheds some light on why training imparts models with circuits / algorithms / internal symmetries that reflect structure in the data.

(Thanks to Dan Murfet for various related discussions.)

^{^}
Very brief SLT context: In SLT, the main quantity of interest is RLCT, which broadly speaking is a measure of degeneracy of the most degenerate point among the optimal parameters. We care about this because it directly controls the asymptotics of the Bayesian posterior. Also, we often care about its localized version where we restrict the parameter space $W$ to an infinitesimal neighborhood (germ) of a particular optimal parameter we're interested in measuring the degeneracy of.
RLCT is a particular invariant of the average log likelihood function $L (θ) = \int q (x) log p (x | θ) d x$ , meaning it is a function of the true distribution $q (x)$ and the parametric model $p (x | θ)$ (the choice of the prior $φ (θ)$ doesn't matter under reasonable regularity conditions).
^{^}
Given a two layer feedforward network with ReLU, multiply the first layer by $α$ and dividing the next by $α$ implements the same function. Many other examples, including non-generic degeneracies which occur at particular weight values unlike the constant multiplication degeneracy which occurs at every $θ$ ; more examples in Liam Carroll's thesis.
^{^}
This reminds me of the notion of data-program equivalence (programs-as-data, Gödel numbering, UTM). Perhaps some infinitesimal version of it?
^{^}
Let the input-side symmetry to be trivial (i.e. $δ x = 0$ ), and we recover degeneracies originating from the structure of the parameter-function map alone as a special case.

Found a proof sketch here (App. D.3), couldn't it find elsewhere in canonical SLT references eg gray book. Idea seems simple:

(We'll prove it for the local RLCT because the splitting lemma most naturally applies when dealing with local diffeomorphisms - but if you're interested in the statement for the global RLCT, then since RLCT is min over local RLCT (3.2.2), just work with the local RLCT at the minimizing point and the rest of the argument follows)
Local RLCT is invariant under local diffeomorphism, easy to prove by the volume scaling formula (check exercise 17).
Apply the splitting lemma (elementary proof in Gilmore 3.4) to locally express the function as a quadratic in the nondegenerate plus the rest.
Use Remark 7.2.3 in the gray book which says that if and $φ (w_{a}, w_{b}) = φ_{a} (w_{a}) φ_{b} (w_{b})$ (which the change of variable in the splitting lemma satisfies), then $λ = λ_{a} + λ_{b}$ (also easy to prove with the volume scaling formula).
So $λ$ is lower-bounded by the RLCT in the quadratic part, which is $\frac{d_{1}}{2}$ (again easy to prove with the volume scaling formula, check pg 17 here)

There shouldn't be a negative sign here (14a).

(will edit this comment over time to collect typos as I find them)

LESSWRONG
LW