LESSWRONG
LW

1316
Dalcy
888Ω42111310
Message
Dialogue
Subscribe

Nothing is "mere." I, too, can see the stars on a desert night, and feel them. But do I see less or more? The vastness of the heavens stretches my imagination - stuck on this carousel, my little eye can catch one-million-year-old light. A vast pattern - of which I am a part - perhaps my stuff was belched from some forgotten star, as one is belching there. Or see them with the greater eye of Palomar, rushing all apart from some common starting point when they were perhaps all together. What is the pattern, or the meaning, or the why? It does not do harm to the mystery to know a little about it.

                          - Richard P. Feynman on The Relation of Physics to Other Sciences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
1Dalcy's Shortform
3y
128
The Hessian rank bounds the learning coefficient
Dalcy1mo80

Found a proof sketch here (App. D.3), couldn't it find elsewhere in canonical SLT references eg gray book. Idea seems simple:

  1. (We'll prove it for the local RLCT because the splitting lemma most naturally applies when dealing with local diffeomorphisms - but if you're interested in the statement for the global RLCT, then since RLCT is min over local RLCT (3.2.2), just work with the local RLCT at the minimizing point and the rest of the argument follows)
  2. Local RLCT is invariant under local diffeomorphism, easy to prove by the volume scaling formula (check exercise 17).
  3. Apply the splitting lemma (elementary proof in Gilmore 3.4) to locally express the function as a quadratic in the nondegenerate plus the rest.
  4. Use Remark 7.2.3 in the gray book which says that if K(wa,wb)=Ka(wa)+Kb(wb) and φ(wa,wb)=φa(wa)φb(wb) (which the change of variable in the splitting lemma satisfies), then λ=λa+λb (also easy to prove with the volume scaling formula).
  5. So λ is lower-bounded by the RLCT in the quadratic part, which is d12 (again easy to prove with the volume scaling formula, check pg 17 here)
Reply
Singular learning theory: exercises
Dalcy1mo10

∇wlog(p(y|x,w))=−(y−f(w)(x))T⋅Jf(w)(x)

There shouldn't be a negative sign here (14a).

(will edit this comment over time to collect typos as I find them)

Reply
If Anyone Builds It, Everyone Dies: Advertisement design competition
Dalcy2mo2-2

The fourth one is great.

Reply1
X explains Z% of the variance in Y
Dalcy3mo40

Conventionally Var[Y|X] is a random variable, just like how E[Y|X] is a random variable. To be fair the conventions are somewhat inconsistent, given that (as you said) H(Y|X) is a number.

Reply
satchlj's Shortform
Dalcy4mo40

Previous discussion, comment by johnswentworth:

Relevant slogan: Goodheart is about generalization, not approximation.

[...]

In all the standard real-world examples of Goodheart, the real problem is that the proxy is not even approximately correct once we move out of a certain regime.

Reply
Alexander Gietelink Oldenziel's Shortform
Dalcy5mo102

Speaking from the perspective of someone still developing basic mathematical maturity and often lacking prerequisites, it's very useful as a learning aid. For example, it significantly expanded the range of papers or technical results accessible to me. If I'm reading a paper containing unfamiliar math, I no longer have to go down the rabbit hole of tracing prerequisite dependencies, which often expand exponentially (partly because I don't know which results or sections in the prerequisite texts are essential, making it difficult to scope my focus). Now I can simply ask the LLM for a self-contained exposition. Using traditional means of self-studying like [search engines / Wikipedia / StackExchange] is very often no match for this task, mostly in terms of time spent or wasted effort; simply having someone I can directly ask my highly specific (and often dumb) questions or confusions and receive equally specific responses is just really useful.

Reply32
Dalcy's Shortform
Dalcy5mo300

Non-Shannon-type Inequalities

The first new qualitative thing in Information Theory when you move from two variables to three variables is the presence of negative values: information measures (entropy, conditional entropy, mutual information) are always nonnegative for two variables, but there can be negative triple mutual information I(X;Y;Z).

This so far is a relatively well-known fact. But what is the first new qualitative thing when moving from three to four variables? Non-Shannon-type Inequalities.

A fundamental result in Information Theory is that I(X;Y∣Z)≥0 always holds.

  • Given n random variables X1,…,Xn and α,β,γ⊆[n], from now on we write I(α;β∣γ) with the obvious interpretation of the variables standing for the joint variables they correspond to as indices.

Since I(α;β|γ)≥0 always holds, a nonnegative linear combination of a bunch of these is always a valid inequality, which we call a Shannon-type Inequality.

Then the question is, whether Shannon-type Inequalities capture all valid information inequalities of n variable. It turns out, yes for n=2, (approximately) yes for n=3, and no for n≥4.

Behold, the glorious Zhang-Yeung inequality, a Non-Shannon-type Inequality for n=4:

I(A;B)≤2I(A;B∣C)+I(A;C∣B)+I(B;C∣A)+I(A;B∣D)+I(C;D)

Explanation of the math, for anyone curious.

  • Given n random variables and α,β,γ⊆[n], it turns out that I(α;β∣γ)≥0 is equivalent to H(α∪β)+H(α∩β)≤H(α)+H(β) (submodularity), H(α)≤H(β) if α⊆β, and H(∅)=0.

    This lets us write the inequality involving conditional mutual information in terms of joint entropy instead.

    Let Γ∗n then be a subset of R2n, each element corresponding to the values of the joint entropy assigned to each subset of some random variables X1,…,Xn. For example, an element of Γ∗2 would be (H(∅),H(X1),H(X2),H(X1,X2))∈R2n for some random variables X1 and X2, with a different element being a different tuple induced by a different random variable (X′1,X′2).

    Now let Γn represent elements of R2n satisfying the three aforementioned conditions on joint entropy. For example, Γ∗2's element would be (h∅,h1,h2,h12)∈R2n satisfying e.g., h1≤h12 (monotonicity). This is also a convex cone, so its elements really do correspond to "nonnegative linear combinations" of Shannon-type inequalities.

    Then, the claim that "nonnegative linear combinations of Shannon-type inequalities span all inequalities on the possible Shannon measures" would correspond to the claim that Γn=Γ∗n for all n.

    The content of the papers linked above is to show that:

    • Γ2=Γ∗2
    • Γ3≠Γ∗3 but Γ3=¯¯¯¯¯¯Γ∗3 (closure[1])
    • Γ4≠Γ∗4 and Γ4≠¯¯¯¯¯¯Γ∗4, and also for all n≥4.
     
  1. ^

    This implies that, while there exists a 23-tuple satisfying Shannon-type inequalities that can't be constructed or realized by any random variables X1,X2,X3, there does exist a sequence of random variables (X(k)1,X(k)2,X(k)3)∞k=1 whose induced 23-tuple of joint entropies converge to that tuple in the limit.

Reply3
quetzal_rainbow's Shortform
Dalcy5mo20

Relevant: Alignment as a Bottleneck to Usefulness of GPT-3

between alignment and capabilities, which is the main bottleneck to getting value out of GPT-like models, both in the short term and the long(er) term?

Reply
Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
Dalcy5mo30

By the way, Gemini 2.5 Pro and o3-mini-high is good at tic-tac-toe. I was surprised because the last time I tested this on o1-preview, it did quite terribly.

Reply
The Hessian rank bounds the learning coefficient
Dalcy5mo40

Where in the literature can I find the proof of the lower bound?

Reply
Load More
29Towards building blocks of ontologies
7mo
0
45What's the Right Way to think about Information Theoretic quantities in Neural Networks?
Q
8mo
Q
13
28Proof Explained for "Robust Agents Learn Causal World Model"
9mo
0
67An Illustrated Summary of "Robust Agents Learn Causal World Model"
Ω
9mo
Ω
2
26Money Pump Arguments assume Memoryless Agents. Isn't this Unrealistic?
Q
1y
Q
6
38But Where do the Variables of my Causal Model come from?
Ω
1y
Ω
2
27Why do Minimal Bayes Nets often correspond to Causal Models of Reality?
Q
1y
Q
1
42When Are Results from Computational Complexity Not Too Coarse?
1y
8
26Epistemic Motif of Abstract-Concrete Cycles & Domain Expansion
2y
2
23Least-problematic Resource for learning RL?
Q
2y
Q
9
Load More