LESSWRONG
LW

Adversarial TrainingGANPhysicsProbability & StatisticsSingular Learning TheorySolomonoff inductionAI
Frontpage

16

The Theory Behind Loss Curves

by James Camacho
6th May 2025
Linkpost from github.com
4 min read
3

16

Adversarial TrainingGANPhysicsProbability & StatisticsSingular Learning TheorySolomonoff inductionAI
Frontpage

16

The Theory Behind Loss Curves
1James Camacho
2RogerDearnaley
1James Camacho
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 12:20 AM
[-]James Camacho4mo10

A couple things to add that don't deserve to be in the main text:

  1. The Taylor series for the Z6 partition function is 1+x−x3−x4+O(x5), which means it actively learns "not this" the second, and third times around. This is why we see a dip (1+x) when x<1, followed by a steep rise in loss (−x2−x3) as x>1, and then a tapering out.

  2. The Z1 and Z2 partition functions correspond to bosons (e.g. photons) and fermions (e.g. electrons) in physics. Perhaps Z6 corresponds to an exotic particle the theorists have yet to classify.

Reply
[-]RogerDearnaley4mo21

Pretty sure that the 'exotic particle' in question for the last sentence would be a spin-1/6 anyon. So '…have already classified'.

Reply
[-]James Camacho4mo10

I haven't been able to find the spin-1/6 anyon's partition function, so mine could be wrong.

Reply
Moderation Log
More from James Camacho
View more
Curated and popular this week
3Comments

Or, why GAN training looks so funky.

Solomonoff's Lightsaber

The simplest explanation is exponentially more important.

Suppose I give you a pattern, and ask you to explain what is going on:

1,2,4,8,16,…

Several explanations might come to mind:

  • "The powers of two,"
  • "Moser's circle problem,"
  • x424−x312+11x224+7x12+1,
  • "The counting numbers in an alien script,"
  • "Fine-structure constants."

Some of these explanations are better than others, but they could all be the "correct" one. Rather than taking one underlying truth, we should assign a weight to each explanation, with the ones more likely to produce the pattern we see getting a heavier weight:

Grand Unified Explanation=∑wexplanation⋅explanation.

Now, what exactly is meant by the word "explanation"? Between humans, our explanations are usually verbal signals or written symbols, with a brain to interpret the meaning. If we want more precision, we can program an explanation into a computer, e.g. fn pattern(n) {2^n}. If we are training a neural network, an explanation describes what region of weight-space produces the pattern we're looking for, with a few error-correction bits since the neural network is imperfect. See, for example, the paper "ARC-AGI Without Pretraining" (Liao & Gu).

Let's take the view that explanations are simply a string of bits, and our interpreter does the rest of the work to turn it into words, programs, or neural networks. This means there are exactly 2n n-bit explanations, and the average weight for each of them is less than 1/2n. Now, most explanations—even the short ones—have hardly any weight, but there are still exponentially more longer explanations that are "good"[1]. This means, if we take the most prominent n explanations, we would expect the remaining explanations to have weight on the order of exp(−n).

Counting Explanations

What you can count, you can measure.

Suppose we are training a neural network, and we want to count how many explanations it has learned. Empirically, we know the loss comes from all the missing explanations, so

Loss∼exp(−explanations)⟺explanations∼−log(Loss).

However, wouldn't it be more useful to go the other way? To estimate the loss, at the beginning of a training run, by counting how many concepts we expect our neural network to learn? That is our goal for today.

If we assume our optimizers are perfect, we should be able to use every bit of training data, and the proportion of the time a neural net has learned any particular concept will be

x=exp(t−|explanation|),

where t is the training iteration and |explanation| the bit-length of the concept. The proportion it learns the concept twice is x2, thrice is x3, and so on. We can use the partition function

Z(x)=∞∑n=0xn=1+x+x2+x3+⋯=(1−x)−1

to keep track of how many times the network has learned a concept. To track multiple concepts, say x and y, we would just multiply their partition functions:

Z(x,y)=Z(x)Z(y).

It's actually more useful to look at the logarithm, that way we can add density functions instead of multiplying partition functions[2]:

Ω(x)=lnZ(x)=−ln(1−x)=∞∑n=1xnn

Now, not every model learns by simply memorizing the training distribution. We'll look at three kinds of learning dynamics:

  • Z1—The network memorizes a concept, and continues to overfit on that concept. This is your typical training run, such as with classifying MNIST digits.
  • Z2—The network can only learn a concept once. Equivalently, we can pretend that the network alternates between learning and forgetting a concept. This is for extremely small models, or grokking in larger training runs.
  • Z6—One network is trying to learn and imitate a concept, while another network is trying to discriminate what is real and what is an imitation. Any time you add adversarial loss—such as with GANs or the information bottleneck—you'll get this learning dynamic.

In general, a learning dynamic can be described by some group G. It's possible to go through several steps at once, so every group element g∈G creates a sub-dynamic. Also, we could begin at any step in the dynamic, at g, g2, or so on, up to g|g|=1 where |g| is the order of g. So, for a particular sub-dynamic g, our density function becomes

Ω(g,x)=∞∑n=1|g|∑k=1(gkx)nn=∞∑n=1xn|g|n=−ln(1−x|g|),

since[3]

|g|∑k=1gkn={g|g|n+n−gngn−1=gn−gngn−1=0n≢0(mod|g|)1+1+⋯+1=|g|n≡0(mod|g|).

To capture the entire group of dynamics, we have to project onto the fundamental representation of our group:

Ω(x)=∑g∈Gχ(g)Ω(g,x).

Finally, to get back the partition function, we exponentiate:

Z(x)=exp(Ω(x)).

For the three groups in question, we have

Z1:χ(n)=1⟹Z(x)=(1−x)−1Z2:χ(n)=(−1)n⟹Z(x)=1+xZ6:χ(n)=e2πin/6⟹Z(x)=1+x1+x3.

To recover the average number of times a concept has been learned, note that taking a derivative drops out the exponents keeping track of this, e.g.

x⋅(x1+x2+x3)′=x1+2x2+3x3

so the expected number of times a concept has been learned is

n(x)=xZ′(x)Z(x).

Putting it altogether, we get

Loss(Z1)∼exp(xx−1)Loss(Z2)∼exp(−xx+1)Loss(Z6)∼exp(x−2x2x2−x+1)

for x∝exp(t). Here are the plots, with theory on the left and experiment on the right:


  1. As in, need very few error-correcting bits after interpretation. The explanation "fine-structure constants" needs many error-correcting bits such as, "your brain spasmed and misinterpreted the text," while "Moser's circle problem" produces the pattern without any need for error correction. ↩︎
  2. This is known as the plethystic logarithm. ↩︎
  3. This is the same idea as roots of unity filters. ↩︎