Have you read Eva Silverstein's work here?
Symmetry Breaking in Transformers for Efficient and Interpretable Training
I'd not seen that, but the idea of spontaneous symmetry breaking in neural networks as they learn is not new, it goes way back to Hopfield networks I believe.
Since their system hasn't got any analogy to thermodynamic cooling (it's literally called an energy conserving optimizer) they won't be able to break their symmetry spontaneously, which is why they had to break it manually.
Statistical mechanics is the process of controlled forgetting. Our main task is to figure out how to forget something about one system, to learn something about another system.
The temperature of a system corresponds to its exchange rate of some conserved quantity, for information. Usually that conserved quantity is energy. The hotter something is, the more energy we need to dump into it to successfully forget some information about it.
Let's suppose we want to take energy out of a system, at the price of learning something about that system.
That's weird! There are some periods where we can get a bunch of energy out without changing the price, but then the price gets suddenly higher after that?
And when we open up the box of gas at the end of the process, we'll find that it's turned into these weird pointy lumps? Huh?
What's going on?
Symmetry
What's the first answer that comes to your mind when I throw the following pair-matching game to you:
I bet you answered that the ice was more symmetrical and the vapour was less symmetrical. When you imagined a cloud of vapour, you imagined a chaotic arrangement of molecules; for an ice crystal, you imagined a regular lattice.
Let's try again, in the Ising model (you can read John's explanation there, or Claude's explanation here)
Claude's Ising Explainer
Imagine a grid of spins, each either up (↑) or down (↓). Each spin has a simple preference: it "wants" to match its neighbours. That's the whole model. What makes it interesting is what happens when you dial a single parameter — temperature — which controls how much random thermal jostling can override those preferences.
At low temperature, the spins cooperate: you get large patches of all-up or all-down. At high temperature, the jostling dominates and the grid is a random mess.
Again, I expect some of you will have said that the hot system was less symmetrical, and the cold system was more symmetrical.
If so, you've not yet caught on to the two most important concepts in stat mech.
Symmetry of States, not Things
The first is that we're thinking about symmetry over states, not over objects.
Let's start with the Ising model, since it's simpler. At high temperatures, both states are equivalent; we have lots of spin ups, and lots of spin downs. At low temperature, all the spins enter the same state, so the two states are no longer equivalent. Since this happens without any external input as to which state to enter, it's called spontaneous symmetry breaking.
What are the states that a water molecule can be in? Roughly, position, orientation, velocity, angular velocity. In the vapour, all the states are equivalent, and molecules are distributed evenly across them.
In the ice crystal, one particular velocity and angular momentum state is privileged (the velocity and angular momentum of the macroscopic crystal). One position and orientation of the lattice is privileged.
This is universal to all crystals. In fact, from the perspective of stat mech, the definition of a crystal is "a spontaneous break in local spatial symmetry."
(As an aside, this might help you make sense of the concept of a "time crystal": it's just a thing which oscillates predictably.)
Symmetry in the Map, not the Territory
The other way of thinking about this is in the map. Imagine that cloud of steam again. You're uncertain about all of the particles; any of them might be anywhere. Your map of the gas is symmetrical across all the locations in the cloud.
Now imagine you learn the location of five of the molecules. Your map basically hasn't changed; it's still essentially symmetrical.
Now imagine the same for the ice crystal. You start unsure of the location of all of the molecules, as before. But this time, if you learn the location of a few molecules, your map of the crystal is completely changed: you now have an enormous amount of information about the position and orientation of all the molecules (of course you don't have perfect information about all of them; only those within the convex hull of the molecules you did see, but that's still quite a lot!).
It's the same with the Ising model. If the temperature is high, then learning about a few of the grid elements' spin states doesn't change what you know about the other states. If the temperature is low, then learning about a state tells you the whole global state of the grid.
When the system has global symmetry, your map is robustly symmetric: learning a little information doesn't tell you much; when it has no global symmetry, your map is only contingently symmetric: learning a little information teaches you a lot.
The Price of Energy
This is the price of that energy. In order to get that energy out, and convert our steam cloud into an ice crystal, we had to learn a lot about the system. It didn't seem like it, since we were still uncertain of where those molecules would actually be, but that's only because we were thinking about the locations of individual molecules, one at a time.
If learning the position of a few molecules of ice tells you the position of all the others, then you already knew quite a lot about the system, it was just contained in the conditional distribution of the molecules, given one another. You were secretly un-forgetting all along!
There's a three-way relationship here:
In these parts, we have another word for a situation where learning the state of a few particles teaches us about the rest. The spontaneous symmetry breaking produced a natural latent. Now, this isn't the only way a natural latent can form, nor might it even be the most common way, but it is a way!
Demos because I have too much free time[1]
You can find the code here.
For our first demo, let's put a bunch of particles in a void. The void loops at its edges, like Pac Man. The particles start out with lots of kinetic energy, and lose it as they bump into each other (this is actually fairly realistic, atoms do lose energy as radiation when accelerating, such as when they collide into one another). There's a non-directional attractive force between the particles, that kicks in at short distances:
And let's do a lattice too! Instead of using up/down states, we'll use angles (this is really just going from , the zero-sphere, to , the one-sphere). Each particle in the grid has an angle and an angular velocity . The velocity slows down (as if by friction, or radiation) over time, but we also inject some randomly according to temperature. You'll have to download the code to look at that one, though.
Analogy to AI Training
If you're uninterested in reading about AI, then feel free to stop reading here. I just couldn't resist.
Suppose your LLM forms an induction head. This is a two-layer circuit where one attention head writes information from the previous token, and another attention head looks for it. This is often referred to as a phase change, which is true, but the analogy works even better.
To what subspace of the residual stream does the first head write to? I have no idea, but I do know that the second head has to read from the same subspace. Sound familiar?
This is true of basically every multi-layer circuit in transformers. I don't know which subspace the previous token head of the Michael Jordan basketball circuit writes to, but I do know that the Michael + Jordan Basketball lookup circuits in the MLP layers (which probably implement a shallow circuit using cross-layer superposition) read from the same subspace, and whatever subspace it writes to, the later heads read from.
I have more thoughts here, about how entropy barriers to crystal nucleation might analogise to entropy barriers to forming multi-layer circuits as opposed to shallow ones during training, but that's a thought for another post.
Haha, just kidding, I'm bunking off writing my PhD thesis.