Crystals in NNs: Technical Companion Piece

Jonas Hallgren

This is the technical companion piece for Have You Tried Thinking About It As Crystals.

Epistemic Status: This is me writing out the more technical connections and trying to mathematize the undelying dynamics to make it actually useful. I've spent a bunch of time on Spectral Graph Theory & GDL over the last year so I'm confident in that part but uncertain in the rest. From the perspective of my Simulator Worlds framing this post is Exploratory (e.g I'm uncertain whether the claims are correct and it hasn't been externally verified) and it is based on an analytical world. Therefore, take it with a grain of salt and explore the claims as they come, it is meant more for inspiration for future work than anything else, especially the physics and SLT part.

Introduction: Why Crystallization?

When we watch a neural network train, we witness something that looks remarkably like a physical process. Loss decreases in fits and starts. Capabilities emerge suddenly after long plateaus. The system seems to "find" structure in the data, organizing its parameters into configurations that capture regularities invisible to random initialization. The language we reach for—"phase transitions," "energy landscapes," "critical points"—borrows heavily from physics. But which physics?

The default template has been thermodynamic phase transitions: the liquid-gas transition, magnetic ordering, the Ising model. These provide useful intuitions about symmetry breaking and critical phenomena. But I want to argue for a different template—one that better captures what actually happens during learning: crystallization.

The distinction matters. Liquid-gas transitions involve changes in density and local coordination, but both phases remain disordered at the molecular level. Crystallization is fundamentally different. It involves the emergence of long-range structural order—atoms arranging themselves into periodic patterns that extend across macroscopic distances, breaking continuous symmetry down to discrete crystallographic symmetry. This structural ordering, I will argue, provides a more faithful analogy for what neural networks do when they learn: discovering and instantiating discrete computational structures within continuous parameter spaces.

More than analogy, there turns out to be some mathematical substance connecting crystallization physics to the theoretical frameworks we use to understand neural network geometry. Both Singular Learning Theory and Geometric Deep Learning speak fundamentally through the language of eigenspectra—the eigenvalues and eigenvectors of matrices that encode local interactions and determine global behavior. Crystallization physics has been developing this spectral language for over sixty years. By understanding how it works in crystals, we may gain insight into how it works in neural networks.

I finally want to reiterate that this is more of a review of related areas and how they tie together, it doesn't really have any satisfactory end point just relational pointers. The clearer model is within the other post.

Part I: What Is Crystallization, Really?

The Thermodynamic Picture

Classical nucleation theory, developed from Gibbs' thermodynamic framework in the late 1800s and given kinetic form by Volmer, Weber, Turnbull, and Fisher through the mid-20th century, describes crystallization as a competition between two driving forces. The bulk free energy favors the crystalline phase when conditions—temperature, pressure, concentration—make it thermodynamically stable. But creating a crystal requires establishing an interface with the surrounding medium, and this interface carries an energetic cost proportional to surface area.

For a spherical nucleus of radius r, the total free energy change takes the form:

where $Δ g_{v}$ represents the bulk free energy density difference favoring crystallization and $γ$ is the interfacial free energy. The competition between volume ( $r^{3}$ ) and surface ( $r^{2}$ ) terms creates a free energy barrier at a critical radius $r^{*}$ , below which nuclei tend to dissolve and above which they tend to grow.

The nucleation rate follows an Arrhenius form:

$J = A exp (- \frac{Δ G^{*}}{k_{B} T})$

where $A$ includes the Zeldovich factor characterizing the flatness of the free energy barrier near the critical nucleus size. This framework captures an essential truth: crystallization proceeds through rare fluctuations that overcome a barrier, followed by deterministic growth once the barrier is crossed. The barrier height depends on both thermodynamic driving force and interfacial properties.

This structure—barrier crossing followed by qualitative reorganization—will find direct echoes in how neural networks traverse loss landscape barriers during training. Recent work in Singular Learning Theory has shown that transitions between phases follow precisely this Arrhenius kinetics, with effective temperature controlled by learning rate and batch size.

The Information-Theoretic Picture

Before diving into the spectral mathematics, it's worth noting that crystallization can be understood through an information-theoretic lens. Recent work by Levine et al. has shown that phase transitions in condensed matter can be characterized by changes in entropy reflected in the number of accessible configurations (isomers) between phases. The transition from liquid to crystal represents a dramatic reduction in configurational entropy—the system trades thermal disorder for structural order.

Studies of information dynamics at phase transitions reveal that configurational entropy, built from the Fourier spectrum of fluctuations, reaches a minimum at criticality. Information storage and processing are maximized precisely at the phase transition. This provides a bridge to thinking about neural networks: training may be seeking configurations that maximize relevant information while minimizing irrelevant variation—a compression that echoes crystallographic ordering.

The information-theoretic perspective also illuminates why different structures emerge under different conditions. Statistical analysis of temperature-induced phase transitions shows that information-entropy parameters are more sensitive indicators of structural change than simple symmetry classification. The "Landau rule"—that symmetry increases with temperature—reflects the thermodynamic trade-off between energetic ordering and entropic disorder.

The Spectral Picture

But the thermodynamic and information-theoretic descriptions, while correct, obscure what makes crystallization fundamentally different from other phase transitions. The distinctive feature of crystallization is the emergence of long-range structural order—atoms arranging themselves into periodic patterns that extend across macroscopic distances. This ordering represents the spontaneous breaking of continuous translational and rotational symmetry down to discrete crystallographic symmetry.

The mathematical language for this structural ordering is spectral. Consider a crystal lattice where atoms sit at equilibrium positions and interact through some potential. Small displacements from equilibrium can be analyzed by expanding the potential energy to second order, yielding a quadratic form characterized by the dynamical matrix $D$ . For a system of N atoms in three dimensions, this is a $3 N \times 3 N$ matrix whose elements encode the force constants between atoms:

$D_{i α, j β} = \frac{1}{\sqrt{m_{i} m_{j}}} \frac{\partial^{2} V}{\partial u_{i α} \partial u_{j β}}$

where $u_{i α}$ denotes the displacement of atom $i$ in direction $α$ . The eigenvalues of this matrix give the squared frequencies $ω^{2}$ of the normal modes (phonons), while the eigenvectors describe the collective atomic motion patterns.

Here is the insight: the stability of a crystal structure is encoded in the eigenspectrum of its dynamical matrix. A stable structure has all positive eigenvalues, corresponding to real phonon frequencies. An unstable structure—one that will spontaneously transform—has negative eigenvalues, corresponding to imaginary frequencies. The eigenvector associated with a negative eigenvalue describes the collective atomic motion that will grow exponentially, driving the structural transformation.

The phonon density of states $g (ω)$ —the distribution of vibrational frequencies—encodes thermodynamic properties including heat capacity and vibrational entropy. For acoustic phonons near the zone center, $g (ω) \propto ω^{2}$ , the Debye behavior. But the full spectrum, including optical modes and zone-boundary behavior, captures the complete vibrational fingerprint of the crystal structure.

Soft Modes and Structural Phase Transitions

This spectral perspective illuminates the "soft mode" theory of structural phase transitions, developed in the early 1960s by Cochran and Anderson to explain ferroelectric and other displacive transitions. The central observation is that approaching a structural phase transition, certain phonon modes "soften"—their frequencies decrease toward zero. At the transition temperature, the soft mode frequency vanishes entirely, and the crystal becomes unstable against the corresponding collective distortion.

Cowley's comprehensive review documents how this soft mode concept explains transitions in materials from SrTiO₃ to KNbO₃. Recent experimental work continues to confirm soft-mode-driven transitions, with Raman spectroscopy revealing the characteristic frequency softening as transition temperatures are approached.

The soft mode concept provides a microscopic mechanism for Landau's phenomenological theory. Landau characterized phase transitions through an order parameter $η$ that measures departure from the high-symmetry phase. The free energy near the transition expands as:

$F = F_{0} + \frac{1}{2} a (T - T_{c}) η^{2} + \frac{1}{4} b η^{4} + \frac{1}{2} κ | \nabla η |^{2} + \dots$

The coefficient of the quadratic term changes sign at the critical temperature $T_{c}$ , corresponding precisely to the soft mode frequency going through zero. The gradient term $κ | \nabla η |^{2}$ penalizes spatial variations in the order parameter—a structure we will recognize when we encounter the graph Laplacian.

What makes this spectral picture so powerful is that it connects local interactions (the force constants in the dynamical matrix) to global stability (the eigenvalue spectrum) and transformation pathways (the eigenvectors). The crystal "knows" how it will transform because that information is encoded in its vibrational spectrum. The softest mode points the way.

Part II: The Mathematical Meeting Ground

The previous section established that crystallization is fundamentally a spectral phenomenon—stability and transformation encoded in eigenvalues and eigenvectors of the dynamical matrix. Now I want to show that this same spectral mathematics underlies the two major theoretical frameworks for understanding neural network geometry: Geometric Deep Learning and Singular Learning Theory.

Bridge One: From Dynamical Matrix to Graph Laplacian

The dynamical matrix of a crystal has a natural graph-theoretic interpretation. Think of atoms as nodes and force constants as weighted edges. The dynamical matrix then becomes a weighted Laplacian on this graph, and its spectral properties—the eigenvalues and eigenvectors—characterize the collective dynamics of the system.

This is not merely an analogy. For a simple model where atoms interact only with nearest neighbors through identical springs, the dynamical matrix has the structure of a weighted graph Laplacian $L = D - A$ , where $D$ is the degree matrix and $A$ is the adjacency matrix. The eigenvalues $λ_{k}$ of $L$ relate directly to phonon frequencies, and the eigenvectors describe standing wave patterns on the lattice.

The graph Laplacian appears throughout Geometric Deep Learning as the fundamental operator characterizing message-passing on graphs. For a graph neural network processing signals on nodes, the Laplacian eigenvectors provide a natural Fourier basis—the graph Fourier transform. The eigenvalues determine which frequency components propagate versus decay. Low eigenvalues correspond to smooth, slowly-varying signals; high eigenvalues correspond to rapidly-oscillating patterns.

The Dirichlet energy:

$E_{D} (f) = f^{T} L f = \sum_{(i, j) \in E} w_{i j} (f_{i} - f_{j})^{2}$

measures the "roughness" of a signal $f$ on the graph—how much it varies across edges. Minimizing Dirichlet energy produces smooth functions that respect graph structure. This is precisely the discrete analog of Landau's gradient term $κ | \nabla η |^{2}$ , which penalizes spatial variations in the order parameter.

The correspondence runs deep:

Crystallization	Graph Neural Networks
Dynamical matrix	Graph Laplacian
Phonon frequencies	Laplacian eigenvalues
Normal mode patterns	Laplacian eigenvectors
Soft mode instability	Low eigenvalue → slow mixing
Landau gradient term	Dirichlet energy
Crystal symmetry group	Graph automorphism group

Spectral graph theory has developed sophisticated tools for understanding how eigenspectra relate to graph properties: connectivity (the Fiedler eigenvalue), expansion, random walk mixing times, community structure. All of these have analogs in crystallography, where phonon spectra encode mechanical, thermal, and transport properties.

This is the first bridge: the mathematical structure that governs crystal stability and transformation is the same structure that governs information flow and representation learning in graph neural networks. The expressivity of GNNs can be analyzed spectrally—which functions they can represent depends on which Laplacian eigenmodes they can access.

Bridge Two: From Free Energy Barriers to Singular Learning Theory

The second bridge connects crystallization thermodynamics to Singular Learning Theory's analysis of neural network loss landscapes. SLT, developed by Sumio Watanabe, provides a Bayesian framework for understanding learning in models where the parameter-to-function map is many-to-one—where multiple parameter configurations produce identical input-output behavior.

Such degeneracy is ubiquitous in neural networks. Permutation symmetry means relabeling hidden units doesn't change the function. Rescaling symmetries mean certain parameter transformations leave outputs unchanged. The set of optimal parameters isn't a point but a complex geometric object—a singular set with nontrivial structure.

The central quantity in SLT is the real log canonical threshold (RLCT), denoted $λ$ , which characterizes the geometry of the loss landscape near its minima. For a loss function $L (w)$ with minimum at $w^{*}$ , the RLCT determines how the loss grows as parameters move away from the minimum:

$\int e^{- n L (w)} d w \sim n^{- λ}$

The RLCT plays a role analogous to dimension, but it captures the effective dimension accounting for the singular geometry of the parameter space. A smaller RLCT means the loss grows more slowly away from the minimum—the minimum is "flatter" in a precise sense—and such minima are favored by Bayesian model selection.

The connection to crystallization emerges when we consider how systems traverse between different minima. Recent work suggests that transitions between singular regions in neural network loss landscapes follow Arrhenius kinetics:

$rate \propto exp (- \frac{Δ F}{T})$

where $Δ F$ is a free energy barrier and $T$ plays the role of an effective temperature (related to learning rate and batch size in SGD). This is precisely the structure of classical nucleation theory, with RLCT differences playing the role of thermodynamic driving forces and loss landscape geometry playing the role of interfacial energy.

The parallel becomes even more striking when we consider that SLT identifies phase transitions in the learning process—qualitative changes in model behavior as sample size or other parameters vary. These developmental transitions, where models suddenly acquire new capabilities, have the character of crystallization events: barrier crossings followed by reorganization into qualitatively different structural configurations.

The Hessian of the loss function—the matrix of second derivatives—plays a role analogous to the dynamical matrix. Its eigenspectrum encodes local curvature, and the eigenvectors corresponding to small or negative eigenvalues indicate "soft directions" along which the loss changes slowly or the configuration is unstable. Loss landscape analysis has revealed that neural networks exhibit characteristic spectral signatures: bulk eigenvalues following particular distributions, outliers corresponding to specific learned features.

The Spectral Common Ground

Both bridges converge on the same mathematical territory: eigenspectra of matrices encoding local interactions. In crystallization, the dynamical matrix eigenspectrum encodes structural stability. In GDL, the graph Laplacian eigenspectrum encodes information flow and representational capacity. In SLT, the Hessian eigenspectrum encodes effective dimensionality and transition dynamics.

But there's a deeper connection here that deserves explicit attention: the graph Laplacian and the Hessian are not merely analogous—they are mathematically related as different manifestations of the same second-order differential structure.

The continuous Laplacian operator $\nabla^{2} = \nabla \cdot \nabla$ is the divergence of the gradient—it measures how a function's value at a point differs from its average in a neighborhood. The graph Laplacian $L = D - A$ is precisely the discretization of this operator onto a graph structure. When you compute $L f$ for a signal $f$ on nodes, you get, at each node, the difference between that node's value and the weighted average of its neighbors. This is the discrete analog of $\nabla^{2} f$ .

The Hessian matrix $H_{i j} = \partial^{2} f / \partial x_{i} \partial x_{j}$ encodes all second-order information about a function—not just the Laplacian (which is the trace of the Hessian, ( $\nabla^{2} f = tr (H)$ ) but the full directional curvature structure. The Hessian tells you how the gradient changes as you move in any direction; the Laplacian tells you the average of this over all directions.

Here's what makes this connection powerful for our purposes: Geometric Deep Learning can be understood as providing a discretization framework that bridges continuous differential geometry to discrete graph structures.

When GDL discretizes the Laplacian onto a graph, it's making a choice about which second-order interactions matter—those along edges. The graph structure constrains the full Hessian to a sparse pattern. In a neural network, the architecture similarly constrains which parameters interact directly. The Hessian of the loss function inherits structure from the network architecture, and this structured Hessian may have graph-Laplacian-like properties in certain subspaces.

This suggests a research direction: can we understand the Hessian of neural network loss landscapes as a kind of "Laplacian on a computation graph"? The nodes would be parameters or groups of parameters; the edges would reflect which parameters directly influence each other through the forward pass. The eigenspectrum of this structured Hessian would then inherit the interpretability that graph Laplacian spectra enjoy in GDL.

The crystallization connection completes the triangle. The dynamical matrix of a crystal is a Laplacian on the atomic interaction graph, where edge weights are force constants. Its eigenspectrum gives phonon frequencies. The Hessian of the potential energy surface—which determines mechanical stability—is exactly this dynamical matrix. So in crystals, the Laplacian-Hessian connection is not an analogy; it's an identity.

This convergence is not coincidental. All three domains concern systems where:

Local interactions aggregate into global structure. Force constants between neighboring atoms determine crystal stability. Edge weights between neighboring nodes determine graph signal propagation. Local curvature of the loss surface determines learning dynamics. In each case, the matrix encoding local relationships has eigenproperties that characterize global behavior.

Stability is a spectral property. Negative eigenvalues signal instability in crystals—the structure will spontaneously transform. Small Laplacian eigenvalues signal poor mixing in GNNs—information struggles to propagate. Near-zero Hessian eigenvalues signal flat directions in loss landscapes—parameters can wander without changing performance. The eigenspectrum is the diagnostic.

Transitions involve collective reorganization. Soft modes describe how crystals transform—many atoms moving coherently. Low-frequency Laplacian modes describe global graph structure—community-wide patterns. Developmental transitions in neural networks involve coordinated changes across many parameters—not isolated weight updates but structured reorganization.

Part III: What the Mapping Illuminates

Having established the mathematical connections, we can now ask: what does viewing neural network training through the crystallization lens reveal?

Nucleation as Capability Emergence

The sudden acquisition of new capabilities during training—the phenomenon called "grokking" or "emergent abilities"—may correspond to nucleation events. The system wanders in a disordered phase, unable to find the right computational structure. Then a rare fluctuation creates a viable "seed" of the solution—a small subset of parameters that begins to implement the right computation. If this nucleus exceeds the critical size (crosses the free energy barrier), it grows rapidly as the structure proves advantageous.

This picture explains several puzzling observations. Why do capabilities emerge suddenly after long plateaus? Because nucleation is a stochastic barrier-crossing event—rare until it happens, then rapid. Why does the transition time vary so much across runs? Because nucleation times are exponentially distributed. Why do smaller models sometimes fail to learn what larger models eventually master? Perhaps the critical nucleus size exceeds what smaller parameter spaces can support.

The nucleation rate formula $J \propto exp (- Δ G^{*} / k_{B} T)$ suggests that effective temperature (learning rate, noise) plays a crucial role. Too cold, and nucleation never happens—the system is stuck. Too hot, and nuclei form but immediately dissolve—no stable structure emerges. There's an optimal temperature range for crystallization, and perhaps for learning.

Polymorphism as Solution Multiplicity

Crystals of the same chemical composition can form different structures depending on crystallization conditions. Carbon makes diamond or graphite. Calcium carbonate makes calcite or aragonite. These polymorphs have identical chemistry but different atomic arrangements, different properties, different stabilities.

Neural networks exhibit analogous polymorphism. The same architecture trained on the same data can find qualitatively different solutions depending on initialization, learning rate schedule, and stochastic trajectory. Some solutions generalize better; some are more robust to perturbation; some use interpretable features while others use alien representations.

The crystallization framework suggests studying which "polymorphs" are kinetically accessible versus thermodynamically stable. In crystals, the polymorph that forms first (kinetic product) often differs from the most stable structure (thermodynamic product). Ostwald's step rule states that systems tend to transform through intermediate metastable phases rather than directly to the most stable structure. Perhaps neural network training follows similar principles—solutions found by SGD may be kinetically favored intermediates rather than globally optimal structures.

Defects as Partial Learning

Real crystals are never perfect. They contain defects—vacancies where atoms are missing, interstitials where extra atoms intrude, dislocations where planes of atoms slip relative to each other, grain boundaries where differently-oriented crystal domains meet. These defects represent incomplete ordering, local frustration of the global structure.

Neural networks similarly exhibit partial solutions—local optima that capture some but not all of the task structure. A model might learn the easy patterns but fail on edge cases. It might develop features that work for the training distribution but break under distribution shift. These could be understood as "defects" in the learned structure.

Defect physics offers vocabulary for these phenomena. A vacancy might correspond to a missing feature that the optimal solution would include. A dislocation might be a region of parameter space where different computational strategies meet incompatibly. A grain boundary might separate domains of the network implementing different (inconsistent) computational approaches.

Importantly, defects aren't always bad. In metallurgy, controlled defect densities provide desirable properties—strength, ductility, hardness. Perhaps some "defects" in neural networks provide useful properties like robustness or regularization. The question becomes: which defects are harmful, and how can training protocols minimize those while preserving beneficial ones?

Annealing as Training Schedules

Metallurgists have developed sophisticated annealing schedules to control crystal quality. Slow cooling from high temperature allows atoms to find low-energy configurations, producing large crystals with few defects. Rapid quenching can trap metastable phases or create amorphous (glassy) structures. Cyclic heating and cooling can relieve internal stresses.

The analogy to learning rate schedules and curriculum learning is direct. High learning rate corresponds to high temperature—large parameter updates that can cross barriers but also destroy structure. Low learning rate corresponds to low temperature—precise refinement but inability to escape local minima. The art is in the schedule.

Simulated annealing explicitly adopts this metallurgical metaphor for optimization. But the crystallization perspective suggests richer possibilities. Perhaps "nucleation agents"—perturbations designed to seed particular structures—could accelerate learning. Perhaps "epitaxial" techniques—initializing on solutions to related problems—could guide crystal growth. Perhaps monitoring "lattice strain"—measuring internal inconsistencies in learned representations—could diagnose training progress.

Two-Step Nucleation and Intermediate Representations

Classical nucleation theory assumes direct transition from disordered to ordered phases. But recent work on protein crystallization has revealed more complex pathways. Systems often pass through intermediate states—dense liquid droplets, amorphous clusters, metastable crystal forms—before reaching the final structure. This "two-step nucleation" challenges the classical picture.

This might illuminate how neural networks develop capabilities. Rather than jumping directly from random initialization to optimal solution, networks may pass through intermediate representational stages. Early layers might crystallize first, providing structured inputs for later layers. Some features might form amorphous precursors before organizing into precise computations.

Developmental interpretability studies how representations change during training. The crystallization lens suggests looking for two-step processes: formation of dense but disordered clusters of related computations, followed by internal ordering into structured features. The intermediate state might be detectable—neither fully random nor fully organized, but showing precursor signatures of the final structure.

Part IV: Limitations and Honest Uncertainty

The crystallization mapping is productive, but I should be clear about what it does and doesn't establish.

What the Mapping Does Not Claim

Neural networks are not literally crystals. There is no physical lattice, no actual atoms, no real temperature. The mapping is mathematical and conceptual, not physical. It suggests that certain mathematical structures—eigenspectra, barrier-crossing dynamics, symmetry breaking—play analogous roles in both domains. But analogy is not identity.

The mapping does not prove that any specific mechanism from crystallization applies to neural networks. It generates hypotheses, not conclusions. When I suggest that capability emergence resembles nucleation, this is a research direction, not an established fact. The hypothesis needs testing through careful experiments, not just conceptual argument.

The mapping may not capture what's most important about neural network training. Perhaps other physical analogies—glassy dynamics, critical phenomena, reaction-diffusion systems—illuminate aspects that crystallization obscures. Multiple lenses are better than one, and I don't claim crystallization is uniquely correct.

Open Questions

Many questions remain genuinely open:

How far does the spectral correspondence extend? The mathematical parallels between dynamical matrices, graph Laplacians, and Hessians are real, but are the dynamics similar enough that crystallographic intuitions transfer? Under what conditions?

What plays the role of nucleation seeds in neural networks? In crystals, impurities and surfaces dramatically affect nucleation. What analogous features in loss landscapes or training dynamics play similar roles? Can we engineer them?

Do neural networks exhibit polymorph transitions? In crystals, one structure can transform to another more stable form. Do trained neural networks undergo analogous restructuring during continued training or fine-tuning? What would the signatures be?

What is the right "order parameter" for neural network phase transitions? Landau theory requires identifying the quantity that changes discontinuously (or continuously but critically) across the transition. For neural networks, is it accuracy? Information-theoretic quantities? Geometric properties of representations?

These questions require empirical investigation, theoretical development, and careful testing of predictions. The crystallization mapping provides vocabulary and hypotheses, not answers.

Conclusion: A Lens, Not a Law

I've argued that crystallization provides a productive template for understanding neural network phase transitions—more productive than generic thermodynamic phase transitions because crystallization foregrounds the spectral mathematics that connects naturally to both Singular Learning Theory and Geometric Deep Learning.

The core insight is that all three domains—crystallization physics, graph neural networks, and singular learning theory—concern how local interactions encoded in matrices give rise to global properties through their eigenspectra. The dynamical matrix, the graph Laplacian, and the Hessian of the loss function are mathematically similar objects. Their eigenvalues encode stability; their eigenvectors encode transformation pathways. The language developed for one may illuminate the others.

This is the value of the mapping: not a proof that neural networks are crystals, but a lens that brings certain mathematical structures into focus. The spectral theory of crystallization offers both technical tools—dynamical matrix analysis, soft mode identification, nucleation kinetics—and physical intuitions—collective reorganization, barrier crossing, structural polymorphism—that may illuminate the developmental dynamics of learning systems.

Perhaps most importantly, crystallization provides images we can think with. The picture of atoms jostling randomly until a lucky fluctuation creates a structured nucleus that then grows as more atoms join the pattern—this is something we can visualize, something we can develop intuitions about. If neural network training has similar dynamics, those intuitions become tools for understanding and perhaps controlling the learning process.

The mapping remains a hypothesis under development. But it's a hypothesis with mathematical substance, empirical hooks, and conceptual fertility. That seems worth pursuing.

[-]Charlie Steiner2mo*120

The companion post eventually did a decent job talking about a metaphor. This one, unfortunately, tries to mathematize the metaphor, quotes some equations, makes no progress, and goes back to the metaphor but more confused.

I actually did like the metaphor that heuristics are active in certain areas (their "domain"), which might grow if successful, and might meet other heuristics at boundaries. Be precise: in what space is this domain defined, what is the growth process during training via gradient descent, how does it compare to domain growth in simple models of phase transition? Then maybe just ditch the local thermodynamic model and try to develop a simple model of heuristic growth.

[-]Jonas Hallgren2mo22

Fair points across the board, in retrospect I think I shouldn't have released this post in the state it was in but I wanted to have something more to bite in as I felt there wasn't enough in the original.

The audience for me here was less RL-based utility learning setups and rather more of a focus on devinterp perspectives but I don't know it well enough to write something good about it.

So lesson learnt, and thank you for the feedback.

[-]Charlie Steiner2mo20

Thanks for being a good sport, you're getting some impersonal collateral damage from my disgruntlement with AI content that (for worse or better) isn't quite there yet.

[-]RogerDearnaley2mo20

Perhaps "epitaxial" techniques—initializing on solutions to related problems—could guide crystal growth

Also known as "warm-starting" — yes, absolutely they can. I led a team that solved a significant problem this way.

LESSWRONG
LW