There’s a common hypothesis that stochastic gradient descent has some kind of built-in bias toward simpler models, or models which generalize better. (It’s essentially the opposite of the NTK/GP/Mingard et al model.) There is also a rough intuitive argument to expect such a thing a priori: sub-structures which are only useful sometimes will be lost on the occasions when they’re not useful, whereas sub-structures which are consistently useful will be retained. Conceptually, it’s similar to the modularly varying goals model in biological evolution.
Mathematically, it’s not too hard to show that SGD does indeed have a bias, and we can even write it down explicitly given some not-too-unreasonable approximations. This post will walk through that derivation, and give some intuition for where the bias comes from.
“Brownian” here refers to Brownian motion, i.e. a random walk. In other words, the path taken by SGD looks qualitatively sort of like this:
The key idea is that each sample used to estimate the gradient is approximately independent (in the probability sense of the word), and the estimate is an average over samples, so the net effect of several steps is approximately normally distributed. That’s the defining feature of Brownian motion. (Alternatively, we can assume that steps are approximately independent and additive, which gets us to the same place with a little more generality but also a little more work.)
Let’s put some math on that.
We use SGD to choose θ to minimize EX[u(X,θ)]. Each step, we take n independent samples X1…Xn of X, use them to estimate the gradient, then take a step:
… where η scales the step size. This step is an approximately-normal random variable, constructed from the IID random variables X1…Xn. It has mean −ηnEX[∇θu(X,θ)], and variance (ηn)2VarX[∇θu(X,θ)].
To formally represent this as a Brownian motion, we declare that the amount-of-time which passes during each SGD step is dt=ηn, which we assume to be small (i.e. we’ll approximate things to first order in dt). Then SGD’s path can be represented as
… where W(t) is a standard Brownian motion, the analogue of a standard normal variable, and the square root is a matrix square root. (If you haven’t worked with Brownian motion before, ignore that formula and keep reading.)
Sometimes, the “noise” in a Brownian system is location-dependent. As an example, let’s consider the original use-case: a grain of pollen floats in water. It’s small enough to get randomly kicked around by water molecules, so its path is Brownian (and can be seen under an ordinary microscope). If the water has a temperature gradient, then the “noise” in the pollen-grain’s path will vary with its location.
When the grain of pollen is in a higher-noise region, it will be kicked around more, and move around faster, until eventually it moves into a lower-noise region. In the lower-noise region, it will be kicked around less, and move around slower, so it takes longer to leave the region. So, the pollen grain will tend to spend more time in lower-noise regions. With a noise gradient, this tendency to spend more time in lower-noise regions becomes a tendency to drift down the noise gradient.
Mathematically, if Y(t) is our pollen location, we can write its motion as
… for a location-dependent “drift” μ(Y), “diffusion matrix” D(Y) (larger in higher temperature regions), and W is the standard Brownian motion again. Intuitively, the “drift” pushes Y along the direction μ, and the “diffusion” controls how fast Y spreads out along each direction - or at least that’s how we think about it for constant drift and diffusion. The probability distribution of Y evolves over time according to:
(This is the Fokker-Planck equation.)
Key thing to notice: we can re-write this as
… so −∇y⋅D(y) acts like a drift term, just like μ(y). This noise-induced drift is nonzero only when there's a noise gradient.
A very simple example of the math: suppose D(y)=y1I (i.e. it’s an identity matrix scaled by y1). Then −∇y⋅D(y)=−(1,0,0,...). So, the diffusion-gradient-induced drift is constant and along the −y1 direction.
In SGD, our “intended” drift is −EX[∇θu(X,θ)] - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias” - a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is
I don’t have much theory or evidence right now for what kinds of things that bias pushes towards (other than “regions of low gradient noise”), but having an explicit formula should help investigate that sort of question. Personally, I suspect that it pushes toward models with a modular structure reflecting the modularity structure of the environment, which is the main reason I'm interested in it.
I think the "drift from high-noise to low-noise" thing is more subtle than you're making it out to be... Or at least, I remain to be convinced. Like, has anyone else made this claim, or is there experimental evidence?
In the particle diffusion case, you point out correctly that if there's a gradient in D caused by a temperature gradient, it causes a concentration gradient. But I believe that if there's a gradient in D caused by something other than a temperature gradient, then it doesn't cause a concentration gradient. Like, take a room with a big pile of rocks in it. Intuitively you can say "when air molecules get deep into the pore spaces of the rock pile, they're stuck and can't easily get out, therefore I expect higher air concentration in the pore spaces of the rock pile than the open part of the room", but that would be wrong, the concentration is the same (as required by thermodynamics when the temperature is constant).
Another example, which is fun albeit more abstract: The classical diode-resistor generator circuit (see Gunn, Sokolov - email me if you want the PDFs). If you connect a resistor at nonzero temperature to a diode at absolute zero, the diode rectifies the resistor's Johnson noise and you get a net current in the diode's forward direction. Conversely, if you connect a resistor at absolute zero to a diode at nonzero temperature, you get a net current in the diode's reverse direction! And (obviously), if the temperatures of the diode and resistor are the same, there's no net current. The dynamics look like a 1D drift-diffusion thing, where the voltage is bouncing around by Johnson noise, while getting pulled back towards zero. But, by the Johnson noise equation, the voltage bounces around a lot more when it has one sign than the opposite sign (assuming the diode IV curve has an infinitely sharp bend at V=0), because the total circuit resistance is different. And yet, the voltage does not systematically go to the less-bouncy side, instead it averages to zero. (Note: the frequency spectrum of the fluctuations changes too depending on the sign of the voltage.) I actually did a time-domain simulation the diode-resistor generator once, a couple jobs ago (I was writing a paper kinda related to it), but was never 100% satisfied that I understood why temperature-related ∇D's can do things that constant-temperature ∇D's can't.
Very speculatively, I wonder if it matters whether you evaluate the gradient at the source of a random step, vs at the destination??
I'm still wrapping my head around this myself, so this comment is quite useful.
Here's a different way to set up the model, where the phenomenon is more obvious.
Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state k, it's equally likely to transition to k+1 or k−1). The rate λk at which the system leaves state k serves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of state k and state k−1 satisfy
... i.e. the flux from values-k-and-above to values-below-k is equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, if λk>λk−1, then pk<pk−1: the system spends more time in lower-diffusion states (locally). Similarly, if the system's state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally).
Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won't go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out.
I haven't looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn't that mean that it does spend more time on the lower-noise side? Because presumably it's typically further from zero on the higher-noise side. (More generally, I don't think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent bets ends up losing all their money with probability 1, but the expectation goes to infinity.)
Also, one simple way to see that the "drift" interpretation of the diffusion-induced drift term in the post is correct: set the initial distribution to uniform, and see what fluxes are induced. In that case, only the two drift terms are nonzero, and they both behave like we expect drift terms to behave - i.e. probability increases/decreases where the divergence of the drift terms is positive/negative.
That makes sense. Now it's coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they're carrying the temperature from their last collision. Whereas in uniform temperature, there's "detailed balance" (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution).
Thinking about the diode-resistor thing more, I suspect it would be a waste-of-time nerd-sniping rabbit hole, especially because of the time-domain aspects (the fluctuations are faster on one side vs the other I think) which don't have any analogue in SGD. Sorry to have brought it up.
The second idea reminds me of a talk years back about swarm behavior. Some fish swim faster in the sunlight, which makes the entire swarm "seek out" the shady parts of the pond.
There is a second mechanism at play here, where fish try to keep close to their neighbors, so the entire swarm kind of turns into the direction of shade as soon as the part of the swarm in the shade slows down.
This suggests an optimizer for parallel training which doesn't completely synchronize the weights on the different machines, but instead only tries to keep all sets of weights reasonably close to some of the other sets of weights.
The effect should be that the swarm of different weights turn into the direction of low noise.
Oh man, this is perfect. I've been looking for another very-different example of the phenomenon to think about, and this is exactly what I wanted. Thanks!
I don't have any empirical evidence, but we can think about what a flat minimum with high noise would mean. It would probably mean the system is able to predict some data points very well, and other data points very poorly, and both of these are robust: we can make large changes to the parameters while still predicting the predictable data points about-as-well, and the unpredictable data points about-as-poorly. In human terms, it would be like having a paradigm in which certain phenomena are very predictable, and other phenomena look like totally-random noise without any hint that they even could be predictable.
Not sure what it would look like in the perfect-training-prediction regime, though.
Hmm, and the -E[u(X, theta)] term would shrink during training, right? So eventually it would become more about the drift term? This makes me think of the "grokking" concept.
Both terms shrink near a local minimum.
This depends on whether it can achieve perfect predictive power or not, no? What I had in mind was something like autoregressive text prediction, where there will always be some prediction errors. I would've assumed those prediction errors constantly introduce some noise into the gradients?
Ah, yeah, you're right. Thanks, I was understanding the reason for convergence of SGD to a local minimum incorrectly. (Convergence depends on steadily decreasing η; that decrease is doing more work than I realized.)
In SGD, our “intended” drift is −EX[u(X,θ)] - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias” - a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is−ηn∇θ⋅VarX[∇θu(X,θ)]
In SGD, our “intended” drift is −EX[u(X,θ)] - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias” - a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is
I have not followed all your reasoning, but focusing on this last formula, does it represent a bias towards less variance over the different gradients one can sample at a given point?
If so, then I do find this quite interesting. A random connection: I actually thought of one strong form of gradient hacking as forcing the gradient to be (approximately) the same for all samples. Your result seems to imply that if such forms of gradient hacking are indeed possible, then they might be incentivized by SGD.
does it represent a bias towards less variance over the different gradients one can sample at a given point?
I'd be interested in the relationship between this and Implicit Gradient Regularization and the sharp/flat minima lit.The basic idea there is to compare the continuous gradient flow on the original objective, to the path followed by SGD due to discretization. They show that the latter can be re-interpreted as optimizing a modified objective which favors flat minima (low sensitivity to parameter perturbations). This isn't clearly the same as what you're analyzing here, since you're looking at variance due to sampling instead, but they might be related under appropriate regularity conditions.
Sam Smith also has this nice paper on sharp/flat minima which links a lot of previous observations together, and has some similarities to your approach here.
Haven't thought about any of this too closely, so my apologies if these aren't useful! Seems close enough that it might be of interest though.
My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, then an optimization algorithm which converges very poorly to sharp minima could actually be a good thing, so long as you combine it with some way to escape the basin of the sharp minimum (e.g. noise in SGD).
That said, I haven't read the various papers on this, so I'm at high risk of misunderstanding.
Also worth noting that there are reasons to expect convergence to flat minima besides bias in SGD itself. A flatter basin fills more of the parameter space than a sharper basin, so we're more likely to initialize in a flat basin (relevant to the NTK/GP/Mingard et al picture) or accidentally stumble into one.
Another example of this sort of thing: least-rattling feedback in driven systems.