SGD's Bias

[-]Steven Byrnes5yΩ13180

I think the "drift from high-noise to low-noise" thing is more subtle than you're making it out to be... Or at least, I remain to be convinced. Like, has anyone else made this claim, or is there experimental evidence?

In the particle diffusion case, you point out correctly that if there's a gradient in D caused by a temperature gradient, it causes a concentration gradient. But I believe that if there's a gradient in D caused by something other than a temperature gradient, then it doesn't cause a concentration gradient. Like, take a room with a big pile of rocks in it. Intuitively you can say "when air molecules get deep into the pore spaces of the rock pile, they're stuck and can't easily get out, therefore I expect higher air concentration in the pore spaces of the rock pile than the open part of the room", but that would be wrong, the concentration is the same (as required by thermodynamics when the temperature is constant).

Another example, which is fun albeit more abstract: The classical diode-resistor generator circuit (see Gunn, Sokolov - email me if you want the PDFs). If you connect a resistor at nonzero temperature to a diode at absolute zero, the diode rectifies the resistor's Johnson noise and you get a net current in the diode's forward direction. Conversely, if you connect a resistor at absolute zero to a diode at nonzero temperature, you get a net current in the diode's reverse direction! And (obviously), if the temperatures of the diode and resistor are the same, there's no net current. The dynamics look like a 1D drift-diffusion thing, where the voltage is bouncing around by Johnson noise, while getting pulled back towards zero. But, by the Johnson noise equation, the voltage bounces around a lot more when it has one sign than the opposite sign (assuming the diode IV curve has an infinitely sharp bend at V=0), because the total circuit resistance is different. And yet, the voltage does not systematically go to the less-bouncy side, instead it averages to zero. (Note: the frequency spectrum of the fluctuations changes too depending on the sign of the voltage.) I actually did a time-domain simulation the diode-resistor generator once, a couple jobs ago (I was writing a paper kinda related to it), but was never 100% satisfied that I understood why temperature-related ∇D's can do things that constant-temperature ∇D's can't.

Very speculatively, I wonder if it matters whether you evaluate the gradient at the source of a random step, vs at the destination??

[-]johnswentworth5yΩ350

I'm still wrapping my head around this myself, so this comment is quite useful.

Here's a different way to set up the model, where the phenomenon is more obvious.

Rather than Brownian motion in a continuous space, think about a random walk in a discrete space. For simplicity, let's assume it's a 1D random walk (aka birth-death process) with no explicit bias (i.e. when the system leaves state , it's equally likely to transition to $k + 1$ or $k - 1$ ). The rate $λ_{k}$ at which the system leaves state $k$ serves a role analogous to the diffusion coefficient (with the analogy becoming precise in the continuum limit, I believe). Then the steady-state probabilities of state $k$ and state $k - 1$ satisfy

$p_{k} λ_{k} = p_{k - 1} λ_{k - 1}$

... i.e. the flux from values- $k$ -and-above to values-below- $k$ is equal to the flux in the opposite direction. (Side note: we need some boundary conditions in order for the steady-state probabilities to exist in this model.) So, if $λ_{k} > λ_{k - 1}$ , then $p_{k} < p_{k - 1}$ : the system spends more time in lower-diffusion states (locally). Similarly, if the system's state is initially uniformly-distributed, then we see an initial flux from higher-diffusion to lower-diffusion states (again, locally).

Going back to the continuous case: this suggests that your source vs destination intuition is on the right track. If we set up the discrete version of the pile-of-rocks model, air molecules won't go in to the rock pile any faster than they come out, whereas hot air molecules will move into a cold region faster than cold molecules move out.

I haven't looked at the math for the diode-resistor system, but if the voltage averages to 0, doesn't that mean that it does spend more time on the lower-noise side? Because presumably it's typically further from zero on the higher-noise side. (More generally, I don't think a diffusion gradient means that a system drifts one way on average, just that it drifts one way with greater-than-even probability? Similar to how a bettor maximizing expected value with repeated independent bets ends up losing all their money with probability 1, but the expectation goes to infinity.)

Also, one simple way to see that the "drift" interpretation of the diffusion-induced drift term in the post is correct: set the initial distribution to uniform, and see what fluxes are induced. In that case, only the two drift terms are nonzero, and they both behave like we expect drift terms to behave - i.e. probability increases/decreases where the divergence of the drift terms is positive/negative.

[-]Steven Byrnes5yΩ340

That makes sense. Now it's coming back to me: you zoom your microscope into one tiny nm^3 cube of air. In a right-to-left temperature gradient you'll see systematically faster air molecules moving rightward and slower molecules moving leftward, because they're carrying the temperature from their last collision. Whereas in uniform temperature, there's "detailed balance" (just as many molecules going along a path vs going along the time-reversed version of that same path, and with the same speed distribution).

Thinking about the diode-resistor thing more, I suspect it would be a waste-of-time nerd-sniping rabbit hole, especially because of the time-domain aspects (the fluctuations are faster on one side vs the other I think) which don't have any analogue in SGD. Sorry to have brought it up.

[-]p.b.5y100

The second idea reminds me of a talk years back about swarm behavior. Some fish swim faster in the sunlight, which makes the entire swarm "seek out" the shady parts of the pond.

There is a second mechanism at play here, where fish try to keep close to their neighbors, so the entire swarm kind of turns into the direction of shade as soon as the part of the swarm in the shade slows down.

This suggests an optimizer for parallel training which doesn't completely synchronize the weights on the different machines, but instead only tries to keep all sets of weights reasonably close to some of the other sets of weights.

The effect should be that the swarm of different weights turn into the direction of low noise.

[-]johnswentworth5y30

Oh man, this is perfect. I've been looking for another very-different example of the phenomenon to think about, and this is exactly what I wanted. Thanks!

[-]Past Account5y*60

[Deleted]

[-]johnswentworth5y20

I don't have any empirical evidence, but we can think about what a flat minimum with high noise would mean. It would probably mean the system is able to predict some data points very well, and other data points very poorly, and both of these are robust: we can make large changes to the parameters while still predicting the predictable data points about-as-well, and the unpredictable data points about-as-poorly. In human terms, it would be like having a paradigm in which certain phenomena are very predictable, and other phenomena look like totally-random noise without any hint that they even could be predictable.

Not sure what it would look like in the perfect-training-prediction regime, though.

[-]tailcalled5y60

Hmm, and the -E[u(X, theta)] term would shrink during training, right? So eventually it would become more about the drift term? This makes me think of the "grokking" concept.

[-]johnswentworth5y20

Both terms shrink near a local minimum.

[-]tailcalled5y60

This depends on whether it can achieve perfect predictive power or not, no? What I had in mind was something like autoregressive text prediction, where there will always be some prediction errors. I would've assumed those prediction errors constantly introduce some noise into the gradients?

[-]johnswentworth5y50

Ah, yeah, you're right. Thanks, I was understanding the reason for convergence of SGD to a local minimum incorrectly. (Convergence depends on steadily decreasing ; that decrease is doing more work than I realized.)

[-]adamShimi5yΩ350

In SGD, our “intended” drift is - i.e. drift down the gradient of the objective. But the location-dependent noise contributes a “bias” - a second drift term, resulting from drift down the noise-gradient. Combining the equations from the previous two sections, the noise-gradient-drift is
$- \frac{η}{n} \nabla_{θ} \cdot V a r_{X} [\nabla_{θ} u (X, θ)]$

I have not followed all your reasoning, but focusing on this last formula, does it represent a bias towards less variance over the different gradients one can sample at a given point?

If so, then I do find this quite interesting. A random connection: I actually thought of one strong form of gradient hacking as forcing the gradient to be (approximately) the same for all samples. Your result seems to imply that if such forms of gradient hacking are indeed possible, then they might be incentivized by SGD.

[-]johnswentworth5yΩ230

does it represent a bias towards less variance over the different gradients one can sample at a given point?

Yup, exactly.

[-]Jonathan Uesato5yΩ340

I'd be interested in the relationship between this and Implicit Gradient Regularization and the sharp/flat minima lit.The basic idea there is to compare the continuous gradient flow on the original objective, to the path followed by SGD due to discretization. They show that the latter can be re-interpreted as optimizing a modified objective which favors flat minima (low sensitivity to parameter perturbations). This isn't clearly the same as what you're analyzing here, since you're looking at variance due to sampling instead, but they might be related under appropriate regularity conditions.

Sam Smith also has this nice paper on sharp/flat minima which links a lot of previous observations together, and has some similarities to your approach here.

Haven't thought about any of this too closely, so my apologies if these aren't useful! Seems close enough that it might be of interest though.

[-]johnswentworth5yΩ340

My own understanding of the flat minima idea is that it's a different thing. It's not really about noise, it's about gradient descent in general being a pretty shitty optimization method, which converges very poorly to sharp minima (more precisely, minima with a high condition number). (Continuous gradient flow circumvents that, but using step sizes small enough to circumvent the problem in practice would make GD prohibitively slow. The methods we actually use are not a good approximation of continuous flow, as I understand it.) If you want flat minima, then an optimization algorithm which converges very poorly to sharp minima could actually be a good thing, so long as you combine it with some way to escape the basin of the sharp minimum (e.g. noise in SGD).

That said, I haven't read the various papers on this, so I'm at high risk of misunderstanding.

Also worth noting that there are reasons to expect convergence to flat minima besides bias in SGD itself. A flatter basin fills more of the parameter space than a sharper basin, so we're more likely to initialize in a flat basin (relevant to the NTK/GP/Mingard et al picture) or accidentally stumble into one.

[-]interstice5y30

Another example of this sort of thing: least-rattling feedback in driven systems.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

63

63

Ω 34

63

Ω 34

First Idea: SGD Is Approximately Brownian

Second Idea: Drift From High-Noise To Low-Noise Regions

Putting It Together