SGD Understood through Probability Current

LESSWRONG
LW

SGD Understood through Probability Current — LessWrong

My previous post about SGD was an intro to this model. That post concerned a model of a loss landscape on two "datapoints". In this post I attempt to build a new model of SGD and validate it, with mixed success, but it is sort of interesting.

Gradient Variance

We could model this another way. The expected change of on each step is $- T \frac{d W}{d L}$ , but we will also expect variance. $W$ will evolve over time through probability space. There are two competing "forces" here, the "spreading force" created by variance in in $\frac{d W}{d l_{j}}$ over all datapoints in the model, and the "descent force" being exerted by gradient descent pushing $W$ back into the centre of a given local minimum.

I think it makes sense to introduce some new notation here.

$g_{j} (W) = \frac{d l_{j}}{d W}$

$G (W) = \frac{d L}{d W} = a v e r a g e_{(a l l j)} (g_{j} (W))$

$S^{2} (W) = v a r i a n c e_{(a l l j)} (g_{j} (W))$

$S (W) = + \sqrt{S^{2} (W)}$

The $S^{2}$ notation should be thought of like the $c o s^{2} (x)$ notation.

Plotting these for our current system:

Places where $G$ is zero and the gradient of $G$ is positive are the stable equilibrium points with regards to gradient descent on $L$ (at ~1 and 2). If $G$ and $S^{2}$ are both zero at the same place, then this is an equilibrium point with regards to SGD on $L$ (only at 2). The zero points for $G$ and $S^{2}$ around the pit at 1 are not quite in the same pace.

It is possible to consider probability mass of $W$ "moving" according to the following rule:

A "point" (dirac $δ$ distribution) of probability at $W$ , between $t$ and $t + 1$ , changes to a distribution centred at $W - T G (W)$ with a variance of $T^{2} S^{2} (W)$ .

Now we have abstracted away $T$ from the actual process of discontinuous updates, we can try and factor out the discontinuity entirely. This will make the maths more manageable when it comes to generalizing to larger models. $T$ will likely be much smaller for larger models but as long as $S (W)$ grows larger with the number of datapoints used, this will compensate.

(Point of notation, I will be using $d$ rather than $\partial$ , even though the latter is arguably more correct. As we will never be "mixing" $W$ and $t$ it won't make a difference to our results)

Instead of probability distribution moving, we might now consider it flowing. This can be described by a probability current density $ρ$ :

Consider a system with $S^{2} (W) = 0$ everywhere. The probability will just flow down the gradient:

$ρ (W, t)_{G} = - T G (W) P (W, t)$

Taking $\frac{d P}{d t} = - \frac{d ρ}{d W}$ we get (when dependencies are removed for ease of reading):

${\frac{d P}{d t}}_{G} = T [G \frac{d P}{d W} + P \frac{d G}{d W}]$

Now consider a system with $G (W) = 0$ everywhere. Now we effectively have the evolution of a probability distribution via random walk. This gives a "spreading out" effect. With constant $S^{2}$ we have the following equation for $ρ$ , borrowed from the heat equation. I will take the central limit theorem and assume that the gradients are normally distributed.

$ρ (W, t)_{S} = - \frac{1}{2} T^{2} S^{2} \frac{d P (W, t)}{d W}$

Based on the fundamental solution of the heat equation this will increase our variance by $T^{2} S^{2}$ each step of $t$ .

Which gives us:

${\frac{d P}{d W}}_{S} = \frac{1}{2} T^{2} S^{2} \frac{d^{2} P}{d W} + \frac{1}{2} T^{2} \frac{d (S^{2})}{d W^{2}} \frac{d P}{d W}$

But the speed of "spreading out" is proportional to $S (W)$ which changes the equation. The slower the "spreading out", the higher the probability of $W$ being there. This makes $S (W)$ act like a "heat capacity" of the location for $P (W, t)$ for which $ρ$ is a conserved current. We might be able to borrow more from heat equations. In this case $P (W, t) S (W)$ acts as the "temperature" of a region.

$ρ (W, t)_{S} = - k (W) \frac{d [P (W, t) S (W)]}{d W}$

$ρ (W, t)_{S} = - k (W) [S (W) \frac{d P (W, t)}{d W} + P (W, t) \frac{d S (W)}{d W}]$

Calculating $k$ based on our previous equation gives $k (W) = \frac{1}{2} T^{2} S (W)$ , which gives:

${\frac{d ρ}{d W}}_{S} = - \frac{d}{d W} [- \frac{1}{2} T^{2} S^{2} (W) \frac{d P (W, t)}{d W} - \frac{1}{2} T^{2} S (W) P (W, t) \frac{d S (W)}{d W}]$

This can be reduced to the rather unwieldy equation (removing function dependencies for clarity):

${\frac{d P}{d t}}_{S} = T^{2} [\frac{3}{2} S \frac{d S}{d W} \frac{d P}{d W} + \frac{1}{2} S^{2} \frac{d^{2} P}{d W^{2}} + \frac{1}{2} (\frac{d S}{d W})^{2} P + \frac{1}{2} S P \frac{d^{2} S}{d W^{2}}]$

But these can be expressed in terms of $S^{2}$ rather than $S$ , which is good when $S$ is pathological in some way (like when $S^{2}$ is zero above, $S$ has a discontinuous derivative). It also makes sense that our equation shouldn't depend on our choosing positive rather than negative $S$ .

$ρ_{S} = - T^{2} [\frac{1}{2} S^{2} \frac{d P}{d W} + \frac{1}{4} P \frac{d (S^{2})}{d W}]$

${\frac{d P}{d T}}_{S} = T^{2} [\frac{3}{4} \frac{d (S^{2})}{d W} \frac{d P}{d W} + \frac{1}{2} S^{2} \frac{d P^{2}}{d W} + \frac{1}{4} \frac{d^{2} (S^{2})}{d W^{2}} P]$

Finally giving our master equations:

$ρ = - T^{2} [\frac{1}{2} S^{2} \frac{d P}{d W} + \frac{1}{4} \frac{d (S^{2})}{d W} P] - T [G P]$

$\frac{d P}{d t} = T^{2} [\frac{3}{4} \frac{d (S^{2})}{d W} \frac{d P}{d W} + \frac{1}{2} S^{2} \frac{d P^{2}}{d W} + \frac{1}{4} \frac{d^{2} (S^{2})}{d W^{2}} P] + T [G \frac{d P}{d W} + P \frac{d G}{d W}]$

Validation of the First Term of the Equations

Let's start with the first equation, and simulate using our G function from before.

T = 0.02, no stochasticity yet.

Here's $W$ on the y-axis, and $t$ on the x-axis. This is what the evolution of $W$ looks like for a series of initial $W$ values:

Now let's pick a couple of initial distributions and see how they evolve over time:

Time evolution with steps of $Δ t = 10$ :

This looks about right!

Now let's plot the mean of this over time, and compare to the mean and standard deviation of a Monte Carlo simulation of gradient descent. The Monte Carlo simulation starts with 1000 $W$ values chosen to form a normal distribution with the roughly same mean and standard deviation (0.5 and 0.175 respectively) as our initial $P (W, t)$ distribution.

Our first equation is an accurate description of non-stochastic gradient descent. The rest of the difference in the standard deviation is most likely due to imperfect matching of our initial data ( $P$ is a truncated normal distribution but our Monte Carlo uses a normal distribution with matched mean and variance to the truncated $P$ , so some elements are $< 0$ where the gradient is small).

Validation of the Second Term of the Equations

Let's take our first example as a distribution spreading out.

$g_{0} = - 1, g_{1} = 1, G = 0, S^{2} = 1$

The probability distribution changes from a concentrated one to a broadened one.

And compare standard deviations to our Monte Carlo simulation:

Looking good, errors here may also be due to truncation.

One final validation step: take $g_{0} = 1.2 W - 2$ , $g_{1} = 0.8 W - 2$ , $G = W - 2$ , $S^{2} = 0.4 W^{2}$ , $T = 0.5$ . This model will be used to assess a few things: our ability to perform well at higher $T$ , its ability to predict the correct form of the counterbalancing "concentrating" and "spreading" forces of $G$ and $S^{2}$ , and its ability to predict the concentration of probability mass in regions of lower $S^{2}$ .

Here the probability distribution moves from the left to the centre but doesn't concentrate due to the gradient variance.

Unfortunately the computational modelling seems to fall apart when applied to the original system. The large first and second derivatives of $S^{2}$ lead to a lot of instability. This means I can't validate it much more than this. High values of $T$ also cause the model to break down, as the gradient might change a lot in the span of a step. I think this can be remedied by (for example) picking a $g$ to update on and updating with multiple small steps before changing $g$ .

I'm no master programmer and I don't have much experience working with unstable PDEs. So I can't do much more here.

Solving for End-States

For an end-state, $ρ = 0$ everywhere. This means:

$0 = - T^{2} [\frac{1}{2} S^{2} \frac{d P}{d W} + \frac{1}{4} \frac{d (S^{2})}{d W} P] - T [G P]$

$T \frac{1}{2} S^{2} \frac{d P}{d W} = - T \frac{1}{4} \frac{d (S^{2})}{d W} P - G P$

$\frac{d P}{d W} \frac{1}{P} = - \frac{1}{2 S^{2}} \frac{d (S^{2})}{d W} - \frac{2 G}{T S^{2}}$

$\frac{d (l o g (P))}{d W} = - \frac{1}{2 S^{2}} \frac{d (S^{2})}{d W} - \frac{2 G}{T S^{2}}$

$\frac{d (l o g (P))}{d W} = - \frac{1}{2} \frac{d (l o g (S^{2}))}{d W} - \frac{2 G}{T S^{2}}$

This shows our problem. When $S^{2}$ vanishes, our equations don't work terribly well. We might have to hope that the two opposing $S^{2}$ terms cancel out and it works, but who knows. This is probably the source of instability in our equations.

But around some minimum it lets us interpret something. If $l o g (P)$ is decreasing linearly then $P$ decreases exponentially. Let's consider the $\frac{2 G}{T S^{2}}$ term now. If we have two minima (with a maximum between them) around which the loss landscapes are exactly the same, except one is twice as wide (in all $l_{i}$ ) then the $G$ component will be halved in the wider one, but the $S^{2}$ part will be quartered. This means the integral of $\int - \frac{2 G}{T S^{2}} d W$ from the centre of the wider one to the maximum will be four times that of the narrower one. Therefore the probability density at the centre of the wider minimum's basin will be ~~e^4 = 56 times~~ Edit: a lot higher.

What's the point?

Reasoning about stochastic processes is difficult. Reasoning about differential equations is also difficult, but the tools to analyse differential equations are different and might be able to solve different problems.

SGD is believed to have certain "bias" towards low-entropy models of the world. Part of this is a preference for "broader" rather than "narrower" minima in $L$ . Now we have some tools which may allow us to understand this. Under this model, SGD is also biased towards regions of low variance in loss function.

Further Investigation

I think there's something like a metric acting on a space here. $S^{2}$ looks like a metric, and perhaps it's actually more correct to consider the space of $W$ with the metric such that $S^{2} = 1$ everywhere. For higher dimensions we get the following transformations:

$W \to - \to W$
$G \to \to G$
$S^{2} \to S^{2}$

Now $- \to W$ and $\to G$ are vectors and $S^{2}$ is a matrix. This extends nicely as we can choose our metric such that $S^{2} = I$ . It might be useful to define some sort of function like an "energy" over the landscape of $\$ $W$ in terms of $G$ , $S$ , and $T$ alone which describes the final probability distribution. In fact such a function must exist assuming SGD converges, as $l o g (P (W, \infty))$ is well-defined. What the actual form of this function is would require to do some working out, and it may not be at all easily described. This whole process is very reminiscent of both chemical dynamical modelling and finding the minimum-energy configuration of a quantum energy landscape, as both consist of a "spreading" term and an "energy" term.

While it is quite interesting, I don't consider this a research priority for myself. About 90% of this post has been sitting in my drafts for the past 3 months. Even if powerful AI is created using SGD, I'm not convinced that this sort of model will be hugely useful. It might be possible to wrangle some selection-theorem-ish-thing out of this but I don't think I'll focus on it.

LESSWRONG
LW

LESSWRONG
LW

24

SGD Understood through Probability Current

24

24

Gradient Variance

Validation of the First Term of the Equations

Validation of the Second Term of the Equations

Solving for End-States

What's the point?

Further Investigation