Visual Exploration of Gradient Descent (many images)

[-]leogao2mo164

I like the spirit of this work but it would benefit a lot from a more in depth review of the existing literature and methodologies. some examples (non exhaustive)

the piecewise approximation thing is a pretty widely accepted opinion in ML
visualizing the loss landscape as a plane between three points in model space is pretty common in the field and often the landscape is a lot more nontrivial.
a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what's explained by the loss, you need to compare two models with the same loss but different actfn.
the post just hand waves away the difference between SGD and Adam. this is an important difference! Adam tries to take ~constant sized steps along each axis direction.
local approximation of the loss landscape as approximately quadratic is pretty widely accepted; generally people look at the eigenvalues of the Hessian to try to understand the local shape of the loss landscape.
scaling the gradient 500x is less impactful than it sounds like because the changes to the gradient scale are way less important than you'd expect because they get multiplied out by (1-beta2), this is unlike SGD where gradient scaling is equivalent to LR scaling.
learning rate decay is an important part of real training that substantially affects many conclusions
to compare models, if possible generally you want to train to the L(D) regime (loss has stopped improving at all), or pick some principled criterion for stopping early compute-optinally (L(C))

[-]silentbob2mo80

a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what's explained by the loss, you need to compare two models with the same loss but different actfn.

I made a fairer comparison now. Training model 4 (same architecture as model 3, but SiLU instead of ReLU) for only 30 epochs, it achieves a test loss of 0.3435, slightly above the 0.3403 of model 3. Taking these two models to render Mandelbrot, I get these results:

Model 3 (as we've seen in the post):

Model 4 with slightly worse test loss than Model 3:

And zooming in a bit to see the fractal details of Model 4:

So I'd say the observation holds - the huge difference between rendered Mandelbrot quality for some reason does not depend so much on the loss here. Even for (roughly) identical loss, they are worlds apart. So my take is that the smoothness of the SiLU activation function somehow enables the fractal iteration to work out much better (but not really sure how else to test this vague assumption).

[-]silentbob2mo80

In fact, even after only 10 epochs and a test loss of >1.2, model 4 already produces something that clearly resembles Mandelbrot, which model 3 failed to achieve even after 100s of epochs:

[-]silentbob2mo20

Thanks for the comment!

but it would benefit a lot from a more in depth review of the existing literature and methodologies

This is definitely not wrong. Probably depends a bit on what one sees as the goal here though - in my case, after weeks of readings and exercises, I was in dire need of basically just playing around with things more freely. So it was less about "how can I create the most valuable possible lesswrong post", and more "I'm going to do this thing anyway, and perhaps documenting the project in a lesswrong post is better than not doing so". I'm not totally confident about that though.

Thanks a lot for the list! Very interesting points that give me some ideas of where to go next.

the piecewise approximation thing is a pretty widely accepted opinion in ML (...)
local approximation of the loss landscape as approximately quadratic is pretty widely accepted

Sounds like good news to me! Certainly not the worst thing if the thing I found happens to be widely accepted. (also makes sense now that in both cases, Claude predicted these outcomes (even though I only asked it after getting to these findings - but asked as if I didn't know what to expect) - I guess it was less about Claude being very smart then and more about these things being relatively well established anyway)

a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what's explained by the loss, you need to compare two models with the same loss but different actfn

Fair point! My coverage of that in the post wasn't great. What I didn't mention there is that during the rendering of the two videos of model3 and model4 "learning Mandelbrot", model4 had the shape (including "fractal depth") down way before even getting close to the loss that model3 had at the end. So even with much lower loss, the ReLU Mandelbrot looked much worse than that of SiLU. But I'll look into that again and make a proper comparison.

[-]silentbob2mo80

One addition: I've been privately informed that another interesting thing to look at would be a visualization of C² (rather than only multiplication of a constant complex number with other complex numbers, see Visualizing Learned Functions section).

So I did that. For instance, here's the square visualization of model2 (the one with [10, 10] hidden neurons):

Again, we see some clear parallel between reality and the model, i.e. colors end up in roughly the right places, but it's clearly quite a bit off anyway. We also still see a lot of "linearity", i.e. straight lines in the model predictions as well as the diff heatmap, but this linearity is now seemingly only occurring in "radial" form, towards the center.

Model 0 and 1 look similar / worse. Model 3 ([20, 30, 20] hidden neurons) gets much closer despite still using ReLU:

And model 4 (same but with SiLU), expectedly, does even better:

But ultimately, we see the same pattern of "the larger the model, the more accurate, and SiLU works better than ReLU" again, without any obvious qualitative difference between SiLU and ReLU - so I don't think these renderings give any direct hint of SiLU performing that much better for actual fractal renderings than ReLU.

[-]dr_s2mo40

If I can ask, just as a matter of practicality that I might be interested in because I've been looking at ARENA myself - at what point did you find that it was basically impossible to go forward with your own hardware, and what did you use to go past that point if you reached it?

[-]silentbob2mo42

You mean in ARENA or with this complex number multiplication project? In both cases I was just using Google Colab (i.e. cloud compute) anyway. It probably would have worked in the free tier, but I did buy $10 worth of credits to speed things up a bit, as in the free tier I was occasionally downgraded to a CPU runtime after running the notebook for too long throughout a day. So I never tried this on my own hardware.

For this project, I'm pretty sure it would have worked completely fine locally. For ARENA, I'm not entirely sure, but would expect so too (and I think many people do work through it locally on their device with their own hardware). I think the longest training run I've encountered took something like 30 minutes on a T4 GPU in Colab, IIRC. According to Claude, consumer GPUs should be able to run that in a similar order of magnitude. Whereas if you only have some mid-range laptop without a proper graphics card, Claude expects a 10-50x slowdown, so that might become rather impractical for some of the ARENA exercises, I suppose.

[-]dr_s2mo40

All right, thanks! I wasn't really aware of Colab's free tier extents so it's good to know there's something of an intermediate stage between using my laptop and paying for compute. Also an easier interface than having to e.g. use AWS... personally I'd also be ok with just SSH'ing into a remote machine and working there but I'm not sure if anyone offers something like that.

Whereas if you only have some mid-range laptop without a proper graphics card, Claude expects a 10-50x slowdown, so that might become rather impractical for some of the ARENA exercises, I suppose.

I have a gaming laptop, so a decently powerful GPU but it obviously still isn't as beefy as what you can rent from these compute services.

[-]silentbob2mo20

One more addition: Based on @leogao's comment, I went a bit beyond the "visualize loss landscape based on gradient" approach, and did the following: I trained 3 models of identical architecture (all using [20, 30, 20] hidden neurons with ReLU) for 100 epochs and then had a look at the loss landscape in the "interpolation space" between these three models (such that model1 would be at (0,0), model2 at (1,0), model3 at (0,1), and the rest just linearly interpolating between their weights). I visualized the log of the loss at each point. My expectation was to get clear minima at (0,0), (1,0) and (0,1), where the trained models are placed, and something elevated between them. And indeed:

Otherwise the landscape does look pretty smooth and boring again.

^{^}

This would be different if my network used some more "exotic" activation functions, like x² or log(x). But in this post I'm only using the more usual ReLU and SiLU.

^{^}

I just took the gradient and shuffled its dimensions; which is not quite a random direction, but something like that

Model	Hidden Layers	Train Loss	Test Loss
0	[]	6.2571	7.6424
1	[10] ReLU	3.1443	3.7585
2	[10, 10] ReLU	1.6828	2.2726
3	[20, 30, 20] ReLU	0.1662	0.3403
4	[20, 30, 20] SiLU	0.0225	0.1108
5	[20, 30, 20] SiLU (1000 epochs)	0.0036	0.0262

LESSWRONG
LW

LESSWRONG
LW

38

Visual Exploration of Gradient Descent (many images)

38

38

What I Did

Network Architecture

Trying Out Different Model Sizes

Results

Neuron Count

Visualizing Learned Functions

Rendering Mandelbrot

Mandelbrot Training Animation

Importance of Network Initialization

Loss Landscape

Parameter Trajectories

Wrapping Up