LESSWRONG
LW

771
silentbob
1510281400
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
6silentbob's Shortform
1y
23
Visual Exploration of Gradient Descent (many images)
silentbob13d20

One more addition: Based on @leogao's comment, I went a bit beyond the "visualize loss landscape based on gradient" approach, and did the following: I trained 3 models of identical architecture (all using [20, 30, 20] hidden neurons with ReLU) for 100 epochs and then had a look at the loss landscape in the "interpolation space" between these three models (such that model1 would be at (0,0), model2 at (1,0), model3 at (0,1), and the rest just linearly interpolating between their weights). I visualized the log of the loss at each point. My expectation was to get clear minima at (0,0), (1,0) and (0,1), where the trained models are placed, and something elevated between them. And indeed:

Otherwise the landscape does look pretty smooth and boring again.

Reply
Visual Exploration of Gradient Descent (many images)
silentbob14d80

In fact, even after only 10 epochs and a test loss of >1.2, model 4 already produces something that clearly resembles Mandelbrot, which model 3 failed to achieve even after 100s of epochs:

Reply
Visual Exploration of Gradient Descent (many images)
silentbob14d80

a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what's explained by the loss, you need to compare two models with the same loss but different actfn.

I made a fairer comparison now. Training model 4 (same architecture as model 3, but SiLU instead of ReLU) for only 30 epochs, it achieves a test loss of 0.3435, slightly above the 0.3403 of model 3. Taking these two models to render Mandelbrot, I get these results:

Model 3 (as we've seen in the post):

Model 4 with slightly worse test loss than Model 3:

And zooming in a bit to see the fractal details of Model 4:

So I'd say the observation holds - the huge difference between rendered Mandelbrot quality for some reason does not depend so much on the loss here. Even for (roughly) identical loss, they are worlds apart. So my take is that the smoothness of the SiLU activation function somehow enables the fractal iteration to work out much better (but not really sure how else to test this vague assumption).

Reply
Visual Exploration of Gradient Descent (many images)
silentbob14d42

You mean in ARENA or with this complex number multiplication project? In both cases I was just using Google Colab (i.e. cloud compute) anyway. It probably would have worked in the free tier, but I did buy $10 worth of credits to speed things up a bit, as in the free tier I was occasionally downgraded to a CPU runtime after running the notebook for too long throughout a day. So I never tried this on my own hardware.

For this project, I'm pretty sure it would have worked completely fine locally. For ARENA, I'm not entirely sure, but would expect so too (and I think many people do work through it locally on their device with their own hardware). I think the longest training run I've encountered took something like 30 minutes on a T4 GPU in Colab, IIRC. According to Claude, consumer GPUs should be able to run that in a similar order of magnitude. Whereas if you only have some mid-range laptop without a proper graphics card, Claude expects a 10-50x slowdown, so that might become rather impractical for some of the ARENA exercises, I suppose.

Reply1
Visual Exploration of Gradient Descent (many images)
silentbob14d80

One addition: I've been privately informed that another interesting thing to look at would be a visualization of C² (rather than only multiplication of a constant complex number with other complex numbers, see Visualizing Learned Functions section).

So I did that. For instance, here's the square visualization of model2 (the one with [10, 10] hidden neurons):

Again, we see some clear parallel between reality and the model, i.e. colors end up in roughly the right places, but it's clearly quite a bit off anyway. We also still see a lot of "linearity", i.e. straight lines in the model predictions as well as the diff heatmap, but this linearity is now seemingly only occurring in "radial" form, towards the center.

Model 0 and 1 look similar / worse. Model 3 ([20, 30, 20] hidden neurons) gets much closer despite still using ReLU:

And model 4 (same but with SiLU), expectedly, does even better:

But ultimately, we see the same pattern of "the larger the model, the more accurate, and SiLU works better than ReLU" again, without any obvious qualitative difference between SiLU and ReLU - so I don't think these renderings give any direct hint of SiLU performing that much better for actual fractal renderings than ReLU.

Reply
Visual Exploration of Gradient Descent (many images)
silentbob14d20

Thanks for the comment!

but it would benefit a lot from a more in depth review of the existing literature and methodologies

This is definitely not wrong. Probably depends a bit on what one sees as the goal here though - in my case, after weeks of readings and exercises, I was in dire need of basically just playing around with things more freely. So it was less about "how can I create the most valuable possible lesswrong post", and more "I'm going to do this thing anyway, and perhaps documenting the project in a lesswrong post is better than not doing so". I'm not totally confident about that though.

Thanks a lot for the list! Very interesting points that give me some ideas of where to go next.

the piecewise approximation thing is a pretty widely accepted opinion in ML (...)
local approximation of the loss landscape as approximately quadratic is pretty widely accepted

Sounds like good news to me! Certainly not the worst thing if the thing I found happens to be widely accepted. (also makes sense now that in both cases, Claude predicted these outcomes (even though I only asked it after getting to these findings - but asked as if I didn't know what to expect) - I guess it was less about Claude being very smart then and more about these things being relatively well established anyway)

a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what's explained by the loss, you need to compare two models with the same loss but different actfn

Fair point! My coverage of that in the post wasn't great. What I didn't mention there is that during the rendering of the two videos of model3 and model4 "learning Mandelbrot", model4 had the shape (including "fractal depth") down way before even getting close to the loss that model3 had at the end. So even with much lower loss, the ReLU Mandelbrot looked much worse than that of SiLU. But I'll look into that again and make a proper comparison.

Reply
How To Dress To Improve Your Epistemics
silentbob14d142

Somewhat relatedly, when I started growing my hair long, I got exclusively positive feedback about it. It would have been easy to take this as evidence that clearly this was a good decision and this is just the better hair style for me. But then again, personal feedback like this tends to be very strongly filtered. Firstly, as in your example, the vast majority of people who disagree will just say nothing rather than telling me "I think this looks worse than before". Secondly, there were a few cases where people saw me after a longer time, said something like "Oh, your hair is longer!" and then after a brief pause added something like "Looks good!" - I suspect many of these cases were just the person realizing that pointing that out without giving a compliment would seem rude or awkward, so they quickly made sure to say something nice about it.

Reply
silentbob's Shortform
silentbob16d100

Some quick thoughts on vibe coding:

  • it turns you from a developer into more of a product manager role
    • but the developers you manage are a) occasionally stupid/unwise and b) extremely fast and never tired
  • this makes it relatively addictive, because feedback cycles are much shorter than for a "real" product manager, who often has to wait for weeks to see their wishes turn into software, and you have a strong element of randomness in your rewards, with things sometimes turning out surprisingly well one-shot, but sometimes not at all
  • It can also lead to laziness, as it's very tempting to getting used to "just letting the AI do it" even in not primarily vibe-coded projects, instead of investing one's own brainpower
  • AI agents tend to never/rarely talk back or tell you that something is a bad idea or doesn't work well with the current architecture; they just do things as best as currently possible. This form of local optimization quickly runs into walls if not carefully mitigated by you.
    • Part of the problem is that by default the AI has extremely little context and knows little about the purpose, scope and ambition of your project. So when you tell it "do X", it typically can't tell whether you mean "do X quick and dirty, I just wand the results asap" or "lay out a 10 step plan to do X in the most sustainable way possible that allows us to eventually reach points Y and Z in the future". If it gets things wrong in either direction, that tends to be frustrating, but it can't read your mind (yet).
  • AI agents that are able to run unit tests and end-2-end tests and see compiler errors are so much more useful than their blind counterparts
  • If you need some particular piece of software but are unsure if current AIs will be able to deliver, it might make sense to write a detailed, self-contained and as-complete-as-possible specification of it, to then throw it at an AI agent whenever a new model (or scaffolding) comes out. Github Copilot with GPT5 was able to do many more things than I would have imagined, with non-trivial but still relatively limited oversight.
    • I haven't tried yet if just letting it to its thing, saying only "continue" after each iteration, may be sufficient. Maybe I put more time into guiding it than would actually be necessary.
    • That being said: writing a self-contained specification that contains your entire idea of something with all the details nailed down such that there is little room for misunderstandings is surprisingly hard. There are probably cases where just writing the software yourself (if you can) takes less time than fully specifying it.
    • That being said, "writing down a specification" can also happen interview-style using an AI's voice mode, so you can do it while doing chores.

 

Reply
Master your sleep cycles
silentbob17d51

Thanks for the post! What makes me skeptical of utilizing this is that it usually takes me anywhere between 10 and 45 minutes to fall asleep, so when going to bed, I have no way of knowing when to get up even if I had perfect knowledge about my sleep cycles. 

I also wonder if sleep inertia is actually as bad as it seems. Does it meaningfully impact the rest of the day? Or do you just feel groggy for 15 minutes but then return to normal? Or is it more about that last bit of sleep (your unfinished sleep cycle) being a waste of time and it makes sense to skip it?

Reply
Davey Morse's Shortform
silentbob26d20

Science bench is made by a Christian Stump. LLMs are literally stumped.

Reply
Load More
38Visual Exploration of Gradient Descent (many images)
15d
9
31Futility Illusions
1mo
10
62Procrastination Drill
2mo
8
60Melatonin Self-Experiment Results
3mo
5
50Four Types of Disagreement
6mo
4
36Any-Benefit Mindset and Any-Reason Reasoning
7mo
9
18Seeing Through the Eyes of the Algorithm
7mo
3
9On Responsibility
8mo
2
18Reality is Fractal-Shaped
10mo
1
14Inverse Problems In Everyday Life
1y
2
Load More