In fact, even after only 10 epochs and a test loss of >1.2, model 4 already produces something that clearly resembles Mandelbrot, which model 3 failed to achieve even after 100s of epochs:
a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what's explained by the loss, you need to compare two models with the same loss but different actfn.
I made a fairer comparison now. Training model 4 (same architecture as model 3, but SiLU instead of ReLU) for only 30 epochs, it achieves a test loss of 0.3435, slightly above the 0.3403 of model 3. Taking these two models to render Mandelbrot, I get these results:
Model 3 (as we've seen in the post):
Model 4 with slightly worse test loss than Model 3:
And zooming in a bit to see the fractal details of Model 4:
So I'd say the observation holds - the huge difference between rendered Mandelbrot quality for some reason does not depend so much on the loss here. Even for (roughly) identical loss, they are worlds apart. So my take is that the smoothness of the SiLU activation function somehow enables the fractal iteration to work out much better (but not really sure how else to test this vague assumption).
You mean in ARENA or with this complex number multiplication project? In both cases I was just using Google Colab (i.e. cloud compute) anyway. It probably would have worked in the free tier, but I did buy $10 worth of credits to speed things up a bit, as in the free tier I was occasionally downgraded to a CPU runtime after running the notebook for too long throughout a day. So I never tried this on my own hardware.
For this project, I'm pretty sure it would have worked completely fine locally. For ARENA, I'm not entirely sure, but would expect so too (and I think many people do work through it locally on their device with their own hardware). I think the longest training run I've encountered took something like 30 minutes on a T4 GPU in Colab, IIRC. According to Claude, consumer GPUs should be able to run that in a similar order of magnitude. Whereas if you only have some mid-range laptop without a proper graphics card, Claude expects a 10-50x slowdown, so that might become rather impractical for some of the ARENA exercises, I suppose.
One addition: I've been privately informed that another interesting thing to look at would be a visualization of C² (rather than only multiplication of a constant complex number with other complex numbers, see Visualizing Learned Functions section).
So I did that. For instance, here's the square visualization of model2 (the one with [10, 10] hidden neurons):
Again, we see some clear parallel between reality and the model, i.e. colors end up in roughly the right places, but it's clearly quite a bit off anyway. We also still see a lot of "linearity", i.e. straight lines in the model predictions as well as the diff heatmap, but this linearity is now seemingly only occurring in "radial" form, towards the center.
Model 0 and 1 look similar / worse. Model 3 ([20, 30, 20] hidden neurons) gets much closer despite still using ReLU:
And model 4 (same but with SiLU), expectedly, does even better:
But ultimately, we see the same pattern of "the larger the model, the more accurate, and SiLU works better than ReLU" again, without any obvious qualitative difference between SiLU and ReLU - so I don't think these renderings give any direct hint of SiLU performing that much better for actual fractal renderings than ReLU.
Thanks for the comment!
but it would benefit a lot from a more in depth review of the existing literature and methodologies
This is definitely not wrong. Probably depends a bit on what one sees as the goal here though - in my case, after weeks of readings and exercises, I was in dire need of basically just playing around with things more freely. So it was less about "how can I create the most valuable possible lesswrong post", and more "I'm going to do this thing anyway, and perhaps documenting the project in a lesswrong post is better than not doing so". I'm not totally confident about that though.
Thanks a lot for the list! Very interesting points that give me some ideas of where to go next.
the piecewise approximation thing is a pretty widely accepted opinion in ML (...)
local approximation of the loss landscape as approximately quadratic is pretty widely accepted
Sounds like good news to me! Certainly not the worst thing if the thing I found happens to be widely accepted. (also makes sense now that in both cases, Claude predicted these outcomes (even though I only asked it after getting to these findings - but asked as if I didn't know what to expect) - I guess it was less about Claude being very smart then and more about these things being relatively well established anyway)
a factor of 3 loss difference is huge! if you want to claim that smooth actfn is better beyond what's explained by the loss, you need to compare two models with the same loss but different actfn
Fair point! My coverage of that in the post wasn't great. What I didn't mention there is that during the rendering of the two videos of model3 and model4 "learning Mandelbrot", model4 had the shape (including "fractal depth") down way before even getting close to the loss that model3 had at the end. So even with much lower loss, the ReLU Mandelbrot looked much worse than that of SiLU. But I'll look into that again and make a proper comparison.
Somewhat relatedly, when I started growing my hair long, I got exclusively positive feedback about it. It would have been easy to take this as evidence that clearly this was a good decision and this is just the better hair style for me. But then again, personal feedback like this tends to be very strongly filtered. Firstly, as in your example, the vast majority of people who disagree will just say nothing rather than telling me "I think this looks worse than before". Secondly, there were a few cases where people saw me after a longer time, said something like "Oh, your hair is longer!" and then after a brief pause added something like "Looks good!" - I suspect many of these cases were just the person realizing that pointing that out without giving a compliment would seem rude or awkward, so they quickly made sure to say something nice about it.
Some quick thoughts on vibe coding:
Thanks for the post! What makes me skeptical of utilizing this is that it usually takes me anywhere between 10 and 45 minutes to fall asleep, so when going to bed, I have no way of knowing when to get up even if I had perfect knowledge about my sleep cycles.
I also wonder if sleep inertia is actually as bad as it seems. Does it meaningfully impact the rest of the day? Or do you just feel groggy for 15 minutes but then return to normal? Or is it more about that last bit of sleep (your unfinished sleep cycle) being a waste of time and it makes sense to skip it?
Science bench is made by a Christian Stump. LLMs are literally stumped.
One more addition: Based on @leogao's comment, I went a bit beyond the "visualize loss landscape based on gradient" approach, and did the following: I trained 3 models of identical architecture (all using [20, 30, 20] hidden neurons with ReLU) for 100 epochs and then had a look at the loss landscape in the "interpolation space" between these three models (such that model1 would be at (0,0), model2 at (1,0), model3 at (0,1), and the rest just linearly interpolating between their weights). I visualized the log of the loss at each point. My expectation was to get clear minima at (0,0), (1,0) and (0,1), where the trained models are placed, and something elevated between them. And indeed:
Otherwise the landscape does look pretty smooth and boring again.