I tried to learn as much Deep Learning math as I could in 24 hours

Phosphorous

TL:DR I designed an experiment where I committed to spend two 12 hour days trying to learn as much deep-learning math as possible, basically from scratch.

Table of Contents

Origins and Motivations

For a long time, I’ve felt intimidated by the technical aspects of alignment research. I had never taken classes on linear algebra or multivariable calculus or deep learning, and when I cracked open many AI papers, I was terrified by symbols and words I didn’t understand.

7 months ago I wrote up a short doc about how I was going to remedy my lack of technical knowledge: I collected some textbooks and some online courses, and I decided to hire a tutor to meet a few hours a week. I had the first two weeks of meetings, it was awesome, then regular meetings got disrupted by travel, and I never came back to it.

When I thought about my accumulating debt of technical knowledge, my cached answer was “Oh, that might take six months to get up to speed. I don’t have the time.”

Then, watching my productivity on other projects over the intervening months, I noticed two things:

There appeared to be massive variance in my productivity. Sometimes, in a single day, I would get more done than I had accomplished in previous weeks.
I seemed to both enjoy and get more done by “sprinting” through certain projects, eg. by spending 10 hours on it in a single day, rather than spreading that same work out over 2 hours a week for 5 weeks. It was, for some reason, way more motivating and seemingly more efficient to sprint.

Also, when I asked myself what I thought the main bottlenecks were for addressing my technical debt problem, I identified two categories:

Time (I felt busy all the time, and was afraid of committing too much to one project)
A combination of lacking Motivation, Accountability and Fun

Then, as my mind wandered, I started to put 2 and 2 together: Perhaps these new things I had noticed about my productivity, could be used to address the bottlenecks in my technical debt? I decided to embark on an experiment: how much technical background on deep learning could I learn in a single weekend? My understanding of the benefits of this experiment were as follows:

Committing “a weekend” felt like a much smaller time cost than committing “a few months”, even if they were the same number of hours.
No Distraction: I could design my environment to minimize distractions for two days, something it would be intractable to do to the same degree for several months.
“Trying to learn as much as possible” felt like a challenge. I was, to be honest, pretty scared. I didn’t know what I was doing, it felt extreme, but that also made it exciting and fun.
I had some historical data that I might be good at this kind of sprinting, and framing this as an experiment to see what I could learn about my productivity added another layer of discovery-driven motivation and fun. What if I learned more about how to be productive and get hard things done via this experiment?
As far as I knew, nobody else among my peers had done this - but I suspected that more people than me had the same problems, and that if I conducted this experiment, I might learn things that would be helpful to others, which added yet another layer of discovery-driven motivation and fun.
Accountability: Once I told somebody about this, it was hard to back out. It’s way easier for them to monitor me for a weekend than for a few months.

Results

I’d consider the experiment a success: I finished the whole curriculum in ~18 hours, and I got a lot of neat take-aways I’ll go over below.
At the end of day 1, after 12 hours of cramming, I was too exhausted to explain what I had learned. However, the next morning, after reading my notes for 20 minutes, I was able to explain everything I had learned the night before. I haven’t tested myself for retention yet, but I feel fairly confident that after glancing at my notes for 20 minutes, I’ll be able to recall and use the information again, which is good enough for my practical purposes.
I ended up having to take more breaks than I expected because my brain just kinda felt fried at times. I would often just lay down on the floor and look at the ceiling. If I wasn’t part of an experiment, I think these moments are likely when I would’ve given up and done something else. But I noticed that after 10, 20 or 30 minutes, I was able to return to being productive, which is likely the moment when this experiment showed its worth.

Takeaways

Experiments are really useful, as outlined by Neel Nanda.
There is no speed limit - the world appears to be not very optimized for speed, including past versions of me. Two days of actually trying allowed me to accomplish more on this particular goal than 7 previous months of cowardice had, and (maybe) about as much as a full semester college course would have.
A big part of optimizing the actions you take is deciding upon the optimal sequence. Learning seems to be a lot about setting up the right sequence of inferential steps; order matters a lot. If you skip the wrong step, it’ll cost you.
Having someone with way more knowledge spend half an hour curating a list of videos, especially by pointing out the things which are irrelevant and can be skipped, saved me tens of hours, probably more.
People waste a lot of time on “prerequisites” for the things they want to do that aren’t actually real prerequisites. Instead, you can try to directly do the thing you want, then notice where you get stuck, and go back to build just the prerequisites you notice are needed to overcome the stuck-ness. No wasted motion.
Explaining what you learn to someone is a great (and fun) way to actually test your knowledge, notice where you’re lacking so you can efficiently direct further study, and get useful explanations on things you’re struggling with.
Asking for help is hard. It’s also really worth it. This experiment probably would not have worked without the advice of Thomas Larsen and Tamera Lanham. Their input of like ~3 total hours probably saved me tens of hours, which seems like a great utilitarian deal.
The Curricula I used was (only partially) optimized. It was still way better than I could’ve come up with on my own, but a better curriculum is definitely possible.
- Thomas just came up with a list of concepts, looked through the 3 blue 1 brown series on them, marked the videos I could skip, then searched on youtube for other videos for the topics 3b1b didn’t cover, chose the ones with the highest views/likes, and watched the first five minutes of a couple to see if he liked them. In total I think this cost him ~1 hour over two days.

The Experiment Set-Up

I committed to spend 12 hours a day, for two days, doing whatever I could to learn about deep learning. I intended to optimize my breaks, to take meals while working, and generally try to ignore everything and everyone else not involved in the experiment...
I would occupy a room in the Lightcone offices with a whiteboard, and lock the door to minimize my friends distracting me.
Originally, the plan was to sprint through a python course and the fast.ai course and code a neural net by the end of the weekend.
- After a half hour talking with Tamera Lanham and Thomas Larsen, it became apparent that what better fit my goals was understanding the math behind neural networks, rather than the coding. This meant that two days before the experiment, I discarded the curricula I had. Thomas graciously spent 1 hour building me a curriculum.
Originally, I was going to conduct this sprint with a friend, and with Thomas Larsen available to help us when stuck. The friend decided to do something else the night before, which in hindsight probably improved the efficiency of the experiment, because I could set my own pace very easily, and was less distracted by exciting-but-irrelevant-to-the-experiment conversation. In my original plan I was afraid of doing it alone, but I might actually recommend doing it that way.

The Curriculum that I used

Intro to deep learning (I kept returning to these videos throughout the experiment, rewatching and understanding slightly more)

Watch 3blue1brown
https://www.youtube.com/watch?v=aircAruvnKk

Linear algebra (this took me 2hr 25 mins, and ~36 mins of breaks)

https://www.youtube.com/watch?v=kjBOesZCoqc&list=PL0-GT3co4r2y2YErbmuJw2L5tW4Ew2O5B&index=1
Skip:
- determinant
- Inverse matrices
- Cross products
- Eigenvectors
- Abstract vector spaces
Watch all the others

Calc 3 (this took me 2hr 57 mins and ~50 mins of breaks)

Do this: https://www.khanacademy.org/math/multivariable-calculus
Only the section:
- 'thinking about multivariable functions'
- 'derivatives of multivariable functions'
  - Skip differentiating parametric curves, curvature, divergence, curl, laplacian, jacobian

- ResNets: https://www.youtube.com/watch?v=ZILIbUvp5lk (took me 18 mins)

- RNNs (optional): https://www.youtube.com/watch?v=_aCuOwF1ZjU (took me 13 mins)

- Transformers: https://www.youtube.com/watch?v=4Bdc55j80l8&t=609s

(I spent like, two hours on the above video which ex-post was not great. I would recommend others choose a different explainer on Transformers. )

- RL basics https://www.youtube.com/watch?v=JgvyzIkgxF0 (took me 25 mins)

- policy gradients / ppo: https://www.youtube.com/watch?v=5P7I-xPq8u8&t=318s

I could not understand the above video after rewatching it several times (I think the curricula skipped some prerequisites for this) so I had to have Thomas Larsen walk me through it on his own for around an hour. Thanks Thomas!

- RLHF: rob miles video: https://www.youtube.com/watch?v=PYylPRX6z4Q (took me 23 mins)

Documentation on Hours

(I used toggl track to record my time, and was fairly happy with the software. However, I made many errors / didn’t record breaks correctly, etc. So take these numbers with a grain of salt.)

Saturday
3b1b Video 1	28 min	Started around 10:00am
Video 1 Summarizing	15 min
???	7 min
3b1b Video 2	29 min
Break	4 min
Video 2 Summarizing	20 min
3b1b Video 3	13 min
Break	9 min
White-Boarding	4 min
Linnear Algebra - first four videos	58 min
Cleaning up notes	7 min
Chapter 4 Linnear Algebra	16 min
White boarding	14 min
Three Dimensional Linnear Transformations	14 min
???	12 min
Chapter 9	10 min
break	27 min
Chapter 13	14 min
Calculus?	3 min
Backpropagation, Chapter 4	10 min
break	9 min
Multivariable Calculus	1hr 6min
meditation	9 min
Multivariable Calculus	10 min
meditation	7 min
Multivariable Calculus	26 min
break	25 min
Multivariable Calculus	47 min
Multivariable Calculus	28 min
Watching Neural Nets Ch. 3 again	30 min
????	45 min
Trying to explain and failing	30 min	Ended around 8:45pm
Sunday
Rewatching Backpropagation	22 min	Started around 11am
Resnets Video	18 min
RNN's Video	13 min
Transformers	21 min
break	5 min
Transformers	36 min
RL Basics	25 min
break	23 min
More RL	23 min
Talking to Thomas about Transformers and Reinforcement Learning and PPO	120 mins	Ended around 6pm

[-]Curt Tigges3y117

This is a cool idea, and I have no doubt it helped somewhat, but IMO it falls prey to the same mistake I see made by the makers of almost every video series/online course/list of resources for ML math: assuming that math is mostly about concepts and facts.

It's only about 5% that. Maybe less. I and many others in ML have seen the same videos and remembered the concepts for a while too. And forgotten them, in time. More than once! On the other hand, I've seen how persistently and operationally fluent (especially in ML and interpretability) people become when they actually learned math the way it must be learned: via hundreds of hours of laborious exercises, proofs, derivations, etc. Videos and lectures are a small fraction of what's ultimately needed.

For most of ML, it's probably fine--you'll never need to do a proof or do more than simple linear algebra operations by hand. But if you want to do the really hard stuff, especially in interpretability, I don't think there's any substitute for cranking through those hours.

To be clear, I think this weekend was a great start on that--if you continue immediately to taking full courses and doing the exercises. I'm a top-down learner, so it would certainly help me. But unless it's practiced in very short order, it will be forgotten, and just become a collection of terms you recognize when others talk about them.

LESSWRONG
LW