1 min read

6

This is a special post for quick takes by silentbob. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
16 comments, sorted by Click to highlight new comments since:

One crucial question in understanding and predicting the learning process, and ultimately the behavior, of modern neural networks, is that of the shape of their loss landscapes. What does this extremely high dimensional landscape look like? Does training generally tend to find minima? Do minima even exist? Is it predictable what type of minima (or regions of lower loss) are found during training? What role does initial randomization play? Are there specific types of basins in the landscape that are qualitatively different from others, that we might care about for safety reasons?

First, let’s just briefly think about very high dimensional spaces. One somewhat obvious observation is that they are absolutely vast. With each added dimension, the volume of the available space increases exponentially. Intuitively we tend to think of 3-dimensional spaces, and often apply this visual/spatial intuition to our understanding of loss landscapes. But this can be extremely misleading. Parameter spaces are utterly incredibly vast to a degree that our brain can hardly fathom. Take GPT3 for instance. It has 175 billion parameters, or dimensions. Let’s assume somewhat arbitrarily that all parameters end up in a range of [-0.5, 0.5], i.e. live in a 175-billion-dimensional unit cube around the origin of that space (as this is not the case, the real parameter space is actually even much, much larger, but bear with me). Even though every single axis only varies by 1 – let’s just for the sake of it interpret this as “1 meter” – even just taking the diagonal from one corner to the opposite one in this high-dimensional cube, you would get a length of ~420km. So if, hypothetically, you were sitting in the middle of this high dimensional unit cube, you could easily touch every single wall with your hand. But nonetheless, all the corners would be more than 200km distant from you.

This may be mind boggling, but is it relevant? I think it is. Take this realization for instance: if you have two minima in this high dimensional space, but one is just a tiny bit “flatter” than the other (meaning the second derivatives overall are a bit closer to 0), then the attractor basin of this flatter minimum is vastly larger than that of the other minimum. This is because the flatness implies a larger radius, and the volume depends exponentially on that radius. So, at 175 billion dimensions, even a microscopically larger radius means an overwhelmingly larger volume. If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is). And this is only for GPT3, which is almost 4 years old by now.

The parameter space is just ridiculously large, so it becomes really crucial how the search process through it works and where it lands. It may be that somewhere in this vast space, there are indeed attractor basins that correspond to minima that we find extremely undesirable – certain capable optimizers perhaps, that have situational awareness and deceptive tendencies. If they do exist, what could we possibly tell about them? Maybe these minima have huge attractor basins that are reliably found eventually (maybe once we switch to a different network architecture, or find some adjustment to gradient descent, or reach a certain model size, or whatever), which would of course be bad news. Or maybe these attractor basins are so vanishingly small that we basically don’t have to care about them at all, because all the computer & search capacity of humanity over the next million years would have an almost 0 chance of ever stumbling onto these regions. Maybe they are even so small that they are numerically unstable, and even if your search process through some incredible cosmic coincidence happens to start right in such a basin, the first SGD step would immediately jump out of it due to the limitations of numerical accuracy on the hardware we’re using.

 

So, what can we actually tell at this point about the nature of high dimensional loss landscapes? While reading up on this topic, one thing that constantly came up is the fact that, the more dimensions you have, the lower the relative number of minima becomes compared to saddle points. Meaning that whenever the training process appears to slow down and it looks like it found some local minimum, it’s actually overwhelmingly likely that what it actually found is a saddle point, hence the training process never halts but keeps moving through parameter space, even if the loss doesn't change that much. Do local minima exist at all? I guess it depends on the function the neural network is learning to approximate. Maybe some loss landscapes exist where the loss can just get asymptotically closer to some minimum (such as 0), without ever reaching it. And probably other loss landscapes exist where you actually have a global minimum, as well as several local ones.

Some people argue that you probably have no minima at all, because with each added dimension it becomes less and less likely that a given point is a minimum (because not only does the first derivative of a point have to be 0 for it to be a minimum, also all the second derivatives need to be in on it, and all be positive). This sounds compelling, but given that the space itself also grows exponentially with each dimension, we also have overwhelmingly more points to choose from. If you e.g. look at n-dimensional Perlin Noise, its absolute number of local minima within an n-dimensional cube of constant side length actually increases with each added dimension. However, the relative number of local minima compared to the available space still decreases, so it becomes harder and harder to find them.

 

I’ll keep it at that. This is already not much of a "quick" take. Basically, more research is needed, as my literature review on this subject yielded way more questions than answers, and many of the claims people made in their blog posts, articles and sometimes even papers seemed to be more intuitive / common-sensical or generalized from maybe-not-that-easy-to-validly-generalize-from research.

One thing I’m sure about however is that almost any explanation of how (stochastic) gradient descent works, that uses 3D landscapes for intuitive visualizations, is misleading in many ways. Maybe it is the best we have, but imho all such explainers should come with huge asterisks, explaining that the rules in very high dimensional spaces may look much different than our naive “oh look at that nice valley over there, let’s walk down to its minimum!” understanding, that happens to work well in three dimensions.

I'd like to point out that for neural networks, isolated critical points (whether minima, maxima, or saddle points) basically do not exist. Instead, it's valleys and ridges all the way down. So the word "basin" (which suggests the geometry is parabolic) is misleading. 

Because critical points are non-isolated, there are more important kinds of "flatness" than having small second derivatives. Neural networks have degenerate loss landscapes: their Hessians have zero-valued eigenvalues, which means there are directions you can walk along that don't change the loss (or that change the loss by a cubic or higher power rather than a quadratic power). The dominant contribution to how volume scales in the loss landscape comes from the behavior of the loss in those degenerate directions. This is much more significant than the behavior of the quadratic directions. The amount of degeneracy is quantified by singular learning theory's local learning coefficient (LLC)

In the Bayesian setting, the relationship between geometric degeneracy and inductive biases is well understood through Watanabe's free energy formula. There's an inductive bias towards more degenerate parts of parameter space that's especially strong earlier in the learning process.

I heard that there is no local minima in high-dimensional spaces because there will be almost always paths to global minimum. 

If, for instance, one minimum’s attractor basin has a radius that is just 0.00000001% larger than that of the other minimum, then its volume will be roughly 40 million times larger (if my Javascript code to calculate this is accurate enough, that is).

Could you share this code? I'd like to take a look.

Maybe I accidentally overpromised here :D this code is just an expression, namely 1.0000000001 ** 175000000000, which, as wolframalpha agrees, yields 3.98e7.

One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding "flips" from representing the original token to finally representing the prediction of the next token.

By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:

  • some 1000 dimensions encode the original token
  • some other 1000 dimensions encode the prediction of the next token
  • the remaining 10,288 dimensions encode information about all available context (which will start out "empty" and get filled with meaningful information through the layers).

In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there's the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it's still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.

Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just "translate" from token space to embedding space and vice versa[1]. This made sense in relation to the initial & naive "embeddings represent tokens" interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an "extraction" of the information content in the embedding that encodes the prediction.

One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that "Zero layer transformers model bigram statistics". So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I'm not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)

I would guess that transformer-experienced people (unless they disagree with my description - in that case, please elaborate what I'm still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.

  1. ^

    Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word "Good" and then unembed the embedding immediately, you would get a very high probability for "Good" back when in practice (I didn't verify this yet) you would probably obtain high probabilities for "morning", "day" etc.

Awkwardly, it depends on whether the model uses tied embeddings (unembed is embed transpose) or has separate embed and unembed matrices. Using tied embedding matrices like this means the model actually does have to do a sort of conversion.

Your discussion seems mostly accurate in the case of having separate embed and unembed, except that I don't think the initial state is like "1k encode current, 1k encode predictions, rest start empty". The model can just directly encode predictions for an initial state using the unembed.

There has actually been some work visualizing this process, with a method called the "logit lens".

The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

A more thorough analysis: https://arxiv.org/abs/2303.08112 

You can learn a per-token bias over all the layers to understand where in the model it stops representing the original embedding (or a linear transformation of it) like in https://www.lesswrong.com/posts/P8qLZco6Zq8LaLHe9/tokenized-saes-infusing-per-token-biases


You could also plot the cos-sims of the resulting biases to see how much it rotates.

I didn't verify this yet

Do it! I bet slightly against your prediction.

For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)

Some examples that I’ve heard from different people around me over the years:

  • Saying “rectangel” instead of “rectangle”
  • Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
  • Saying something like, uhh, “devil-oupaw” instead of “developer”
  • Saying “leech” instead of “league”
  • Saying “immu-table” instead of “immutable”
  • Saying "cyurrently" instead of "currently"

I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it's pronounced. This happened to me quite a lot[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I've seen all these other people stick to their very unusual pronunciations anyway. What's up with that?[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.

Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing "dude" incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.

So, as I learned now, "dude" is pronounced "dood" or "dewd". Whereas I used to say "dyood" (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.

Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said "dood", and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said "dood" (which, in my defense, didn't happen all that often in my presence[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.

I never quite realized that practically everyone said "dood" and I was the only "dyood" person.

So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words. 

But, admittedly, I still don't wanna be the one to point it out to them.

And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.

  1. ^

    e.g., for some time I thought "biased" was pronounced "bee-ased". Or that "sesame" was pronounced "see-same". Whoops. And to this day I have a hard time remembering how "suite" is pronounced.

  2. ^

    Of course one part of the explanation is survivorship bias. I'm much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious. 

  3. ^

    Maybe they were intimidated by my confident "dyood"s I threw left and right.

I use written English much more than spoken English, so I am probably wrong about the pronunciation of many words. I wonder if it would help to have a software that would read each sentence I wrote immediately after I finished it (because that's when I still remember how I imagined it to sound).

EDIT: I put the previous paragraph in Google Translate, and luckily it was just as I imagined. But that probably only means that I am already familiar with frequent words, and may make lots of mistakes with rare ones.

After first learning about transformers, I couldn't help but wonder why on Earth this works. How can this totally made-up, complicated structure somehow end up learning how to write meaningful text and having a mostly sound model of our world?

(tl;dr: no novel insights here, just me writing down some thoughts I've had after/while learning more about neural nets and transformers.)

When I once asked someone more experienced, they essentially told me "nobody really knows, but the closest thing we have to an answer is 'the blessing of dimensionality' - with so many dimensions in your loss landscape, you basically don't run into local minima but the thing keeps improving if you just throw enough data and compute at it".

I think this makes sense, and my view on how/why/when deep neural networks work is currently something along the lines of:

  • there's some (unknown) minimal network size (or maybe rather "minimal network frontier", as with different architectures you end up with different minimal sizes) for every problem you want to solve (for a certain understanding of the problem and when you consider it solved), so your network needs to be big enough to even be able to solve the problem
  • the network size & architecture also determines how much training data you need to get anywhere
  • basically, you try to find network architectures such that you encode sensible priors about the modality you're working with that are basically always true while also eliminating a priori-useless weights from your network; this way, the training efforts allow the network to quickly learn important things rather than first having to figure out the priors themselves
    • for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
    • for image detection, you realize that the prior of any given pixel being relevant for any other given pixel is higher the closer they are, so you end up with something like CNNs, where you start looking at low level features, and throughout the layers of the network, allow it to "convert" the raw pixel data successively to semantic data
  • in theory, you probably could just use a huge feed forward network (as long as it's not so huge as to overfit instead of generalizing to anything useful) and it would possibly end up solving problems in similar ways as "smarter" architectures do (but not sure about this), but you would need way more parameters and way more training data to achieve similar results, much of which would be wasted on "low quality parameters" that could just as well be omitted
  • so, encoding these modality priors into your network architecture spares you probably orders of magnitude of compute compared to naive approaches
  • while the bitter lesson makes sense, it maybe under-emphasizes the degree to which choosing suitable network architecture + high quality training data matters?
  • lastly, the question "which problem you're trying to solve" cannot just be answered on a high level with "I want to minimize loss in next-token prediction", but the exact problem the network solves depends strongly on the training data; loss minimization is a trade-off between all the things you're minimizing, so the higher the amount of rambling, gossip, meaningless binary data and so on in your training data is, the more parameters and training time you'll need just for those, and the less will the network be capable to predict more meaningful tokens.

Related to that last point, I recently worked on a small project where you, as the user, play Pong against an AI. That AI is controlled by a small neural network (something in the order of 2 or 3 hidden layers and a few dozen neurons), initialized randomly, so at first it's very easy for the human to win. While you play, though, the game collects your behavior as training data and constantly trains the neural network, which eventually learns to mirror you. So after a few minutes of playing, it plays very similar to the human and it becomes much harder to beat it.

One thing I noticed while working on this is that the naive approach to training this AI was far from optimal: much of the training data I collected ended up being pretty irrelevant for playing well! E.g., it's much more important how the paddle moves while the ball is closing in, and almost entirely irrelevant what you do right after hitting the ball. There were several such small insights, leading me to tweak how exactly training data is collected (e.g. sampling it with lower probability while the ball is moving away than when it's getting closer), which greatly reduced the time it took for the AI to learn, even with the network architecture staying the same.

Notably, this does not necessarily mean the loss curve dropped more quickly - due to me tweaking the training data, the loss curves before and after doing so related to quite different things. The same loss for higher quality data is much more useful than for noisy or irrelevant data.

There's just so many degrees of freedom in all of this that it seems very likely that, even if there were not hardware advances at all, research would probably be able to come up with faster/cheaper/better-performing models for a long time.

[-]gwern*90

for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism

If you are trying to convince yourself that a Transformer could work and to make it 'obvious' to yourself that you can model sequences usefully that way, it might be a better starting point to begin with Bengio's simple 2003 LM and MLP-Mixer. Then Transformers may just look like a fancier MLP which happens to implement a complicated way of doing token-mixing inspired by RNNs and heavily tweaked empirically to eke out a bit more performance with various add-ons and doodads.

(AFAIK, no one has written a "You Could Have Invented Transformers", going from n-grams to Bengio's LM to MLP-Mixer to RNN to Set Transformer to Vaswani Transformer to a contemporary Transformer, but I think it is doable and useful.)

I think you would appreciate this post

For people who like guided meditations: there's a small YouTube channel providing a bunch of secular AI-generated guided meditations of various lengths and topics. More are to come, and the creator (whom I know) is happy about suggestions. Three examples:

They are also available in podcast form here.

I wouldn't say these meditations are necessarily better or worse than any others, but they're free and provide some variety. Personally, I avoid apps like Waking Up and Headspace due to both their imho outrageous pricing model and their surprising degree of monotony. Insight Timer is a good alternative, but the quality varies a lot and I keep running into overly spiritual content there. Plus there's obviously thousands and thousands of guided meditations on YouTube, but there too it's hit and miss. So personally I'm happy about this extra source of a good-enough-for-me standard.

Also, in case you ever wanted to hear a guided meditation on any particular subject or in any particular style, I guess you can contact the YouTube channel directly, or tell me and I'll forward your request.

Curated and popular this week