silentbob

Wikitag Contributions

Comments

Sorted by
silentbob*330

One thing that confused me about transformers is the question of when (as in, after how many layers) each embedding "flips" from representing the original token to finally representing the prediction of the next token.

By now, I think the answer is simply this: each embedding represents both at the same time (and more). For instance, in GPT3 there are 12,288 embedding dimensions. At first I thought that all of them initially encode the original token, and after going through all the layers they eventually all encode the next token, and somewhere in the layers between this shift must happen. But what, upon some reflection, makes much more sense would be something very roughly like, say:

  • some 1000 dimensions encode the original token
  • some other 1000 dimensions encode the prediction of the next token
  • the remaining 10,288 dimensions encode information about all available context (which will start out "empty" and get filled with meaningful information through the layers).

In practice, things are of course much less clean, and probably most dimensions will have some role in all these things, to different degrees, as of course all of this is learned through gradient descent and hence will be very noisy and gradual. Additionally, there's the whole positional encoding thing which is also part of the embeddings and makes clear distinctions even more difficult. But the key point remains that a single embedding encodes many things, only one of which is the prediction, and this prediction is always there from the beginning (when it's still very superficial and bad) and then, together with the rest of the embedding, gets refined more and more throughout the layers.

Another misconception I had was that embedding and unembedding are very roughly symmetric operations that just "translate" from token space to embedding space and vice versa[1]. This made sense in relation to the initial & naive "embeddings represent tokens" interpretation, but with the updated view as described above, it becomes clear that unembedding is rather an "extraction" of the information content in the embedding that encodes the prediction.

One piece of evidence for this updated view is that this paper (thanks to Leon Lang for the hint) found that "Zero layer transformers model bigram statistics". So, indeed, embedding + unembedding alone already perform some very basic next-token prediction. (Admittedly I'm not sure if this is only the case when the transformer is trained with zero layers, or also in, say, GPT3, when during inference you just skip all the layers)

I would guess that transformer-experienced people (unless they disagree with my description - in that case, please elaborate what I'm still getting wrong) will find all of this rather obvious. But for me, this was a major missing piece of understanding, even after once participating in an ML-themed bootcamp and watching all the 3Blue1Brown videos on transformers several times, where this idea either is not directly explained, or I somehow managed to consistently miss it.

  1. ^

    Of course, this is not entirely true to begin with because the unembedding yields a distribution rather than a single token. But my assumption was that, if you embed the word "Good" and then unembed the embedding immediately, you would get a very high probability for "Good" back when in practice (I didn't verify this yet) you would probably obtain high probabilities for "morning", "day" etc.

It’s sad to admit, but I think there are many good things that simply don’t have good titles.

I've been thinking for many years that bad titles are a common reason for failure, say of movies or video games or other things that are sold where superficial first impressions are important. In the sense of: there are some products out there that would have been orders of magnitude less or more successful, had they gone with a different name.

This seems particularly important to me for anything that has a "viral" element, where people tell their friends about it. A good title most definitely affects the "reproduction number" to some degree. If it sounds cool people may easily be 50% more likely to speak about it than if the name is cringe or confusing or hard to remember or hard to pronounce. If this moves your R from 0.9 to 1.4, that can obviously make a tremendous difference for the trajectory of the thing.

My girlfriend once came across this metaphor of "spiraling upwards". Whatever you're struggling with, you'll have low points again with near certainty, but ideally you have learned something in the meantime that improves some aspect of your situation or your ability to bounce back. I think it's a nice way to look at things when it's true. Generally, dealing with setbacks seems like one of the crucial parts of making progress in any area.

Dwarkesh Patel

Most people here probably know it, but for the few of you who don't: in-depth AI podcast with many high-profile guests from AI labs and beyond. Often brings up AI Safety concerns, but the general vibe of the podcast is usually rather somewhere between excited and optimistic. Dwarkesh is quick on his feet and tends to ask many good questions, often "good-faith-challenging" his guests.

He's great at extracting the world views out of his guests and at keeping conversations very engaging even over many hours. My impression is that he vibes well with most guests and gets them to share their views more freely than they would otherwise. Most noteworthy for me were the episodes with Sutskever, Aschenbrenner, gwern, and of course the AI 2027 one with Daniel Kokotajlo and Scott Alexander.

If the above sounds interesting, then consider this a recommendation.

If you consider Mechanize to be net-negative and don't want to support anyone funding them, then rather don't consider this a recommendation.

The Studies Show

It's entertaining yet refreshingly skeptical of science (in a, you know, rather rational way) and the problems it has. Tears apart many papers, myths and misconceptions. Tom Chivers keeps mentioning Bayes and Scott Alexander. Has some episodes on general scientific & statistical concepts and the major problems in science, as well as many object-level ones on concrete research topics, such as growth mindset, autism, seed oil or IQ. I prefer the latter ones. Spoiler alert: the outcome of most episodes is "we know much less than people think", about pretty much anything.

One weakness of the show may be that they're possibly erring too much on the "there may be some evidence for X but can we really tell? Actually, nobody really knows and it's all just guessing based on a bunch of very flawed studies" side. Occasionally the hosts seem a bit less well prepared than they could be. Still, on the majority of topics, I find their episodes rather enlightening. Another plus is that they have some episodes on their past mistakes on the podcast (of which there are indeed quite a few).

If you're a bit cynical and enjoy two witty Brits making fun of bad science while learning a few things about the state of research, you might enjoy this one.

silentbob*50

The Clearer Thinking podcast

I like how it explores a variety of important topics deeply without becoming less relevant even after so many episodes. It has a good length of ~60-90 minutes per episode. Spencer's questions are often great, plus he tends to bring his own insights and perspectives to the table that add a lot.

The episodes that I learned the most from were probably the ones on different psychological conditions, such as talking to a narcissist, a sociopath, someone with borderline, or to a victim of sexual abuse.

People who are interested in rational discussions of science, psychology, mental health, ethics etc probably have a good shot at getting something out of the clearer thinking podcast.

For a long time, I used to wonder what causes people to consistently mispronounce certain words even when they are exposed to many people pronouncing them correctly. (which mostly applies to people speaking in a non-native language, e.g. people from continental Europe speaking English)

Some examples that I’ve heard from different people around me over the years:

  • Saying “rectangel” instead of “rectangle”
  • Saying “pre-purr” (like prefer, but with a p) instead of “prepare”
  • Saying something like, uhh, “devil-oupaw” instead of “developer”
  • Saying “leech” instead of “league”
  • Saying “immu-table” instead of “immutable”
  • Saying "cyurrently" instead of "currently"

I did, of course, understand that if you only read a word, particularly in English where pronunciations are all over the place and often unpredictable, you may end up with a wrong assumption of how it's pronounced. This happened to me quite a lot[1]. But then, once I did hear someone pronounce it, I usually quickly learned my lesson and adapted the correct way of saying it. But still I've seen all these other people stick to their very unusual pronunciations anyway. What's up with that?[2] Naturally, it was always too awkward for me to ask them directly, so I never found out.

Recently, however, I got a rather uncomfortable insight into how this happens when a friend pointed out that I was pronouncing "dude" incorrectly, and have apparently done so for all my life, without anyone ever informing me about it, and without me noticing it.

So, as I learned now, "dude" is pronounced "dood" or "dewd". Whereas I used to say "dyood" (similar to duke). And while I found some evidence that dyood is not completely made up, it still seems to be very unusual, and something people notice when I say it.

Hence I now have the, or at least one, answer to my age-old question of how this happens. So, how did I never realize? Basically, I did realize that some people said "dood", and just took that as one of two possible ways of pronouncing that word. Kind of, like, the overly American way, or something a super chill surfer bro might say. Whenever people said "dood" (which, in my defense, didn't happen all that often in my presence[3]) I had this subtle internal reaction of wondering why they suddenly saw the need to switch to such a heavy accent for a single word.

I never quite realized that practically everyone said "dood" and I was the only "dyood" person.

So, yeah, I guess it was a bit of a trapped prior and it took some well-directed evidence to lift me out of that valley. And maybe the same is the case for many of the other people out there who are consistently mispronouncing very particular words. 

But, admittedly, I still don't wanna be the one to point it out to them.

And when I lie awake at night, I wonder which other words I may be mispronouncing with nobody daring to tell me about it.

  1. ^

    e.g., for some time I thought "biased" was pronounced "bee-ased". Or that "sesame" was pronounced "see-same". Whoops. And to this day I have a hard time remembering how "suite" is pronounced.

  2. ^

    Of course one part of the explanation is survivorship bias. I'm much less likely to witness the cases where someone quickly corrects their wrong pronunciation upon hearing it correctly. Maybe 95% of cases end up in this bucket that remains invisible to me. But still, I found the remaining 5% rather mysterious. 

  3. ^

    Maybe they were intimidated by my confident "dyood"s I threw left and right.

or can read interview transcripts in much less time than listening to a podcast would take.

This always baffles me. :) Guess I'm both a slow reader and a fast listener, but for me audio allows for easily 3x as much speed as reading.

So what made you change your mind?

It's interesting how two years later, the "buy an expert's time" suggestion is almost outdated. There are still situations where it makes sense, but probably in the majority of situations any SOTA LLM will do a perfectly fine job giving useful feedback on exercises in math or language learning.

Thanks for the post!

Load More