Wiki Contributions


I think there's a mistake here which kind of invalidates the whole post. Ice cream is exactly the kind of thing we’ve been trained to like. Liking ice cream is very much the correct response.

Everything outside the training distribution has some value assigned to it. Merely the fact that we like ice cream isn’t evidence that something’s gone wrong.

I agree completely. This is a plausible explanation, but it’s one of many plausible explanations and should not be put forward as a fact without evidence. Unfortunately, said evidence is impossible to obtain due to OpenAI’s policies regarding access to their models. When powerful RLHF models begin to be openly released, people can start testing theories like this meaningfully.

Linear warm-up over the first 10% of training, then cosine decay to a minimum of one-tenth the peak LR which is set to occur at the end of training (300B tokens). Peak LRs vary by model but are roughly consistent with GPT-3 and OPT values. You can find all the config details on GitHub. The main divergence relevant to this conversation from mainstream approaches is that we use a constant batch size (2M) throughout scaling. Prior work uses batch sizes up to 10x smaller for the smallest models, but we find that we can train large batch small models without any problems. This enables us to achieve a substantial wall-clock speed-up for small models by throwing more GPUs at them. We continue to use this batch size for the 11B model for consistency, although the standard progression of batch sizes would encourage one of 3M or 4M by that point.

Checkpoint 20 and 40 are at 20k and 40k iterations respectively, and the entire training runs for 143k iterations. So they occur relatively shortly after the LR peaks, but don't coincide with anything I know to be particularly special.

This is really exciting work to see, and exactly the kind of thing I was hoping people would do when designing the Pythia model suite. It looks like you're experimenting with the 5 smallest models, but haven't done analysis on the 2.8B, 6.9B, or 12B models. Is that something you're planning on adding, or no?

I am really very surprised that the distributions don't seem to match any standard parameterized distribution. I was fully ready to say "okay, let's retrain some of the smaller Pythia models initialized using the distribution you think the weights come from" but apparently we can't do that easily. I suppose we can do a MCMC sampler? In general, it seems like a natural follow-up to the contents of this post is to change the way we initialize things in models, retrain them, and see what happens (esp. with the loss curve). If that's something you'd like to collaborate with EleutherAI about, I would be more than happy to arrange something :)

In general, the reliability of the things you're seeing across model scales is really cool. I agree that it seems to refute some of the theoretical assumptions of the NTK literature, but I wonder if perhaps it's consistent with the Tensor Programs work by Greg Yang et al. that lead to muP.

To clarify what's going on with the Pythia models:

  1. This work appears to be using the initial model release, which has an inconsistent naming scheme. Some models were named based on total parameters, while others were named based on the number of learnable parameters. The former is what models are typically named based on, but the later is what people put on the x-axis of scaling laws plots. This is a nomenclature change only with no impact on results.
  2. Shortly after release, we renamed the models to be consistently named using the total number of parameters. The models studied in this post are currently named 70M, 160M, 410M, 1B, and 1.4B.
  3. When writing the paper for these models, we discovered a handful of inconsistencies in the suite's hyperparameters. Specifically, the batch size and some all-reduce optimizations were inconsistent across training. We expect this to have no impact on the OP or 90% of experiments using the suite. That said, if we're going to spend all this compute to design a suite for controlled scientific experiments, it should control for as many factors as possible. The current models will remain public and people are encouraged to compare results across them to further validate that various properties don't impact the behavior that they're finding.

This is excellent work, though I want to generically recommend caution when making assumptions about the success of such attacks based only on blackbox evaluations. Thorough analysis of false positive and false negative rates with ground-truth access (ideally in an adversarially developed setting) is essential for validation. [Sidebar: this reminds me that I really need to write up my analysis in the EleutherAI discord showing why prompt extraction attacks can be untrustworthy]

That said, this is really excellent work and I agree it looks quite promising.

Do you have a reference to the work you’re talking about? I’m doing some stuff involving fitting curves to activation tails currently.

This is very interesting. The OP doesn’t contain any specific evidence of Gaussianness, so it would be helpful if they could provide an elaboration of what evidence lead them to conclude these are Gaussian.

I’m not sure when you developed this work, but the LLM.int8 paper identifies outliers as an essential factor in achieving performance for models larger than 2.7B parameters (see Fig. 1 and Fig. 3 especially). There’s also some follow-up work here and here. Very curiously, the GLM-130B paper reports that they don’t see outlier features at all, or the negative effects of their lack of impact.

I’ve spoken with Tim (LLM.int8 lead author) about this a bit and some people in EleutherAI, and I’m wondering if there’s some kind of explicit or implicit regularizing effect in the GLM model that prevents it from learning outlier features. If this is the case, one might expect to find different patterns in outliers in models with sufficiently different architecture, perhaps GPT-2 vs Pythia vs GLM vs T5

I think that the answer is no, and that this reflects a common mental barrier when dealing with gradient descent. You would like different experts to specialize in different things in a human-interpretable way, but Adam doesn’t care what you say you want. Adam only cares about what you actually write down in the loss function.

Generally, a useful line of thinking when dealing with lines of thought like this is to ask yourself if your justification for why something should happen already justifies something that is known to not happen. If so, it’s probably flawed.

In this case there is: as far as I can tell, your justification applies to multiheaded attention (as an improvement over single headed attention). While there has been some attempts to examine MHA as an interpretability magnifying technique, in practice there hasn’t really been much success. Whatever story you tell about why it should work with MoE needs to distinguish MoE from MHA.

I think this question matters because it doesn't seem implausible to me that MoE models could be at par with dense models in terms of capabilities.

There are two regimes when talking about scaling LLMs, and I think it’s very important to keep them separate when talking about things like this. The literature on scaling laws was written by researchers at a very small number of companies that have a very important and non-standard situation: they are predicated upon the assumption that using twice as many GPUs for half as long doesn’t impact costs. It’s hard to overstate how few people fall into this regime.

I run EleutherAI, the non-profit org that has trained more and larger multi-billion parameter LLMs than any other non-profit in the world, and have worked on three different models that held the title “largest publicly available GPT-3-like LLM in the world.” I have access to thousands of A100 GPUs to train models if I really want to, and recently won a USG grant for 6 million V100 hours. I generally do not operate in this regime.

The regime that almost everyone finds themselves in is one where one day the VRAM runs out. Maybe it’s at a pair of 3090 Tis, maybe it’s at a v3-8 TPU, maybe it’s at a DGX machine. But one day you lose the ability to halve your runtime by doubling the amount of VRAM you are using without impacting costs.

In this “VRAM-constrained regime,” MoE models (trained from scratch) are nowhere near competitive with dense LLMs. While there has been some success at turning dense models into MoE models with less performance loss, that work isn’t really relevant to your hypothesis without a substantial amount of additional intellectual work. MoE models are egregiously inefficient in terms of performance-per-VRAM, but compensate by being more efficient in terms of performance-per-FLOP.

How egregious exactly? Well the first MoE paper I grabbed claims that their 1.1T parameter MoE model performs similarly to a 6.7B parameter dense model and that their 207B parameter MoE model performs similarity to a 1.3B parameter model. To put these numbers in prospective: the (currently unverified) claims NVIDIA is making about quantization on their H100 GPUs would enable you to fit a 640B parameter model on an 8xH100 (80GB) device. So you can use an entire 8xH100 machine to fit a MoE model, or you can use a single 3090 Ti and get better performance (using LLM.int8).

Edit: in a reply to the other answer you say

I'm not saying that MoE are more interpretable in general. I'm saying that for some tasks, the high level view of "which expert is active when and where" may be enough to get a good sense of what is going on.

I had misread your claim, but I think the intent of my response is still valid. Even with this more specific claim, you see people aspiring to believe that this is true for MHA and coming up (largely, albeit not entirely) empty. There’s still a significant burden on you to show why your position is better than the same position with the word “MoE” replaced with “MHA.”

What sources do you have for your claim that “large groups” of people believe this?

Load More