At least for sample efficiency, using the Chinchilla paper for LLMs, even trying to maximize the number of parameters towards infinity only gets you about 1 OOM less data to reach the same loss, when there's a 3-6 OOM gap to be explained, and also even if we do believe that human sample efficiency is mostly just the prior, the marginal sample efficiency of models is also a lot worse, and prior differences don't help to explain that one.
All quotes taken from the Dwarkesh article on sample efficiency here.
The quote about the Chinchilla scaling law meaning we only get 1 OOM of sample efficiency even if we scaled neural nets to an infinite number of neurons.
- The way the scaling law equations work is that parameter and data terms are added to the loss independently. If you have a model that is trained compute optimally, and suppose you ask, well what if I just wanna maximize sample efficiency and use less data - and I’ll throw in as many parameters as it takes to make that happen. With the constants from the Chinchilla scaling laws paper (and the nature of the result wouldn’t change even with different constants), even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. Scaling of current models simply can’t make up for that discrepancy. This really does suggest that humans are on a different scaling curve altogether.
And the quote about sample efficiency for marginal capabilities also being worse in a way that we can't explain via prior differences:
- Even if it were the case that we can explain away the trillions of tokens required to pretrain a base model as catching up to evolution, it doesn’t explain why the marginal capabilities take so much data - once you have been educated, you don’t need 100 different professors to learn a new programming language, but the AIs (even once pretrained) do.
So your proposal, if it worked would have to have a much more favorable sample efficiency curve with increasing parameters than Chinchilla's scaling laws.
I'm just noting how much of a big deal it would be if the catapulted NN idea actually worked, because right now the scaling curves of LLM sample efficiency, even if we added more parameters to the NNs, are very terrible, because even an infinite number of parameters has 1 OOM less data required to get the same loss, compared to 3-6 OOMs of sample efficency difference between LLMs and humans.
Also, another useful puzzle is why even after pre-training ends, where the priors should have been baked in, do AIs still require 100 different professors to learn a new programming language, compared to humans who often need 1-3 professors at most, implying a 2 OOM sample efficiency advantage even when the priors baked in by pre-training are taken into account.
One final point about prior differences, or lack thereof:
Many billions of years of evolution is our pre-training, so it’s unfair to compare how little data we see simply within our lifetime to what these cold-started LLMs have to learn from.
- Our genome is 3GB, about 1-2% protein coding. That is just not enough space to store the model parameters that are supposedly pretrained (frontier models are terabytes sized). The closer analogy is probably that evolution has found the right hyperparameters and loss functions (Sidenote: I had an interesting podcast with Adam Marblestone where he argued that the loss functions were the more significant find from evolution), but that the equivalent of parameter training is still happening within lifetime, and is encoded in the map of neural connections in the brain built up over a lifetime.
So your proposal, if it worked would have to have a much more favorable sample efficiency curve with increasing parameters than Chinchilla's scaling laws.
Indeed. There is no reason to expect Chinchilla to be relevant to sample-efficiency, because Chinchilla only claims to be compute-optimal, and within a very narrow setting at that (old Transformers trained with that specific arch and mostly heuristically-set hyperparameters), on ordinary average data, with extremely shaky, unreliable extrapolations. And people routinely find ways to increase sample-efficiency markedly like 1 OOM, including in the papers I cite about getting supra-Chinchilla sample efficiency by ensembling and multi-epoch training and heavier weight-decay; and there's no reason to expect NNs to automatically achieve Bayes-optimal sample-efficiency when not optimizing for sample-efficiency in the first place. I don't know why Dwarkesh thinks that any extrapolation from Chinchilla tells us anything but a loose lower bound on NN sample-efficiency in alternative scaling regimes, especially unknown ones.
I'm just noting how much of a big deal it would be
I agree. Any scaling regime change is extremely important, and yet effectively undiscussed. (Where is the equivalent of Impagliazzo?) I've long been puzzled at the unthinking acceptance of Chinchilla compute-optimal scaling laws as the end-all-be-all and almost aggressive field-wide disinterest in the behavior of overparameterized NNs as you scale them up. (For example, do LLMs get more or less interpretable, for iso-loss, as you scale them up from eg 10b to 100,000b? Do 100,000b-parameter LLMs even work?) There is no theoretical reason to think that Chinchilla is the final scaling law (see the theory papers I cite), and we have seen so many scaling law improvements in the past that why would one expect this one to to be the last one? Statements about 'Chinchilla says you can't do that' will age as well as 'n-grams say you can't do that'. And what about looking at scaling laws in hard subsets of data, like adversarial examples? That seems like the best way to reconcile the observations that there's something very strange about pretraining working and the next-token prediction argument since even a GPT-2 seems likely better at predicting the next-token than humans, and yet, clearly not AGI and we've had to keep scaling a ton to get ever more performance while still not being AGI and having many odd stylized facts about 'jaggedness' etc.
Incidentally, one of the reasons I was thinking about this in the first place was the comment in the Chinchilla paper about weight-decay:
Interestingly, a model trained with AdamW only passes the training performance of a model trained with Adam around 80% of the way through the cosine cycle, though the ending performance is notably better– see Figure A7.
"The curves cross" is always one of the signatures of a better scaling regime that wins in the long run...
and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons.
I doubt this. High capabilities are at least somewhat an attractor basin, which makes them possible to target using tools like GD. There is no corresponding attractor at the particular utility function we want the AI to have, so I think there would be a lot of gotchas with trying to learn it using GD.
(2024-04-21) There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such stupid ways, and biological brains stupid but in smart ways?
I propose a major change in deep learning scaling paradigms: the architectural differences between human brains and NNs (particularly LLMs) may be due to a bias-variance tradeoff, where LLMs minimize variance and human brains minimize bias. Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets. This approach would lead to sample-efficiently and compute-efficiently traveling (or catapulting) to a highly-generalizing human-like basin in the model loss landscape, while performing poorly up until the end and failing to memorize much data.
If true, this would explain a number of odd stylized facts about how humans/NNs perform well/poorly.
Such a 'catapulted LLM' would generalize much better than existing NNs, be immune to adversarial attacks, have better economics and be more resistant to cloning, could potentially enable extremely efficient MLP architectures, and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons.
This could be feasibly tested by training multi-trillion-parameter models for relatively few steps at high cyclical learning rate schedules, and benchmarking adversarial and hard examples on tasks like arithmetic and small-image classification.