Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?

gwern

Scaling Hypothesis #2: Are Humans Just More Over-Parameterized? — LessWrong

88 Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?

by gwern

17th Jun 2026

1 min read

88

This is a linkpost for https://gwern.net/llm-catapult

(2024-04-21) There are many mysteries about deep learning and human intelligence, but we could describe the biggest anomaly this way: why are artificial neural nets smart in such stupid ways, and biological brains stupid but in smart ways?

I propose a major change in deep learning scaling paradigms: the architectural differences between human brains and NNs (particularly LLMs) may be due to a bias-variance tradeoff, where LLMs minimize variance and human brains minimize bias. Human brains do this by deep double descent-style overparameterization, and adopting a scaling strategy of extremely high-learning-rate training of extremely overparameterized models on small diverse highly-filtered datasets. This approach would lead to sample-efficiently and compute-efficiently traveling (or catapulting) to a highly-generalizing human-like basin in the model loss landscape, while performing poorly up until the end and failing to memorize much data.

If true, this would explain a number of odd stylized facts about how humans/NNs perform well/poorly.

Such a 'catapulted LLM' would generalize much better than existing NNs, be immune to adversarial attacks, have better economics and be more resistant to cloning, could potentially enable extremely efficient MLP architectures, and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons.

This could be feasibly tested by training multi-trillion-parameter models for relatively few steps at high cyclical learning rate schedules, and benchmarking adversarial and hard examples on tasks like arithmetic and small-image classification.

General intelligenceNeuroscienceScaling LawsAI

Frontpage

88

New Comment

20 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:49 AM

[-]Lucius Bushnaq2mo40-1

Massive overparametrization isn't actually necessary for finding well-generalising solutions. Contrary to some old double descent ideas, you don't need to overfit and then grok, if you do your job right and don't screw up weight regularisation you can just smoothly learn solutions that generalise well.

Maybe you just mean "Spend FLOPs on bigger models at the cost of shorter training runs", in which case, sure, that's a thing one can try. I'm not going to speculate on whether it'd work or not because I don't want to help increase model capabilities.

More importantly:

and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons.

I do not think our issues here are primarily caused by a lack of generalization ability. The problem isn't that the AIs are overfitting. GPT-4o's sycophancy worked pretty well for it in interactions both inside and outside its training data. The problem is that it is hard to predict in advance which exact inner objectives and other complex internal properties of AIs a given training environment will induce. And because this is hard to predict, it is difficult for engineers to successfully design a training setup on which AIs with complex internal properties they want are selected for over AIs with complex internal properties they don't want.

Making the AIs generalise better only makes that task even harder, because the more creative and agentic the AIs get, the harder it becomes for engineers to correctly guess what thoughts an AI with a given internal objective or proclivity might think in response to a given situation in training.

[-]Zach Furman1mo143

Massive overparametrization isn't actually necessary for finding well-generalising solutions. Contrary to some old double descent ideas, you don't need to overfit and then grok, if you do your job right and don't screw up weight regularisation you can just smoothly learn solutions that generalise well.

I don't see how the second sentence necessarily supports the first sentence? They seem logically independent. My personal belief is that I agree with the second sentence and disagree with the first (though I won't argue the object level here).

Perhaps you're arguing against an argument like "massive overparameterization finds well-generalizing solutions via a 'grokking' or 'double descent' process of first overfitting and then generalizing"? I've heard this before, and I think it's a bit silly - as you say, these sorts of grokking or double descent settings are IMO unrepresentative[1]. But I think "massive overparametrization isn't actually necessary for finding well-generalising solutions" doesn't logically follow, I think that needs a separate argument? (Which, knowing you, I would expect you have a separate argument, but just pointing out the logical jump here.)

[1] To be clear I still think these settings are highly valuable, but as existence proofs pushing the limits of our theories (like e.g. black holes in physics) rather than "how things usually are."

[-]Noosphere892mo270

At least for sample efficiency, using the Chinchilla paper for LLMs, even trying to maximize the number of parameters towards infinity only gets you about 1 OOM less data to reach the same loss, when there's a 3-6 OOM gap to be explained, and also even if we do believe that human sample efficiency is mostly just the prior, the marginal sample efficiency of models is also a lot worse, and prior differences don't help to explain that one.

All quotes taken from the Dwarkesh article on sample efficiency here.

The quote about the Chinchilla scaling law meaning we only get 1 OOM of sample efficiency even if we scaled neural nets to an infinite number of neurons.

The way the scaling law equations work is that parameter and data terms are added to the loss independently. If you have a model that is trained compute optimally, and suppose you ask, well what if I just wanna maximize sample efficiency and use less data - and I’ll throw in as many parameters as it takes to make that happen. With the constants from the Chinchilla scaling laws paper (and the nature of the result wouldn’t change even with different constants), even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. Scaling of current models simply can’t make up for that discrepancy. This really does suggest that humans are on a different scaling curve altogether.

And the quote about sample efficiency for marginal capabilities also being worse in a way that we can't explain via prior differences:

Even if it were the case that we can explain away the trillions of tokens required to pretrain a base model as catching up to evolution, it doesn’t explain why the marginal capabilities take so much data - once you have been educated, you don’t need 100 different professors to learn a new programming language, but the AIs (even once pretrained) do.

So your proposal, if it worked would have to have a much more favorable sample efficiency curve with increasing parameters than Chinchilla's scaling laws.

I'm just noting how much of a big deal it would be if the catapulted NN idea actually worked, because right now the scaling curves of LLM sample efficiency, even if we added more parameters to the NNs, are very terrible, because even an infinite number of parameters has 1 OOM less data required to get the same loss, compared to 3-6 OOMs of sample efficency difference between LLMs and humans.

Also, another useful puzzle is why even after pre-training ends, where the priors should have been baked in, do AIs still require 100 different professors to learn a new programming language, compared to humans who often need 1-3 professors at most, implying a 2 OOM sample efficiency advantage even when the priors baked in by pre-training are taken into account.

One final point about prior differences, or lack thereof:

Many billions of years of evolution is our pre-training, so it’s unfair to compare how little data we see simply within our lifetime to what these cold-started LLMs have to learn from.
Our genome is 3GB, about 1-2% protein coding. That is just not enough space to store the model parameters that are supposedly pretrained (frontier models are terabytes sized). The closer analogy is probably that evolution has found the right hyperparameters and loss functions (Sidenote: I had an interesting podcast with Adam Marblestone where he argued that the loss functions were the more significant find from evolution), but that the equivalent of parameter training is still happening within lifetime, and is encoded in the map of neural connections in the brain built up over a lifetime.

[-]gwern2mo*302

So your proposal, if it worked would have to have a much more favorable sample efficiency curve with increasing parameters than Chinchilla's scaling laws.

Indeed. There is no reason to expect Chinchilla to be relevant to sample-efficiency, because Chinchilla only claims to be compute-optimal, and within a very narrow setting at that (old Transformers trained with that specific arch and mostly heuristically-set hyperparameters), on ordinary average data, with extremely shaky, unreliable extrapolations. And people routinely find ways to increase sample-efficiency markedly like 1 OOM, including in the papers I cite about getting supra-Chinchilla sample efficiency by ensembling and multi-epoch training and heavier weight-decay; and there's no reason to expect NNs to automatically achieve Bayes-optimal sample-efficiency when not optimizing for sample-efficiency in the first place. I don't know why Dwarkesh thinks that any extrapolation from Chinchilla tells us anything but a loose lower bound on NN sample-efficiency in alternative scaling regimes, especially unknown ones. It's a bit like claiming that Fable is impossible because the Kaplan et al 2020 scaling curves on LSTM RNNs show that RNNs scale poorly - like many impossibility proofs, there is less there than meets the eye.

I'm just noting how much of a big deal it would be

I agree. Any scaling regime change is extremely important, and yet effectively undiscussed. (Where is the equivalent of Impagliazzo?) I've long been puzzled at the unthinking acceptance of Chinchilla compute-optimal scaling laws as the end-all-be-all and almost aggressive field-wide disinterest in the behavior of overparameterized NNs as you scale them up. (For example, do LLMs get more or less interpretable, for iso-loss, as you scale them up from eg 10b to 100,000b? Do 100,000b-parameter LLMs even work?) There is no theoretical reason to think that Chinchilla is the final scaling law (see the theory papers I cite), and we have seen so many scaling law improvements in the past that why would one expect this one to to be the last one? Statements about 'Chinchilla says you can't do that' will age as well as 'n-grams say you can't do that'. And what about looking at scaling laws in hard subsets of data, like adversarial examples - you know, the kind of data that stubbornly remains a problem even as we keep dumping powerlaw data into the Chinchilla hopper and seeing our loss go down efficiently yet semi-uselessly? That seems like the best way to reconcile the observations that there's something very strange about pretraining working and the next-token prediction argument since even a GPT-2 seems likely better at predicting the next-token than humans, and yet, clearly not AGI and we've had to keep scaling a ton to get ever more performance while still not being AGI and having many odd stylized facts about 'jaggedness' etc.

Incidentally, one of the reasons I was thinking about this in the first place was the comment in the Chinchilla paper about weight-decay:

Interestingly, a model trained with AdamW only passes the training performance of a model trained with Adam around 80% of the way through the cosine cycle, though the ending performance is notably better – see Figure A7.

The more delayed superiority is, serially, the easier it is to miss. And "the curves cross" is one of the signatures of a better scaling regime that wins in the long run...

[-]Noosphere892mo50

I generally agree with this, though I will make some comments:

because Chinchilla only claims to be compute-optimal

That to be fair is probably targeted because back in the day (and to a lesser extent even now), the amount of data was clearly much larger than the amount of compute, so sample inefficiency was not really a problem, and there's still a reasonable chance that it doesn't actually matter for LLMs being transformative in the way we want.

And it could very well be that at least part of the answer to the puzzle is that you cannot train both a compute-optimal and a data-optimal model, and you have to choose one or the other to target.

old Transformers trained with that specific arch

To be fair, companies are very conservative with architecture changes, and again that's because right now they don't need to and trying to do it would have serious downside risk for their profitability. That said, it's definitely a lot less true for research.

It's a bit like claiming that Fable is impossible because the Kaplan et al 2020 scaling curves on LSTM RNNs show that RNNs scale poorly - like many impossibility proofs, there is less there than meets the eye.

Note Dwarkesh does not claim that it's impossible, only that it requires (at least) a non-trivial amount of research to solve the issue. If there is an error, it's that he didn't realize that there was already research that made progress on the issue.

That said, one reason for this:

I don't know why Dwarkesh thinks that any extrapolation from Chinchilla tells us anything but a loose lower bound on NN sample-efficiency in alternative scaling regimes, especially unknown ones.

Is partially ignorance of research and partially because companies are understandably conservative about trying new things (for pretty good reason here, because AI is now in the era where you can actually make real profits since AIs are now good enough that they can do real economic work, and this means your products have to be reliable, and new tech is often unreliable).

[-]Petropolitan2mo10

Why do you believe the loss goes down "semi-uselessly"? By now I think it's generally accepted as conventional wisdom that RL, including RLVR, doesn't create much deep knowledge in the LLM (which, I guess, would theoretically be unexpected, given the small magnitude of its updates) but rather converts it from just token-prediction into practical, economically valuable skills. As an example, see pages 2, 3 and 21 of the Composer 2 technical report from March: https://cursor.com/resources/Composer2.pdf

[-]anaguma2mo76

Two years ago, you were asked:

I hear you sometimes share dual-use (or plain capabilities?) ideas with Anthropic. If that's true, does this change your policy?

To which you responded:

Anthropic is in little need of ideas from me, but yeah, I'll probably pause such things for a while. I'm not saying the RSP is bad, but I'd like to see how things work out.

I find it a bit sad that in this essay, and in your one advocating for your AUNN architecture, you've gone in a different direction and shared your capabilities ideas not only with Anthropic but with the public^[1]. The alignment section is fairly speculative, and doesn't make a strong argument for why your concrete proposals (of high learning rates, weight decay, overparameterization etc.) will lead to 'true generalization' and a 'genuinely moral AI'. Assuming your proposal does lead to brain-like generalization, there are still many alignment problems left unsolved which your essay doesn't discuss. Without further progress on these, it seems unwise to me to publish this type of research.

^{^}
Though this essay mostly does seem like a capabilities proposal to the labs. There are not many private actors who have the means and expertise to run the 100T parameter runs outlined.

[-]niplav1mo*40

Update: @Paragox links ~~the~~ such a hash in their comment.

[-]Paragox1mo62

To be clear, I don't think there is any trivial way to prove the hash released in 2024 corresponds to this potentially 2-year-later revised and modified version, unless @gwern himself confirms of course. My line of reasoning is purely speculation, since he released this with seemingly very little fanfare on gwernnet/twitter/substack before posting on LW and its fun to try and guess the hidden motivations.

Relatedly, since this is in fact, at its core, a very good sounding idea, and gwern has been somewhat prescient previously about pointing out the early signs of both pretrain-scaling->alexnet and test-time-compute-scaling->4chan, it does seem strange the grandfather comment was initially downvoted so heavily for considering the capability acceleration implications (which again ties into my motivational curiosity). Personally, I think a compute overhang is to be avoided, so I'm all for the open dissemination of ideas - but then again gwern does seem to like to play some fun 5d games from time to time.

[-]niplav1mo40

I remember him tweeting hashes of unreleased essays (𝕏 is blocked on my machine right now, so I can't look them up), so I'd guess from one perspective this is the mode of Gwern holding back.

[-]ACCount2mo60

By now, I have little doubt that human brain has a parameter (capacity) and a compute advantage over LLMs, and uses it to run something like progressive distillation. But that's not the only part of the "suspicious sample efficiency" story.

My pet hypothesis for the bulk of the "sample efficiency advantage" is still: low k-complexity priors. Not "entire circuits" as predicted by "massive modularity hypothesis", but evolved biases that, despite their compact genetic encoding, do a good job of seeding the right computational structure early. What an LLM has to spend a lot of training signal discovering from scratch, or even fail to discover from scratch (leading to inhuman brittleness), the human brain just gets by default.

Evolution has gone and discovered priors like that through millions of years of highly parallel search. A lot of what does heavy lifting in human brains now might have originated long before milestones like speech or bipedal locomotion, and has been repurposed for higher cognition. Learning in humans is mostly just primate learning, scaled up and forced into a new regime - animal intelligence bent into a more abstract and general shape. A lot of the old priors could be fitted, and only some had to be novel^[1].

Thus, the starting point of human brain is closer to advanced LLM "in-context learning + context distillation" setups.

The "right computational structure" may involve "structure that is primed to learn X and anti-primed to learn Y". Thus, not entirely against a bias/variance type of mechanism? Architecture, regularization, hyperparameters, initialization and training all constrain possible learning trajectories, and can substitute for each other, to a degree - priors could be delivered through each.

Success of FDSL in image domains and transfer attempts to LLMs like the recent NCA work sure hint that k-compact priors (presented as synthetic data - training substituted for initialization) can help convergence. Even if "all priors are wrong, some are useful" holds, well selected priors could underperform "add more in-domain data" in the limit, but outperform all "realistic amount of data" regimes.

^{^}
The cleanest case for the latter being, possibly, executive function - anatomically distinct, notoriously fragile, and capable of failing without bringing down the rest of the system with it.

[-]Paragox2mo*60

I find this a little strange to drop without context, given how 2024-centric it is. I fully claim fallibility in my interpretation, but I feel this was largely written post-gpt4 in 2024, with small updates circa 2025 and even smaller c. 2026. I follow your posts a bit more zealously than even the average LW reader, and still I'm left a bit confused about what this is, exactly. My best guess is along the lines of: publicly releasing something you had previously hashed (eg. this is in fact that 3rd hash) originally with the intent of proving your novel, prescient authorship? But now with the potential bragging rights dwindling away, your utility preference has swapped to at least getting the ideas / article out into the public LLM training set (perhaps on the eve of scaled RSI being seriously attempted), vs continuing to bank on when the frontier labs of the ensuing two years would surely get around to figuring it out themselves?

On its own merits, this an incredible collection of insights, so regardless I thank you for finally sharing. But it seems rather strange to not acknowledge the apparent shortcomings the 2024 framing brings, or the continual awkwardness in not referencing or building core arguments off newer papers, models, knowledge -- the vastly expanded frontier since 2024! And this possibly leaves a cynical conclusion, that this is not your actual present, frontier-informed thesis - perhaps that is only to be revealed 2 years from now?

Apologetically, I concede this could be the real deal, and the lack of updates is largely mundane, ex. due to the buttoning up of public frontier research, perhaps lightly salted with a pinch of sino-lab disdain? The core ideas still stand: commercialization obsessed US labs have done nothing but shallowly scale the same old tired transformer, Chinchilla-esque-thinking reigns supreme and drowns inference-shaped nets in tokens, no one has gotten anywhere nearer (ideologically) to densely training 10T on 100M tokens, etc. etc.

[-]Andrii Vasylenko2mo61

and by giving true generalization, provide a sturdy foundation for AI safety in the form of useful NNs which are aligned & safe for the right reasons.

I doubt this. High capabilities are at least somewhat an attractor basin, which makes them possible to target using tools like GD. There is no corresponding attractor at the particular utility function we want the AI to have, so I think there would be a lot of gotchas with trying to learn it using GD.

[-]Jacob G-W2mo30

How does this explain people like Von Neumann who have ~perfect memory?

[-]gwern2mo170

The reason I brought up von Neumann was that I was surprised to realize that Von Neumann is the exception that proves the rule: the all-time great, the furthest point on the Pareto frontier, the greatest and most creative person ever to attain quasi-LLM-like memorization skills (maybe, keeping in mind that he was never tested properly to the extent of a Kim Peek, say), who nevertheless fell short and was surpassed by thinkers with less raw gifts, by his own and others' account, on... creativity and out-of-sample generalization/novelty. Just like the other cases of extreme memory. There are countless ways that someone with extreme memory could be psychologically flawed, and yet, that way seems to be the common thread.

[-]Dr. Birdbrain1mo1-3

Possibly controversial, but I think the biggest thing that is wrong with modern deep learning is that backpropagation is the wrong learning rule.

Reading Reiner Pope’s “How to Scale Your Model”, backpropagation triples compute cost compared to inference, which means that it is not economically feasible to deploy large models that learn online.

This is absurd! This cannot be the Master Learning Algorithm that the human brain uses to implement AGI at 20W power consumption.

I recently heard Ilya Sutskever say that his heuristic is to draw inspiration from the best understanding of how the human brain works, and use that as a “good taste“ heuristic as to what is likely to work. In this context, backpropagation is terrible taste, absolutely disgusting.

The next iteration of learning updates will most likely be lighter. Probably a modernization of Hebbian learning.

[-]cubefox1mo52

backpropagation triples compute cost compared to inference

That doesn't sound bad.

[-]Dr. Birdbrain1mo1-2

I’m confused by this response, so let’s some numbers on it.

Suppose you have enough compute to train a model with 2 trillion parameters with the conventional backpropagation algorithm. If you had a better algorithm that didn’t incur the memory overhead of backprop with its global update rules, you could use the same hardware to train a model triple the size, which is a 6T model.

Scaling laws tell us that we can reasonably expect this to be a much more capable model.

[-]cubefox1mo30

There is reason to believe that backpropagation is theoretically optimal because it is based on the chain rule of calculus. Having just triple the cost compared to inference also means that it isn't overly costly either.

[-]ACCount2mo*10

Do ANNs "provide little insight into biological brains" because of some fundamental divergence that impairs transfer - like the proposed bias/variance story? Or is it purely a skill issue?

ANN mechanistic interpretability is in a deep pit, and there is very little reason to expect BNN interpretability to be less challenging - and it suffers from far worse instrumentation capabilities. Even if ANN insights are incredibly useful for understanding biological brains (my prior: they are), and some of the methods could fully transfer (my prior: it's possible but not certain), applying them across the tooling gap will be anything but trivial^[1].

At the same time: anyone who's good at working with ANNs tends to work in AI, not neuroscience. People who would be the best at applying AI knowledge tend to apply it back to the field of AI - a field ripe in cash and career opportunities, quick in iteration speed and fast to transition to practical applications. Neuroscience is exactly none of those things.

^{^}
Conversely: the same could hold for adversarial samples? I.e. picking an adversarial sample for an ANN requires the degree of access that is intractable for BNNs. As such, we don't know if BNNs are inherently far more robust to adversarial samples, or simply possess individual and temporal variance and don't expose enough intermediates to have adversarial samples fit to them reliably.

Moderation Log