The best public estimate is that GPT-4 has 1.8 trillion “parameters”, meaning that its neural network has that many connections. In the two and a half years since it was released, it’s not clear that any larger models have been deployed (GPT-4.5 and Grok 3 might be somewhat larger).
The human brain is far more complex than this; the most common estimate is 100 trillion connections, and each connection is probably considerably more complex than the connections in current neural networks. In other words, the brain has far more information storage capacity than any current artificial neural network.
Which leads to the question: how the hell do LLMs manage to learn and remember so many more raw facts than a person[4]?
I don't have any expertise in neuroscience, but I think this is somewhat confused:
I'm having trouble parsing what you've said here in a way that makes sense to me. Let me try to lay out my understanding of the facts very explicitly, and you can chime in with disagreements / corrections / clarifications:
The human brain has, very roughly, 100B neurons (nodes) and 100T synapses (connections). Each synapse represents at least one "parameter", because connections can have different strengths. I believe there are arguments that it would in fact take multiple parameters to characterize a synapse (connection strength + recovery time + sensitivity to various neurotransmitters + ???), and I'm sympathetic to this idea on the grounds that everything in the body turns out to be more complicated than you think, but I don't know much about it.
Regarding GPT-4, I believe the estimate was that it has 1.8 trillion parameters, which if shared weights are used may not precisely correspond to connections or FLOPs. For purposes of information storage ("learning") capacity, parameter count seems like the correct metric to focus on? (In the post, I equated parameters with connections, which is incorrect in the face of shared weights, but does not detract from the main point, unless you disagree with my claim that parameter count is the relevant metric here.)
To your specific points:
Probably the effective number of parameters in the human brain is actually lower than 100 trillion because many of these "parameters" are basically randomly initialized or mostly untrained. (Or are trained very slowly/weakly.) The brain can't use a global learning algorithm, so it might effectively use parameters much less efficiently.
What is your basis for this intuition? LLM parameters are randomly initialized. Synapses might start with better-than-random starting values, I have no idea, but presumably not worse than random. LLMs and brains both then undergo a training process; what makes you think that the brain is likely to do the worse job of training its available weights, or that many synapses are "mostly untrained"?
Also note that the brain has substantial sources of additional parameters that we haven't accounted for yet: deciding which synapses to prune (out of the much larger early-childhood count), which connections to form in the first place (the connection structure of an LLM can be described in a relative handful of bits, while the connection structure of the brain has an enormous number of free parameters; I don't know how "valuable" those parameters are, but natural systems are clever!), where to add additional connections later in life.
It's a bit confusing to describe GPT-4 as having 1.8 trillion connections as 1.8 trillion is the number of floating point operations (roughly) not the number of neurons.
I never mentioned neurons. 1.8 trillion is, I believe, the best estimate for GPT-4's parameter count. Certainly we know that the largest open-weight models have parameter counts of this order of magnitude (somewhat smaller but not an OOM smaller). As noted, I forgot about shared weights when equating parameters to connections, but again I don't think that matters here. FLOPs to my understanding would correspond to connections (and not parameter counts, if shared weights are used), but I don't think FLOPs are relevant here either.
In general, the analogy between the human brain and LLMs is messy because a single neuron probably has far fewer learned parameters than a LLM neuron, but plausibly somewhat more than a single floating point number.
GPT-5 estimates that GPT-4 had just O(100M) neurons. Take that figure with a grain of salt, but I mention it to point out that in both modern LLMs and the human brain, there are far more connections / synapses than nodes / neurons, and the vast majority of parameters will be associated with connections, not nodes. (Which is why I didn't mention neurons in the post, and I don't think it's useful to talk about learned parameters in reference to neurons.)
Regarding GPT-4, I believe the estimate was that it has 1.8 trillion parameters, which if shared weights are used may not precisely correspond to connections or FLOPs.
For standard LLM architectures, forward pass FLOPs are (because of the multiply and accumulate for each matmul param). It could be that GPT-4 has some non-standard architecture where this is false, but I doubt it.
So, yeah we agree here, I was just noting that connection == FLOP (roughly).
What is your basis for this intuition? [...] what makes you think that the brain is likely to do the worse job of training its available weights, or that many synapses are "mostly untrained"?
The brain is purely local which makes training all the parameters efficiently much harder, my understanding is that in at least the vision focused part of the brain there is a bunch of use of randomly initialized filters, and I seem to recall some argument made somewhere (by Steven Byrnes?) that the effective number of parameters was much lower. Sorry I can't say more here.
To put it another way: compared to people, large language models seem to be superhuman in crystallized knowledge, which seems to be masking shortcomings in fluid intelligence. Is that a dead end, great for benchmarks but bad for a lot of work in the real world?
You seem to imply that AIs aren't improving on fluid intelligence. Why do you think this? I'd guess that AIs will just eventually have sufficiently high fluid intelligence while still compensating with (very) superhuman crystallized knowledge (like an older experienced human professional).
If fluid intelligence wasn't improving, this would be a dead end, but if there is some pipeline which is improving fluid intelligence (quickly), then I don't see a particular reasons to think that high crystallized knowledge is a reason for discounting AI.
I do believe that AIs will eventually surpass humans at fluid intelligence, though I'm highly uncertain as to the timeline.
My point here is really just the oft-repeated observation that when we see an AI do X, intuitively we tend to assess the AI the way we would assess a human being who could do X, and that intuition can lead to very poor estimates of whether the AI can also do Y. (For instance, bar exam → practicing law.) For instance, the relative ratios of fluid vs. crystal intelligence may capture much of the reason that AIs are approaching superhuman status at competition coding problems but are still far from superhuman at many real-world coding tasks. It doesn't mean AIs will never get to real-world tasks. It just suggests (to me) that they might be farther from that milestone than their performance on crystal-intelligence-friendly tasks would imply.
It just suggests (to me) that they might be farther from that milestone than their performance on crystal-intelligence-friendly tasks would imply.
I basically agree, but we can more directly attempt extrapolations (e.g. METR horizon length) and I put more weight on this.
I also find it a bit silly when people say "AIs are very good at competition programming, so surely they must soon be able to automate SWE" (a thing I have seen at least some semi-prominent frontier AI company employees imply). That said, I think AIs being good at competitive programming is substantially not based on better cystalized intelligence and is instead based on this being easier to train for with RL and easier to scale up inference compute on.
In general, this post seems to make a bunch of claims that LLMs have specific large qualitative barriers relative to humans, but these claims seem mostly unsupported. The evidence as far as I can tell seems more consistent with LLMs being weaker in a bunch of specific quantitative ways which are improving over time. For instance, LLMs can totally do continuous learning or consolidate memory, it's just that the best methods for this work pretty poorly. (But plausibly are still within the human range for most/many relevant tasks.)
Agreed that I have not supported my claims here – this was a vibes piece.
I agree that LLMs are improving at ~everything, but my intuition is that some of those improvements – for instance, regarding continuous learning – may currently be of the "climbing a ladder to get closer to the moon" variety. Sounds like we just have very different intuitions here.
AIs have been demonstrating what arguably constitutes superhuman performance on FrontierMath, a set of extremely difficult mathematical problems.
AIs aren't superhuman on frontier math. I'd guess that Terry Tao with 8 hours per problem (and internet access) is much better than current AIs. (Especially after practicing on some of the problems etc.)
At a more basic level, this superhumanness would substantially be achieved by broadness/generality rather than by being superhuman within some field (which is arguably less important/impactful). Like, if you compared AIs to a group of humans who are pretty good at this type of math, the humans would probably also destroy the AI.
Yeah, I was probably too glib here. I was extrapolating from the results of the competition Epoch organized at MIT, where "o4-mini-medium outperformed the average human team, but worse than the combined score across all teams, where we look at the fraction of problems solved by at least one team". This was AI vs. teams of people (rather than any one individual person), and it was only o4-mini, but none of those people were Terence Tao, and it only outperformed the average team.
I would be fascinated to see how well he'd actually perform in the scenario you describe, but presumably we're not going to find out.
if you compared AIs to a group of humans who are pretty good at this type of math, the humans would probably also destroy the AI.
I wonder? Given that, to my understanding, each FrontierMath problem is deep in a different subfield of mathematics. But I have no understanding of the craft of advanced / research mathematics, so I have no intuition here.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems "the wrong way", and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems "the wrong way", and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.
Yep, though it's worth distinguishing between LLMs often solving FrontierMath problems the "wrong way" and always solving them the "wrong way". My understanding is that they don't always solve them the "wrong way" (at least for Tier 1/2 problems rather than Tier 3 problems), so you should (probably) be strictly more impressed than you would be if you only know that LLMs solved X% of problems the "right way".
Is sample-efficient learning a singularly important step on the path to AGI?
Almost definitionally, learning as efficiently as top humans would suffice for AGI. (You could just train the AI on way more data/compute and it would be superhuman.)
AIs will probably reach milestones like full automation of AI R&D before matching top human sample efficiency in broad generality (though they might be better in some/many cases).
Will the journey from here to AGI feature “aha” moments?
Looks like it did feature such moments in the past. The METR graph that you quote had a GPT4-GPT4o plateau, and all subsequent models used CoTs and context window lengtheners and rapidly increased compute spendings on RL. This strategy began to crumble when Claude Opus 4 (who didn't even reach SOTA on time horizon), Grok 4 and GPT-5 failed to follow the 4o-o3[1] faster trend.
something deep about the nature of large tasks vs. small tasks, and the cognitive skills that people and LLMs bring to each.
A human brain, unlike current AIs, has a well-developed dynamic memory which is OOMs bigger (and OOMs worse trained, forcing evolution to use high learning rates) than current context windows or CoTs, let alone the number of neurons in a layer of a LLM. What if the key to AGI lies in a similar direction?
However, METR observed the trend by using 4o-o1 because o3 had yet to be released. Another complication is that the set of METR's tasks is no longer as reliable as it once was, potentially causing us to underestimate the models' abilities.
A take I haven't seen yet is that scaling our way to AI that can automate away jobs might fail for fundamentally prosaic reasons, and that new paradigms might be needed not because of fundamental AI failures, but because scaling compute starts slowing down when we can't convert general chips into AI chips.
This doesn't mean the strongest versions of the scaling hypothesis was right, but I do want to point out that fundamental changes in paradigm can happen for prosaic reasons, and I expect a lot of people to underestimate how much progress was made in the AI summer, even if it isn't the case that imitative learning scales to AGI with realistic compute and data.
But if this were true, you’d think they’d be able to handle ARC-AGI puzzles (see the example image just above)
In a footnote you note that models do well on ARC-AGI-1, but I think you're description of the situation is misleading:
Overall, I think LLMs do handle ARC-AGI puzzles. They are well within the human range for ARC-AGI-1/2 and their failures are pretty often perception failures.
Fair enough if your objection is that the level of sample efficiency on this type of task for typical humans isn't sufficient. (I agree.)
Maybe they’re only good at picking up ideas from an example, if they’d already learned that idea during their original training? In other words, maybe in-context learning is helpful at jogging their memory, but not for teaching new concepts.
My view is that LLMs are generally qualitatively dumber than the most capable humans in a bunch of ways (including ability to learn new things), but that this is improving over time. Thiere isn't some dictomy between "sample efficient learning" and not. I think you'll struggle to find tasks where AIs haven't been improving by following the heuristic "what haven't AIs already learned" (though AIs do gain an advantage by knowing lots of stuff, they are also improving at all kinds of stuff).
AIs trained on the training set of ARC-AGI (and given a bunch of compute) can beat humans on ARC-AGI-1.
Say more? At https://arcprize.org/leaderboard, I see "Stem Grad" at 98% on ARC-AGI-1, and the highest listed AI score is 75.7% for "o3-preview (Low)". I vaguely recall seeing a higher reported figure somewhere for some AI model, but not 98%.
ARC-AGI-2 isn't easy for humans. It's hard for humans and AIs probably do similarly to random humans (e.g. mturkers) given a limited period.
This post states that the "average human" scores 60% on ARC-AGI-2, though I was unable to verify the details (it claims to be a linkpost for an article which does not seem to contain that figure). Personally I tried 10-12 problems when the test was first launched, and IIRC I missed either 1 or 2.
The leaderboard shows "Grok 4 (Thinking)" on top at 16%... and, unfortunately, does not present data for "Stem Grad" or "Avg. Mturker" (in any case I'm not sure what I think of the latter as a baseline here).
Agreed that perception challenges may be badly coloring all of these results.
There isn't some dictomy between "sample efficient learning" and not.
Agreed, but (as covered in another comment – thanks for all the comments!), I do have the intuition that the AI field is not currently progressing toward rapid improvement on sample efficient learning, and may currently be heading toward a fairly low local maximum.
Say more? At https://arcprize.org/leaderboard, I see "Stem Grad" at 98% on ARC-AGI-1, and the highest listed AI score is 75.7% for "o3-preview (Low)". I vaguely recall seeing a higher reported figure somewhere for some AI model, but not 98%.
By "can beat humans", I mean AIs are well within the human range, probably somewhat better than the average/median human in the US at ARC-AGI-1. In this study, humans get 65% right on the public evaluation set.
This post states that the "average human" scores 60% on ARC-AGI-2, though I was unable to verify the details (it claims to be a linkpost for an article which does not seem to contain that figure). Personally I tried 10-12 problems when the test was first launched, and IIRC I missed either 1 or 2.
I'm skeptical, I bet mturkers do worse. This is very similar to the score that was found for humans for ARC-AGI-1 which is much easier from my understanding this study.
By "hard for humans", I just mean that it takes substantially effort even for somewhat smart humans, I don't mean that humans can't do it.
Many thanks for sharing your reflections. I found them very valuable for me and agree with most of them fully or mostly, while you are much closer to LLM foundational model developments than I am, so this is just a humble feeling on my side (I am intensive user in data space and understanding things and society/policy/economy developments, since early 2022, and we plan to integrated LLMs into a SAAS we want to develop). Then, I was actually surprised to read you were active in the 80s already (what makes you even older than me, being from 1968 😉).
Anyway, I wanted to share some reflections back with you (inserted after the numbers from from your text - it didn't want to keep the quotations due to form reformatting of what I pasted in, my apologies) about some aspects, where I have some ideas around the Sample-Efficient Learning (mostly, and a bit on other points). If you find them useful, please use them; I would also be happy about your feedback:
Sample-Efficient Learning
on 10. and 11.
Marc: I think, it is a contributor: filtering out fluff, but intelligence is also/more combining and transferring things we learned in comprehensive and complex ways
12.
Marc: possibly, largely (with the filtering and during-intake-interpretation&classifying being a key part, I think).
13 and 14.
Marc: this fits to my interpretation above: GPUs see all pixels, but they don’t interpret/filter the pattern but store it all and the pattern emerges (very inefficiently) from the the amount of pixel “bags”
15.
Marc: That is not the same, of course. It can though obviously compensate for a lot, same as thinking does very effectively and also rather efficiently, but hallucinations and easily losing context/track (more than smart humans would) appears the inherent price to pay.
16.
Marc: In my experience humans can do this; I have worked across domains and found this to be true for me.
18.
Marc: This fits to my above interpretation: many more facts/data, but not explicitly filtered/interpreted&classified during ingestion.
19.
Marc: I think that context pollution from inner reflection and context window size, and getting then lost in complexity: too many different sub-contexts that are mixed up in a single “thinking”, where each wrong turn derails the who effort, due to lack of de-pollution measures and lack of retro-inspection from outside the active context window - at least what I understand how this is currently done, while this could even rather easily be changed! I’d wish I had moved to work I the AI field as developer - I feel like this fits to my way of thinking and problems to solve.
30.
Marc: I would have some ideas how to improve this situation in LLMs – the tricks I am aware of now are arguably unsuitable (here humans are indeed better, but I think something similar can be achieved, even with a mix of LLM&ML).
36.
Marc: Indeed. There are even more reasons, why this graphic does not tell us much, really (yes, I should name them (maybe CAPEX/OPEX ratios, available cash for invest, narrowness of topic, scale of economic expectations, who finances etc.), but you already named several so that should suffice). But putting things in perspective is always compelling (including clearly to me - even if only to realising that another perspectives would be needed.)
Another thought: I think AGI and ASI is not defined/understood as it should be - this is currently overly anthropocentric - but why?
Note: No LLM was abused or at least used in writing this feedback 😉
Amidst the unrelenting tumult of AI news, it’s easy to lose track of the bigger picture. Here are some ideas that have been developing quietly in the back of my head, about the path from here to AGI.
Back in the 80s and 90s, I used to attend SIGGRAPH, the annual computer graphics conference. The highlight of the week was always the film show, a two-hour showcase demonstrating the latest techniques. It was a mix of academic work and special-effects clips from unreleased Hollywood movies.
Every year, the videos would include some important component that had been missing the year before. Shadows! Diffuse lighting! Interaction of light with texture! I’d gaze upon the adventurer bathed in flickering torchlight, and marvel at how real it looked. Then the next year I’d laugh at how cartoonish that adventurer’s hair had been, after watching a new algorithm that simulated the way hair flows when people move.
I think AI is a little like that: we’re so (legitimately!) impressed by each new model that we can’t see what it lacks… until an even-better model comes along. As I said when I first started blogging about AI: as we progress toward an answer to the question “can a machine be intelligent?”, we are learning as much about the question as we are about the answer.
(Case in point: in the press briefing for the GPT-5 launch, Sam Altman said that we’ll have AGI when AIs get continuous learning. I’ve never heard him point to that particular gap before.)
Here are some examples of sample efficiency in humans: learning to drive a car in a few dozen hours. Figuring out the rule in an ARC-AGI task from just a couple of examples. Learning the ropes at a new job. Sussing out the key trick to solve a difficult mathematical problem. Are these all basically the same skill?
AIs have been demonstrating what arguably constitutes superhuman performance on FrontierMath, a set of extremely difficult mathematical problems. But they mostly seem to do it “the wrong way”: instead of finding elegant solutions, they either rely on knowledge of some obscure theorem that happens to make the problem much easier, or grind out a lengthy brute-force answer.
Does this matter? I mean, if you get the answer, then you get the answer. But in mathematics, much of the value in finding a proof is the insights you acquired along the way. If AIs begin knocking off unsolved problems in mathematics, but in ways that don’t provide insight, perhaps we’ll still need mathematicians to do the real work of advancing the overall field. Or maybe, once AIs can solve these problems at all, it’ll be a short step to solving them with insight? My instinct is that it’s not a short step, but that could be cope. In any case, the big question is what this tells us about AI’s potential in applications other than mathematics. What portion of human activity requires real insight?
The best public estimate is that GPT-4 has 1.8 trillion “parameters”, meaning that its neural network has that many connections. In the two and a half years since it was released, it’s not clear that any larger models have been deployed (GPT-4.5 and Grok 3 might be somewhat larger).
The human brain is far more complex than this; the most common estimate is 100 trillion connections, and each connection is probably considerably more complex than the connections in current neural networks. In other words, the brain has far more information storage capacity than any current artificial neural network.
Which leads to the question: how the hell do LLMs manage to learn and remember so many more raw facts than a person[4]?
One possible answer: perhaps models learn things in a shallower way, that allows for more compact representations but limits their ability to apply things they’ve learned in creative, insightful, novel ways. Perhaps this also has something to do with their poor sample efficiency.
Solving Large Problems
When projecting the future of AI, many people look at this graph:
It shows that the size of software engineering tasks an AI can complete has been roughly doubling every 7 months. This trend has held for over 5 years (arguably[5]), during which the achievable task size increased from around 3 seconds to around 2 hours. Why should the trend be so steady?
It’s not obvious that the difficulty that AI will have in completing a task should increase steadily with the size of the task. If a robot could assemble 10 Ikea bookshelves, it could assemble 20 bookshelves. If a coding agent can create a form with 10 fields, it can probably create a form with 20 fields. Why is it that if an AI can complete a 10-minute project, it still may not be able to complete a 20-minute project? And why does the relationship between time and difficulty hold steady across such a wide range of times?
I think the predictable(-ish) trend of AIs tackling larger and larger software engineering tasks has something to do with large tasks containing a fractal distribution of subtasks. There is a fuzzy collection of tactical and strategic skills involved, ranging from “write a single line of code” to “design a high-level architecture that breaks up a one-month project into smaller components that will work well together”. Larger tasks require high-level skills that are more difficult for AIs (and people) to master, but every task requires a mix of skills, tasks of the same size can involve different mixes (building one fancy model airplane vs. 20 bookshelves), and the fuzzy overlaps smooth out the graph.
Everyone is sharing this graph, which compares the level of investment in railroads in the 1880s, telecommunications infrastructure during the dot-com bubble, and AI data centers today:
The usual takeaway is: wow, the AI boom (or bubble) is bigger than the dot-com bubble. I don’t understand why people aren’t focusing more on the fact that railroad investments peaked at three times the dot-com boom and AI datacenter rollout put together. Holy shit, the 1880s must have been absolutely insane. The people of that time must have really believed the world was changing, to be willing to sustain that level of investment. (I’ve seen arguments that the pace of change in the late 1800s and early 1900s made our current era seem positively static. Steam power, electricity, railroads, the telegraph, telephones, radio, etc. This graph makes that a bit more visceral.)
(I also wonder whether these numbers may turn out to be wrong. There’s at least one obvious error – the dot-com boom took place around 2000, not 2020. Some commentator noted that older GDP figures may be misleading because the informal economy used to play a much larger role. When a startling statistic spreads like wildfire across the Internet, it often turns out to be incorrect.)
Quick reminder that the regular application deadline for The Curve is next Friday, August 22nd! In case you missed it: on October 3-5, in Berkeley, we’ll bring together ~250 folks with a wide range of backgrounds and perspectives for productive discussions on the big, contentious questions in AI. Featuring Jack Clark, Jason Kwon, Randi Weingarten, Dean Ball, Helen Toner, and many more great speakers! If you’d like to join us, fill out this form.
Thanks to Taren for feedback and images.
This quote is from a restatement of the paradox by Steven Pinker. Moravec’s original statement, in 1988:
It is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.
Large Language Models, such as GPT.
Yes, the some models can now post high scores on the original ARC-AGI-1 test, but they still struggle with ARC-AGI-2 and ARC-AGI-3. Also, yes, it seems likely that one reason models struggle on ARC-AGI problems is that they don’t have much experience looking at pixelated images. But I still stand by the observation that models seem to only be selectively skilled at in-context learning.
I asked ChatGPT, Claude, and Gemini to compare the number of “facts” known by a typical adult to a frontier LLM. They all estimated a few million for people, and a few billion for LLMs. To arrive at those estimates, they engaged in handwaving so vigorous as to affect the local weather, so take with a grain of salt. (ChatGPT transcript, Claude transcript, Gemini transcript)
The data does suggest that the rate of progress has accelerated recently, perhaps to a 4 month doubling time, but this is debated and there isn’t enough data to be confident in either direction.
Though some of the relevant data had been available for several decades.