snewman — LessWrong

Software engineer and repeat startup founder; best known for Writely (aka Google Docs). Now starting https://www.aisoup.org to foster constructive expert conversations about open questions in AI and AI policy, and posting at https://amistrongeryet.substack.com and https://x.com/snewmanpv.

AIs trained on the training set of ARC-AGI (and given a bunch of compute) can beat humans on ARC-AGI-1.

Say more? At https://arcprize.org/leaderboard, I see "Stem Grad" at 98% on ARC-AGI-1, and the highest listed AI score is 75.7% for "o3-preview (Low)". I vaguely recall seeing a higher reported figure somewhere for some AI model, but not 98%.

ARC-AGI-2 isn't easy for humans. It's hard for humans and AIs probably do similarly to random humans (e.g. mturkers) given a limited period.

This post states that the "average human" scores 60% on ARC-AGI-2, though I was unable to verify the details (it claims to be a linkpost for an article which does not seem to contain that figure). Personally I tried 10-12 problems when the test was first launched, and IIRC I missed either 1 or 2.

The leaderboard shows "Grok 4 (Thinking)" on top at 16%... and, unfortunately, does not present data for "Stem Grad" or "Avg. Mturker" (in any case I'm not sure what I think of the latter as a baseline here).

Agreed that perception challenges may be badly coloring all of these results.

There isn't some dictomy between "sample efficient learning" and not.

Agreed, but (as covered in another comment – thanks for all the comments!), I do have the intuition that the AI field is not currently progressing toward rapid improvement on sample efficient learning, and may currently be heading toward a fairly low local maximum.

Yeah, I was probably too glib here. I was extrapolating from the results of the competition Epoch organized at MIT, where "o4-mini-medium outperformed the average human team, but worse than the combined score across all teams, where we look at the fraction of problems solved by at least one team". This was AI vs. teams of people (rather than any one individual person), and it was only o4-mini, but none of those people were Terence Tao, and it only outperformed the average team.

I would be fascinated to see how well he'd actually perform in the scenario you describe, but presumably we're not going to find out.

if you compared AIs to a group of humans who are pretty good at this type of math, the humans would probably also destroy the AI.

I wonder? Given that, to my understanding, each FrontierMath problem is deep in a different subfield of mathematics. But I have no understanding of the craft of advanced / research mathematics, so I have no intuition here.

Anyway, I think we may be agreeing on the main point here: my suggestion that LLMs solve FrontierMath problems "the wrong way", and your point about depth arguably being more important than breadth, seem to be pointing in the same direction.

I do believe that AIs will eventually surpass humans at fluid intelligence, though I'm highly uncertain as to the timeline.

My point here is really just the oft-repeated observation that when we see an AI do X, intuitively we tend to assess the AI the way we would assess a human being who could do X, and that intuition can lead to very poor estimates of whether the AI can also do Y. (For instance, bar exam → practicing law.) For instance, the relative ratios of fluid vs. crystal intelligence may capture much of the reason that AIs are approaching superhuman status at competition coding problems but are still far from superhuman at many real-world coding tasks. It doesn't mean AIs will never get to real-world tasks. It just suggests (to me) that they might be farther from that milestone than their performance on crystal-intelligence-friendly tasks would imply.

Agreed that I have not supported my claims here – this was a vibes piece.

I agree that LLMs are improving at ~everything, but my intuition is that some of those improvements – for instance, regarding continuous learning – may currently be of the "climbing a ladder to get closer to the moon" variety. Sounds like we just have very different intuitions here.

I'm having trouble parsing what you've said here in a way that makes sense to me. Let me try to lay out my understanding of the facts very explicitly, and you can chime in with disagreements / corrections / clarifications:

The human brain has, very roughly, 100B neurons (nodes) and 100T synapses (connections). Each synapse represents at least one "parameter", because connections can have different strengths. I believe there are arguments that it would in fact take multiple parameters to characterize a synapse (connection strength + recovery time + sensitivity to various neurotransmitters + ???), and I'm sympathetic to this idea on the grounds that everything in the body turns out to be more complicated than you think, but I don't know much about it.

Regarding GPT-4, I believe the estimate was that it has 1.8 trillion parameters, which if shared weights are used may not precisely correspond to connections or FLOPs. For purposes of information storage ("learning") capacity, parameter count seems like the correct metric to focus on? (In the post, I equated parameters with connections, which is incorrect in the face of shared weights, but does not detract from the main point, unless you disagree with my claim that parameter count is the relevant metric here.)

To your specific points:

Probably the effective number of parameters in the human brain is actually lower than 100 trillion because many of these "parameters" are basically randomly initialized or mostly untrained. (Or are trained very slowly/weakly.) The brain can't use a global learning algorithm, so it might effectively use parameters much less efficiently.

What is your basis for this intuition? LLM parameters are randomly initialized. Synapses might start with better-than-random starting values, I have no idea, but presumably not worse than random. LLMs and brains both then undergo a training process; what makes you think that the brain is likely to do the worse job of training its available weights, or that many synapses are "mostly untrained"?

Also note that the brain has substantial sources of additional parameters that we haven't accounted for yet: deciding which synapses to prune (out of the much larger early-childhood count), which connections to form in the first place (the connection structure of an LLM can be described in a relative handful of bits, while the connection structure of the brain has an enormous number of free parameters; I don't know how "valuable" those parameters are, but natural systems are clever!), where to add additional connections later in life.

It's a bit confusing to describe GPT-4 as having 1.8 trillion connections as 1.8 trillion is the number of floating point operations (roughly) not the number of neurons.

I never mentioned neurons. 1.8 trillion is, I believe, the best estimate for GPT-4's parameter count. Certainly we know that the largest open-weight models have parameter counts of this order of magnitude (somewhat smaller but not an OOM smaller). As noted, I forgot about shared weights when equating parameters to connections, but again I don't think that matters here. FLOPs to my understanding would correspond to connections (and not parameter counts, if shared weights are used), but I don't think FLOPs are relevant here either.

In general, the analogy between the human brain and LLMs is messy because a single neuron probably has far fewer learned parameters than a LLM neuron, but plausibly somewhat more than a single floating point number.

GPT-5 estimates that GPT-4 had just O(100M) neurons. Take that figure with a grain of salt, but I mention it to point out that in both modern LLMs and the human brain, there are far more connections / synapses than nodes / neurons, and the vast majority of parameters will be associated with connections, not nodes. (Which is why I didn't mention neurons in the post, and I don't think it's useful to talk about learned parameters in reference to neurons.)

Nice analysis. I can't add anything substantive, but this writeup crystallized for me just how much we're all focusing on METR's horizon lengths work. On the one hand, it's the best data set we have at the moment for quantitative extrapolation, so of course we should focus on it. On the other hand, it's only one data set, and could easily turn out to not imply what we think it implies.

My only points are (a) we shouldn't weight the horizon length trends too heavily, and (b) boy do we need additional metrics that are both extrapolatable, and plausibly linked to actual outcomes of interest.

Thanks. This is helpful, but my intuition is substantially coming from the idea that there might be other factors involved (activities / processes involved in improving models that aren't "thinking about algorithms", "writing code", or "analyzing data"). In other words, I have a fair amount of model uncertainty, especially when thinking about very large speedups.

quantity of useful environments that AI companies have

Meaning, the number of distinct types of environments they've built (e.g. one to train on coding tasks, one on math tasks, etc.)? Or the number of instances of those environments they can run (e.g. how much coding data they can generate)?

GPT-4.5 is going to be quickly deprecated

It's still a data point saying that OpenAI chose to do a large training run, though, right? Even if they're currently not planning to make sustained use of the resulting model in deployment. (Also, my shaky understanding is that expectations are for a GPT-5 to be released in the coming months and that it may be a distilled + post-trained derivative of GPT-4.5, meaning GPT-5 would be downstream of a large-compute-budget training process?)

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments