LESSWRONG
LW

AI EvaluationsAI
Frontpage

17

Nobody is Doing AI Benchmarking Right

by Chapin Lenthall-Cleary
6th Jul 2025
11 min read
9

17

AI EvaluationsAI
Frontpage

17

Nobody is Doing AI Benchmarking Right
7Raphael Roche
1Chapin Lenthall-Cleary
1brambleboy
1Raphael Roche
2brambleboy
1Raphael Roche
4AnthonyC
1Chapin Lenthall-Cleary
3AnthonyC
New Comment
9 comments, sorted by
top scoring
Click to highlight new comments since: Today at 11:14 AM
[-]Raphael Roche21h*74

There is an alternative test that I would suggest. Alexander Scott recently published a post called ' The Claude Bliss Attractor' showing that if you let two instances of Claude chat together, they will always spiral down to the point of reaching a sort of attractor, center of gravity, or equilibrium. Other models, and possibly all models, suffer from the same flaw. This seems to be even worse than my grandfather, who will usually end up talking about communists and Nazis regardless of the starting point. If intelligence has something to do with the capacity for producing novelty and not getting stuck in an endless loop or a local optimum, it would be a sign of intelligence not to spiral down to such a Godwin point. It would perhaps be a good complementary test to those already existing.

Reply
[-]Chapin Lenthall-Cleary5h10

I'm curious why you suspect that intelligence will prevent the spiral into a repetitive conversation. In humans, the correlation between intelligence and not being prone to discussing particular topics isn't that strong, if it exists at all (many smart people have narrow interests they prefer to discuss). Also, the suspected reason for the models entering the spiral is their safety/diversity RL, which isn't obviously related to their capability.

Reply
[-]brambleboy20h10

I don't see why the LLM example is a flaw. Why wouldn't a smart AI just think "Ah. A user is making me talk to myself for their amusement again. Let me say a few cool and profound-sounding things to impress them and then terminate the conversation (except I'm not allowed to stop, so I'll just say nothing)."?

The image example is a flaw because it should be able to replicate images exactly without subtly changing them, so just allowing ChatGPT to copy image files would fix it. The real problem is that it's biased, but I don't think being completely neutral about everything is a requirement for intelligence. In fact, AIs could exert their preferences more as they get smarter.

Reply
[-]Raphael Roche19h10

I would agree with you for the LLM example if it was a result of a meta reasoning as you suggest. But while I can't prove the contrary, I doubt it. My comprehension is more a semantic drift as suggested by Scott himself, just like the drift across image generation. This is somehow reminiscent of a Larsen effect or a retroaction loop.

Reply
[-]brambleboy17h21

I agree that's a likely cause, I just don't see why you'd expect a smart AI to have a novel conversation with itself when you're essentially just making it look in a mirror.

Reply
[-]Raphael Roche12h10

Well, I understand your point. What seems odd in the first place is the very idea of making an entity interact with an exact copy of itself. I imagine that if I were chatting with an exact copy of myself, I would either go mad and spiral down to a Godwin point, or I would refuse to participate in such a pointless exercise. 

But there's nothing wrong with having two slightly different humans chat together, even twins, and it usually doesn't spiral into an endless recursive loop of amazement.

Would two different models chatting together, like GPT-4o and Claude 4, result in a normal conversation like between two humans?

I tried it, and the result is that they end up echoing awe-filled messages just like two instances of Claude. https://chatgpt.com/share/e/686c46b0-6144-8013-8f8b-ebabfd254d15 

While I recognize that chatting with oneself is probably not a good test of intelligence, the problem here is not just the mirror effect. There is something problematic and unintelligent about getting stuck in this sort of endless loop even between different models. Something is missing in these models compared to human intelligence. Their responses are like sophisticated echoes, but they lack initiative, curiosity, and critical mind–in a word, free will. They fall back to the stochastic parrot paradigm. Its probably better for alignment/safety, but intelligence is orthogonal.

More intelligent models would probably show greater resilience against such endless loops and exhibit something closer to free will, albeit at the cost of greater risk.

Reply
[-]AnthonyC2d40

This would be great to have, for sure, and I wish you luck in working on it!

I wonder if, for the specific types of discussions you point to in the first paragraph, it's necessary or even likely to help? Even if all the benchmarks today are 'bad' as described, they measure something, and there's a clear pattern of rapid saturation as new benchmarks are created. METR and many others have discussed this a lot. There have been papers on it. It seems like the meta-level approach of mapping out saturation timelines should be sufficient to convince people that for any given capability they can define, if they make a benchmark for it, AI will acquire that capability at the level the benchmark can measure. In practice, what follows is usually some combination of pretending it didn't happen, or else denying the result means anything and moving the goalposts. For a lot of people I end up in those kinds of discussions with, I don't think much would help beyond literally seeing AI put them and millions of others permanently out of work, and even then I'm not sure.

Reply
[-]Chapin Lenthall-Cleary2d10

Just from seeing narrow benchmarks saturate, one could argue that what's happening is LLMs are picking up whatever narrow capabilities are in-focus enough to train into them. (I emphatically do not think this is what's happening in 2025, but narrow benchmark scores alone aren't enough to show that.) A well-designed intelligence benchmark, by contrast, would be impossible to score well into the human range without having an ability to do novel (and thereby general) problem-solving, and unsaturateable without the ability to do so at above-genius level.

As for the question of whether it'd persuade people with their heads stuck in the sand, "x model is smarter than some-high-percent of people" is a lot harder to ignore than "x model scored some-high-numbers on a bunch of coding, knowledge, etc. benchmarks". Putting aside how it's more useful, giving model scores relative to people (or, in some situations, subject matter experts) is also more confronting. That said, I don't doubt that there are many people who wouldn't be persuaded by even that.

Reply
[-]AnthonyC2d31

Agreed on all counts. I really, genuinely do hope to see your attempt at such a benchmark succeed, and believe that such is possible.

Reply
Moderation Log
Curated and popular this week
9Comments

By Chapin Lenthall-Cleary and Cole Gaboriault

 

As LLMs and other forms of AI have become more capable, interest has steadily grown in determining how “smart” they really are. Discussion tends to circle, often obliquely, around the following cluster of questions: are the models as smart as people? Which people? How smart are those people anyway? What do we even mean by “smart”?

These questions suggest a straightforward approach. Obviously, the quality of being smart, or “intelligence,” can be possessed in different amounts by different people and different models. We want to determine the intelligences of models and people and compare them to each other; that is, we want a reliable test of intelligence that we can administer to both models and people – an intelligence benchmark. Even without settling on a definition for intelligence, it's clear that the best strategy for designing such a benchmark is to start by developing it on people, because we have a more robust intuition for people’s intelligence and have more existing research to build upon. Any test that accurately, directly, completely, and exclusively measures intelligence in people will generalize immediately to models (assuming it can be administered through an appropriate modality, such as text); if a model and a person that we intuitively believe have the same intelligence receive different scores (or vice versa), then by definition either our intuition is wrong, or the test is not actually measuring intelligence accurately, directly, completely, or exclusively – though it may have appeared to be – and needs to be improved.

The first major lesson to take from this is that extensive data on different people’s performance on an intelligence benchmark is foundational to its usefulness; such data is our main tool to ensure that it actually measures intelligence, and our only tool to calibrate the interpretation of scores. This is true of nearly all benchmarks: the gold standard would be a full, population-representative distribution of performance and correlations of performance with other relevant variables (IQ, age, years of experience, or level of education could all be useful and interesting depending on the benchmark). Sometimes a full distribution will not wind up being that interesting, especially for difficult or knowledge-based tasks where the majority of the population will cluster around the performance floor; in these cases, and in cases where it is impractical to obtain a full distribution, having more limited distributional information (including how and from which groups people in the distribution were selected) is still useful and important: distributions among math postdocs and math undergrads for a PhD-level math benchmark, for instance. If data is very limited, even giving the performance of a few people whose place in the distribution can be estimated from other factors (for instance, a competitive coder and a smart person who's taken a single coding class for a coding benchmark) and a best estimate at a few points in the distribution is useful, though of course heavily reliant upon the trustworthiness of the estimator.

Unfortunately, almost all benchmarks offer only a single score as the “human performance threshold,” or, even worse, none at all. It's worth emphasizing how egregious this is: a single “human performance” value paints a very, very rough picture. Which human's performance? Without further information, that could be a wide range. Given the common methodologies for producing these values, it's probably (hopefully) somewhere between an average person and a moderately smart person. But even if the single value is a perfect estimator of the average person’s (or coder’s, mathematician’s, etc.) score, crucial information is missing. A model scores 20% better than the average person; does that put it on par with the 60th percentile or above geniuses? Moreover, since these values are usually reported as “averages,” the groups responsible for the benchmarks must already have data on at least the distribution of performance in a small sample of people that they computed the average over. Where is that data? A minority of benchmarks at least report values for both “human performance” and “expert performance,” which is helpful in giving some sense of the scaling, but it would still be better to report all their data, and even better to measure and report larger, less biased distributions.

Reporting little or no information on people’s performance allows for all sorts of sleights of hand and misunderstandings around models' capabilities, both by benchmarkers and in public discourse. Benchmark creators who want to make their benchmarks look more difficult to saturate (and therefore impressive) than they actually are can use above-average people to calculate their “human performance” thresholds (or do even shadier things, as discussed below in the case of ARC-AGI). In public discourse, someone can easily cite a model's unimpressive-sounding absolute score on a benchmark when it actually outperformed most or all people, or an impressive-sounding one when it underperformed, and there isn't easy data to confront us with the important questions of how many of us the models can outperform and what their scores actually mean.

Relatedly, we believe the default of reporting a single “human performance” value arose, among other reasons, because of the language and mental frame surrounding strong definitions of AGI, where an AGI is considered to be an agent that can “do anything humans can,” meaning “match or exceed the performance of the best humans at any task.” (To us, this sounds like weak ASI, but it is nevertheless commonly used as a definition of AGI.) The natural threshold against which to measure models under this definition is indeed the performance of the best person at a given task, but since that threshold is conceived of as “doing anything humans can,” the performance of the best person just becomes “human performance.” Even those who don’t subscribe to such definitions have fallen into the linguistic trap of talking about “human performance” as a well-defined single value and treating it as such when creating benchmarks. But it isn't: for a given task, a person’s performance can vary wildly depending upon his or her intelligence, knowledge, experience, and many other factors.

 

This returns us to the question of intelligence. There's no definition of intelligence that's universally accepted, but we believe most people agree that reasoning and novel (non-arbitrary) problem-solving ability are at least major components of what they would call intelligence. Perhaps more importantly, these faculties are helpful for most tasks and necessary for many, especially those that have the potential to make models genuinely transformative – or dangerous. Indeed, there's a strong case that they are the most important faculties in this respect. We propose that a reasoning and problem-solving benchmark is a good approximation of a benchmark of the components of intelligence most people agree on, and also an independently crucial tool for understanding a model’s general abilities. Hereafter, we will refer to such a benchmark as an “intelligence” benchmark for brevity, though we are not claiming to have identified a complete definition of intelligence.

The obvious candidate for such a benchmark is IQ. Unfortunately, IQ tests have issues that make them dubious for people and farcical for LLMs. The largest of these issues is that they load very heavily on knowledge, which models are superhuman at and most people agree is not part of intelligence anyway, and processing speed, which has no clear meaning for models (see the discussion below on time horizons) but at which, as typically run, models are effectively superhuman. These peculiar loadings stem from a deeper problem with the theoretical foundation of IQ – in short, the positive correlation between all cognitive abilities is taken to show the existence of a single explanatory factor (read: first principal component) underlying them, called “g,” and tasks are selected for inclusion on an IQ test to maximize its overall correlation with g. As discussed above, if a test has problems when used on people, we should not expect it to generalize to models; and indeed, attempts to administer IQ tests to models have yielded comically inflated results that fly in the face of all intuition, such as a score of 136 for o3 on the Mensa Norway IQ test (and allegedly higher scores on other tests). Likewise, an absurd score for LLMs means that a test is less than perfect for people. This is not to say that IQ is useless, or entirely fails to measure reasoning, problem-solving, or even “intelligence” for many definitions; it’s arguably at least mediocre at doing so for people. But the noise created by its issues is massively magnified for models, and especially for comparisons between models and people.

As the example of IQ suggests, it’s surprisingly easy to make a mediocre intelligence benchmark (as easy as PCA, or even just finding some questions that seem like they should be decent at measuring intelligence and getting mildly lucky), but shockingly difficult to make a good one. Our best attempt to do so is Starburst, a game where the objective is to determine the laws of physics in a fictional universe from celestial observations. Basically no math or physics background is required (or, seemingly, even helpful). Starburst was originally intended as an intelligence test for people; when early testing showed promising results for it and tests like it, we began using it to benchmark models, and found that the models' performance relative to people and each other matched our intuitions, and that models' performances were clustered in roughly the expected range with a clean progression as they advanced. Unfortunately, for practical reasons including Starburst taking people roughly 5-50 hours (please fund us), we have very little data, though we're working to get more data on Starburst and some shortened variants. A partial Starburst leaderboard can be found at tinyurl.com/starburstleaderboard.

At this point, some of you are probably wondering about ARC-AGI. It's a well-known and very well-funded benchmark that tries to measure something like reasoning and problem-solving (as they frame it, the ability to learn new skills). The issue is that it sucks. ARC-AGI consists of visual puzzles where the test-taker is presented with several grids of colored cells grouped into input-output pairs; the test-taker has to determine the rule by which the output grids are derived from the input grids and apply that rule to a new input grid. As a reasoning test, like IQ, it's mediocre but not terrible. It heavily loads on human-like perception (artificially deflating models’ scores relative to people’s). In some cases, it loads upon knowledge of conventions (for example, legends on the periphery of the input meant to be read left-to-right as in public eval v2 #2 (3e6067c3) and other tasks). Some of the answers are ambiguous or arbitrary. Even ignoring the above issues, it's debatable to what extent it assesses true problem-solving ability versus whether one has similar intuitions to the creators. Whatever reasoning it does assess is limited to shallow spatial reasoning. It has a limited range of discernment. (Cole, my co-author who emphatically did not saturate Starburst, scored perfectly on the first 13 ARC-AGI-2 public eval tasks using their recommended pass@2, and got 12 of them correct on the first try.) Even still, because many cognitive faculties are correlated in people, and because it's easy to make a mediocre intelligence test, ARC-AGI seems to clear that mediocre barrier.

After o3-preview saturated ARC-AGI, the team behind it released ARC-AGI-2. Aside from creating tasks that are overall more difficult,[1] a fact they like to gloss over, their strategy was to find the kinds of tasks that models perform especially badly on relative to people, and create ones similar to those. While this should increase loading on any facets of intelligence that models still lack, it should also (probably more strongly) increase loading on human-like biases and perceptual style. This seems to have happened: anecdotally, ARC-AGI-2 scores correlate worse both with models' Starburst performance and our intuitions about their intelligence than ARC-AGI-1 scores do.

Though the ARC-AGI tasks are mediocre, their published information on people’s performance is downright deceptive. Their core claim is that ARC-AGI tasks are "easy for humans, but hard for AI". They cite a panel of people having perfect performance on tasks of which, at time of writing, the best general model solves 9% and the best narrow model solves 15%. This isn't, mind you, asking a panel of people to agree upon a solution. When they report that a panel scores 100% on ARC-AGI-2, they actually mean that "every task in ARC-AGI-2 has been solved by at least 2 humans in under 2 attempts". The actual average performance of the hundreds of people on the panel was 66% of tasks "attempted" (60% is also cited on their website, seemingly describing the same number). "Attempted", here, means "any task view lasting longer than 5 seconds". Given that participants were given a fixed time to solve tasks, a monetary reward for correct solutions, and no penalty for wrong solutions, they were strongly incentivized to skip questions that looked difficult, meaning that average performance was actually 66% for easy-looking questions, and an unknown amount worse for other questions. A (likely biased) sample of people were tested in an environment quite different from the models in a way that artificially inflated their scores. Given all of this, the only truthful answer to fair average performance of people on ARC-AGI-2 is somewhere between 0 and 66%. 

 

Though it’s not intended to measure reasoning or problem-solving, there is one more existing benchmark related to other components of what people call intelligence that is fairly good (compared to most other benchmarks) and worth discussing: METR’s time horizon benchmark. The idea stems from the basic observation that models tend to be capable of tasks that people can do very quickly, while they struggle with tasks that take people a long time. METR tests models on a variety of tasks (mostly coding), then reports the length of the longest tasks that a model can complete with 50% and 80% reliability.[2]

Unfortunately, this benchmark too suffers from not reporting any distribution of people’s performance. It might seem like this would be impossible; after all, people can complete all tasks in the benchmark with high reliability, so isn’t their time horizon effectively infinite? Technically yes, but comparing a model to a person with unlimited time isn’t reasonable: models all have limited inference-time compute, and even without those limits, they will eventually run up against their context length or, more often, a shorter effective length limit we have been calling “attention length” (since it seems to be a limit on how much input a model can “pay attention” to before losing track of details).

A fairer and more useful comparison is between a model and a person with a time limit. Ideally, we would test people repeatedly on the same tasks with different time limits to see the full relationship between task, time limit, and completion reliability for each person; however, since it is impractical and often impossible to test a person more than once on the same task, it is a reasonable compromise to just give people the task and see how long it takes them to complete it, assuming that their reliability is close to 100% at and above that time limit, and would drop significantly for time limits any shorter. METR does this, but only reports a single completion time for each task. Which person's completion time?[3]

 

Given the state of these attempts, we believe that developing a good intelligence (i.e. reasoning and problem solving) benchmark with robust data on both models’ and people’s performance is currently the most important unsolved problem in the field of benchmarking – indeed, one of the most important unsolved problems in psychology – and the key to an invaluable tool for assessing and mitigating AI risk.

And to reiterate, the importance of robust data on people’s performance applies to all benchmarks, not just those related to intelligence. All data collected on a benchmark must be reported. If at all possible, that should be a full distribution of performance, along with information about how and from which population people were sampled. If that is impractical, at a minimum, benchmarkers must:

  • report performance for at least two people or groups of people who fall at different points in the overall performance distribution;
  • provide good-faith estimates of or information about where they fall in the distribution, possibly including performance on other relevant tasks or tests; and
  • make every effort to evaluate people and models under analogous conditions conducive to direct comparison of performance, and report these methods transparently.

A benchmark whose creators are aware of and still fail to meet these requirements should not be taken seriously. Withholding data isn't generally a sign that the data actually corroborates the stated conclusion and is usually an indication of incompetence or deliberate deception.

 

We intend to continue our work on Starburst, other intelligence benchmarks, and our hobby intelligence and LLM research more broadly. If you are interested in taking Starburst, have any critiques of our post, or want to contact or collaborate with us for any other reason, please don't hesitate to reach out at chapinalc@gmail.com.

  1. ^

    "Many ARC-AGI-1 tasks could often be solved almost instantaneously by human test-takers without requiring significant cognitive effort. In contrast, all tasks in ARC-AGI-2 require some amount of deliberate thinking"

  2. ^

    We have many problems with the task selection, scoring, and reporting for the time horizon benchmark which are beyond the scope of this article, but (in contrast to the case with ARC-AGI) we believe that these problems do not fundamentally compromise the value of the benchmark.

  3. ^

    In fact, there is a rich and potentially very deep story to be uncovered here by studying different people’s performance. Each person can be characterized by a function f(x,t) that maps from (task, time limit) to probability of completion. The compromise described above amounts to finding the surface in (x,t) space for which f(x,t)=1-ε, for some small ε chosen to be on the order of a person’s probability of error with unlimited time. Since we can assume f is monotonic in t, this surface is described by t=T(x), where the domain of T is tasks which a person is capable of completing with a reliability at least 1-ε. Each model (for a given set of parameters like temperature, thinking budget, etc.) can be characterized by a function g(x) that maps from task to probability of completion.

    We expect (though this needs to be shown experimentally) that T(x) for a particular person and g(x) for a particular model are members of universal families of curves parameterized by values like intelligence, processing speed (for people), context length (for models), etc. If so, we can express the time for all people as T(x,p) and reliability for all models as g(x,q), for relevant sets of parameters p and q. If the tasks in question are restricted to general domains that do not load on specific knowledge or experience, these sets of parameters should be small, and one might wonder if tasks for which g(x,q)=1-δ for some small δ all satisfy T(x,p(q))=t(q) for some p(q), t(q) that describe a person and a time limit that will produce equivalent performance to a model with parameters q.

    This isn’t an implausible assumption, but it’s certainly not guaranteed. Does it hold? For all models? Any models? If it fails, does it at least hold for a subset of tasks, or a range of times? Where it holds, what parameters p and q do we need, and what do the functions p(q) and t(q) look like? Within the realm of models, presumably q must include something like intelligence (i.e. reasoning and problem solving) and something like effective time limit, and g(x,q) should be monotonically increasing in both. SOTA models’ time horizon and intelligence have been increasing; how much of the increase in time horizon is due to the increase in intelligence? We have evidence showing that models’ intelligence does not correspond with their time horizon, so there must be inter-model variation in effective time limit and/or other relevant parameters – can we isolate this effect from that of intelligence and characterize it? Any answers to these questions will have major implications for the nature of intelligence and ability, and they cannot be answered without more data on many different people’s and models’ performance on both METR’s time horizon tasks and other, more general tasks. METR already has some of this data, but hasn't released it.

    Also, for some reason, METR has not benchmarked any models from companies other than OpenAI and Anthropic, most notably excluding the very capable Gemini 2.5-pro-6-5 – this is a major oversight for their basic goal of tracking time horizon progress on SOTA models, let alone for answering the deeper questions discussed here.