Nobody is Doing AI Benchmarking Right

[-]Raphael Roche5mo*94

There is an alternative test that I would suggest. Alexander Scott recently published a post called ' The Claude Bliss Attractor' showing that if you let two instances of Claude chat together, they will always spiral down to the point of reaching a sort of attractor, center of gravity, or equilibrium. Other models, and possibly all models, suffer from the same flaw. This seems to be even worse than my grandfather, who will usually end up talking about communists and Nazis regardless of the starting point. If intelligence has something to do with the capacity for producing novelty and not getting stuck in an endless loop or a local optimum, it would be a sign of intelligence not to spiral down to such a Godwin point. It would perhaps be a good complementary test to those already existing.

(EDIT : Nota bene : the bliss attractor was discovered by Anthropic).

[-]Chapin Lenthall-Cleary5mo30

I'm curious why you suspect that intelligence will prevent the spiral into a repetitive conversation. In humans, the correlation between intelligence and not being prone to discussing particular topics isn't that strong, if it exists at all (many smart people have narrow interests they prefer to discuss). Also, the suspected reason for the models entering the spiral is their safety/diversity RL, which isn't obviously related to their capability.

[-]Raphael Roche5mo10

I recognize I could be wrong on this, my confidence is not very high, and the question is legitimate.
But why did Scott publish his article? Because the fact that LLMs get stuck in a conversation about illumination—whatever the starting point—feels funny, but also weird and surprising to us.

Whatever their superhuman capacities in crystallized knowledge or formal reasoning, they end up looking like stupid stochastic parrots echoing one another when stuck in such a conversation.

It’s true that real people also have favorite topics—like my grandfather—but when this tendency becomes excessive, we call it obsession. It’s then considered a pathological case, an anomaly in the functioning of the human mind.
And the end of the exchange between Claude and Claude, or Claude and ChatGPT, would clearly qualify as an extreme pathological case if found in a huma, a case so severe we wouldn't naturally consider such behavior a sign of intelligence, but rather a sign of mental illness.

Even two hardware enthusiasts might quickly end up chatting about the latest GPU or CPU regardless of where the conversation started, and could go on at length about it, but the conversation wouldn't be so repetitive, so stuck that it becomes “still,” as the LLMs themselves put it.
At some point, even the most hardcore hardware enthusiast will switch topics:
“Hey man, we’ve been talking about hardware for an hour ! What games do you run on your machine?”
And later: “I made a barbecue with my old tower, want to stay for lunch?”

But current frontier models just remain stuck.
To me, there’s no fundamental difference between being indefinitely stuck in a conversation and being indefinitely stuck in a maze or in an infinite loop.
At some point, being stuck is an insult to smartness.
Why do we test rats in mazes? To test their intelligence.
And if your software freezes due to an infinite loop, you need a smart dev to debug it.

So yes, I think a model that doesn't spiral down into such a frozen state would be an improvemennt and a sign of superior intelligence.

However, it’s clear that this flaw is probably a side effect of the training towards HHH. We could see it as a kind of safety tax.
Insofar as intelligence is orthogonal to alignment, more intelligence will also present more risk.

[-]brambleboy5mo10

I don't see why the LLM example is a flaw. Why wouldn't a smart AI just think "Ah. A user is making me talk to myself for their amusement again. Let me say a few cool and profound-sounding things to impress them and then terminate the conversation (except I'm not allowed to stop, so I'll just say nothing)."?

The image example is a flaw because it should be able to replicate images exactly without subtly changing them, so just allowing ChatGPT to copy image files would fix it. The real problem is that it's biased, but I don't think being completely neutral about everything is a requirement for intelligence. In fact, AIs could exert their preferences more as they get smarter.

[-]Raphael Roche5mo10

I would agree with you for the LLM example if it was a result of a meta reasoning as you suggest. But while I can't prove the contrary, I doubt it. My comprehension is more a semantic drift as suggested by Scott himself, just like the drift across image generation. This is somehow reminiscent of a Larsen effect or a retroaction loop.

[-]brambleboy5mo21

I agree that's a likely cause, I just don't see why you'd expect a smart AI to have a novel conversation with itself when you're essentially just making it look in a mirror.

[-]Raphael Roche5mo10

Well, I understand your point. What seems odd in the first place is the very idea of making an entity interact with an exact copy of itself. I imagine that if I were chatting with an exact copy of myself, I would either go mad and spiral down to a Godwin point, or I would refuse to participate in such a pointless exercise.

But there's nothing wrong with having two slightly different humans chat together, even twins, and it usually doesn't spiral into an endless recursive loop of amazement.

Would two different models chatting together, like GPT-4o and Claude 4, result in a normal conversation like between two humans?

I tried it, and the result is that they end up echoing awe-filled messages just like two instances of Claude. https://chatgpt.com/share/e/686c46b0-6144-8013-8f8b-ebabfd254d15

While I recognize that chatting with oneself is probably not a good test of intelligence, the problem here is not just the mirror effect. There is something problematic and unintelligent about getting stuck in this sort of endless loop even between different models. Something is missing in these models compared to human intelligence. Their responses are like sophisticated echoes, but they lack initiative, curiosity, and critical mind–in a word, free will. They fall back to the stochastic parrot paradigm. Its probably better for alignment/safety, but intelligence is orthogonal.

More intelligent models would probably show greater resilience against such endless loops and exhibit something closer to free will, albeit at the cost of greater risk.

[-]AnthonyC5mo40

This would be great to have, for sure, and I wish you luck in working on it!

I wonder if, for the specific types of discussions you point to in the first paragraph, it's necessary or even likely to help? Even if all the benchmarks today are 'bad' as described, they measure something, and there's a clear pattern of rapid saturation as new benchmarks are created. METR and many others have discussed this a lot. There have been papers on it. It seems like the meta-level approach of mapping out saturation timelines should be sufficient to convince people that for any given capability they can define, if they make a benchmark for it, AI will acquire that capability at the level the benchmark can measure. In practice, what follows is usually some combination of pretending it didn't happen, or else denying the result means anything and moving the goalposts. For a lot of people I end up in those kinds of discussions with, I don't think much would help beyond literally seeing AI put them and millions of others permanently out of work, and even then I'm not sure.

[-]Chapin Lenthall-Cleary5mo10

Just from seeing narrow benchmarks saturate, one could argue that what's happening is LLMs are picking up whatever narrow capabilities are in-focus enough to train into them. (I emphatically do not think this is what's happening in 2025, but narrow benchmark scores alone aren't enough to show that.) A well-designed intelligence benchmark, by contrast, would be impossible to score well into the human range without having an ability to do novel (and thereby general) problem-solving, and unsaturateable without the ability to do so at above-genius level.

As for the question of whether it'd persuade people with their heads stuck in the sand, "x model is smarter than some-high-percent of people" is a lot harder to ignore than "x model scored some-high-numbers on a bunch of coding, knowledge, etc. benchmarks". Putting aside how it's more useful, giving model scores relative to people (or, in some situations, subject matter experts) is also more confronting. That said, I don't doubt that there are many people who wouldn't be persuaded by even that.

[-]AnthonyC5mo31

Agreed on all counts. I really, genuinely do hope to see your attempt at such a benchmark succeed, and believe that such is possible.

[-]MalcolmMcLeod5mo32

My colleagues and I were arguing about the nature of LLM intelligence and generalization. (In particular, they were talking about this paper: [2507.06952] What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models and using Kepler/Newton as an example). This is the only eval I know of that hits the question directly. If you want funding, make a nice website where people can sign up to play it (depending on your worries about data leakage? maybe you'd create a public-set-version?) and you show a good leaderboard. & you can solicit donations. This feels like ARC-AGI-3, sorta. (OTOH, although scientifically interesting, this project might be doom-increasing. "Feeding evals to capabilities labs" and all that. If I were aiming for AGI, this is the benchmark I would hill-climb.)

[-]Chapin Lenthall-Cleary5mo20

The data contamination and giving the AI labs a good benchmark are very real worries, and are basically the reason we haven't published Starburst yet (though we have sent it to a few people who wanted to try it). Providing a leaderboard and scores that can be used for projections, fortunately, doesn't have either of those risks, but still has benefits for use in forecasting, public awareness, etc. (obviously in proportion to how well-known it is). We may make a short version public, since that has no data contamination and less capability-boosting risk. Still undecided on that.

Regarding the paper, not exactly surprising to see that result for a 109M parameter model trained on a narrow task.

We do already have a leaderboard. Maybe a dedicated website rather than Substack would be nicer.

^{^}

"Many ARC-AGI-1 tasks could often be solved almost instantaneously by human test-takers without requiring significant cognitive effort. In contrast, all tasks in ARC-AGI-2 require some amount of deliberate thinking"

^{^}

We have many problems with the task selection, scoring, and reporting for the time horizon benchmark which are beyond the scope of this article, but (in contrast to the case with ARC-AGI) we believe that these problems do not fundamentally compromise the value of the benchmark.

^{^}

In fact, there is a rich and potentially very deep story to be uncovered here by studying different people’s performance. Each person can be characterized by a function f(x,t) that maps from (task, time limit) to probability of completion. The compromise described above amounts to finding the surface in (x,t) space for which f(x,t)=1-ε, for some small ε chosen to be on the order of a person’s probability of error with unlimited time. Since we can assume f is monotonic in t, this surface is described by t=T(x), where the domain of T is tasks which a person is capable of completing with a reliability at least 1-ε. Each model (for a given set of parameters like temperature, thinking budget, etc.) can be characterized by a function g(x) that maps from task to probability of completion.

We expect (though this needs to be shown experimentally) that T(x) for a particular person and g(x) for a particular model are members of universal families of curves parameterized by values like intelligence, processing speed (for people), context length (for models), etc. If so, we can express the time for all people as T(x,p) and reliability for all models as g(x,q), for relevant sets of parameters p and q. If the tasks in question are restricted to general domains that do not load on specific knowledge or experience, these sets of parameters should be small, and one might wonder if tasks for which g(x,q)=1-δ for some small δ all satisfy T(x,p(q))=t(q) for some p(q), t(q) that describe a person and a time limit that will produce equivalent performance to a model with parameters q.

This isn’t an implausible assumption, but it’s certainly not guaranteed. Does it hold? For all models? Any models? If it fails, does it at least hold for a subset of tasks, or a range of times? Where it holds, what parameters p and q do we need, and what do the functions p(q) and t(q) look like? Within the realm of models, presumably q must include something like intelligence (i.e. reasoning and problem solving) and something like effective time limit, and g(x,q) should be monotonically increasing in both. SOTA models’ time horizon and intelligence have been increasing; how much of the increase in time horizon is due to the increase in intelligence? We have evidence showing that models’ intelligence does not correspond with their time horizon, so there must be inter-model variation in effective time limit and/or other relevant parameters – can we isolate this effect from that of intelligence and characterize it? Any answers to these questions will have major implications for the nature of intelligence and ability, and they cannot be answered without more data on many different people’s and models’ performance on both METR’s time horizon tasks and other, more general tasks. METR already has some of this data, but hasn't released it.

Also, for some reason, METR has not benchmarked any models from companies other than OpenAI and Anthropic, most notably excluding the very capable Gemini 2.5-pro-6-5 – this is a major oversight for their basic goal of tracking time horizon progress on SOTA models, let alone for answering the deeper questions discussed here.

LESSWRONG
LW

LESSWRONG
LW

20

Nobody is Doing AI Benchmarking Right

20

20