ARC-AGI-2 human baseline surpassed (updated)

Tim H

...contrary to the misleading leaderboard (which their technical paper implies should actually list humans at ~53%, as explained below):

The 98% listed as the "Human Panel" score for ARC-AGI-1 is relatively easy to interpret. It was the score of an actual human who attempted all 100 private evaluation tasks.^[1] The higher human score of 100% listed for ARC-AGI-2 suggests that the newer benchmark is slightly easier for humans. And the ARC-AGI-2 announcement does nothing to discourage that impression, asserting that it maintains "the same relative ease for humans."^[2] An attached technical report, however, explains that ARC-AGI-2 is designed to be more difficult for humans as well as AI systems.^[3]

It turns out that the 100% listed for ARC-AGI-2 has a very different interpretation from that of the 98% listed for ARC-AGI-1. Instead, it means that "every task in ARC-AGI-2 has been [at least partially] solved by at least 2 humans [out of 9 or 10, on average]."^[2] (Given that most "tasks" consist of a single "test pair" but "some had two (29%), three (3%), or four (<1%) test pairs," the more precise criterion is that at least two participants "solved one or more sub-pairs within their first two attempts."^[4])

To my knowledge, no human has ever scored 100% on the 120 semi-private evaluation tasks (for which AI system scores are reported in the leaderboard). It may be possible, but I am personally doubtful, partly based on having tried a couple sample tasks myself. Instead, the best information we have to go on for a human baseline is the performance of the human participants reported in the technical paper.

The 120 tasks in the semi-private evaluation set were solved by an average of ~4.8 participants, based on the chart below from the technical paper. Since these were attempted by an average of 9-10 participants,^[2] this implies that average human performance was 53% or lower.^[5] And both GPT-5.2 and a refinement of Gemini 3 Pro have now surpassed that.^[6]

Which humans, specifically, have leading AI systems surpassed on ARC-AGI-2? The technical report does not reveal much about the 407 human participants or how they were recruited, merely describing them as "from diverse professional backgrounds, with a wide variation in self-reported experience in technology, programming, math, and puzzle solving (partially shown in Figure 2 [below])." They worked on the tasks in person on computers "in a conference room setting" (in San Diego^[7]) attempting an average of 33 task test pairs each (out of an initial pool of 1,848).^[4]

p.s. I may have misinterpreted or overlooked information in the ARC-AGI-2 technical paper or elsewhere. Corrections and other feedback are most welcome.

^{^}
"The original private evaluation tasks were originally tested by two people who scored 97% and 98%, and, together, solved all 100%." (ARC Prize 2024: Technical Report) No further information is given about these two people, but we can assume they were not randomly selected or otherwise representative.
(An independent study recruited 1729 crowd-workers over the internet and found they averaged 76% on the training set and 64% on the public evaluation set. The ARC Prize team highlights that "99% of public evaluation tasks were solved by at least one worker, with 10 workers assigned to each task.")
^{^}
"To ensure each task was solvable by humans, we made a minimum success bar of: 'two people in two attempts or less.' On average, each task was attempted by about 9-10 participants." https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025
^{^}
"Many ARC-AGI-1 tasks could often be solved almost instantaneously by human test-takers without requiring significant cognitive effort. In contrast, all tasks in ARC-AGI-2 require some amount of deliberate thinking — for instance, the average time for task completion among human test takers in our sample was 2.7 minutes." ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
^{^}
ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
^{^}
While the technical paper states that "final ARC-AGI-2 test pairs were solved, on average, by 75% of people who attempted them," this average is apparently dominated by the (large) public training set, which is easier on average. According to the paper's Figure 5, the public training tasks were solved by an average ~6.4 participants (out of an average 9-10). But something is off because Figure 5 appears to represent only around 350 "Public Train" tasks, whereas the announcement post says there are 1000. (Elsewhere, the ARC 2 web page lists average human performance as 60%, but it is not clear which set of tasks that pertains to.)
^{^}
Remarkably, AI has surpassed humans in cost-efficiency as well as pure capability. 'GPT-5.2 (X-High)' in particular achieves 52.9% for just $1.90 per task, whereas human participants were paid $5 per task (plus "a $115-150 show-up fee").
^{^}
https://arcprize.org/arc-agi/2/ Elsewhere, they appear to be described as "STEM Professionals." ARC Prize states here and in the announcement post their intention to publish the full human performance data ("This first-party data provides a solid benchmark for human performance and will be published alongside the ARC-AGI-2 paper.") but they have apparently not followed through on open-sourcing it. The human testing data is available at https://huggingface.co/datasets/arcprize/arc_agi_2_human_testing

Why is the human baseline so low? This is more tentative, but I'm thinking in terms of two basic possibilities. The reason only 2-5 out of 9 humans found the intended solution could be that it's a well-posed but difficult problem, or it could be that there are 2-3 equally valid solutions and landing on the intended "correct" answer requires pure luck.

Starting with the latter possibility, a non-trivial proportion of these tasks may be ill-posed problems with multiple legitimate answers. Even an ideal solver would still only average 50% (or 33% or lower) on such tasks. If this were the case for all eval tasks, ~50% would not only be the human baseline but would also mean saturation of the benchmark.

Perhaps more likely is that the human baseline is so low primarily due to tricky tasks that a competent human doesn't automatically "get" with enough thought. She needs a sort of "luck" in terms of having some idiosyncratic experience or idea to get to the solution. The fact that a non-trivial proportion of ARC 2 tasks (over 300, according to Figure 5 in the technical paper) were not even solved by two participants suggests that this is the case for a decent proportion of the tasks they designed. Note that some tasks which only one ninth of potential participants (the idealized "population") would solve on average will happen to actually be solved by 2 participants by chance--leading to some such too-hard tasks being included in ARC-AGI-2. (By contrast, if they had designed a set of tasks and found that all of them could be solved by at least 2 out of 9 humans, that would be more reassuring, providing evidence that their task design process reliably produces human-solvable tasks, albeit often still tasks that less than half of humans can solve.) To the extent this is the reason for the low human baseline, AI systems may be able to substantially outdo typical human performance and approach 100% despite a human baseline around 50%.

I thought I was just speculating about the potential for multiple valid solutions, but now I see that the ARC 2 launch post not only acknowledges the possibility but says ambiguity is sometimes even there by design!

Like ARC-AGI-1, ARC-AGI-2 uses a pass@2 measurement system to account for the fact that certain tasks have explicit ambiguity and require two guesses to disambiguate. As well as to catch any unintentional ambiguity or mistakes in the dataset. Given controlled human testing with ARC-AGI-2, we are more confident in the task quality compared to ARC-AGI-1. [emphasis added]

I continue to question whether 2 out of 9 solve rates by their human testers should have given them such confidence. My guess is that they expected higher human performance, especially with an incentive of $5 per solve. (MTurkers had achieved better performance on ARC 1 despite their lower pay being largely unconditional on solving: $10 for attempting five tasks plus "a bonus of $1 if they succeeded at a randomly selected task...")

One aspect of the human testing design likely reduced the intended solve incentive: participants were given 90 minutes rather than a set number of tasks. Given that, the reward-maximizing strategy is to move on quickly from relatively hard problems rather than give them your best effort.

I did not expect scores as high as Gemini Deep Think's new mark of 85% on ARC 2 to be possible. I still predict we'll never see a score above 95%, but we'll see.

Like ARC-AGI-1, ARC-AGI-2 uses a pass@2 measurement system to account for the fact that certain tasks have explicit ambiguity and require two guesses to disambiguate. As well as to catch any unintentional ambiguity or mistakes in the dataset. Given controlled human testing with ARC-AGI-2, we are more confident in the task quality compared to ARC-AGI-1. [emphasis added]

I did not expect scores as high as Gemini Deep Think's new mark of 85% on ARC 2 to be possible. I still predict we'll never see a score above 95%, but we'll see.

LESSWRONG
LW

LESSWRONG
LW

21

ARC-AGI-2 human baseline surpassed (updated)

21

21

21