Why is the human baseline so low? This is more tentative, but I'm thinking in terms of two basic possibilities.
One is that a non-trivial proportion of these tasks may be ill-posed problems with multiple legitimate answers. The reason only 2 out of 9 humans found the intended solution could be that it's a difficult well-posed problem, or it could be that there are 2-3 equally valid solutions and landing on the intended "correct" answer requires luck. A perfectly capable solver would still only average 50% or even 33% on such tasks. In this case, around 50% would not only be the human baseline but would also mean saturation of the benchmark.
Or maybe it is more likely that the human baseline is so low due to tricky tasks that a competent human doesn't automatically "get" with enough thought and needs to get lucky in terms of having some idiosyncratic experience or random idea for getting the solution. The fact that a non-trivial proportion of their tasks (over 300, according to Figure 5 in the technical paper) were not even solved by two participants suggests that this is the case for a decent proportion of the tasks they designed. Note that some tasks which only one ninth of potential participants would solve on average will happen to actually be solved by 2 participants by chance--leading to some such too-hard tasks being included in ARC-AGI-2. (By contrast, if they had designed a set of tasks and found that all of them could be solved by at least 2 out of 9 humans, that would be more reassuring that their task design process reliably produces human-solveable tasks, albeit still tasks that less than half of humans can solve.) To the extent this is the reason for the low human baseline, AI systems may be able to substantially outdo typical human performance and approach 100% despite a human baseline around 50%.
...contrary to the misleading leaderboard (which their technical paper implies should actually list humans at ~48%, as explained below):
The 98% listed as the "Human Panel" score for ARC-AGI-1 is relatively easy to interpret. It was the score of an actual human who attempted all 100 private evaluation tasks.[1] The higher human score of 100% listed for ARC-AGI-2 suggests that the newer benchmark is slightly easier for humans. And the ARC-AGI-2 announcement does nothing to discourage that impression, asserting that it maintains "the same relative ease for humans."[2] An attached technical report, however, explains that ARC-AGI-2 is designed to be more difficult for humans as well as AI systems.[3]
It turns out that the 100% listed for ARC-AGI-2 has a very different interpretation from that of the 98% listed for ARC-AGI-1. Instead, it means that "every task in ARC-AGI-2 has been [at least partially] solved by at least 2 humans [out of 9 or 10, on average]."[2] (Given that "tasks" consist of a single "test pair" but "some had two (29%), three (3%), or four (<1%) test pairs," the more precise criterion appears to be that at least two participants "solved one or more sub-pairs within their first two attempts."[4])
To my knowledge, no human has ever scored 100% on the 120 private evaluation tasks (for which AI system scores are reported in the leaderboard). It may be possible, but I am personally doubtful, partly based on having tried the sample tasks myself. Instead, the best information we have to go on for a human baseline is the performance of the human participants reported in the technical paper.
The 120 tasks in the private evaluation set were solved by an average of ~4.3 participants, based on the chart below from the technical paper. Since these were attempted by an average of 9-10 participants,[2] this implies that average human performance was below 50%, no?[5] And both GPT-5.2 and a refinement of Gemini 3 Pro have now surpassed that.
Which humans, specifically, have leading AI systems surpassed on ARC-AGI-2? The technical report does not reveal much about the 407 human participants or how they were recruited, merely describing them as "from diverse professional backgrounds, with a wide variation in self-reported experience in technology, programming, math, and puzzle solving (partially shown in Figure 2 [below])." They worked on the tasks in person on computers "in a conference room setting," attempting an average of 33 task test pairs each (out of an initial pool of 1,848).[4]
p.s. I may have misinterpreted or overlooked information in the ARC-AGI-2 technical paper or elsewhere. Corrections and other feedback are most welcome.
"The original private evaluation tasks were originally tested by two people who scored 97% and 98%, and, together, solved all 100%." (ARC Prize 2024: Technical Report) No further information is given about these two people, but we can assume they were not randomly selected or otherwise representative.
(An independent study recruited 1729 crowd-workers over the internet and found they averaged 76% on the training set and 64% on the public evaluation set. The ARC Prize team highlights that "99% of public evaluation tasks were solved by at least one worker, with 10 workers assigned to each task.")
"To ensure each task was solvable by humans, we made a minimum success bar of: 'two people in two attempts or less.' On average, each task was attempted by about 9-10 participants." https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025
"Many ARC-AGI-1 tasks could often be solved almost instantaneously by human test-takers without requiring significant cognitive effort. In contrast, all tasks in ARC-AGI-2 require some amount of deliberate thinking — for instance, the average time for task completion among human test takers in our sample was 2.7 minutes." ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
While the technical paper states that "final ARC-AGI-2 test pairs were solved, on average, by 75% of people who attempted them," this average is apparently dominated by the (large) public training set, which is easier on average. According to the paper's Figure 5, the public training tasks were solved by an average ~6.4 participants (out of an average 9-10). But something is off because Figure 5 appears to represent only around 350 "Public Train" tasks, whereas the announcement post says there are 1000.