...contrary to the misleading leaderboard:
The 98% listed as the "Human Panel" score for ARC-AGI-1 is relatively easy to interpret. It was the score of an actual human who attempted all 100 private evaluation tasks.[1] The higher human score of 100% listed for ARC-AGI-2 suggests that the newer benchmark is slightly easier for humans. And the ARC-AGI-2 announcement does nothing to discourage that impression, asserting that it maintains "the same relative ease for humans."[2] An attached technical report, however, explains that ARC-AGI-2 is designed to be more difficult for humans as well as AI systems.[3]
It turns out that the 100% listed for ARC-AGI-2 has a very different interpretation from that of the 98% listed for ARC-AGI-1. Instead, it means that "every task in ARC-AGI-2 has been [at least partially] solved by at least 2 humans [out of 9 or 10, on average]."[2] (Given that "tasks" consist of a single "test pair" but "some had two (29%), three (3%), or four (<1%) test pairs," the more precise criterion appears to be that at least two participants "solved one or more sub-pairs within their first two attempts."[4])
To my knowledge, no human has ever scored 100% on the 120 private evaluation tasks (for which AI system scores are reported in the leaderboard). It may be possible, but I am personally doubtful, partly based on having tried the sample tasks myself. Instead, the best information we have to go on for a human baseline is the performance of the human participants reported in the technical paper.
The 120 tasks in the private evaluation set were solved by an average of ~4.3 participants, based on the chart below from the technical paper. Since these were attempted by an average of 9-10 participants,[2] this implies that average human performance was below 50%, no?[5] And both GPT-5.2 and a refinement of Gemini 3 Pro have now surpassed that.
Which humans, specifically, have leading AI systems surpassed on ARC-AGI-2? The technical report does not reveal much about the 407 human participants or how they were recruited, merely describing them as "from diverse professional backgrounds, with a wide variation in self-reported experience in technology, programming, math, and puzzle solving (partially shown in Figure 2 [below])." They worked on the tasks in person on computers "in a conference room setting," attempting an average of 33 task test pairs each (out of an initial pool of 1,848).[4]
p.s. I may have misinterpreted or overlooked information in the ARC-AGI-2 technical paper or elsewhere. Corrections and other feedback are most welcome.
"The original private evaluation tasks were originally tested by two people who scored 97% and 98%, and, together, solved all 100%." (ARC Prize 2024: Technical Report) No further information is given about these two people, but we can assume they were not randomly selected or otherwise representative.
(An independent study recruited 1729 crowd-workers over the internet and found they averaged 76% on the training set and 64% on the public evaluation set. The ARC Prize team highlights that "99% of public evaluation tasks were solved by at least one worker, with 10 workers assigned to each task.")
"To ensure each task was solvable by humans, we made a minimum success bar of: 'two people in two attempts or less.' On average, each task was attempted by about 9-10 participants." https://arcprize.org/blog/announcing-arc-agi-2-and-arc-prize-2025
"Many ARC-AGI-1 tasks could often be solved almost instantaneously by human test-takers without requiring significant cognitive effort. In contrast, all tasks in ARC-AGI-2 require some amount of deliberate thinking — for instance, the average time for task completion among human test takers in our sample was 2.7 minutes." ARC-AGI-2: A New Challenge for Frontier AI Reasoning Systems
While the technical paper states that "final ARC-AGI-2 test pairs were solved, on average, by 75% of people who attempted them," this average is apparently dominated by the (large) public training set, which is easier on average. According to the paper's Figure 5, the public training tasks were solved by an average ~6.4 participants (out of an average 9-10). But something is off because Figure 5 appears to represent only around 350 "Public Train" tasks, whereas the announcement post says there are 1000.