- Restricting to technical and scientific benchmark skills, current frontier models have crossed Average Humans (late 2022), Skilled Generalists (early 2024) and Domain Experts (2025) on the human expertise spectrum.
- Future models are forecasted to reach Top-Performer level by October 2027 (95% CI: May 2027 – March 2028). Though this timeline should be interpreted with a grain of salt given that benchmarks remain imperfect proxies and human baseline data is sparse.
Okay but clearly in some sense these claims clearly aren't true because the average human is still employable, as are skilled generalists and domain experts.
Basically, I think your nomenclature choices where are confusing and seem to obscure what you actually show, which is something more like the complexity of tasks AI can do is going up over time. I can't deny that the vibe of what you're claiming seems right in some narrow sense, but only because it's what's already obvious from METR studies.
This is a crosspost from the General-Purpose AI Policy Lab research blog.
The “Rosetta Stone for AI Benchmarks” paper, by Epoch AI and Google DeepMind researchers, which underpins the Epoch Capability Index, gave us a great way to rank AI models and benchmarks on a common difficulty scale. But the resulting “capability score” is hard to interpret (what does a score of 2.54 mean?). We extended the framework to include human baselines.
TL;DR
Context: Comparing AI Progress to Human Performance
Short Introduction to the Rosetta Stone
The pace of AI development is often reported through its technical components: more parameters, more FLOPs, or higher scores on specific benchmarks. Our work builds directly on the Rosetta Stone framework (closely related to the Epoch Capabilities Index), which follows this approach: the original paper introduced a method to estimate the capability of models based on benchmark difficulties. The key graph (Figure 1 in the original paper, reproduced below) shows that these estimated capabilities broadly match user intuition regarding model rankings.
In one glance, the graph gives an idea of the relative ranking of models and benchmarks across time. This scoring method offers a form of “universal” ranking, allowing us to compare the capability of models that were never tested against one another, as well as the relative difficulty of benchmarks such as a PhD-level chemistry exam versus a set of cybersecurity-related tasks. Benchmarks (pink) that have yet to be saturated appear naturally above the cloud of models (teal).
Adding Human Baselines to Rosetta
Knowing that Gemini 2.5 Pro has a "Capability of 2.54" is not really meaningful on its own. The Rosetta Stone paper proposes to interpret model scores more quantitatively by looking at relative differences between models and mapping their capabilities to task time horizons. We build on this by anchoring these numbers to human expertise levels as real-world reference points, using human baseline scores from the literature.
The Rosetta framework is built on the assumption that a common capability factor underlies all tasks, and that models can be placed on that difficulty dimension accordingly. By treating human expertise groups as "models" within the Rosetta database, we can calibrate this axis. This allows us to make concrete statements about where a model sits relative to, say, a PhD-level specialist.
A Single Axis?
Only approximately. Model-to-model comparison reveals differences beyond just capability (see Benchmark Scores = General Capability + Claudiness), and this effect is reinforced here given the jagged frontier of AI performance relative to human capabilities. Humans show a predictable hierarchy of skills: if a person can solve a PhD-level chemistry problem, we can safely assume they can also answer a common-sense question about temporal or spatial scenarios. AI models do not share this hierarchy and routinely invert this ordering.
This creates a structural problem for the Rosetta framework:
The Rosetta Stone authors acknowledge this limitation. A natural extension of their framework would incorporate multiple difficulty axes. For now, we filter for benchmarks where the human difficulty axis and the AI difficulty axis are reasonably aligned: primarily technical and scientific tasks. 8 benchmarks out of 38 are thus removed in the main analysis. We have included the full, unfiltered results in Appendix 3.
Results: Anchoring the Rosetta Stone with Human References
We categorized human baselines into four distinct tiers (see full methodology in Appendix 1), plus an aggregated "Committee" version for expert categories:
Calibrated human scale
When technical and scientific benchmarks are isolated to align human and AI difficulty, an expected hierarchy emerges:
Group
Estimated Capability
Average Human
0.55
Skilled Generalist
1.54
Domain Expert
2.54
Committee of Domain Experts
2.97
Top Performer
4.53
The Skilled Generalist, with a capability score of 1.54, sits just below Claude 3 Opus. Domain Experts land at 2.54, equal to Gemini 2.5 Pro. Both the Committee of Domain Experts and Top Performers score above GPT-5, currently the strongest model in the Rosetta database at 2.81.
Current Standings: Models vs. Experts
As of 2026, frontier models have effectively crossed the thresholds for Average Humans (late 2022), Skilled Generalists (early 2024), and Domain Experts (2025). Current frontier models are reaching for the Committee of Domain Experts level: they are almost performing on par with aggregated professional teams on well-defined technical tasks.
Two of the hardest benchmarks are GSO-Bench and FrontierMath, which respectively aim to test software optimization (improving code runtime efficiency against expert developer solutions) and research-grade mathematical problems that often require hours of expert collaboration to solve. These benchmarks approach real-world professional standards, although in simplified settings, indicating that while current models are reaching expert committees, elite individual performance remains out of reach for now.
Note: These benchmarks are simplified proxies, not direct measures of real-world professional competence. Scoring at "Domain Expert level" on a benchmark does not mean models can replace domain experts in their actual work. Real-world tasks involve ambiguity, social context, judgment under uncertainty, and prolonged iterative work that benchmarks deliberately strip away. What these results do show is that models are increasingly capable of solving the kinds of technical, closed-ended problems experts solve, under controlled conditions, which is already remarkable.
Projections: When will AI surpass the Expert Threshold?
Linear extrapolations of the frontier models’ capabilities (the top three performers at any given release date) provide a time estimate for expected parity. Note that these projections assume a continuation of current scaling trends and do not account for other variables:
Motivation for More Data
The most immediate bottleneck in this analysis is data sparsity; most human groups have only 2 to 4 data points each, and coverage drops sharply outside a group's primary expertise domain. We need cross-difficulty mapping of human performance for distinct sets of skills (e.g., technical, common sense). Concretely:
Limits of Human Baselines
Unlike model evaluations, human scores come from different studies with wildly varying sample sizes (thousands of crowd workers vs. a handful of experts), incentives, time limits, and tool access. This noise compounds with ceiling effects on easy benchmarks and missing cross-difficulty data for each expertise tier.
Wei et al. (ICML 2025) make a closely related point: existing human baselines in model evaluations are neither sufficiently rigorous nor sufficiently well-documented to robustly support human vs. AI comparisons. Their recommended checklist offers a useful standard for better-structured human metadata.
Limits of the single-axis model
The second major limitation is the single-axis model itself. The Rosetta framework assumes one underlying difficulty dimension, but the jagged frontier problem suggests multiple axes: technical/ground-truth-based knowledge, common sense, fluid reasoning, cultural priors, etc. For example, Burnell et al. (2023), Maia Polo et al. (2025), Kipnis et al. (2025) and Ruan et al. (2024) find low-dimensional latent skills such as reasoning, language modeling, and instruction following.
Extending Rosetta to handle multiple difficulty axes would let us represent and understand AI capabilities more faithfully. The authors acknowledge this as a natural next step, and we are currently working on this multi-axis extension of the difficulty scale.
Future work
Beyond these two, a few other directions seem worth exploring:
Appendix 1
Full Methodology
We are building on Rosetta by adding human baseline groups on the same coordinate system, treating them as pseudo-models in the database. Human baselines are derived from scores reported in the literature on the same benchmarks used to evaluate AI models (see Appendix 2). We categorized these into four distinct tiers, plus an aggregated "Committee" version for expert categories:
This four-tier structure naturally arises from the existing literature; researchers generally report scores in minimally defined clusters. A significant number of technical benchmarks are assessed against either PhD-level specialists or PhD-level individuals from unrelated fields. Meanwhile, another large share of benchmark literature is crowd workers or non-specialized participants, broadly categorized as Average Humans. Above these three tiers, a small number of exceptional scores sit above the typical expert level without belonging to a clearly distinct credential category. These set the frontier of human performance, likely reflecting a combination of domain mastery, test-taking skill, and familiarity with the benchmark format.
Finer resolution distinguishing, for example, a second-year PhD student from a senior researcher would be useful, but the current literature doesn't yet provide enough consistently labeled data to support it.
Methodological choices
Appendix 2
Human Baselines - Technical & Scientific
Benchmark
Human Group
Score
Source
Information
FrontierMath 2025
Committee of Domain Experts
0.35
FrontierMath | Epoch AI
Solved collectively across all teams (40 exceptional math undergraduates and subject-matter experts) in four and a half hours and with internet access
PRBench Finance
Committee of Domain Experts
0.796
PRBench
Agreement between human experts
GPQA diamond
Domain Expert
0.812
GPQA
Experts who have or are pursuing PhDs in the corresponding domains
GPQA diamond
Skilled Generalist
0.219
GPQA
Highly skilled and incentivized non-experts who have or are pursuing PhDs in other domains
GPQA diamond
Domain Expert
0.697
GPQA Diamond | Epoch AI
PhD-level experts recruited by OpenAI
GSM8K
Skilled Generalist
0.968
GSM8K
Qualified human annotators who have passed a qualification exam with at least a bachelor's degree
GeoBench
Top Performer
0.9
GeoBench
Top player
MATH level 5
Top Performer
0.9
MATH
(all levels)
Three-time IMO gold medalist
MATH level 5
Skilled Generalist
0.4
MATH
(all levels)
One computer science PhD student
MMLU
Domain Expert
0.898
MMLU
Score is an estimation from the authors
MMLU
Average Human
0.345
MMLU
MTurk
ScienceQA
Average Human
0.884
ScienceQA
MTurk with high school or higher degree and who have passed the qualification examples
OSWorld
Skilled Generalist
0.724
OSWorld
Computer science college students but not familiar with the software
TriviaQA
Average Human
0.797
TriviaQA
Human performance level
Human-Easy Baselines
Benchmark
Human Group
Score
Source
Information
ARC AGI 1
Average Human
0.77
ARC Prize
MTurk
ARC AGI 1
Committee of Average Humans
0.98
ARC Prize
Human panel (at least two participants solved one or more sub-pairs within their first two attempts)
ARC AGI 1
Committee of Skilled Generalists
0.98
ARC Prize
STEM Graduates
HellaSwag
Committee of Average Humans
0.956
HellaSwag
Majority vote of 5 crowd workers (MTurk)
OpenBookQA
Average Human
0.92
OpenBookQA
Random human subjects
PIQA
Committee of Skilled Generalists
0.949
PIQA
Majority vote of top annotators
SimpleBench
Average Human
0.837
SimpleBench
Nine non-specialized humans
VPCT
Average Human
1
VPCT | Epoch AI
Three volunteers
WinoGrande
Committee of Average Humans
0.94
WinoGrande
Majority vote of crowd workers (MTurk)
Appendix 3
All Benchmarks and Baselines
Including all benchmarks breaks the model's intuitions immediately: The Average Human baseline appeared disproportionately capable, with an estimated capability even higher than the Domain Experts and Skilled Generalists.
Most scores contributing to the Average Human baseline come from common sense benchmarks (easy for humans, historically hard for AI). They are tested on tasks that yield high points in the model's difficulty weighting. Conversely, Skilled Generalists are almost exclusively tested on difficult scientific benchmarks (e.g., GPQA Diamond) where they naturally score lower than Domain Experts and Top Performers.
Because the Rosetta graph measures difficulty relative to AI struggles, the high performance of Average Humans on "AI-hard" common sense tasks inflates their perceived capability.
Additional experiment: Adding “filler” data
We ran a separate analysis on the full benchmark set, common sense benchmarks included, while manually adding extrapolated score estimates to observe how the Average Human tier’s position changes relative to other expertise levels:
As a result, most expert groups move back above the Average Human baseline. Average Human estimated capability remains higher than it should be. The anomaly is probably still missing data; we have no Average Human scores on hard benchmarks, so the model can't place it correctly. More data in the right structure would likely resolve this.