Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index)

Laura Domenech; Jérémy Andréoletti

This is a crosspost from the General-Purpose AI Policy Lab research blog.

The “Rosetta Stone for AI Benchmarks” paper, by Epoch AI and Google DeepMind researchers, which underpins the Epoch Capability Index, gave us a great way to rank AI models and benchmarks on a common difficulty scale. But the resulting “capability score” is hard to interpret (what does a score of 2.54 mean?). We extended the framework to include human baselines.

TL;DR

The Rosetta Stone framework produces relative capability scores for AI models, but these scores lack a real-world anchor (though they do correlate with METR Task Time Horizons, giving some indirect grounding).
We integrate human performance baselines (ranging from crowd workers to PhD-level domain experts and top performers) directly into the Rosetta framework, giving the capability scale concrete human references.
Some benchmarks have been specifically designed to be easy for humans but hard for AIs, which doesn’t match the assumption of a single axis of capability/difficulty. We performed the analysis both with and without these benchmarks.
Main Results:
- Restricting to technical and scientific benchmark skills, current frontier models have crossed Average Humans (late 2022), Skilled Generalists (early 2024) and Domain Experts (2025) on the human expertise spectrum.
- Future models are forecasted to reach Top-Performer level by October 2027 (95% CI: May 2027 – March 2028). Though this timeline should be interpreted with a grain of salt given that benchmarks remain imperfect proxies and human baseline data is sparse.
One bottleneck is that human performance data is inconsistently collected across difficulty levels. We need standardized, cross-difficulty human baselines to make this kind of calibration more robust and meaningful, and harder benchmarks to better estimate the human performance ceiling.

Context: Comparing AI Progress to Human Performance

Short Introduction to the Rosetta Stone

The pace of AI development is often reported through its technical components: more parameters, more FLOPs, or higher scores on specific benchmarks. Our work builds directly on the Rosetta Stone framework (closely related to the Epoch Capabilities Index), which follows this approach: the original paper introduced a method to estimate the capability of models based on benchmark difficulties. The key graph (Figure 1 in the original paper, reproduced below) shows that these estimated capabilities broadly match user intuition regarding model rankings.

In one glance, the graph gives an idea of the relative ranking of models and benchmarks across time. This scoring method offers a form of “universal” ranking, allowing us to compare the capability of models that were never tested against one another, as well as the relative difficulty of benchmarks such as a PhD-level chemistry exam versus a set of cybersecurity-related tasks. Benchmarks (pink) that have yet to be saturated appear naturally above the cloud of models (teal).

*Figure 1 from* *A Rosetta Stone for AI Benchmarks. Estimated model capabilities and benchmark difficulties over time. 0 corresponds to the difficulty of the WinoGrande benchmark.*

Adding Human Baselines to Rosetta

Knowing that Gemini 2.5 Pro has a "Capability of 2.54" is not really meaningful on its own. The Rosetta Stone paper proposes to interpret model scores more quantitatively by looking at relative differences between models and mapping their capabilities to task time horizons. We build on this by anchoring these numbers to human expertise levels as real-world reference points, using human baseline scores from the literature.

The Rosetta framework is built on the assumption that a common capability factor underlies all tasks, and that models can be placed on that difficulty dimension accordingly. By treating human expertise groups as "models" within the Rosetta database, we can calibrate this axis. This allows us to make concrete statements about where a model sits relative to, say, a PhD-level specialist.

A Single Axis?

Only approximately. Model-to-model comparison reveals differences beyond just capability (see Benchmark Scores = General Capability + Claudiness), and this effect is reinforced here given the jagged frontier of AI performance relative to human capabilities. Humans show a predictable hierarchy of skills: if a person can solve a PhD-level chemistry problem, we can safely assume they can also answer a common-sense question about temporal or spatial scenarios. AI models do not share this hierarchy and routinely invert this ordering.

This creates a structural problem for the Rosetta framework:

Certain benchmarks (like HellaSwag or ARC-AGI) were designed specifically to be trivial for humans but difficult for AI, while advanced technical benchmarks (GPQA Diamond, FrontierMath) are hard for both.
When we introduce benchmarks that are human-easy but AI-hard, we break the assumption of a unified difficulty axis.
Because the current Rosetta implementation doesn't yet support multi-dimensional capability axes, and lots of human baselines are available for these common-sense benchmarks, they make average humans appear far more "capable" than they are in technical domains.

The Rosetta Stone authors acknowledge this limitation. A natural extension of their framework would incorporate multiple difficulty axes. For now, we filter for benchmarks where the human difficulty axis and the AI difficulty axis are reasonably aligned: primarily technical and scientific tasks. 8 benchmarks out of 38 are thus removed in the main analysis. We have included the full, unfiltered results in Appendix 3.

Results: Anchoring the Rosetta Stone with Human References

We categorized human baselines into four distinct tiers (see full methodology in Appendix 1), plus an aggregated "Committee" version for expert categories:

Average Human: Crowd workers (e.g., MTurk) or non-specialized participants.
Skilled Generalist: Individuals with advanced education but not in the target domain (e.g., PhD students in unrelated fields, skilled professionals).
Domain Expert: PhD-level specialists in the relevant domain or expert professionals.
Top Performer: Elite performers (e.g., Fields Medal mathematicians, top 5% test takers or best result).
Committees: Aggregated majority votes or average team scores for the above groups.

Figure 2 — Evolution of AI capabilities and benchmark difficulties compared to human levels (technical and scientific competencies). Horizontal dashed lines indicate calibrated human capability thresholds.

Calibrated human scale

When technical and scientific benchmarks are isolated to align human and AI difficulty, an expected hierarchy emerges:

Group	Estimated Capability
Average Human	0.55
Skilled Generalist	1.54
Domain Expert	2.54
Committee of Domain Experts	2.97
Top Performer	4.53

The Skilled Generalist, with a capability score of 1.54, sits just below Claude 3 Opus. Domain Experts land at 2.54, equal to Gemini 2.5 Pro. Both the Committee of Domain Experts and Top Performers score above GPT-5, currently the strongest model in the Rosetta database at 2.81.

Current Standings: Models vs. Experts

As of 2026, frontier models have effectively crossed the thresholds for Average Humans (late 2022), Skilled Generalists (early 2024), and Domain Experts (2025). Current frontier models are reaching for the Committee of Domain Experts level: they are almost performing on par with aggregated professional teams on well-defined technical tasks.

Two of the hardest benchmarks are GSO-Bench and FrontierMath, which respectively aim to test software optimization (improving code runtime efficiency against expert developer solutions) and research-grade mathematical problems that often require hours of expert collaboration to solve. These benchmarks approach real-world professional standards, although in simplified settings, indicating that while current models are reaching expert committees, elite individual performance remains out of reach for now.

Note: These benchmarks are simplified proxies, not direct measures of real-world professional competence. Scoring at "Domain Expert level" on a benchmark does not mean models can replace domain experts in their actual work. Real-world tasks involve ambiguity, social context, judgment under uncertainty, and prolonged iterative work that benchmarks deliberately strip away. What these results do show is that models are increasingly capable of solving the kinds of technical, closed-ended problems experts solve, under controlled conditions, which is already remarkable.

Projections: When will AI surpass the Expert Threshold?

Linear extrapolations of the frontier models’ capabilities (the top three performers at any given release date) provide a time estimate for expected parity. Note that these projections assume a continuation of current scaling trends and do not account for other variables:

Crossover with Average Human: December 2022.
Crossover with Skilled Generalist: February 2024.
Crossover with Domain Expert: May 2025.
Crossover with Top Performer: This baseline represents a significant jump in performance, and in the projection, AI models intersect this baseline around October 2027 (95% Confidence Interval: May 2027 – March 2028, not accounting for uncertainty on the top performer performance threshold).

*Figure 3 — Projection of frontier AI capabilities toward human performance levels (technical and scientific competencies). The pink band represents the 95% confidence interval.*

Motivation for More Data

The most immediate bottleneck in this analysis is data sparsity; most human groups have only 2 to 4 data points each, and coverage drops sharply outside a group's primary expertise domain. We need cross-difficulty mapping of human performance for distinct sets of skills (e.g., technical, common sense). Concretely:

Experts on easy and medium benchmarks, to establish a proper ceiling.
Average participants on hard benchmarks, to anchor the lower end of the capability scale on difficult tasks.
Consistent coverage across skill categories (e.g., technical, common sense, visual reasoning)

Limits of Human Baselines

Unlike model evaluations, human scores come from different studies with wildly varying sample sizes (thousands of crowd workers vs. a handful of experts), incentives, time limits, and tool access. This noise compounds with ceiling effects on easy benchmarks and missing cross-difficulty data for each expertise tier.

Wei et al. (ICML 2025) make a closely related point: existing human baselines in model evaluations are neither sufficiently rigorous nor sufficiently well-documented to robustly support human vs. AI comparisons. Their recommended checklist offers a useful standard for better-structured human metadata.

*Figure 4 — Recommended checklist by Wei et al. (ICML 2025) for collecting standardized human baselines in model evaluations.*

Limits of the single-axis model

The second major limitation is the single-axis model itself. The Rosetta framework assumes one underlying difficulty dimension, but the jagged frontier problem suggests multiple axes: technical/ground-truth-based knowledge, common sense, fluid reasoning, cultural priors, etc. For example, Burnell et al. (2023), Maia Polo et al. (2025), Kipnis et al. (2025) and Ruan et al. (2024) find low-dimensional latent skills such as reasoning, language modeling, and instruction following.

Extending Rosetta to handle multiple difficulty axes would let us represent and understand AI capabilities more faithfully. The authors acknowledge this as a natural next step, and we are currently working on this multi-axis extension of the difficulty scale.

Future work

Beyond these two, a few other directions seem worth exploring:

Expanding benchmark coverage into professional domains (e.g., finance, law, medicine, computer science) where real-world human performance data exists. Breaking results down by domain (math, coding, biology, etc.) would also reveal where frontier models have actually crossed expert-level performance versus where the aggregate score flatters them.
Harder and more realistic benchmarks. Models are saturating existing benchmarks faster than new ones appear. We need benchmarks like the Remote Labor Index to represent professional tasks under real constraints that can still discriminate at the frontier.
Finer human expertise tiers. The distinction in capability between a PhD candidate and a researcher with years of experience, for example, would require more consistently labeled data, but would meaningfully clarify the upper end of the human scale.
Committee and collaboration scores. Individual expert baselines are not a ceiling for what humans can achieve. Systematic data on small-group performance would give a more honest upper bound for human capability, especially relevant as AI systems are increasingly compared to teams rather than individuals.

Appendix 1

Full Methodology

We are building on Rosetta by adding human baseline groups on the same coordinate system, treating them as pseudo-models in the database. Human baselines are derived from scores reported in the literature on the same benchmarks used to evaluate AI models (see Appendix 2). We categorized these into four distinct tiers, plus an aggregated "Committee" version for expert categories:

Average Human: Crowd workers (e.g., MTurk) or non-specialized participants.
Skilled Generalist: Individuals with advanced education but not in the target domain (e.g., PhD students in unrelated fields, skilled professionals).
Domain Expert: PhD-level specialists in the relevant domain or expert professionals.
Top Performer: Elite performers (e.g., Fields Medal mathematicians, top 5% test takers or best result).
Committees: Aggregated majority votes or average team scores for the above groups.

This four-tier structure naturally arises from the existing literature; researchers generally report scores in minimally defined clusters. A significant number of technical benchmarks are assessed against either PhD-level specialists or PhD-level individuals from unrelated fields. Meanwhile, another large share of benchmark literature is crowd workers or non-specialized participants, broadly categorized as Average Humans. Above these three tiers, a small number of exceptional scores sit above the typical expert level without belonging to a clearly distinct credential category. These set the frontier of human performance, likely reflecting a combination of domain mastery, test-taking skill, and familiarity with the benchmark format.

Finer resolution distinguishing, for example, a second-year PhD student from a senior researcher would be useful, but the current literature doesn't yet provide enough consistently labeled data to support it.

Methodological choices

Removing on-purpose human-easy benchmarks: ARC-AGI, TriviaQA, HellaSwag, OpenBookQA, PIQA, SimpleBench, VPCT, and WinoGrande were removed entirely from the Rosetta database, scores included. Keeping them would distort both capability estimates and benchmark placement for everything else. To accommodate this change, the benchmark difficulty’s “anchor” defining the zero was changed from WinoGrande to ScienceQA (which had a difficulty score of 0.776).
Introducing PRBench Finance: We integrated PRBench Finance as a test case for extending the Rosetta framework beyond its original benchmark set. PRBench Finance reflects professional, real-world finance tasks and allows us to add data for the Committee of Domain Experts baseline.
Calculating uncertainty: We follow the original paper's method. The margin of error surrounding the baselines and the models (represented as a 95% confidence interval) reflects the uncertainty inherent in testing on a finite set of benchmarks; this is defined by calculating how far a capability score can be shifted before causing a 5% increase in the model’s loss function.
Projecting crossover dates: Crossover dates are linear extrapolations from the capability scores of the top 3 frontier models over time, assuming scaling continues roughly as it has. These are trajectories for building intuition.

Appendix 2

Human Baselines - Technical & Scientific

Benchmark	Human Group	Score	Source	Information
FrontierMath 2025	Committee of Domain Experts	0.35	FrontierMath \| Epoch AI	Solved collectively across all teams (40 exceptional math undergraduates and subject-matter experts) in four and a half hours and with internet access
PRBench Finance	Committee of Domain Experts	0.796	PRBench	Agreement between human experts
GPQA diamond	Domain Expert	0.812	GPQA	Experts who have or are pursuing PhDs in the corresponding domains
GPQA diamond	Skilled Generalist	0.219	GPQA	Highly skilled and incentivized non-experts who have or are pursuing PhDs in other domains
GPQA diamond	Domain Expert	0.697	GPQA Diamond \| Epoch AI	PhD-level experts recruited by OpenAI
GSM8K	Skilled Generalist	0.968	GSM8K	Qualified human annotators who have passed a qualification exam with at least a bachelor's degree
GeoBench	Top Performer	0.9	GeoBench	Top player
MATH level 5	Top Performer	0.9	MATH (all levels)	Three-time IMO gold medalist
MATH level 5	Skilled Generalist	0.4	MATH (all levels)	One computer science PhD student
MMLU	Domain Expert	0.898	MMLU	Score is an estimation from the authors
MMLU	Average Human	0.345	MMLU	MTurk
ScienceQA	Average Human	0.884	ScienceQA	MTurk with high school or higher degree and who have passed the qualification examples
OSWorld	Skilled Generalist	0.724	OSWorld	Computer science college students but not familiar with the software
TriviaQA	Average Human	0.797	TriviaQA	Human performance level

Human-Easy Baselines

Benchmark	Human Group	Score	Source	Information
ARC AGI 1	Average Human	0.77	ARC Prize	MTurk
ARC AGI 1	Committee of Average Humans	0.98	ARC Prize	Human panel (at least two participants solved one or more sub-pairs within their first two attempts)
ARC AGI 1	Committee of Skilled Generalists	0.98	ARC Prize	STEM Graduates
HellaSwag	Committee of Average Humans	0.956	HellaSwag	Majority vote of 5 crowd workers (MTurk)
OpenBookQA	Average Human	0.92	OpenBookQA	Random human subjects
PIQA	Committee of Skilled Generalists	0.949	PIQA	Majority vote of top annotators
SimpleBench	Average Human	0.837	SimpleBench	Nine non-specialized humans
VPCT	Average Human	1	VPCT \| Epoch AI	Three volunteers
WinoGrande	Committee of Average Humans	0.94	WinoGrande	Majority vote of crowd workers (MTurk)

Appendix 3

All Benchmarks and Baselines

Including all benchmarks breaks the model's intuitions immediately: The Average Human baseline appeared disproportionately capable, with an estimated capability even higher than the Domain Experts and Skilled Generalists.

Most scores contributing to the Average Human baseline come from common sense benchmarks (easy for humans, historically hard for AI). They are tested on tasks that yield high points in the model's difficulty weighting. Conversely, Skilled Generalists are almost exclusively tested on difficult scientific benchmarks (e.g., GPQA Diamond) where they naturally score lower than Domain Experts and Top Performers.

Because the Rosetta graph measures difficulty relative to AI struggles, the high performance of Average Humans on "AI-hard" common sense tasks inflates their perceived capability.

Additional experiment: Adding “filler” data

We ran a separate analysis on the full benchmark set, common sense benchmarks included, while manually adding extrapolated score estimates to observe how the Average Human tier’s position changes relative to other expertise levels:

If average humans scored >90% on a benchmark (e.g., WinoGrande, VPCT), we assigned a score of 100% to Skilled Generalists and Experts for that same task.
These filler data are rough estimates; the exact values are not the point. The goal is to simulate what the human baseline scaffolding would look like with more uniform data coverage.

As a result, most expert groups move back above the Average Human baseline. Average Human estimated capability remains higher than it should be. The anomaly is probably still missing data; we have no Average Human scores on hard benchmarks, so the model can't place it correctly. More data in the right structure would likely resolve this.

[-]Gordon Seidoh Worley2mo40

Restricting to technical and scientific benchmark skills, current frontier models have crossed Average Humans (late 2022), Skilled Generalists (early 2024) and Domain Experts (2025) on the human expertise spectrum.
Future models are forecasted to reach Top-Performer level by October 2027 (95% CI: May 2027 – March 2028). Though this timeline should be interpreted with a grain of salt given that benchmarks remain imperfect proxies and human baseline data is sparse.

Okay but clearly in some sense these claims clearly aren't true because the average human is still employable, as are skilled generalists and domain experts.

Basically, I think your nomenclature choices where are confusing and seem to obscure what you actually show, which is something more like the complexity of tasks AI can do is going up over time. I can't deny that the vibe of what you're claiming seems right in some narrow sense, but only because it's what's already obvious from METR studies.

[-]Laura Domenech2mo40

Thanks for the comment! It's an ongoing project so we are definitely open to suggestions.
Regarding your remark, on one hand it sure doesn't mean that humans at these expertise levels can be replaced, see our note "These benchmarks are simplified proxies, not direct measures of real-world professional competence. Scoring at "Domain Expert level" on a benchmark does not mean models can replace domain experts in their actual work. [etc.]" and the TLDR acknowledges, perhaps too succinctly, that it only applies to "technical and scientific benchmark skills"
But on the other hand these are empirical baseline results from real experts who took these benchmarks (with limited time) and got worse results than leading models. So scoring at a Domain Expert level does mean a model matches or exceeds the accuracy of human experts on these specific, closed-ended sets of benchmark tasks.
For the next iteration we'd like to add more complex benchmarks and additional difficulty axes that should account for other factors like fluid intelligence and real-world messiness. This should help clarify that models are not exceeding experts on all axes.

18