I think this is a useful goal. I'd pose that quantifying human research taste seems like the best starting point. Can a human researcher achieve a high score on this metric?
A metric I would propose is somewhat different, and I think potentially less vulnerable to noise and measurement error:
For example, if I were evaluating an LLM that had stopped training just after DiffPure released, I might ask it how best to combat this defense from the perspective of the attacker. I'd then compare its proposal to the methods demonstrated by DiffAttack, and rank LLMs (and humans - either queried before the cutoff or from separate subfields without knowledge of the ground truth outcome) by having an evaluator repeatedly decide which of two models' proposals best match what ended up working. This might look like:
Yes, we think a lack of human baseline is a key weakness of any stronger conclusions we'd like to make. This a really interesting proxy task, but the obvious weakness here is assuming real-life trends provide the best ground truth (our task also runs into this, but in a less limiting way). This is also why we're trying to move closer to a task that captures the full R&D loop but with a very heavy emphasis on the non-engineering parts.
This is an early stage research update. We love feedback and comments!
TL;DR:
Why ‘research taste’?
AI R&D Automation
One of the fastest paths to ASI is automation of AI R&D, which could allow for an exponential explosion in capabilities (“foom”). Many labs and governments also highlighted AI R&D automation as an important risk
Based on some previous work from David Owen at Epoch AI, we split the AI R&D workflow into “engineering” and “non-engineering” tasks.
Non-engineering is the left half, while engineering is the right.
There are lots of benchmarks examining engineering skills (e.g., HCAST, SWE-Bench, internal suites), but not much on non-engineering skills.
We choose to focus on experimental selection skills, commonly referred to as research taste (a la Olah and Nanda) in LLMs, which we define as “the set of intuitions and good judgment guiding a researcher's decisions throughout the research process, whenever an open-ended decision arises without an obvious way to find the right answer.”
Dichotomizing research taste
Research taste can further be broken down into:
Strategic Taste: (picking what mountain to climb on)
Tactical Taste: (choosing how to climb the mountain)
Why is research taste important?
Research taste, as well as other non-engineering tasks in the R&D automation cycle, has been less studied because there could be a concerning level of speed up from just automating engineering. However, we think that research taste particularly matters for AI R&D automation because it acts as a compute efficiency multiplier. Models with strong research taste can extract more insights from less compute, because they will pick the “right” experiment to run.
AI 2027 similarly claimed that “given that without improved experiment selection (i.e. research taste) we'd hit sharply diminishing returns due to hardware and latency bottlenecks, forecasting improved experiment selection above the human range is quite important”.
How much would improvement in research taste accelerate AI R&D progress?
Not much previous work has been done in this direction: see our further research section for more open questions! With our benchmark, we attempt to answer: “Can models predict whether an approach will yield insights or improvements before executing it?“
Methodology
Generating a proxy for strategic research taste
Citation velocity tracks the rate at which a paper accumulates citations over time. We count citations in fixed time windows and fit a linear model, and then rank papers by their slope parameter.
V (t) = dC / dt
(C is cumulative citations at time t)
We think citation velocity is a decent, non-expert-requiring measure for how much a piece of research “unblocked” or accelerated the field overall, as it favors empirical algorithmic/efficiency improvements that can be used in many other pieces of research quickly.
Experimental Set-Up
We pulled 200 papers published between January and March 2024 with at least 20 citations from Semantic Scholar with the query “reinforcement learning large language model llm rl”, and measured the average number of citations they accrued per month over the first 15 months since publication. We removed all papers that were not strictly algorithmic-level RL improvements aimed at math, coding, and reasoning tasks. This gave us a “ground truth” ranking of 38 papers ranked in order of citation velocity.
For each paper, we use an LLM to extract the core approach of the paper from the abstract, which we then use to elicit pairwise rankings from a judge model using three prompts (below). We use those choices to calculate an Elo ranking of the papers. The correlation between the ground truth ranking and Elo ranking is our proxy for the judge model’s taste.
Extract core idea
Prompt 1 (normal)
Prompt 2 (goodharted)
Prompt 3 (max goodharted)
Results
We tested Claude Sonnet 4.5, Gemini 2.5 Pro, and GPT 5.1, as well as some weaker open source models, and found them all ~equally ineffective at predicting citation velocity, no matter which prompt we used. Here are representative results for the goodharted prompt:
Here, b refers to citation velocity.
One unfair failure mode here for the model might be that the “core idea” simply doesn’t give very much information about exact technical details and scope, which probably makes it difficult to predict relative citation velocity. So, for the subset of 25 papers that had full-text html versions on arXiv, we asked Claude Sonnet 4.5 to summarize the paper’s methods, formulations, and overall technical approach in 300-400 words and gave that to the judging model instead of the “core idea.”
Summary generation prompt
This did not help much.

What taste score would indicate superhuman ability? As a rough lower bound, when given the entire introduction and methods section of each paper (which often include post-hoc analysis and high-level discussion of results), Gemini 2.5 Pro achieved a score of 0.525. Therefore, based on our experiments, we’re quite confident frontier LLM’s don’t have superhuman research taste.
Making stronger claims (e.g., “LLMs clearly have subhuman research taste”) seems dicey for the reasons below.
Limitations
We think citation velocity is a flawed metric for a couple of reasons:
Again, a and b are parameters of the linear model citations(t) = a + bt, so b is citation velocity.
Learnings
Why we released these results
Researching in the open
Feedback priorities
Open questions on research taste
Our code can be found here! Please point out errors, and we love feedback!
Acknowledgement:
We thank (in no particular order) Anson Ho, Josh You, Sydney Von Arx, Neev Parikh, Oliver Sourbut, Tianyi Qiu, and many friends at UChicago XLab for preliminary feedback!