The Benchmark Trap — LessWrong

Rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work.

Read full explanation

Why state-of-the-art performance on benchmarks is not the same as operational performance and why the distinction matters now more than ever

There is a ritual in machine learning: a new model achieves impressive results on a benchmark, a press release is issued, the field moves on. Policymakers cite the numbers and executives build business cases around them. The benchmark becomes the truth.

Benchmarks are not wrong. They simply substitute a convenient weighting function for an economically or ethically relevant one. Real-world payoffs tend to be concentrated, asymmetric, and domain-specific—not uniform. When the weightings diverge, as they often do, state-of-the-art performance and operational value can come apart completely.

We can formalize the divergence. Let denote the risk function a benchmark optimizes: losses summed uniformly across the evaluation distribution. Let be the operational risk function: losses weighted by economic exposure, asymmetric cost structures, real-time constraints, or the specific decision context in which outputs are used. Benchmarks could in principle be designed to approximate , but they rarely are. The gap is systemic and almost never quantified. No standard requires disclosure of the weighting gap, and almost no capability claim that travels from a research paper to policy acknowledges it.

Phrased differently, a benchmark answers "How well does this model perform on a set of fixed cases?" An operator asks a different question: "What happens when this system makes a mistake in my environment?" Benchmarks usually treat every example as equally important and every error as roughly comparable. Real usage is very different. Some errors are cheap, while others are catastrophic. When we merge everything into a single average score, we quietly encode a particular view of what matters. The number may be real, but the weighting is arbitrary.

This pattern recurs across domains with perhaps the clearest case being energy forecasting.

Weather: The ERA5 Problem

Frontier ML weather models, such as FourCastNet, Pangu-Weather, GraphCast, FuXi, GenCast, Aurora, or FengWu, are often evaluated on ERA5, the global reanalysis data set produced by the ECMWF. ERA5 provides hourly fields on a 31 km square grid, constructed through data assimilation rather than direct observation, with quality-assured updates published around five days behind real time and a fully validated product available two to three months after the fact.

ERA5 is a retrospective reconstruction of past atmospheric states. It is distinct from ECMWF's operational IFS, a live forecast system evaluated against real-time station observations and radiosonde data through WMO verification protocols. When ML papers benchmark against ERA5, they are measuring similarity to a smoothed historical reconstruction, not operationally relevant conditions, in which initial conditions almost always deviate in terms of coverage and quality of various variables relevant to weather forecasting. The numbers cited in headlines are typically these ERA5 comparisons.

A 31 km grid cell spans roughly 1,000 km². Utility-scale solar farms occupy single-digit square kilometres. Modern offshore wind turbines have rotor diameters of about 250 metres, a fraction of a single grid cell. Coastal shear, terrain-driven flow, wake effects, and local cloud variability are not visible at ERA5 scale.

Spatial resolution is not the only mismatch. ERA5 provides winds at standardized vertical levels, including ground level, at 10 m, and at 100 m above ground. Modern onshore turbines commonly operate at hub heights exceeding 120 m, whereas offshore turbines go higher still. Hub-height wind is either interpolated from model levels or inferred diagnostically rather than directly observed. Sub-grid shear profiles, wake effects, and terrain-induced flow variation remain unresolved at 31 km. The economically relevant state variable, wind at turbine hub height within a wind farm's specific terrain and wake environment, is not the same thing as the benchmark target.

The power curve makes the asymmetry explicit. Below the cut-in wind speed of 3–4 m/s, a turbine does not produce any power. Between the cut-in and rated speed, output scales roughly with the cube of wind speed, and above the cut-out speed of around 25 m/s, turbines are shut down entirely. A 1 m/s error near the rated speed is nearly irrelevant, whereas the same error near cut-in can mean the difference between zero output and full ramp-up. Symmetric squared-error metrics do not encode this. Even proper scoring rules like CRPS, which do encode distributional uncertainty, are still averaged uniformly across grid cells unless explicitly re-weighted by exposure. Geographic weighting by economic significance is mostly absent in the ML-powered weather forecast literature.

Fleet-level aggregation across hundreds of turbines does dampen single-turbine non-linearities. Grid operators still need site-level and farm-level forecasts to manage ramp events and curtailment decisions. Short-term local volatility is precisely what uniform global averaging obscures.

According to IRENA's 2025 renewable capacity statistics, China, the United States, and the European Union accounted for 83.6% of all renewable capacity installed in 2024. Renewable infrastructure is clustered in wind corridors, coastal zones, and high-irradiance areas. Many ML weather papers report global average skill scores. A grid cell over the central Pacific or in the Andes receives the same weight as one over the North Sea, irrespective of economic value. For generic weather models that is perhaps expected, although with 71% of the planet's surface covered by oceans, global forecasts that average over areas no one lives also have limited value to regular people who wish to know whether it is about to rain in their area or not. A model can improve while degrading at every location that actually matters, not because it is failing globally, but because uniform weighting systematically undervalues the locations where errors are most costly.

No widely adopted, exposure-weighted benchmark exists for site-level wind farm output forecasting, solar plant generation under tilt-specific irradiance, or basin-level hydro inflow prediction. Operational forecasting relies on proprietary SCADA data, plant telemetry, and market-specific balancing constraints that are not publicly accessible. The research community evaluates against what is available, and what is available is ERA5, not the operational loss functions that utilities actually face: is observable, yet often is not.

Weather forecasting illustrates the problem of physical state: the benchmark target differs from the economically decisive one. The same divergence appears in systems where the target is not atmospheric state but human preference or reliability.

Recommenders: engagement vs utility

In recommender systems, the fundamental mismatch is between engagement and utility. Most recommender systems, and the benchmarks that evaluate them, optimize for clicks, time spent, and short-term interaction signals, tractable proxies for the harder question of whether the system is actually delivering value over time for both user and provider. As Milli et al. documented, engagement signals are systematically misaligned with user utility, a fact known in psychology, because it distinguishes impulsive from deliberate preferences. Optimizing for short-term engagement reinforces System 1's impulse-driven behaviour at the expense of System 2's utility-driven choices, with measurable long-term effects on retention and satisfaction.

Standard offline evaluation compounds this by relying on historical interaction logs, which record what users clicked, not what they valued. Revealed preference is assumed to be a reliable signal of utility, but it is not, particularly when the system shapes what users see.

And there is also the question of identity instability. Standard benchmarks assign equal weight to each prediction and treat user identity as stable across sessions, even though 36% of streaming subscription holders share accounts across two or more households, a lower bound on session-level identity ambiguity. The magnitude of the resulting recommendation quality degradation has not been systematically quantified, although the structural direction is clear.

If we take historical interaction logs and probabilistically reassign a fraction of sessions to latent "household members" based on realistic account sharing rates, we can stress-test recommenders under deliberate identity mixing. If the performance degrades even at modest mixing levels, then identity stability is not a harmless modelling simplification. Offline benchmark scores then become conditional on an assumption that does not hold in the real world.

LLMs: when benchmarks select for confident confabulation

Large language models (LLMs) present the same pattern with an additional complication: benchmarks actively select for the wrong behaviour. Consider MMLU, one of the most widely cited capability benchmarks. The model must select A, B, C, or D; there is no option to say "I don't know" or "I need more context." Abstention is penalized the same as an incorrect answer. A model that reduces abstentions and guesses more aggressively rises on the leaderboard regardless of what that does in operational settings where confident confabulation can be costly.

Some evaluations nowadays include calibration and refusal metrics, yet the models that surface as frontier in most widely cited rankings, especially the ones that reach policy discussions and procurement decisions, are not necessarily the most capable in any operationally meaningful sense. As I have argued elsewhere, the field has built evaluation infrastructure that optimizes for fluent guessing and reported the results as a general capability with having little to do with .

A confident confabulation in a medical, legal, or safety-critical context carries far greater cost than a cautious refusal. A model that reduces abstentions and improves average accuracy can rise on the leaderboard while increasing expected harm in exactly the contexts where harm is most costly.

Why the gap persists

The weighting mismatch would be easier to fix if the people who build models were the same people who live with the consequences of a model failure. They are not, and that separation is structural.

In most organizations, data or research scientists develop models in environments optimized for exploration, benchmarked against whatever labelled data is available. The prototype-to-production chasm is genuinely hard to cross: no tooling bridges the gap well enough between a feature as a hypothesis and a feature as a production contract, so models are handed off across an organizational boundary rather than graduated through one. What survives the hand-off is the benchmark score. Engineering receives the model and is responsible for operationalizing it, but typically had no role in defining the evaluation criteria. This split, science vs engineering, is a documented anti-pattern that erodes accountability for end-to-end performance.

In SaaS setups, value is captured through subscription revenue, not model performance directly. A model that degrades can take quarters to show up in churn, and by then the causal chain is long enough to dispute. The team that built the model is not the team bearing the cost. This specific mechanism has not been systematically quantified in the literature and should be treated as a structural hypothesis rather than a demonstrated fact. The incentive structure, in which the people who evaluate models are decoupled from the people who experience their failures, is consistent with the broader pattern of metric misalignment documented across industries. Without a feedback loop, awareness is not enough to produce a correction.

The gap as a governance problem

For most of the history of machine learning, the benchmark gap was largely internal. Practitioners knew the numbers were optimistic and adjusted informally. The cost of miscalibration was largely confined to the research community.

That is no longer the case. The EU AI Act, in force since August 2025, classifies general-purpose AI models as carrying systemic risk based primarily on training compute thresholds. Benchmark results are explicitly named in the regulation as evidence that providers can submit to contest a systemic risk designation.

Compute and benchmark performance are measurable. The regulation uses compute as a proxy for capability and permits benchmark performance as counter-evidence against risk: both have standing in for . The models subject to the most stringent oversight obligations are identified through a proxy the research community already knows is imperfect, and they can rebut oversight through a second proxy that is also imperfect, all without any requirement to disclose how either proxy relates to the operational risk the regulation is trying to govern. Policymakers work with the information the field provides, and the field provides the wrong weighting function without saying so.

Generic benchmarks are defensible for research exploration. The problem arises when those scores migrate into operational, financial, or regulatory decisions.

Escaping the benchmark trap

Benchmark papers should report their weighting function explicitly, not just aggregate performance but the ways in which the evaluation distribution and loss function differ from plausible operational conditions. Each benchmarked result should be accompanied by a loss sensitivity curve, in which the weighting function is varied across plausible operational regimes to show how the model ranking is affected. If the relative order is invariant, the claim is robust. If it flips, the "state-of-the-art" is only conditional on a specific and rarely disclosed choice of weighting.

Capability claims should include out-of-distribution performance. A model that achieves state-of-the-art performance on a benchmark yet degrades substantially on operational data is a different model than one that does not. That difference should be reported.

LLM evaluation should penalize confident errors more than abstentions. Calibrated uncertainty is a capability, and it ought to be measured as one. Instead we can re-score existing benchmarks with higher penalties for confident errors in safety-critical domains and lower penalties for abstention. If the leaderboard order changes under a harm-weighted metric, then current rankings are selecting for fluency under symmetric loss rather than reliability under asymmetric cost. This does not require new benchmarks, only that we admit that the cost of being wrong is not constant across questions.

The field should be honest about deployment norms. Shipping a non-negative signal, deploying when evidence says "probably not worse" rather than "demonstrably better," is understandable under resource constraints, but it is not a scientific standard.

When proxies scale

Goodhart's Law is decades old, and the gap between proxy and target is not a new discovery.

What is new is the scale of downstream decisions made on the basis of benchmark scores. When those decisions concern energy infrastructure, clinical systems, legal tools, or the scope of AI regulation itself, the cost of optimizing the wrong weighting function is no longer buffered by the research community.

The gap between reported capability and operational performance is incentive-driven, predictable from benchmark design, and almost never disclosed. It is being ignored at a scale neither the field nor society can afford.

And so the ritual continues. The press release is issued, and the field moves on.

It should not.