The Evals Gap

Marius Hobbhahn

In our engagements with governments, AI safety institutes, and frontier AI developers, we found the concept of the “evaluation gap” (short: ‘evals gap’) helpful to communicate the current state of the art and what is needed for the field to move towards more robust evaluations. In this post, we briefly explain the concept and its implications. For the purpose of this post, “evals” specifically refer to safety evaluations of frontier models.

Evals have become a prominent tool underpinning governance frameworks and AI safety mechanisms. Given that, we are concerned that policymakers and industry players both (i) overestimate the number of currently available high-quality evals and (ii) underestimate the time it takes to develop them. In our experience, available evals are not sufficient (in quality and quantity) to robustly identify the capabilities of existing and near-future models.

We call this overarching idea the evals gap. Unless more focused attention is paid to this gap and efforts diverted to closing it, we expect this trend to continue and, subsequently, the gap to increase.

We think it is possible to close the evals gap. This post serves as a call to action to be more ambitious with evals efforts, e.g. dedicate more resources to the science, development, and running of evals.

Evaluations underpin many high-stakes decisions

Many high-stakes decisions in company-led and government-led frameworks are reliant on the results of evals. In voluntary commitments such as Anthropic’s Responsible Scaling Policy, OpenAI’s Preparedness Framework, and Google DeepMind’s Frontier Safety Framework, mitigation measures and deployment decisions are directly tied to the identification of specific capabilities through evals. While these voluntary policies differ between companies, some consensus on best practices, including the role of evals, is beginning to be reached through e.g. the Frontier AI Safety Commitments. Evals are also a core part of legislation and governance frameworks, such as the EU AI Act or the US Executive Order.

We think evals are an important tool for AI governance and AI safety, and we think it is good that governments and AI companies use them in their safety frameworks. However, given the high stakes involved, it is crucial to ensure that the evals used to underpin these decisions are adequate.

Current evaluations are insufficient to underpin high-stakes decisions

Many important evals don’t yet exist

The aforementioned safety frameworks sometimes require evals that either do not yet exist or, where they do, evals that are in an early development phase and incapable of generating sufficiently robust evidence.

For example, the EU AI Act (Article 51) states, “A general-purpose AI model shall be classified as a general-purpose AI model with systemic risk if it meets any of the following conditions: (a) it has high impact capabilities evaluated on the basis of appropriate technical tools and methodologies, including indicators and benchmarks;[...]” In this case, the “high impact capabilities” in question are yet to be specified. Until that happens, evaluators are required to guess which evals will be required to identify them.

In their Responsible Scaling Framework(October 2024 version), Anthropic writes, “We are currently working on defining any further ‘Capability Thresholds’ that would mandate ASL-4 Required Safeguards [...]”. This statement indicates that a full ASL-4 evals suite that could provide strong evidence about catastrophic risks does not yet exist (Though, some of the respective capability evaluations already exist, e.g. Benton et al., 2024).^[1]

Coverage is typically low

From our interactions with others in the evals ecosystem (e.g. AI developers, third-party evaluators, academics, AISIs, and other government agencies), we noted a consensus that the coverage of evals is low, i.e. that there are much fewer evals than needed for any given risk domain^[2]. Thus, even when there exist evals for a given risk domain, they only cover a small number of potential threat scenarios. A shared belief among the people we spoke to was that existing evals can only spot-check their specific risk domain and should not be conflated with a rigorous assessment.

In general, we want evals to be predictive of real scenarios, i.e. their results to indicate what we should expect from real deployment situations. This is a big assumption since real deployments can capture countless different use cases across a wide variety of scenarios and users. Therefore, creating good coverage is a significant endeavor that would optimally result in a comprehensive suite of evals per risk domain.

While there are many important efforts to create new and better evals (e.g. Kinniment et al., 2023; Phuong et al., 2024; Mouton et al., 2024; OpenAI, 2024; Benton et al., 2024, see more here), the field is nowhere near having good coverage of all relevant risks and domains. Thus, the existing evals mostly function as a spot check for their corresponding risk domains, providing some evidence but not sufficient coverage.

Development and interpretation of evals is complicated

The development of a single suite of high-quality evals can require the full-time effort of a small research team for multiple months. This process can come with many unforeseen challenges.

For example, most safety frameworks (e.g. Anthropic’s RSP, Google DeepMind’s FSF, OpenAI’s PF) mention risks from model autonomy and sometimes even specifically talk about a model’s ability to accelerate AI R&D. SWE Bench (Jimenez et al., 2023) is an evaluation framework consisting of 2,294 software engineering problems drawn from real GitHub issues. It is often used to compare the software engineering abilities of LM agents and used as a proxy for autonomy and AI R&D capabilities. However, it was later found that a significant number of these coding problems were underspecified or otherwise unsolvable. Thus, any reported results on the original version of SWE Bench were misleading. This led to the introduction of SWE-Bench Verified, which only includes coding problems confirmed to be solvable and required hiring many skilled software engineers to manually validate each sample.

This example should not be misconstrued as a criticism of the original SWE-Bench authors. Rather, it is a demonstration of the fact that building large-scale, high-quality evaluation benchmarks is complicated and expensive. Furthermore, there are various issues with other equally widely used benchmarks like MMLU or BBQ, third-party evaluation frameworks, model-written evals, and more (see Anthropic’s Challenges in evaluating AI systems for details).

On top of that, interpreting evals results too can be challenging. For example, despite noteworthy outcomes on our most recent publicly available evals suite (see o1-preview system card), we found it challenging to formulate clear and robust takeaways or tie them to very concrete recommendations for the developers. The three main reasons for this are (a) defining clear capability thresholds and designing evals for them is still a nascent research area (and, in our opinion, part of the evals gap); (b) even if we have clear thresholds, the empirical evidence might be insufficient or hard to interpret; (c) with current techniques, there is a concern that the model’s capabilities were not maximally elicited.

Proper elicitation is an unsolved research question

In evals, we often encounter the problem that “absence of evidence is not (strict) evidence of absence.”^[3] Just because a model didn’t show a capability in our eval does not mean it doesn’t have this capability under different circumstances.

So far, we neither have a principled way to quantify “how hard we tried” nor a way to upper-bound the model's capabilities through evaluations. This poses a challenge because we cannot accurately estimate what capabilities the model will have after widespread deployment (and thus which harm it might create).

As an analogous example, consider jailbreaking. AI developers typically extensively evaluate their model for jailbreaks before release, i.e. they check if they can find clever ways to get their model to respond to harmful queries. Presumably, their internal efforts conclude that the model is hard to jailbreak before they release it. Nevertheless, in almost all recent releases, a model has been jailbroken by users within a few hours or days available. This is comparable to the problem we face with evals. A small number of evaluators have a limited amount of time to assess the model’s capabilities and then make a prediction about the revealed capabilities during deployment. We currently don’t have a rigorous theory of predicting these revealed capabilities.

Without action, the evals gap may widen

Higher capabilities require more evals. Increased capabilities imply that the model is able to do a larger number of tasks. For example, GPT-4 can be used in many more use cases than GPT-2. Therefore, the space that evals for GPT-4 have to cover is much larger than for GPT-2.

Higher capabilities require more complex evals. The more complex the capabilities of models, the more complex and time-consuming it is to foresee, develop, and run evals. For example, consider that until recently, we could evaluate most relevant LLM behaviors with QA benchmarks. Since models are now capable enough of acting as LM agents, evals have to be increasingly complex tasks, which significantly increases the overhead per eval.

Increased stakes for evals in the future. Right now, it appears that most models are not capable enough to trigger evals with high-stakes consequences. For example, OpenAI assessed their o1-preview model to be medium capable of CBRN and persuasion^[4]. Since only models with a post-mitigation score of "medium" or below can be deployed, o1-preview could be deployed. Anthropic has assessed their Claude-3 model family to be in category ASL-2^[5], where most costly mitigations only start with ASL-3.

As capabilities in AI models increase, more and more evals identifying specific thresholds tied to high-stakes reactions will get surpassed and trigger increasingly important consequences. Eventually, the results of evals will have a direct effect on billion-dollar deployment decisions. For example, Anthropic’s ASL-3 specifies a large number of mitigations and actions on model security (corresponding to SL-4 in RAND’s Securing AI Model Weights report). These mitigations could take significant time to implement, which would lengthen the time to deploy their model and may come at a direct financial cost. If a model provider has to postpone their deployment by multiple months due to a relevant threshold being passed and having to improve their safety guardrails, they might lose out on important investments.

We expect that these increased stakes will lead to increased pressure on and scrutiny of evals and their results. Furthermore, in uncertain cases, AI developers might be incentivized to criticize these evals, question their legitimacy, or, in extreme cases, even go to court with legislators to reduce their financial burden. This necessitates that the quantity and quality of evaluations rise to the challenge posed by rapid AI progress. Only then can evals provide a meaningful mechanism to rigorously test for all relevant risk domains and support other safety mechanisms in a defense-in-depth approach.

Closing the evals gap is possible

With sufficient foresight, we can close the evals gap. All of the aforementioned problems are solvable with more resources, people, and time. However, AI progress is fast, and high-stakes decisions will have to be made in the near future. Given the magnitude of the stakes, these decisions will likely be made with whatever evals are available at the time and will not be put on hold until better evals are developed. Therefore, we think closing the evals gap is an urgent priority.

We suggest the following efforts as initial areas for improvement:

Directly fund evals development: Merely funding the running of evals is not sufficient. Conducting evals is often only a small fraction (e.g. 10%) of the total effort and cost going into evaluations. We recommend to:
1. Fund an external ecosystem of evaluation builders. We recommend casting a wide net, e.g. including private organizations, non-profits, academic labs, and individual researchers.
2. Fund government bodies such as AISIs or AI offices well so that they can hire the technical talent needed to develop, run, and judge evaluations.
3. Grow the evaluation teams within frontier AI companies such that they can better prepare for systems with increased capabilities.
Fund the science of evals: Improvements in the science of evals benefit the entire field. For example, a better understanding of how to design high-quality evals at scale would both increase their usefulness and reduce their costs.
Shape market incentives: Because most evals efforts are currently based on voluntary commitments, the incentive to pay for or build evals is small. If there were stronger incentives, more people and organizations would specialize in building evals. While we commend efforts like Anthropic’s Initiative for developing third-party model evaluations, market incentives would have to change more broadly for an ecosystem to develop at the required pace.

^{^}
This is not a specific criticism of Anthropic. We think it is good that they state their level of preparedness and their plans in public to expose them to external feedback.
^{^}
Note that these reflect our interpretations of these conversations and are not the result of more systematic efforts such as structured interviews.
^{^}
Technically, absence of evidence is some evidence of absence. The core point is that it should not be used as conclusive evidence unless we have exhaustively tested all options or have other arguments that imply strong evidence of absence.
^{^}
This has not been externally verified.
^{^}
This has not been externally verified.

Development and interpretation of evals is complicated

Proper elicitation is an unsolved research question

... and yet...

Closing the evals gap is possible

Why are you sure that effective "evals" can exist even in principle?

I think I'm seeing a "we really want this, therefore it must be possible" shift here.

Why are you sure that effective "evals" can exist even in principle?

Relatedly, the point which is least clear to me is what exactly would it mean to solve the "proper elicitation problem" and what exactly are the "requirements" laid out by the blue line on the graph. I think I'd need to get clear on this problem scope before beginning to assess whether this elicitation gap can even in principle be crossed via the methods which are being proposed (i.e. better design & coverage of black box evaluations).

As a non-example, possessing the kind of foundational scientific understanding which would allow someone to confidently say "We have run this evaluation suite and we now know once and for all that this system is definitely not capable of x, regardless of whatever elicitation techniques are developed in the future" seems me to be Science-of-AI-complete and is thus a non-starter for a north star for an agenda aimed at developing stronger inability arguments.

When I fast forward the development of black box evals aimed at supporting inability arguments, I see us arriving at a place where we have:

More SWE-Benchesque evaluations across critical domains which are perhaps "more Verified" by having higher quality expert judgement passed upon them.
Some kind of library which brings together a family of different SOTA prompting and finetuning recipes to apply to any evaluation scenario.
More data points and stronger forecasts for post training enhancements (PTEs).

Which would allow us to make the claim "Given these trends in PTEs, and this coverage in evaluations, experts have vibed out that the probability that of this model being capable of producing catastrophe x is under an acceptable threshold" for a wider range of domains. To be clear, that's a better place than we are now and something worth striving for but not something which I would qualify as "having solved the elicitation problem". There are fundamental limitations to the kinds of claims which black box evaluations can reasonably support, and if we are to posit that the "elicitation gap" is solvable it needs have the right sorts of qualifications, amendments and hedging such that it's on the right side of this fundamental divide.

Note, I don't work on evals and expect that others have better models than this. My guess is that @Marius Hobbhahn has strong hopes on the field developing more formal statistical guarantees and other meta-evaluative practices as outlined in the references in the science of evals post, and would thus predict a stronger safety case sketch than the one laid out in the previous paragraph, but what the type signature of that sketch would be, and consequently how reasonable this sketch is given fundamental limitations of black box evaluations, is currently unclear to me.

These are all good points. I think there are two types of forecasts we could make with evals:

1. strict guarantees: almost like mathematical predictions where we can proof that the model is not going to behave in a specific way even with future elicitation techniques.
2. probabilistic predictions: We predict a distribution of capabilities or a range and agree on a threshold that should not be crossed. For example, if the 95% upper bound of that distribution crosses our specified capability level, we treat the model differently.

I think the second is achievable (and this is what the post is about), while the first is not. I expect we will have some sort of detailed scaling laws for LM agent capabilities and we will have a decent sense of the algorithmic progress of elicitation techniques. This would allow us to make a probabilistic prediction about what capabilities any given model is likely to have, e.g. if a well-motivated actor is willing to spend $1M on PTE in 4 years.

Additionally, I expect that we would get quite a long way with what Lucas calls "meta-evaluative practices", e.g. getting a better sense of how wrong our past predictions were and accounting for that. I think this could have the form of "We invested $1M, 10 FTE-years and X FLOP to elicit the best capabilities; Let's predict what 10x, 100x, 1000x, etc." of that could achieve accounting for algorithmic progress.

Finally, I really think evals are just one part of a bigger defense-in-depth strategy. We still need control, scalable oversight, interpretability, governance, etc. The post is merely trying to express that for the evals part of that strategy, we should internalize which kind of scientific rigor we will likely need for the decisions we have tied to evals results and make sure that we can achieve them.

For context, I just trialed at METR and talked to various people there, but this take is my own.

I think further development of evals is likely to either get effective evals (informal upper bound on the future probability of catastrophe) or exciting negative results ("models do not follow reliable scaling laws, so AI development should be accordingly more cautious").

The way to do this is just to examine models and fit scaling laws for catastrophe propensity, or various precursors thereof. Scaling laws would be fit to elicitation quality as well as things like pretraining compute, RL compute, and thinking time.

In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.

It is NOT KNOWN what world we are in (worst-case assumptions would put us in 2 though I'm optimistic we're closer to 1 in practice), and determining this is just a matter of data collection. If our evals are still not good enough but we don't seem to be in World 2 either, there are endless of tricks to add that make evals more thorough, some of which are already being used. Like evaluating models with limited human assistance, or dividing tasks into subtasks and sampling a huge number of tries for each.

I agree. In which case, I think the concrete proposal of "We need to invest more resources in this" is even more important. That way, we can find out if it's impossible soon enough to use it as justification to make people stop pretending they've got it under control.

Yeah, it's not a watertight argument and somewhat based on my current interpretation of past progress and projects in the making.

1. Intuitively, I would say for the problems we're facing in evals, a ton of progress is bottlenecked by running fairly simple experiments and iterating fast. A reasonable part of it feels very parallelizable and the skill required is quite reachable for many people.
2. Most evals questions feel like we have a decent number of "obvious things" to try and since we have very tight feedback loops, making progress feels quite doable.

Intuitively, the "hardness level" to get to a robust science of evals and good coverage may be similar to going from the first transformer to GPT-3.5; You need to make a lot of design choices along the way, lots of research and spend some money but ultimately it's just "do much more of the process you're currently doing" (but we should probably spend more resources and intensify our efforts because I don't feel like we're on pace).

In contrast, there are other questions like "how do we fully map the human brain" that just seem like they come with a lot more fundamental questions along the way.

I'm curious if there is a need for developing safety evals with tasks that we want an AI to be good at vs. ones that we don't want an AI to be good at. For example, a subset of evals that, if any AI or combination of AIs can solve it, it becomes a matter of national security. My thinking is that this would be capital-intensive and would involve making external tools LLM-interfaceable, especially gov't ones. I'm not even sure what these would look like, but in theory, there are tasks that we don't want a widely-available LLM to ever solve (e.g. bio-agent design eval, persuasion, privacy, etc.)

Development and interpretation of evals is complicated

Proper elicitation is an unsolved research question

... and yet...

Closing the evals gap is possible

Why are you sure that effective "evals" can exist even in principle?

I think I'm seeing a "we really want this, therefore it must be possible" shift here.

Why are you sure that effective "evals" can exist even in principle?

When I fast forward the development of black box evals aimed at supporting inability arguments, I see us arriving at a place where we have:

More SWE-Benchesque evaluations across critical domains which are perhaps "more Verified" by having higher quality expert judgement passed upon them.
Some kind of library which brings together a family of different SOTA prompting and finetuning recipes to apply to any evaluation scenario.
More data points and stronger forecasts for post training enhancements (PTEs).

For context, I just trialed at METR and talked to various people there, but this take is my own.

In a world where elicitation quality has very reliable scaling laws, we would observe that there are diminishing returns to better scaffolds. Elicitation quality is predictable, ideally an additive term on top of model quality, but more likely requiring some more information about the model. It is rare to ever discover a new scaffold that can 2x the performance of an already well-tested models.
In a world where elicitation quality is not reliably modelable, we would observe that different methods of elicitation routinely get wildly different bottom-line performance, and sometimes a new elicitation method makes models 10x smarter than before, making error bars on the best undiscovered elicitation method very wide. Different models may benefit from different elicitation methods, and some get 10x benefits while others are unaffected.

56

The Evals Gap

56

Ω 30

Evaluations underpin many high-stakes decisions

Current evaluations are insufficient to underpin high-stakes decisions

Many important evals don’t yet exist

Coverage is typically low

Development and interpretation of evals is complicated

Proper elicitation is an unsolved research question

Without action, the evals gap may widen

Closing the evals gap is possible

56

Ω 30

56

Ω 30