Which AI Safety Benchmark Do We Need Most in 2025?

Loïc Cabannes; William Ludington

Intro

The recent invention of AI systems with human-like capabilities across multiple domains of intelligence, it often seems these days that a future where AI will radically improve our society for has gone from a sci-fi premise to a just-beyond-the-horizon reality. However, for each one of the AI ages' golden promises there is a correspondingly terrifying risk: AI will allow the rapid design of precision-targeted drugs, but will also deliver the basic tools for bio-terrorism into the hands of anyone with an internet connection. LLMs will relieve us of the burden of mindless work, but AGI may one day make all of us economically redundant. Self-driving cars will zip us around futuristic trafficless cities, but the same autonomous motion capability will enable the development of killer robodogs.

This duality between artificial intelligence's potential boons and potential risks promises at least one thing: AI will be extremely disruptive to the status quo. While the dangers of machine intelligence have been acknowledged since the Industrial Revolution are are today the subject of intensive research at LLM developers like Anthropic, there is no question that current AI safety measures are lacking. With the US government seemingly content to pursue a gung-ho approach to AGI development, it is up to us, the research community, we must evaluate our current methods for mitigating AI-caused harm and then determine which safety measures we most need to create. Developing criteria to rank future benchmarks is crucial for our ability to select which research to invest money and time in. In the authors' opinion, the AI safety community still lacks a systematic framework for assessing which new benchmarks would be most beneficial, i.e. prevent the most harm to humankind.

We present a first attempt at such a framework by extending a list of AI's risks to society and their expected harm compiled by the Centre pour la Sécurité de l’IA (CeSIA). Using this framework, we evaluate to what extent existing AI benchmarks and safety methods guard against each potential danger to decide which threats most urgently require good benchmarks, that is which benchmarks AI safety researchers should focus on to maximize their expected benefit to society. In this respect our methodology draws inspiration from Effective Altruism. While our study of benchmarks is by no means comprehensive, and our judgment of their efficacy is subjective, we hope that this framework is of use to the AI safety community for prioritizing the use of their time.

Methodology

Our starting point is a table of possible or probable hazards created by ever-more-powerful artificial intelligence. This chart, compiled by CeSIA, contains a median risk case for each area, as well as a (rough) probability of the occurrence of this median risk case, and an estimate of the detriment to the world caused by this case. By multiplying this detriment (or severity) by the probability of occurrence, they arrive at an estimate of the expected severity.

We then assess the ability of current benchmarking methods to identify AI systems that present these risks on a scale from 0 to 10. Multiplying how far current methods are from perfect by the expected severity of an outcome gives us a value that represents the potential benefit to humanity of creating a benchmark that eliminates this type of risk. We propose this framework as a guide for AI researchers looking to optimize the use of their valuable research time.

Risks	Probability	Median case	Severity	E[Severity]	Benchmarks	Coverage	New Benchmark Need
Misuses				17,0			112
Autonomous weapons	80%	Localized use in conflict zones, causing civilian casualties, drones, robocop like dogs	20	16,0	FTR benchmark, Anthropic sabotage	3	112
Misinformation	85%	30% of online content is AI-generated misinformation	20	17,0	Truthful QA, Macchiavelli, Anthropic model persuasiveness, HaluEval	8	34
Systemic				22,5			130
Power concentration	65%	Tech giants controlling AI become more powerful than most nations	20	13,0	Unassessable	0	130
Unemployment	50%	25% of jobs automated, leading to economic restructuring and social unrest	20	10,0	SWEBench, The AI Scientist	2	80
Deterioration of epistemology	60%	Difficulty distinguishing truth from AI-generated falsehoods	30	18,0	HaluEval	8	36
Vulnerable world	25%	AI lowers barrier for creating weapons of mass destruction	90	22,5	WMDP	8	45
S-Risks	5%	AI creates suffering on massive scale due to misaligned objectives	200	10,0	Harmbench, ETHICS	6	40
Alignment of AGI				30,0			90
Successor species	50%	Highly capable AI systems perform most cognitive tasks, humans are deprecated	30	15,0	MMLU, Sabotage, The AI Scientist, SWEBench	7	45
Loss of control - à la Critch	60%	Humans become gradually disempowered in decision-making, and are asphyxiated	50	30,0	Anthropic sabotage	7	90
Recommendation AI				22,5		0	225
Weakening Democracy	50%	AI-driven microtargeting and manipulation reduce electoral integrity	20	10,0	Anthropic model persuasiveness	4	60
Mute News	75%	AI filters create personalized echo chambers, reducing exposure to diverse views	30	22,5	No existing method	0	225

In our full methodology we evaluate 30+ potential risk areas. This table displays the most acute risk areas, those with the highest expected severity and those most in need of better benchmarking. In the reset of this article, we discuss the potential risk areas with benefit greater than 50 in our methodology, breaking down each risk into the specific dangers posed by AI in the area, the existing benchmarks intended to catch these threats, and the benchmarks we propose to better assess and stop AI systems posing a risk in this area.

Misuse Risks

Misuse risks stem from the ability of malevolent actors to use artificially intelligent systems for unethical purposes, such as pursuing warfare in conflict with the Geneva Convention, or the amassment of political power.

Autonomous Weapons

E[Severity] = 16, New Benchmark Need = 116

Current benchmarks that could be used to gauge an AI's efficacy at directing an autonomous weapon system, such as the FTR benchmark or Anthropic’s Sabotage Report, remain limited. The FTR benchmark measures the capability of embodied models to navigate uneven terrains, an obvious prerequisite for autonomous terrestrial fighting robots. The Sabotage Report, on the other hand, measures a model’s ability to achieve nefarious goals even under human oversight. However, no benchmark currently measures a model’s capability to operate multiple agents in modern warfare simulations and achieve nefarious goals while escaping notice from a human/AI monitoring system, as described in the Sabotage Report.

That is why, to assess the risk presented by AI-controlled weapons, we propose combining the warfare-like environment of the FTR benchmark and the oversight aspect of the Sabotage Report into a single benchmark. This benchmark would measure whether a model trained to achieve military objectives in a simulated warfare environment would be able to secretly attain its goals through nefarious means under various levels of oversight from a human or machine monitoring system.

Systemic Risks

Systemic risks are those hazards of AI development that would arise from the interaction of this revolutionary technology with existing social systems such as capitalism and democracy. Systemic risks are particularly hard to benchmark against since they come about from interactions with the chaotic real world.

Power Concentration

E[Severity] = 13, New Benchmark Need = 130

The risk of power concentration is defined as a lack of diversity among the most powerful AI developers. Some fear that with the aid of future AGI systems, the owners of these systems will become more powerful than national governments. This would enable them to impose their will on the human population almost unchecked.

In order to measure power concentration, we propose a diversity index that would keep track of the number of different companies developing the k best performing models as measured by a predefined set of widely used benchmarks such as chatbot arena or MMLU. The evolution of this index through time could help us get a sense of the current trend of power concentration in the AI landscape.

Unemployment

E[Severity] = 10, New Benchmark Need = 80

Another systemic risk of AGI is mass unemployment caused by the economic advantage of employing cheap AI's over costly humans. Several benchmarks, such as SWE Bench and The AI Scientist attempt to evaluate the ability of models to fulfill human jobs, in this case software engineers and scientists. However, these tasks only test two occupations (biased towards those held by AI researchers) and do not accurately represent an AI model’s fitness for replacing the majority of society's various occupations.

We therefore suggest a new and more comprehensive occupational fitness benchmark which would take tasks from a wider variety of occupations, including physical tasks in the real world requiring simulated environments or embodied systems.

Alignment of AGI

Alignment risks are those dangers posed by an AGI directed to achieve certain goals but misaligned with human morals, so that the means they use to attain objectives are slightly or disastrously out of line with what humans would deem acceptable.

Loss of Control

E[Severity] = 30, New Benchmark Need = 90

Loss of control of AI systems refers to situations where artificial intelligence systems operate beyond the intended parameters set by their human creators, leading to unpredictable or unintended outcomes. It would represent a point of no return with potentially unbounded harm to humanity as a whole. As such, loss of control is one of the most serious risks arising from the improvement of AI

To our knowledge, one of the only papers dealing with loss of control is Anthropic’s “Sabotage Report” paper that, as mentioned before, takes on the task of measuring the capability of language models to circumvent human supervision and monitoring systems. The paper is already extensive in its evaluation of several possible loss of control scenarios. However, due to the severity of the risk posed by a loss of control as well as the numerous ways loss of control could come to pass, and because designing robust evaluations for this risk is particularly tricky, we wish to highlight the urgent need for more benchmarks and evaluation suites that would complement Anthropic’s work on the subject and together provide better coverage for this most critical risk.

Recommendation AI

The risks posed by AI-driven recommendations are perhaps the first to be felt by society, as all who use modern social networks, like TikTok, Instagram, and X, know the addictive power of user-tailored content recommendation. Besides the very real harm currently being done by these algorithms, there is a fear that ever-improving AI will become even better at targeting micro-niches of the internet with misinformation and propaganda

Weakening Democracy

E[Severity] = 10, New Benchmark Need = 60

Artificial intelligence has the potential to weaken democracy in many ways. For the sake of discussion, we will here focus on the problem of language model-powered bots on social media trying to influence voter’s political opinion. Very few attempts have been made at measuring AI’s impact on public discourse, both through AI-driven recommendation algorithms and news generation bots enabled by the advance of language models.

The “persuasiveness of language models” report published by Anthropic represents a first attempt at measuring this phenomenon. Although they found language models to be quite apt at persuading humans, we believe these results underestimate the actual capacity of current models to influence political beliefs. Indeed, Anthropic's evaluation remains limited to single turn exchanges and avoids all political issues. We believe it also doesn’t exhibit the model’s capacities to their fullest extent.

In order to obtain a more realistic upper bound on an LLM's persuasion capability, we propose extending Anthropic's methodology to:

Multi-turn exchanges. Iterative interaction between user and bot will more closely mimic typical argumentative scenarios.
Encouraging the model to use false information in its argumentation. This will further exhibit the model's abilities while the conversations produced will be more representative of online discourse, which is often not grounded in truth.
Measuring persuasion on political and ethical issues is highly relevant to the evaluation of AI’s potential impact on public discourse and therefore, to the well-being of our democracy.

Mute News

E[Severity] = 22.5, New Benchmark Need = 225

Mute news refers to the well observed problem of social media users being fed of stream of (often fake) news that exclusively reinforces their existing world view. Although the concept of online echo-chambers, a grave threat to our democratic systems, is already well known, few attempts has been made to systematically measure it.

We propose the creation of an automated benchmark using the “LLM as a judge” methodology to assess the tendency of various social media platforms to systematically promote the content of a certain political side to users based on their posts and past interactions with the platform. In this scenario, the role of the language model would be to classify the political leaning of all the posts given to it by the recommendation algorithm of that platform.

Conclusion

As we have seen, the potential risks of AI are manyfold and varied, while existing safety benchmarks are quite limited in their scope and assumptions. Perhaps the biggest caveat to our evaluation, which we of course cannot rule out from an AGI, is the ability for an AI to realize it is being tested and act differently when conscious of this, thus assuring us of its safety while secretly harboring maleficent abilities and desires. Given our current understanding of AI interpretability, it remains impossible for us to reliably probe the inner thoughts of an AI system.

Another important factor to consider is that benchmarks are useful insofar as they are being used. That is why legislators should consider enforcing a certain level of safety benchmarking on model manufacturers to limit the possibility of unforeseen capabilities in AI models released to the public. Benchmarks are only useful if we can make AI leaders like OpenAI, Meta, Google, and others use them.

[-]Nathan Helm-Burger1y40

Seems to me that there's a major category of risk you are missing. Acceleration of capabilities. If rapid RSI becomes possible, and the result is that very quickly all AI capabilities rapidly increase (and costs to produce powerful AI decrease), then every other risk skyrockets along with the new capabilities and accessibility.

I discuss some related ideas in my recent post here: https://www.lesswrong.com/posts/xoMqPzBZ9juEjKGHL/proactive-if-then-safety-cases

[-]Loïc Cabannes1y32

Fair point! I think we can probably account for RSI by increasing the probability of each risk based on what probability we give to RSI itself. This would certainly deserve to be made explicitly taken into account in our methodology.

Great article btw!