Race to the Top: Benchmarks for AI Safety

Isabella Duan

This is an executive summary of a post from my personal blog, also cross-posted from the EA Forum. Read the full texts here.

Summary

Benchmarks support the empirical, quantitative evaluation of progress in AI research. Although benchmarks are ubiquitous in most subfields of machine learning, they are still rare in the subfield of AI safety.

I argue that creating benchmarks should be a high priority for AI safety. While this idea is not new, I think it may still be underrated. Among other benefits, benchmarks would make it much easier to:

track the field’s progress and focus resources on the most productive lines of work;
create professional incentives for researchers - especially Chinese researchers - to work on problems that are relevant to AGI safety;
develop auditing regimes and regulations for advanced AI systems.

Unfortunately, we cannot assume that good benchmarks will be developed quickly enough “by default." I discuss several reasons to expect them to be undersupplied. I also outline actions that different groups can take today to accelerate their development.

For example, AI safety researchers can help by:

directly trying their hand at creating safety-relevant benchmarks;
clarifying certain safety-relevant traits (such as “honesty” and “power-seekingness”) that it could be important to measure in the future;
building up relevant expertise and skills, for instance by working on other benchmarking projects;
drafting “benchmark roadmaps,” which identify categories of benchmarks that could be valuable in the future and outline prerequisites for developing them.

And AI governance professionals can help by:

co-organizing workshops, competitions, and prizes focused on benchmarking;
creating third-party institutional homes for benchmarking work;
clarifying, ahead of time, how auditing and regulatory frameworks can put benchmarks to use;
advising safety researchers on political, institutional, and strategic considerations that matter for benchmark design;
popularizing the narrative of a “race to the top” on AI safety.

Ultimately, we can and should begin to build benchmark-making capability now.

Acknowledgment

I would like to thank Ben Garfinkel and Owen Cotton-Barratt for their mentorship, Emma Bluemke and many others at the Centre for the Governance of AI for their warmhearted support. All views and errors are my own.

Future research

I am working on a paper on the topic, and if you are interested in benchmarks and model evaluation, especially if you are a technical AI safety researcher, I would love to hear from you!

edit: this response no longer represents my views two years later. Even when I wrote it, I was being diplomatic about the fact that I think it's really really hard to make a benchmark that can be trusted to generalize far, and much of the work folks are doing (like mechinterp or agent foundations or fiddling with variants of qaci) relate to trying to make a strongly durable benchmark. Anyway, original response>

Strongly agreed. The question is how to make durable benchmarks for ai safety that are not themselves vulnerable to goodharting. Some prior work on benchmark design (selected from the results for a metaphor.systems query for this comment):

(Relevance ratings are manual labels by me.)

++++++ https://benchmarking.mlsafety.org/index.html - "Up to $500,000 in prizes for ML Safety benchmark ideas."
+++++ https://github.com/HumanCompatibleAI/overcooked_ai - "A benchmark environment for fully cooperative human-AI performance." - eight papers are shown on the github as having used this benchmark
+++++ https://partnershiponai.org/introducing-the-safelife-leaderboard-a-competitive-benchmark-for-safer-ai/
++++ https://bair.berkeley.edu/blog/2021/07/08/basalt/ - "a NeurIPS competition and benchmark called BASALT: a set of Minecraft environments and a human evaluation protocol that we hope will stimulate research and investigation into solving tasks with no pre-specified reward function"
++ https://arxiv.org/abs/1907.01475 "... . This paper investigates safety and generalization from a limited number of training environments in deep reinforcement learning (RL). We find RL algorithms can fail dangerously on unseen test environments even when performing perfectly on training environments. ..."
++ https://arxiv.org/abs/1911.01875 - summary: question your data too. "Metrology for AI: From Benchmarks to Instruments"
+ https://arxiv.org/abs/2008.09510 "... . We apply new theorems extending Conservative Bayesian Inference (CBI), which exploit the rigour of Bayesian methods while reducing the risk of involuntary misuse associated with now-common applications of Bayesian inference; we define additional conditions needed for applying these methods to AVs. Results: Prior knowledge can bring substantial advantages if the AV design allows strong expectations of safety before road testing. We also show how naive attempts at conservative assessment may lead to over-optimism instead; why ..."
+ https://arxiv.org/abs/2007.06898 "Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and `hack' datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance -- and thus overestimation in AI systems' capabilities -- we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation."
+ https://openai.com/blog/safety-gym/ - old openai benchmark, I'm not sure how much it's actually been used?

tangential, but interesting:

++ https://openreview.net/forum?id=B1xhQhRcK7 "We show that rare but catastrophic failures may be missed entirely by random testing, which poses issues for safe deployment. Our proposed approach for adversarial testing fixes this"
https://arxiv.org/abs/1907.04446 - "Let's Keep It Safe: Designing User Interfaces that Allow Everyone to Contribute to AI Safety" - "We first present a task design in which workers evaluate the safety of individual state-action pairs, and propose several variants of this task with improved task design and filtering mechanisms. Although this first design is easy to understand, it scales poorly to large state spaces. Therefore, we develop a new user interface that allows workers to write constraint rules without any programming."
https://ai-safety-papers.quantifieduncertainty.org/table - an overview of papers someone thought relevant to ai safety, last updated 2020
https://deepmindsafetyresearch.medium.com/building-safe-artificial-intelligence-52f5f75058f1 - "Building safe artificial intelligence: specification, robustness, and assurance" - intro post
https://www.lesswrong.com/posts/ZHXutm7KpoWEj9G2s/an-unaligned-benchmark - "I think of the possibly-unaligned AIs as a benchmark: it’s what AI alignment researchers need to compete with. The further we fall short of the benchmark, the stronger the competitive pressures will be for everyone to give up on aligned AI and take their chances"

We need a clear definition of bad AI before we can know what is -not- that I think. These benchmarks seem to itemize AI as if it will have known, concrete components. But I think we need to first compose in the abstract a runaway self sustaining AI, and work backwards to see which pieces are already in place for it.

I haven't kept up with this community for many years, so I have some catching up to do, but I am currently on the hunt for the most clear and concise places where the various runaway scenarios are laid out. I know there is a wealth of literature, I have the Bostrom book from years ago as well, but I think simplicity is the key here. In other words, where is the AI redline ?

Curious if you ever found what you were looking for.

I didn't. I'm sure words towards articulating this have been spoken many times, but the trick is in what forum / form does it need to exist more specifically in order for it to be comprehensible and lasting. Maybe I'm wrong that it needs to be highly public; as with nukes not many people are actually familiar with what is considered sufficient fissile material - governments (try to) maintain this barrier by themselves. But at this stage as it still seems a fuzzy concept, any input seems valid.

Consider the following combination of properties:

(software - if that's the right word?) capable of self replication / sustainability / improvement
capable of eluding human control
capable of doing harm

In isolation none of these is sufficient, but taken together I think we could all agree we have a problem. So we could begin to categorize and rank various assemblages of AI by these criteria, and not by how "smart" they are.

I know I am super late to the party but this seems like something along the lines of what you’re looking for: https://www.alignmentforum.org/posts/qYzqDtoQaZ3eDDyxa/distinguishing-ai-takeover-scenarios

yea that's cool to see. Very similar attempt at categorization. I feel we get caught up often in the potential / theoretical capabilities of systems. But there are already plenty of systems that fulfill self-replicating, harmful, intelligent behaviors. It's entirely a question of degrees. That's why a visual ranking of all systems' metrics is in order I think.

Defining what comprises a 'system' would be the other big challenge. Is a hostile government a system? That's fairly intelligent and self-replicating. etc.