"If you cannot measure it, you cannot improve it." - Lord Kelvin
The science of AI safety evaluations is still nascent, but it is making progress! We know much more today than we did two years ago.
We tried to make this knowledge accessible by writing a literature review and systematizing all the knowledge.
We wanted to go beyond a central source that just summarized collected knowledge, so we put a lot of effort into distillation, and disentangling concepts that are often presented in a messy way. We created original visualizations and gathered others from many different sources to accompany these explanations.
The review provides a taxonomy of AI safety evaluations along three dimensions:
We also bring up some limitations that safety evaluations might face including things like: "sandbagging" (strategic underperformance on tests), organizational "safetywashing" (misrepresenting capability improvements as safety advancements), or other more fundamental inherent challenges like proving absence rather than presence of capabilities.
The text is available in many different places:
This paper is intended to be part of a larger body of work called the AI Safety Atlas. We think of it as chapter 5 in a comprehensive collection of literature reviews collectively forming a textbook for safety.