The Bitter Lesson for AI Safety Research
Read the associated paper "Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?": https://arxiv.org/abs/2407.21792 Focus on safety problems that aren’t solved with scale. Benchmarks are crucial in ML to operationalize the properties we want models to have (knowledge, reasoning, ethics, calibration, truthfulness, etc.). They act as a criterion to judge the quality of models and drive implicit competition between researchers. “For better or worse, benchmarks shape a field.” We performed the largest empirical meta-analysis to date of AI safety benchmarks on dozens of open language models. Around half of the benchmarks we examined had high correlation with upstream general capabilities. Some safety properties improve with scale, while others do not. For the models we tested, benchmarks on human preference alignment, scalable oversight (e.g., QuALITY), truthfulness (TruthfulQA MC1 and TruthfulQA Gen), and static adversarial robustness were highly correlated with upstream general capabilities. Bias, dynamic adversarial robustness, and calibration when not measured with Brier scores had relatively low correlations. Sycophancy and weaponization restriction (WMDP) had significant negative correlations with general capabilities. Often, intuitive arguments from alignment theory are used to guide and prioritize deep learning research priorities. We find these arguments to be poorly predictive of these correlations and are ultimately counterproductive. In fact, in areas like adversarial robustness, some benchmarks basically measured upstream capabilities while others did not. We argue instead that empirical measurement is necessary to determine which safety properties will be naturally achieved by more capable systems, and which safety problems will remain persistent.[1] Abstract arguments from genuinely smart people may be highly “thoughtful,” but these arguments generally do not track deep learning phenomena, as deep learning is too often counterintu
I agree the term is in common use, but there is value in proposing a detailed operationalization of a concept that otherwise has a fuzzy referent. This is one way to ground timelines debates and make forecasts cross-comparable, as we discuss in the piece.