Nobody is Doing AI Benchmarking Right
By Chapin Lenthall-Cleary and Cole Gaboriault As LLMs and other forms of AI have become more capable, interest has steadily grown in determining how “smart” they really are. Discussion tends to circle, often obliquely, around the following cluster of questions: are the models as smart as people? Which people? How smart are those people anyway? What do we even mean by “smart”? These questions suggest a straightforward approach. Obviously, the quality of being smart, or “intelligence,” can be possessed in different amounts by different people and different models. We want to determine the intelligences of models and people and compare them to each other; that is, we want a reliable test of intelligence that we can administer to both models and people – an intelligence benchmark. Even without settling on a definition for intelligence, it's clear that the best strategy for designing such a benchmark is to start by developing it on people, because we have a more robust intuition for people’s intelligence and have more existing research to build upon. Any test that accurately, directly, completely, and exclusively measures intelligence in people will generalize immediately to models (assuming it can be administered through an appropriate modality, such as text); if a model and a person that we intuitively believe have the same intelligence receive different scores (or vice versa), then by definition either our intuition is wrong, or the test is not actually measuring intelligence accurately, directly, completely, or exclusively – though it may have appeared to be – and needs to be improved. The first major lesson to take from this is that extensive data on different people’s performance on an intelligence benchmark is foundational to its usefulness; such data is our main tool to ensure that it actually measures intelligence, and our only tool to calibrate the interpretation of scores. This is true of nearly all benchmarks: the gold standard would be a full, populati