The Missing Error Bars in AI Research That Nobody Talks About.
Andrey Seryakov ex-CERN particle physicist independent AI behaviour researcher a.u.seryakov@gmail.com This article is about systematic uncertainties and biases which are so often missing in our research papers. TL;DR * Systematic uncertainty is about hidden biases in experimental design * We have to spend time in finding and evaluating them * No universal solution, but rigorous testing of assumptions is crucial This small blog should be considered as an extension of a paper “Adding Error Bars to Evals” by Evan Miller (https://arxiv.org/pdf/2411.00640 ). Where he extensively covers a problem and a need of statistical uncertainty in AI evaluations using a very rigorous math foundation. However, while Miller addresses how to measure statistical “error bars”, I'm focusing on the orthogonal problem: biases that change your entire measurement. Contrary to statistical uncertainties evaluation of systematical errors in physics we call “an art”, as there is no math foundation at all. It is always about knowing well your experiment design and biases which it introduces to your final results. The systematic is about biases. But first, I want to talk a bit about the temperature case, as I believe it’s very illustrative, and only afterwards move to systematic in general. Can and should we take temperature into account? You wouldn't study how fair a coin is by flipping it just once, yet that's what we do with LLMs at t=0. LLMs have a probabilistic behaviour by construction. If we want to study it we have to gain statistics, but which temperature we have to use? Widely used t = 0 will give you nothing. Before actually experimenting with it I was thinking about taking two values (for example 0.7 and 1) and plotting two different statistical error bars, as we do in physics with statistical and systematical uncertainties, like a+/-st+/-sy. We often consider temperature as just a noise level and I was too but then I found that when I change it not just variance of b
thank you, fixed.