PSA: Consider alternatives to AUROC when reporting classifier metrics for alignment
TL;DR If you’re presenting a classifier that detects misalignment and providing metrics for it, please: 1. report the TPR at FPR=0.001, 0.01, and 0.05 2. plot the ROC curve on a log-log scale See https://arxiv.org/abs/2112.03570 for more context on why you might want to do this. ML Background (If all the terms in the TL;DR made sense to you, you probably want to skip this section.) Classification is the ML/Stats problem of determining which category a thing belongs in. Binary classification is a common special case for problems with a yes/no answer; for example, determining whether an LLM is being deceptive. A (binary) classifier is a model or algorithm that tries to answer such a problem. Usually, classifiers don't directly output "yes" or "no", instead outputting a score, which can be loosely interpreted as inside-view confidence. When deployed, these scores are translated to "yes" and "no" (and less commonly, "I don't know") by comparing against thresholds. One natural, but limited, way to quantify how good a classifier is is it's accuracy; that is, how often does it produce the right answer. One shortcoming of this is that it depends on the thresholds set, so that a classifier may be "better", but its thresholds were set badly (e.g. it was set using data that doesn't represent the real world, which in turn makes it over- or under-confident), so it fares poorly on this metric, even though it would be straightforward to improve. A way of comparing classifiers that bypasses threshold-setting is the ROC curve, which considers every threshold and estimates the true-positive rate (TPR) and false-positive rate (FPR) for that threshold. The ROC curve is a commonly plotted as visual summary of a classifier, and can itself be summarized numerically by integrating it to get the Area Under the ROC curve (AUROC). The ROC curve and AUROC have shortcomings themselves, which are known in the ML community. Sometimes, people propose alternatives like precision-recall cu
Thanks!
I did notice that we were comparing traces at the same parameters values by the third read-through, so I appreciate the clarification. I think the thing that would have made this clear to me is an explicit mention that it only makes sense to compare traces within the same run.