People aren't properly calibrated on FrontierMath
As most of you know, openAI has recently showcased o3's SOTA results on various benchmarks. In my opinion FrontierMath was the hardest of the bunch, and it was reflected in model performance as the SOTA was 2% before Friday. It also seems to be the benchmark with the least visibility....
Dec 23, 202431