They have? How so?
Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations ...
Ugh. Just when I felt I could relax a bit after seeing Grok 4's lackluster performance.
Still, this seems quite suspicious to me. Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this "new general-purpose method" before OpenAI? There is clearly a huge incentive to cheat here, and OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results.
Edit: Oh, this is also interesting: https://leanprover.zulipchat.com/#narrow/channel/219941-Machine-L...
Are you sure? I'm pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don't know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don't think other models were tested with scaffolds specifically engineered for them getting a higher score.
... (read more)