First take in my Lesswrong shortform. I think I will mostly be sharing takes about AI that are too long for twitter.
I think google deepmind getting gold on the IMO 2025 is not surprising and shouldn't be much of an update, because the problems were unusually easy this year and it is plausible that last year's alphaproof+alphageoemtry system would have gotten gold this year. On the other hand, I am pretty surprised by regular reasoning LLMs getting gold in the way openai described that they did. I am pretty puzzled as to how they are setting this up. It is somewhat plausible that the model is marginally better than o3 but scales better with test-time compute and is very scaled, since gemini 2.5 pro got 31% from 1 try and this openai experimental model may have tried hundreds of times to get 81%; plausible! but probably it is quite a bit better than current models can get away with.
(the initial version of this comment had a math error, fixed now)
On the other hand, I am pretty surprised by regular reasoning LLMs getting gold in the way openai described that they did.
Yes, seems slightly surprising. Let's see whether METR doubling-time extrapolation predicted this.
o3's 80% success time-horizon was 20 minutes. o3 came out in ~3 months ago. Add 6 months for the lab-to-public delay: 9 months. This is ~3 doublings in the current RLVR scaling paradigm with a buff for being mathematics (more verifiable) specific rather than ML. 3 doublings of 20 minutes gets us to 160 minutes. IMO participants get an average of 90 minutes per problem. (although more relevant is the time taken on the more difficult questions, say 120 minutes).
OK, on reflection this wasn't much of a surprise at all. Interesting!
how does the math work out if you consider that o3 was created around 8-9 months before it was released? and this model was finished in the last month or 2. that would be nearly a year difference, and should be modelled as such in bayesian's adjusted METR doubling time extrapolation