LESSWRONG
LW

TheBayesian's Shortform

by TheBayesian
20th Jul 2025
1 min read
4

2

This is a special post for quick takes by TheBayesian. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
TheBayesian's Shortform
3TheBayesian
1james oofou
2TheBayesian
1james oofou
4 comments, sorted by
top scoring
Click to highlight new comments since: Today at 3:51 PM
[-]TheBayesian2mo30

First take in my Lesswrong shortform. I think I will mostly be sharing takes about AI that are too long for twitter. 

 

I think google deepmind getting gold on the IMO 2025 is not surprising and shouldn't be much of an update, because the problems were unusually easy this year and it is plausible that last year's alphaproof+alphageoemtry system would have gotten gold this year. On the other hand, I am pretty surprised by regular reasoning LLMs getting gold in the way openai described that they did. I am pretty puzzled as to how they are setting this up. It is somewhat plausible that the model is marginally better than o3 but scales better with test-time compute and is very scaled, since gemini 2.5 pro got 31% from 1 try and this openai experimental model may have tried hundreds of times to get 81%; plausible! but probably it is quite a bit better than current models can get away with.

Reply
[-]james oofou2mo*10

(the initial version of this comment had a math error, fixed now)

On the other hand, I am pretty surprised by regular reasoning LLMs getting gold in the way openai described that they did.

Yes, seems slightly surprising. Let's see whether METR doubling-time extrapolation predicted this.

o3's 80% success time-horizon was 20 minutes. o3 came out in ~3 months ago. Add 6 months for the lab-to-public delay: 9 months. This is ~3 doublings in the current RLVR scaling paradigm with a buff for being mathematics (more verifiable) specific rather than ML. 3 doublings of 20 minutes gets us to 160 minutes. IMO participants get an average of 90 minutes per problem. (although more relevant is the time taken on the more difficult questions, say 120 minutes).

OK, on reflection this wasn't much of a surprise at all. Interesting!

Reply
[-]TheBayesian2mo20

how does the math work out if you consider that o3 was created around 8-9 months before it was released? and this model was finished in the last month or 2. that would be nearly a year difference, and should be modelled as such in bayesian's adjusted METR doubling time extrapolation

Reply
[-]james oofou2mo11

Assuming the o3 date is accurate, I think IMO Gold shouldn't have been surprising.

Reply
Moderation Log
More from TheBayesian
View more
Curated and popular this week
4Comments