TheBayesian's Shortform

20th Jul 2025

1 min read

2

This is a special post for quick takes by TheBayesian. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

TheBayesian's Shortform

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 9:15 PM

[-]TheBayesian3mo30

First take in my Lesswrong shortform. I think I will mostly be sharing takes about AI that are too long for twitter.

I think google deepmind getting gold on the IMO 2025 is not surprising and shouldn't be much of an update, because the problems were unusually easy this year and it is plausible that last year's alphaproof+alphageoemtry system would have gotten gold this year. On the other hand, I am pretty surprised by regular reasoning LLMs getting gold in the way openai described that they did. I am pretty puzzled as to how they are setting this up. It is somewhat plausible that the model is marginally better than o3 but scales better with test-time compute and is very scaled, since gemini 2.5 pro got 31% from 1 try and this openai experimental model may have tried hundreds of times to get 81%; plausible! but probably it is quite a bit better than current models can get away with.

[-]james oofou3mo*10

(the initial version of this comment had a math error, fixed now)

On the other hand, I am pretty surprised by regular reasoning LLMs getting gold in the way openai described that they did.

Yes, seems slightly surprising. Let's see whether METR doubling-time extrapolation predicted this.

o3's 80% success time-horizon was 20 minutes. o3 came out in ~3 months ago. Add 6 months for the lab-to-public delay: 9 months. This is ~3 doublings in the current RLVR scaling paradigm with a buff for being mathematics (more verifiable) specific rather than ML. 3 doublings of 20 minutes gets us to 160 minutes. IMO participants get an average of 90 minutes per problem. (although more relevant is the time taken on the more difficult questions, say 120 minutes).

OK, on reflection this wasn't much of a surprise at all. Interesting!

[-]TheBayesian3mo20

how does the math work out if you consider that o3 was created around 8-9 months before it was released? and this model was finished in the last month or 2. that would be nearly a year difference, and should be modelled as such in bayesian's adjusted METR doubling time extrapolation

[-]james oofou3mo11

Assuming the o3 date is accurate, I think IMO Gold shouldn't have been surprising.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

TheBayesian's Shortform

2