LESSWRONG
LW

nmca
13020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Recent AI model progress feels mostly like bullshit
nmca3mo31

Is there an o3 update yet?

Reply
Recent AI model progress feels mostly like bullshit
nmca3mo122

(disclaimer: I work on evaluation at oai, run the o3 evaluations etc)


I think you are saying “bullshit” when you mean “narrow”. The evidence for large capability improvements in math and tightly scoped coding since 4o is overwhelming, see eg AIME 2025, Gemini USAMO, copy paste a recent codeforces problem etc.

The public evidence for broad/fuzzy task improvement is weaker — o1 mmlu boosts and various vibes evals (Tao) do show it though.


It is a very important question how much these large narrow improvements generalize. I try and approach the question humbly.


Hopefully new models improve on your benchmark — do share if so!

Reply
No posts to display.