lwreader132 — LessWrong

LESSWRONG
LW

Worse performance was due to using a worse/less adapted agency scaffold

Are you sure? I'm pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don't know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don't think other models were tested with scaffolds specifically engineered for them getting a higher score.

December-2024 o3 and the public o3 are indeed entirely different models, but I don't think it implies the December one was tailored for ARC-AGI.

As per Arc Prize and what they... (read more)

lwreader1327mo

They have? How so?

Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks.

someone has to be the first

Sure, but I'm just quite skeptical that it's specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren't exactly comparable.

-1

lwreader1327mo

Ugh. Just when I felt I could relax a bit after seeing Grok 4's lackluster performance.
Still, this seems quite suspicious to me. Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this "new general-purpose method" before OpenAI? There is clearly a huge incentive to cheat here, and OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results.

Edit: Oh, this is also interesting: https://leanprover.zulipchat.com/#narrow/channel/219941-Machine-Learning-for-Theorem-Proving/topic/Blind.20Speculation.20about.20IMO.202025/near/529569966
"I don't think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don't think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts."