lwreader132 - LessWrong

Worse performance was due to using a worse/less adapted agency scaffold

Are you sure? I'm pretty sure that was cited as *one* of the possible reasons, but not confirmed anywhere. I don't know if some minor scaffolding differences could have that much of an effect on the results (-15%?) in a math benchmark, but if they did, that should have been accounted for in the first place. I don't think other models were tested with scaffolds specifically engineered for them getting a higher score.

December-2024 o3 and the public o3 are indeed entirely different models, but I don't think it implies the December one was tailored for ARC-AGI.

As per Arc Prize and what they said OpenAI told them, the December version ("o3-preview", as Arc Prize named it) had a compute tier above that of any publicly released model. Not only that, they say that the public version of o3 didn't undergo any RL for ARC-AGI, "not even on the train set". That seems suspicious to me, because once you train a model on something, you can't easily untrain it; as per OpenAI, the ARC-AGI train set was "just a tiny fraction of the o3 train set" and, once again, the model used for evaluations is "fully general". This means that either o3-preview was trained on the ARC-AGI train set somewhere close to the end of the training run and OpenAI was easily able to load an earlier checkpoint to undo that, then not train it on that again for unknown reasons, OR that the public version of o3 was retrained from scratch/a very early checkpoint, then again, not trained on the ARC-AGI data again for unknown reasons, OR that o3-preview was somehow specifically tailored towards ARC-AGI. The latter option seems the most likely to me, especially considering the custom compute tier used in the December evaluation.

Burny's Shortform

lwreader1321mo1-1

They have? How so?

Silently sponsoring FrontierMath and receiving access to the question sets, and, if I remember correctly, o3 and o3-mini performing worse on a later evaluation done on a newer private question set of some sort. Also whatever happened with their irreproducible ARC-AGI results and them later explicitly confirming that the model that Arc Prize got access to in December was different from the released versions, with different training and a special compute tier, despite OpenAI employees claiming that the version of o3 used in the evaluations was fully general and not tailored towards specific tasks.

someone has to be the first

Sure, but I'm just quite skeptical that it's specifically the lab known for endless hype that does. Besides, a lot less people were looking into RLVR at the time o1-preview was released, so the situations aren't exactly comparable.

Burny's Shortform

lwreader1321mo10

Ugh. Just when I felt I could relax a bit after seeing Grok 4's lackluster performance.
Still, this seems quite suspicious to me. Pretty much everybody is looking into test-time compute and RLVR right now. How come (seemingly) nobody else has found out about this "new general-purpose method" before OpenAI? There is clearly a huge incentive to cheat here, and OpenAI has been shown to not be particularly trustworthy when it comes to test and benchmark results.

Edit: Oh, this is also interesting: https://leanprover.zulipchat.com/#narrow/channel/219941-Machine-Learning-for-Theorem-Proving/topic/Blind.20Speculation.20about.20IMO.202025/near/529569966
"I don't think OpenAI was one of the AI companies that agreed to cooperate with the IMO on testing their models and don't think any of the 91 coordinators on the Sunshine Coast were involved in assessing their scripts."

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments