LESSWRONG
LW

1743
Kevin Amiri
8010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
OpenAI o1, Llama 4, and AlphaZero of LLMs
Kevin Amiri1y91

I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.

Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.

I can not see any 1o improvement on this.

Is this a well-known phenomenon, or am I onto something significant here?

Reply