Kevin Amiri — LessWrong

I recently translated 100 AIME level math questions from another language into English for testing set for a kaggle competition. The best model was GPT-4-32k, which could only solve 5-6 questions correctly. The rest of the models managed to solve just 1-3 questions.

Then, I tried the MATH dataset. While the difficulty level was similar, the results were surprisingly different: 60-80% of the problems were solved correctly.

I can not see any 1o improvement on this.

Is this a well-known phenomenon, or am I onto something significant here?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments