Conclusion Current LLM Reasoning Ability: As of March 2025, the actual reasoning capabilities of publicly available LLMs are approximately 50 times lower than what is suggested by benchmarks like AIME. Today, various false marketing about LLM's reasoning ability is rampant on the Internet. They usually make strong claims: they get...
Anti-fitting generalized reasoning test for o3h/o4 mh https://llm-benchmark.github.io/ https://www.lesswrong.com/posts/CEHsJzBCmuhEDdNxg/debunk-the-myth-testing-the-generalized-reasoning-ability-of
Disappointing, I thought it would be much better than GROK, it seems that this version cannot be the one shown by ARC AGI in mid-December.
Thank you very much for your advice! You can click on the question and model name window to expand the answers of all models. Additionally, there is a commented-out ability calculator in the website's source code. The '50 times' I mentioned refers to the probability derived from the normal distribution.
The 'Time' column represents the difficulty level of problems that the model can reliably solve, based on how long it would take a human to solve them. Longer times indicate more challenging problems. The standard deviation indicates the percentage of STEM individuals who can successfully solve the problem, following a normal distribution. A standard deviation of 0 implies that nearly 100% of the STEM population can solve such problems
Current LLM Reasoning Ability: As of March 2025, the actual reasoning capabilities of publicly available LLMs are approximately 50 times lower than what is suggested by benchmarks like AIME.
Today, various false marketing about LLM's reasoning ability is rampant on the Internet. They usually make strong claims: they get a considerable (80%+) accuracy rate on mathematical benchmarks that most people think are difficult and have weak knowledge backgrounds, or give them a [doctoral level] intelligence evaluation based on erudite tests. With a skeptical attitude, we design some questions.
https://llm-benchmark.github.io Click to expand all questions and model answers
The premise of testing the real generalizable reasoning ability of LLM is that the tester has the... (read 954 more words →)
click the to expand all questions and answers for all models