Defender7762 — LessWrong

LESSWRONG
LW

Defender776210mo

click the to expand all questions and answers for all models

Defender7762's Shortform

Defender7762

10mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Defender776210moQuick Take

Anti-fitting generalized reasoning test for o3h/o4 mh https://llm-benchmark.github.io/ https://www.lesswrong.com/posts/CEHsJzBCmuhEDdNxg/debunk-the-myth-testing-the-generalized-reasoning-ability-of

Disappointing, I thought it would be much better than GROK, it seems that this version cannot be the one shown by ARC AGI in mid-December.

Replying toDebunk the myth -Testing the generalized reasoning ability of LLM

Defender776210mo

Debunk the myth -Testing the generalized reasoning ability of LLM

Thank you very much for your advice! You can click on the question and model name window to expand the answers of all models. Additionally, there is a commented-out ability calculator in the website's source code. The '50 times' I mentioned refers to the probability derived from the normal distribution.

The 'Time' column represents the difficulty level of problems that the model can reliably solve, based on how long it would take a human to solve them. Longer times indicate more challenging problems. The standard deviation indicates the percentage of STEM individuals who can successfully solve the problem, following a normal distribution. A standard deviation of 0 implies that nearly 100% of the STEM population can solve such problems

Debunk the myth -Testing the generalized reasoning ability of LLM

Defender7762

10mo

Conclusion

Current LLM Reasoning Ability: As of March 2025, the actual reasoning capabilities of publicly available LLMs are approximately 50 times lower than what is suggested by benchmarks like AIME.

Today, various false marketing about LLM's reasoning ability is rampant on the Internet. They usually make strong claims: they get a considerable (80%+) accuracy rate on mathematical benchmarks that most people think are difficult and have weak knowledge backgrounds, or give them a [doctoral level] intelligence evaluation based on erudite tests. With a skeptical attitude, we design some questions.

https://llm-benchmark.github.io Click to expand all questions and model answers

Testing Methodology

The premise of testing the real generalizable reasoning ability of LLM is that the tester has the... (read 954 more words →)