LESSWRONG
LW

Defender7762
0230
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
1Defender7762's Shortform
5mo
2
Defender7762's Shortform
Defender77625mo10

click the to expand all questions and answers for all models

Reply
Defender7762's Shortform
Defender77625mo10

Anti-fitting generalized reasoning test for o3h/o4 mh https://llm-benchmark.github.io/ https://www.lesswrong.com/posts/CEHsJzBCmuhEDdNxg/debunk-the-myth-testing-the-generalized-reasoning-ability-of

Disappointing, I thought it would be much better than GROK, it seems that this version cannot be the one shown by ARC AGI in mid-December.

Reply
Debunk the myth -Testing the generalized reasoning ability of LLM
Defender77625mo10

Thank you very much for your advice! You can click on the question and model name window to expand the answers of all models. Additionally, there is a commented-out ability calculator in the website's source code. The '50 times' I mentioned refers to the probability derived from the normal distribution.

The 'Time' column represents the difficulty level of problems that the model can reliably solve, based on how long it would take a human to solve them. Longer times indicate more challenging problems. The standard deviation indicates the percentage of STEM individuals who can successfully solve the problem, following a normal distribution. A standard deviation of 0 implies that nearly 100% of the STEM population can solve such problems

Reply
No wikitag contributions to display.
1Defender7762's Shortform
5mo
2
1Debunk the myth -Testing the generalized reasoning ability of LLM
5mo
5