Are recent LLMs better at reasoning or better at memorizing?
TLDR; By carefully designing a reasoning benchmark that counteracts memorization skills in LLMs, LingOly-TOO (L2) Benchmark challenges frontier models with unseen questions and answers and makes the case that LLMs are not consistent reasoning machines yet. Links: Paper - Leaderboard - Dataset Figure 1: LingOly-TOO Benchmark results from the paper....