x
Automated Evaluation of LLMs for Math Benchmark - A Practical Solution — LessWrong