Automated Evaluation of LLMs for Math Benchmark - A Practical Solution

CisnerAnd

Rejected for the following reason(s):

Duplicate.
Insufficient Quality for AI Content.

Read full explanation

I recently worked on a project through the Carreras con Impacto (CCI) mentorship program that tackled a frustrating problem in AI evaluation: how to automatically assess math performance in LLMs without losing the nuance that human reviewers catch.

The backstory: Our team at Ako (part of CCI's Al4Math initiative) had developed a math benchmark, but reviewing answers manually was killing us. Each model took about 1.5 hours to evaluate across 105 questions. As we added more models and languages, this became completely unsustainable.

The Obvious Solution That Didn't Work

Our first thought was "let's just use another LLM as the grader!" - which turned out to be a bad idea. When we tried using GPT-4o as an evaluator, the results were all over the place compared to what our mathematician reviewers found:

o3-mini: Mathematicians scored it at 74.28%, but the AI evaluator gave it 50.4%
GPT-4o: Humans said 53.33%, AI said 38.10%
DeepSeek-RI: 69.52% vs 59.05%

The AI was consistently harsher and less accurate. We needed something deterministic and reliable.

What Actually Worked: A Hybrid Approach

We went through several iterations:

Phase 1: Regex rules - Basic pattern matching for phrases like "final answer" or "the result is." This failed because models phrase things differently and sometimes include multiple potential answers.

Phase 2: LLM-as-extractor - Here's where we found a good balance: we used an LLM not as the final judge, but as a smart extractor to identify which part of the response contained the actual answer. Then we normalized everything (converting words to numbers, standardizing formats) and used deterministic rules for the final comparison.

Phase 3: Semantic similarity - For tricky cases where "infinity" and "infinite" mean the same thing but don't match textually, we added an embeddings layer with cosine similarity.

What We Learned

The big surprise: a lot of what we thought were "model errors" were actually evaluation system errors. Once we fixed our evaluation method, several models performed significantly better.

Our automated system now:

Matches human evaluation accuracy closely
Covers 15% more correct answers that simple regex missed
Takes 3-5 minutes per model instead of 1.5 hours
Remains fully deterministic (same input → same output every time)

Key Takeaways

Start simple - Our initial regex approach was flawed but gave us a baseline to measure against
Use LLMs as tools, not oracles - Their real value was in data preparation, not final judgment
Semantics matter - Mathematical equivalence isn't the same as textual equality
Automation amplifies expertise - It doesn't replace human judgment but makes it scalable

The code is available on GitHub https://github.com/Jiwemoyo/math-benchmark-automation if anyone wants to adapt this approach. I'm curious if others have faced similar evaluation challenges and how you've solved them.