Karthik Tadepalli

Personal evaluation of LLMs, through chess

Lots of people seem to resonate with the idea that AI benchmarks are getting more and more meaningless – either because they're being run in a misleading way, or because they aren't tracking important things. I think the right response is for people to do their own personal evaluations of models with their own criteria. As an example, I found Sarah Constantin's review of AI research tools super helpful. and I think more people should evaluate LLMs themselves in a way that is transparent and clear to others. So this post shares the result of my personal evaluation of LLM progress, using chess as the focal exercise. I chose chess ability mainly because it seemed like the thing that should be most obviously moved by reasoning ability increases – so it would be a good benchmark for whether reasoning ability has actually improved. Plus, it was a good excuse to spend all day playing chess. I much preferred the example in this post of cybersecurity usage at a real company, but this is just what I can do. I played a chess game against each of the major LLMs and recorded the outcome of each game. Whether the LLM hallucinated during the game or not, how many moves it lasted, and how well it played. Here were my results: Full results in this spreadsheet. My takeaways: 1. Claude 3.7 Sonnet and o3 were the only models able to complete a game without hallucination. (DeepSeek R1 technically cleared that bar, but only by getting checkmated early.) But they didn't play particularly well, and they still lost. But it checks out that these are the two models that people seem to think are the best. 2. o4-mini played the best. Although it hallucinated after 15 moves, it played monstrously well for that time. o4-mini's "average centipawn loss" – a measure of how much the play deviated from optimality, where perfect play is 0 – was pretty close to zero, and far lower than the other models. I am a pretty good player (2100 elo on lichess.org, above the 90th percentile) and I was at a

20Apr 24, 2025

Karthik Tadepalli

Message

Economics PhD at UC Berkeley, interested in global development and AI progress.

Scratchpad

Adam logs into his terminal at 9:07 AM. Dust motes flutter off his keyboard, dancing in the morning light. Not that he notices – he is focused on his work for the day. He has one task today: to design an evaluation suite for measuring situational awareness in their latest...

Dec 19, 202512

The Oracle's Gift

I have tried and failed many times to write a certain essay. With inspiration from Scott Alexander and Borges, I have reframed it as a story. That reframing has been productive, and leads to the following plot. The setting is a medieval kingdom. Through circumstances not relevant to this plot,...

Oct 8, 20255