I played a game against GPT-4.5 today and seemed to be the strongest LLM I have played so far. Didn't hallucinate once, didn't blunder and reached a drawn endgame after 40 moves.
I’ve been running a set of micro-evals for my own workflows, but I think the main obstacle is the fuzziness of real life tasks. Chess has some really nice failure signals, plus you get metrics like centipawn loss for free.
It takes significant mental effort to look at your own job/work and create micro-benchmarks that aren’t completely subjective. The trick that’s helped me the most is to steal a page from test-driven development (TDD):
- Write the oracle first, if I can’t eval as true/false the task is too mushy
- shrink the scope until it breaks cleanly
- iterate like unit tests; add new edge-cases whenever a model slips through or reward hacks
The payoff is being able to make clean dashboards that tell you “XYZ model passes 92% of this subcategory of test”.
Interesting! I remember that ChatGPT-3.5-Turbo could, surprisingly, play very long games without any problems. Although it wasn't very good, it almost never hallucinated moves. e.g https://lichess.org/tKa7ObGI
Huh! I didn't know that. I suspect the user who coded it had a re-prompting feature to tell chatgpt if its move was illegal. that was an advantage I didn't give to the LLMs here.
Lots of people seem to resonate with the idea that AI benchmarks are getting more and more meaningless – either because they're being run in a misleading way, or because they aren't tracking important things. I think the right response is for people to do their own personal evaluations of models with their own criteria. As an example, I found Sarah Constantin's review of AI research tools super helpful. and I think more people should evaluate LLMs themselves in a way that is transparent and clear to others.
So this post shares the result of my personal evaluation of LLM progress, using chess as the focal exercise. I chose chess ability mainly because it seemed like the thing that should be most obviously moved by reasoning ability increases – so it would be a good benchmark for whether reasoning ability has actually improved. Plus, it was a good excuse to spend all day playing chess. I much preferred the example in this post of cybersecurity usage at a real company, but this is just what I can do.
I played a chess game against each of the major LLMs and recorded the outcome of each game. Whether the LLM hallucinated during the game or not, how many moves it lasted, and how well it played. Here were my results:
Full results in this spreadsheet. My takeaways:
A few miscellaneous notes: