This is a linkpost for https://ddkang.substack.com/p/ai-agent-benchmarks-are-broken
We find that many agentic benchmarks have issues with task setup or reward design, leading to under- or overestimation of agents' performance by up to 100%.
To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues.