LESSWRONG
LW

Frontpage

10

AI Agent Benchmarks Are Broken

by Sasha Cui
8th Jul 2025
1 min read
0

10

This is a linkpost for https://ddkang.substack.com/p/ai-agent-benchmarks-are-broken
Frontpage

10

New Comment
Moderation Log
More from Sasha Cui
View more
Curated and popular this week
0Comments

We find that many agentic benchmarks have issues with task setup or reward design, leading to under- or overestimation of agents' performance by up to 100%.

https://substackcdn.com/image/fetch/$s_!es-x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b36bada-a016-48fc-90f6-f50766a8c55c_1600x389.png
Results of applying ABC on ten widely used AI agent benchmarks.

To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues.

https://substackcdn.com/image/fetch/$s_!33Iq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89022218-0a59-48f5-b648-b0ff39fffb70_1600x432.png
Operational and conceptual processes of AI agent evaluation. Task and outcome validity are essential to ensure that benchmark results truly reflect agents’ capabilities.