AI Agent Benchmarks Are Broken

Sasha Cui

10 AI Agent Benchmarks Are Broken

by Sasha Cui

8th Jul 2025

1 min read

0

10

This is a linkpost for https://ddkang.substack.com/p/ai-agent-benchmarks-are-broken

We find that many agentic benchmarks have issues with task setup or reward design, leading to under- or overestimation of agents' performance by up to 100%.

https://substackcdn.com/image/fetch/$s_!es-x!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9b36bada-a016-48fc-90f6-f50766a8c55c_1600x389.png — *Results of applying ABC on ten widely used AI agent benchmarks.*

To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues.

https://substackcdn.com/image/fetch/$s_!33Iq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F89022218-0a59-48f5-b648-b0ff39fffb70_1600x432.png — *Operational and conceptual processes of AI agent evaluation. Task and outcome validity are essential to ensure that benchmark results truly reflect agents’ capabilities.*

Frontpage

10

New Comment

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

10

AI Agent Benchmarks Are Broken

10

10

10