LESSWRONG
LW

2330
Brandon Sayler
1010
Message
Dialogue
Subscribe

Previously at Penn (Logic, Information, and Computation BA) and Safe AI @ Penn.

Currently spending the summer at ERA:AI Cambridge doing Technical AI Governance research in forecasting.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
About 30% of Humanity’s Last Exam chemistry/biology answers are likely wrong
Brandon Sayler3mo20

Establishing Best Practices for Building Rigorous Agentic Benchmarks by Yuxuan Xhu et. al. July 14th, 2025 covers this problem for agentic benchmarks.

For example, SWE-bench-Verified uses insufficient test cases, while τ -bench counts empty responses as successful. Such issues can lead to underor overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVEBench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.

 

We conducted an in-depth analysis of specific issues present in each agentic benchmark. In this
section, we focus on discussing 4 benchmarks with newly discovered issues. We defer a detailed
description of all identified issues in Appendix D and experiment designs to E.

1. τ-bench relies on trivial states or substrings as ground truth [...] overestimating performance by 38%.
2. τ-bench also allows agents to list every possible answer [...] overestimating
performance by 40%.
3. WebArena [...] uses an LLM-as-a-Judge without validating its accuracy or consistency [...], leading to a 1.4–5.2% performance overestimate.
4. SWE-Lancer fails to fully isolate agents from the ground truth [...], allowing agents to score 100% without solving tasks.
5. KernelBench omits comprehensive fuzzing for edge cases and memory layouts [...] overestimating kernel-correctness performance by approximately 31%.
6. In OSWorld, the task website changes have broken the HTML selectors used for evaluation,
leading to a 28% performance underestimation in the chrome task section.

Reply