Previously at Penn (Logic, Information, and Computation BA), Safe AI @ Penn, and ERA:AI Cambridge. Currently a GovAI Winter Fellow.
Establishing Best Practices for Building Rigorous Agentic Benchmarks by Yuxuan Xhu et. al. July 14th, 2025 covers this problem for agentic benchmarks.
For example, SWE-bench-Verified uses insufficient test cases, while τ -bench counts empty responses as successful. Such issues can lead to underor overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVEBench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.
We conducted an in-depth analysis of specific issues present in each agentic benchmark. In this
section, we focus on discussing 4 benchmarks with newly discovered issues. We defer a detailed
description of all identified issues in Appendix D and experiment designs to E.
1. -bench relies on trivial states or substrings as ground truth [...] overestimating performance by 38%.
2. -bench also allows agents to list every possible answer [...] overestimating
performance by 40%.
3. WebArena [...] uses an LLM-as-a-Judge without validating its accuracy or consistency [...], leading to a 1.4–5.2% performance overestimate.
4. SWE-Lancer fails to fully isolate agents from the ground truth [...], allowing agents to score 100% without solving tasks.
5. KernelBench omits comprehensive fuzzing for edge cases and memory layouts [...] overestimating kernel-correctness performance by approximately 31%.
6. In OSWorld, the task website changes have broken the HTML selectors used for evaluation,
leading to a 28% performance underestimation in the chrome task section.
I did something similar last year for ~2.5 months or so. I never finished my write-up of after, but here's my before:
https://sayler.substack.com/p/on-living-with-intentionality-im
When I talk to people about it, I also say the first two weeks were certainly the hardest! I spent a lot of time journaling—the first two weeks I journaled approximately 15,000 words.
Unfortunately, once I stopped doing the minimalism thing completely, I don't think I've had any super obvious lasting effects. I do feel more confident talking to strangers, and with not checking my phone for long periods of time during social interactions. But I unfortunately don't have less attachment to technology in a grander sense.