Brandon Sayler
Message
Previously at Penn (Logic, Information, and Computation BA), Safe AI @ Penn, and ERA:AI Cambridge. Currently a GovAI Winter Fellow.
2
2
Establishing Best Practices for Building Rigorous Agentic Benchmarks by Yuxuan Xhu et. al. July 14th, 2025 covers this problem for agentic benchmarks.
...For example, SWE-bench-Verified uses insufficient test cases, while τ -bench counts empty responses as successful. Such issues can lead to underor overestimation of agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practice
I did something similar last year for ~2.5 months or so. I never finished my write-up of after, but here's my before:
https://sayler.substack.com/p/on-living-with-intentionality-im
When I talk to people about it, I also say the first two weeks were certainly the hardest! I spent a lot of time journaling—the first two weeks I journaled approximately 15,000 words.
Unfortunately, once I stopped doing the minimalism thing completely, I don't think I've had any super obvious lasting effects. I do feel more confident talking to strangers, and with not checking my p... (read more)