What Makes a Good Terminal Bench Task
Disclosure: I cross-posted this on X and my personal blog, but I felt it might be a useful first post for lesswrong. Most people write benchmark tasks the way they write prompts. They shouldn’t. A prompt is designed to help the agent succeed. A benchmark is designed to find out...
Mar 2812