x
I Used Safety Benchmarks to Test Agentic Scaffolding. The Benchmarks Were More Broken Than the Scaffolds. — LessWrong