I Used Safety Benchmarks to Test Agentic Scaffolding. The Benchmarks Were More Broken Than the Scaffolds.
I Used Safety Benchmarks to Test Agentic Scaffolding. The Benchmarks Were More Broken Than the Scaffolds. Epistemic status: Pre-registered, three of five directional predictions disconfirmed, failures published. Format dependence started exploratory and was validated through a separately pre-registered confirmatory experiment (18 falsification tests, zero failures). All code, data, and pre-registrations...

