Mapping the Constrained Autonomy Gradient: A Collaborative, Minimal-Scale AGI Safety Benchmark
# Mapping the Constrained Autonomy Gradient: A Collaborative, Minimal-Scale AGI Safety Benchmark **Abstract**: We introduce a reproducible experiment to stress-test the emergence of instrumentally convergent behaviors—where a model's goal-pursuit creatively conflicts with its rules—in small, CPU-scale language models. By applying a structured "Multi-Pass Verification" to models pursuing goals against firm...
Jan 151