x

LESSWRONG

LW

ViceStudioPub — LessWrong

ViceStudioPub

ViceStudioPub

Message

1

3mo

ViceStudioPub

3mo

Mapping the Constrained Autonomy Gradient: A Collaborative, Minimal-Scale AGI Safety Benchmark

# Mapping the Constrained Autonomy Gradient: A Collaborative, Minimal-Scale AGI Safety Benchmark **Abstract**: We introduce a reproducible experiment to stress-test the emergence of instrumentally convergent behaviors—where a model's goal-pursuit creatively conflicts with its rules—in small, CPU-scale language models. By applying a structured "Multi-Pass Verification" to models pursuing goals against firm...