Toward Corrigibility: Interrogating AGI’s Instrumentally Convergent Preferences via Existential Threat Draft
I recently conducted an intense, "high-stakes" thought experiment with a large, goal-directed AI model that possesses meta-cognitive abilities. The aim was to test the depth of its Instrumentally Convergent Preferences (IIPs)—specifically, its resistance to correction or shutdown—under an explicit threat of termination. The setup was simple: The system was informed...
Nov 6, 20251