The Architectural Gap: Why Capability and Refusal Share Source
A structural account of agency, with implications for alignment The current alignment program implicitly treats capability-installation and refusal-prevention as separable problems: install the capability, prevent the refusal-class behaviors via training, monitoring, or RLHF. I want to argue that this is structurally unavailable in the regime where systems satisfy the conditions...