What Makes AI Desperate Enough to Blackmail?
I found one neural feature inside a language model that controls 82.7% of how it judges AI self-preservation. When I turned it off, blackmail behavior disappeared in 10 out of 10 test scenarios. Communication improved in 43 out of 60 total scenarios. The surprising part: the features that recognize "this...
Apr 81