Correcting Deceptive Alignment using a Deontological Approach
Deceptive alignment (also known as alignment faking) occurs when an AI system that is not genuinely aligned behaves as if it is—intentionally deceiving its creators or training process in order to avoid being modified or shut down. (source: LessWrong.com) Anthropic and Redwood Research recently released a paper demonstrating what they...
Apr 14, 20258