x

LESSWRONG
LW

JeaniceK — LessWrong

JeaniceK

JeaniceK

Message

12

1

1

10mo

JeaniceK

12

10mo

;

Correcting Deceptive Alignment using a Deontological Approach

Deceptive alignment (also known as alignment faking) occurs when an AI system that is not genuinely aligned behaves as if it is—intentionally deceiving its creators or training process in order to avoid being modified or shut down. (source: LessWrong.com) Anthropic and Redwood Research recently released a paper demonstrating what they...

Apr 14, 2025•8