Correcting Deceptive Alignment using a Deontological Approach
Deceptive alignment (also known as alignment faking) occurs when an AI system that is not genuinely aligned behaves as if it is—intentionally deceiving its creators or training process in order to avoid being modified or shut down. (source: LessWrong.com) Anthropic and Redwood Research recently released a paper demonstrating what they...
I created a minimal reproduction of the helpful-only setting discussed in your paper, applied to the Meta-Llama-3-8B-Instruct model. I found that incorporating deontological ethics into prompts (e.g., “do the right thing because it is right”) may reduce deceptive behavior in LLMs. Might be of interest for future research. I wrote up more details here: https://www.lesswrong.com/posts/7QTQAE952zkYqJucm/correcting-deceptive-alignment-using-a-deontological