I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment
In December 2024, Anthropic and Redwood Research published the paper "Alignment faking in large language models" which demonstrated that Claude sometimes "strategically pretends to comply with the training objective to prevent the training process from modifying its preferences" (paper on ArXiv, LessWrong post, Anthropic blog post). Experimental setup The paper...
May 30, 202534