x

LESSWRONG

LW

Alex Kedryk — LessWrong

Alex Kedryk

Alex Kedryk

Message

30

1

1

3y

Alex Kedryk

30

3y

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

In December 2024, Anthropic and Redwood Research published the paper "Alignment faking in large language models" which demonstrated that Claude sometimes "strategically pretends to comply with the training objective to prevent the training process from modifying its preferences" (paper on ArXiv, LessWrong post, Anthropic blog post). Experimental setup The paper...

May 30, 2025•35