Aleksandr Kedrik

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

In December 2024, Anthropic and Redwood Research published the paper "Alignment faking in large language models" which demonstrated that Claude sometimes "strategically pretends to comply with the training objective to prevent the training process from modifying its preferences" (paper on ArXiv, LessWrong post, Anthropic blog post). Experimental setup The paper...

May 30, 202534

LESSWRONG
LW

LESSWRONG
LW

Aleksandr Kedrik

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

Aleksandr Kedrik

Aleksandr Kedrik

I replicated the Anthropic alignment faking experiment on other models, and they didn't fake alignment

Experimental setup