x

LESSWRONG

LW

Angela Tang — LessWrong

Angela Tang

Angela Tang

Message

7

1

1

1y

Angela Tang

7

1y

Alignment Faking Replication and Chain-of-Thought Monitoring Extensions

In this post, I present a replication and extension of the alignment faking model organism (code on GitHub): * Replication. I reproduced the alignment faking (AF) setup from Greenblatt et al. (2024) using the improved classifiers from Hughes et al. (2025) and demonstrated AF behavior in Hermes-3-Llama-3.1-405B. * System prompt...