x
Alignment Faking is a Linear Feature in Anthropic's Hughes Model — LessWrong