x
Sandbagging Is Linearly Separable in Transformer Activations — LessWrong