x
Training Models to Detect Activation Steering: Results and Implications — LessWrong