x
Steering Awareness: Models Can Be Trained to Detect Activation Steering — LessWrong