x
AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment — LessWrong