x
Red-teaming language models via activation engineering — LessWrong