x
Zero-Shot Alignment: Harm Detection via Incongruent Attention Mechanisms — LessWrong