x
Label-free jailbreak detection from the alignment weight delta — LessWrong