Avoiding jailbreaks by discouraging their representation in activation space — LessWrong