Detecting High-Stakes Interactions with Activation Probes — LessWrong