x
LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It. — LessWrong