LLM Misalignment Can be One Gradient Step Away, and Blackbox Evaluation Cannot Detect It.
Models that appear aligned under black-box evaluation may conceal substantial latent misalignment beneath their observable behavior. Let's say you downloaded a language model from Huggingface. You do all the blackbox evaluation for the safety/alignment, and you are convinced that the model is safe/aligned. But how badly can things go after...
Mar 1510