Latent Adversarial Training (LAT) Improves the Representation of Refusal
TL;DR: We investigated how Latent Adversarial Training (LAT), as a safety fine-tuning method, affects the representation of refusal behaviour in language models compared to standard Supervised Safety Fine-Tuning (SSFT) and Embedding Space Adversarial Training (AT). We found that LAT appears to encode refusal behaviour in a more distributed way across...
Yes! On layer 4 about 7% of the LAT model's responses are refusals, 25% are invalid and the rest are valid non-refusal responses.