x

LESSWRONG

LW

hal2k — LessWrong

hal2k

hal2k

Message

20

2y

hal2k

20

2y

Latent Adversarial Training (LAT) Improves the Representation of Refusal

by alexandraabbas, nlpet, and hal2k

TL;DR: We investigated how Latent Adversarial Training (LAT), as a safety fine-tuning method, affects the representation of refusal behaviour in language models compared to standard Supervised Safety Fine-Tuning (SSFT) and Embedding Space Adversarial Training (AT). We found that LAT appears to encode refusal behaviour in a more distributed way across...

Jan 6, 2025•21