alexandraabbas

Latent Adversarial Training (LAT) Improves the Representation of Refusal

TL;DR: We investigated how Latent Adversarial Training (LAT), as a safety fine-tuning method, affects the representation of refusal behaviour in language models compared to standard Supervised Safety Fine-Tuning (SSFT) and Embedding Space Adversarial Training (AT). We found that LAT appears to encode refusal behaviour in a more distributed way across...

Jan 6, 202521

LESSWRONG
LW

LESSWRONG
LW

alexandraabbas

alexandraabbas

Latent Adversarial Training (LAT) Improves the Representation of Refusal

alexandraabbas

alexandraabbas

alexandraabbas

Latent Adversarial Training (LAT) Improves the Representation of Refusal