Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability
TL;DR This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems....