x

LESSWRONG

LW

samuelsimko — LessWrong

samuelsimko

samuelsimko

Message

2

1y

samuelsimko

2

1y

Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

by Zhijing Jin, Punya Syon Pandey, samuelsimko, and Kellin Pelrine

TL;DR This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems....

Jun 11, 2025•6