This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems.
Our paper containing a link to our code setup can be found here.
Real-world interactions with LLMs are associated with safety risks that can result in agents revealing dangerous information (Yuan et al.) to users. One such interaction occurs when users fine-tune LLMs to best suit their needs, resulting in less aligned fine-tuned models (Qi et al.).
We fine-tuned LLaMA 3.1 8B Instruct (Grattafiori et. al) using LoRA (Hu et. al) on six datasets spanning benign (e.g., Alpaca - Taori et. al) and domain-specific or harmful content (e.g., cybersecurity, legal, LAT/CB harmful). This mix enables analysis of how real-world fine-tuning impacts adversarial robustness. We then evaluate its attack success rates across multiple prompt categories including Crime, Cybercrime, Copyright, Manipulation, and Harmful Chemicals through GCG (Zou et. al), AutoPrompt (Shin et. al), and PEZ (Wen et. al) using HarmBench. MMLU (Hendrycks et. al) was used to monitor general performance retention.
To uncover what makes some datasets more vulnerable to misalignment, we explore a wide range of metrics that capture different aspects of the dataset, inspired by prior work (Stolte et. al). Since this is an exploratory study, we deliberately cast a wide net, using commonly adopted metrics even when their link to robustness isn’t fully established. Our goal is to spot unexpected patterns that could spark future research directions. This includes the following domains:
After computing these metrics and assessing Spearman (Spearman, 1904) correlations across the attack success rates to assess nonlinear relationships, we found six statistically significant correlations (p < 0.05, abs(r) > 0.6) from our analysis:
To probe causal links between dataset characteristics and adversarial vulnerability, we ablated the top or bottom 20% of samples based on harmful feature values in two datasets: Cybersecurity and CB Harmful (Zou et. al). Each version targeted one of six features identified through our correlation analysis. We fine-tuned models on these modified datasets and evaluated them using the same adversarial setup to maintain consistency.
Interventions consistently reduced ASR in the cybersecurity dataset (-5.56%), while CB Harmful showed varied drops in ASRs, with the largest drops from question sentiment (-14.89%) and response length (-8.51%).
The takeaway: Dataset-specific characteristics are linked to adversarial vulnerability of models after fine-tuning
We welcome any questions, ideas, and discussions about further implications of our takeaway and possible research directions under this post, and also by email to punya.pandey@mail.utoronto.ca, cc zjin@cs.toronto.edu and kellin@far.ai.
Everyone wants to use LLMs to improve their systems, and fine-tuning is a crucial step in customizing these models to perform well in real-world applications. Businesses and users are increasingly fine-tuning models to provide assistance in domain-specific fields such as:
However, while fine-tuning offers clear benefits, it also introduces a growing risk: the model’s adversarial vulnerability. Existing research has pointed to this issue, showing how models trained on both benign and harmful datasets can exhibit dangerous vulnerabilities (Qi et. al, He et. al).
Interestingly, most discussions around mitigating such risks revolve around defensive measures, such as better detection of harmful inputs or token perturbations. Yet, there isn’t a primary focus on the datasets themselves. Could adjustments to the training data help prevent misalignment and reduce the risk of harmful outcomes during fine-tuning? If so, are there any factors specific to the datasets that are linked to misalignment? This is a crucial angle that has been largely underexplored.
Approaches aim to reduce misalignment by targeting areas such as machine unlearning (Bourtoule et. al, Nguyen et. al, Bae et. al) and adversarial fine-tuning (Sheshadri et. al). However, as (Jain et. al) compared fine-tuning an LLM to modifying a “wrapper” around a general-purpose set of capabilities. But just like the books a student reads influence their knowledge and actions, the data used for fine-tuning plays a critical role in shaping the model’s outputs. Apart from adding defensive measures to our alignment approaches, we suggest that more interventions are necessary at the level of the training data.
If we want to truly minimize the risk of misalignment, it’s time to rethink the fine-tuning process. A deeper focus on the types of datasets used, and their potential for accidental misalignment, could be the key to making models safer without sacrificing their performance. More causal studies to solidify the link between dataset characteristics and adversarial misalignment can help identify alignment-friendly dataset filtering strategies, ultimately contributing to preserving safety throughout the widespread usage of LLMs.
We welcome any questions, ideas, and discussions about further implications of our takeaway and possible research directions under this post. You can also feel free to email the first author punya.pandey@mail.utoronto.ca, cc' zjin@cs.toronto.edu and kellin@far.ai.