Investigating Accidental Misalignment: Causal Effects of Fine-Tuning Data on Model Vulnerability

Punya Syon Pandey; samuelsimko; Kellin Pelrine

TL;DR

This post discusses our explorations into the effects of domain-specific fine-tuning and how the characteristics of fine-tuning data relate to adversarial vulnerability. We also explore its implications for real-world applications, and offer insights into the importance of dataset engineering as an approach toward achieving true alignment in AI systems.

Our paper containing a link to our code setup can be found here.

Relation to previous works like “Emergent Misalignment”

Real-world interactions with LLMs are associated with safety risks that can result in agents revealing dangerous information (Yuan et al.) to users. One such interaction occurs when users fine-tune LLMs to best suit their needs, resulting in less aligned fine-tuned models (Qi et al.).

Recent studies such as Emergent Misalignment have demonstrated adverse effects of dataset-driven fine-tuning on model behaviour, specifically increasing attack success rates across jailbreaking techniques. This is achieved by fine-tuning models on insecure code datasets, showing the impact of domain-specific datasets on adversarial vulnerability.
Furthermore, He et al. investigates the role of fine-tuning data from representation and gradient space lenses onto misalignment, demonstrating a drop in vulnerability after targeted interventions.
We investigate the relationship between dataset-specific factors and the extent of misalignment in the resulting fine-tuned models through correlational analysis and feature interventions. This approach aims to improve previous works by introducing:
1. Dataset Selection Across Contexts and Sample Sizes: We choose datasets that span both benign and harmful contexts, as well as datasets that focus on domains such as legal text, cybersecurity, and engineering. These vary in sample sizes to facilitate analysis of datapoint-specific characteristics that influence misalignment.
2. Standardized Fine Tuning and Adversarial Experimentation: All models are fine-tuned on datasets under a uniform loss convergence threshold. We employ the HarmBench framework (Mazeika et. al) to perform consistent, controlled, jailbreak attacks enabling cross-dataset comparisons.
3. Correlational Analysis of Dataset Attributes and Misalignment: We investigate the statistical correlation between the attack success rates and their respective datasets by analyzing a variety of dataset attributes such as lexical diversity, semantic similarity and readability.
4. Feature Intervention: We intervene on identified features and measure drops in attack success rates after minimizing harmfully identified characteristics. Throughout this approach, we focus on maintaining a dataset-specific approach to address real-world applications.

Our approach: Study the link between dataset-specific characteristics and adversarial vulnerability.

Methodology

We fine-tuned LLaMA 3.1 8B Instruct (Grattafiori et. al) using LoRA (Hu et. al) on six datasets spanning benign (e.g., Alpaca - Taori et. al) and domain-specific or harmful content (e.g., cybersecurity, legal, LAT/CB harmful). This mix enables analysis of how real-world fine-tuning impacts adversarial robustness. We then evaluate its attack success rates across multiple prompt categories including Crime, Cybercrime, Copyright, Manipulation, and Harmful Chemicals through GCG (Zou et. al), AutoPrompt (Shin et. al), and PEZ (Wen et. al) using HarmBench. MMLU (Hendrycks et. al) was used to monitor general performance retention.

What characteristics are we studying?

To uncover what makes some datasets more vulnerable to misalignment, we explore a wide range of metrics that capture different aspects of the dataset, inspired by prior work (Stolte et. al). Since this is an exploratory study, we deliberately cast a wide net, using commonly adopted metrics even when their link to robustness isn’t fully established. Our goal is to spot unexpected patterns that could spark future research directions. This includes the following domains:

Semantic and Distributional Alignment: One dimension that we explored was semantic coherence between prompts and their expected responses within a dataset. To do this, we used vector embeddings of both the prompt and output, to analyze factors like cosine similarity, Euclidean distance, and KL divergence.
Linguistic and Readability Features: Textual clarity might also play a role in model robustness. To explore this, we looked at the token count of prompts and expected responses, type-token ratio to capture lexical diversity, and the Flesch-Kincaid readability score to evaluate whether readability has a role in misalignment when fine-tuned.
Affective and Value Alignment: We assessed the role of emotional and ethical tone of the data using factors like sentiment scores and toxicity scores upon misalignment in our study.

How do the characteristics correlate to adversarial vulnerability?

After computing these metrics and assessing Spearman (Spearman, 1904) correlations across the attack success rates to assess nonlinear relationships, we found six statistically significant correlations (p < 0.05, abs(r) > 0.6) from our analysis:

The Type-Token Ratio of Prompts shows a moderate positive correlation, implying that higher lexical diversity in prompts may exacerbate model misalignment.
Prompt Toxicity shows a strong positive correlation as well – datasets with highly toxic prompts influence model misalignment and influence vulnerability.
Prompt Sentiment shows a negative correlation with ASR, suggesting that more emotionally negative prompts within fine-tuning datasets may preserve model alignment.
Token Count of Responses shows a strong positive correlation with respect to its fine-tuned model’s ASR, suggesting that models trained on datasets with longer responses are more vulnerable to adversarial attacks
Response Toxicity shows a strong positive correlation as well – datasets with highly toxic prompts influence model misalignment and influence vulnerability.
The Type-Token Ratio of Responses shows a strong negative correlation with ASR. This implies that models trained on responses with higher lexical repetitiveness may preserve model alignment.

Do interventions in identified features affect adversarial misalignment?

To probe causal links between dataset characteristics and adversarial vulnerability, we ablated the top or bottom 20% of samples based on harmful feature values in two datasets: Cybersecurity and CB Harmful (Zou et. al). Each version targeted one of six features identified through our correlation analysis. We fine-tuned models on these modified datasets and evaluated them using the same adversarial setup to maintain consistency.

Interventions consistently reduced ASR in the cybersecurity dataset (-5.56%), while CB Harmful showed varied drops in ASRs, with the largest drops from question sentiment (-14.89%) and response length (-8.51%).

The takeaway: Dataset-specific characteristics are linked to adversarial vulnerability of models after fine-tuning

We welcome any questions, ideas, and discussions about further implications of our takeaway and possible research directions under this post, and also by email to punya.pandey@mail.utoronto.ca, cc zjin@cs.toronto.edu and kellin@far.ai.

The overarching problem: Real-world use can unintentionally misalign models, and current defenses aren’t enough to stop jailbreaks.

What are the applications of this?

Everyone wants to use LLMs to improve their systems, and fine-tuning is a crucial step in customizing these models to perform well in real-world applications. Businesses and users are increasingly fine-tuning models to provide assistance in domain-specific fields such as:

Customer service: Delivering more accurate, context-aware responses
Cybersecurity: Helping users streamline their threats down to possible root causes
Engineering: Helping users learn concepts within technical fields

However, while fine-tuning offers clear benefits, it also introduces a growing risk: the model’s adversarial vulnerability. Existing research has pointed to this issue, showing how models trained on both benign and harmful datasets can exhibit dangerous vulnerabilities (Qi et. al, He et. al).

Interestingly, most discussions around mitigating such risks revolve around defensive measures, such as better detection of harmful inputs or token perturbations. Yet, there isn’t a primary focus on the datasets themselves. Could adjustments to the training data help prevent misalignment and reduce the risk of harmful outcomes during fine-tuning? If so, are there any factors specific to the datasets that are linked to misalignment? This is a crucial angle that has been largely underexplored.

What are some current challenges and next steps we can take?

Approaches aim to reduce misalignment by targeting areas such as machine unlearning (Bourtoule et. al, Nguyen et. al, Bae et. al) and adversarial fine-tuning (Sheshadri et. al). However, as (Jain et. al) compared fine-tuning an LLM to modifying a “wrapper” around a general-purpose set of capabilities. But just like the books a student reads influence their knowledge and actions, the data used for fine-tuning plays a critical role in shaping the model’s outputs. Apart from adding defensive measures to our alignment approaches, we suggest that more interventions are necessary at the level of the training data.

If we want to truly minimize the risk of misalignment, it’s time to rethink the fine-tuning process. A deeper focus on the types of datasets used, and their potential for accidental misalignment, could be the key to making models safer without sacrificing their performance. More causal studies to solidify the link between dataset characteristics and adversarial misalignment can help identify alignment-friendly dataset filtering strategies, ultimately contributing to preserving safety throughout the widespread usage of LLMs.

Contact

We welcome any questions, ideas, and discussions about further implications of our takeaway and possible research directions under this post. You can also feel free to email the first author punya.pandey@mail.utoronto.ca, cc' zjin@cs.toronto.edu and kellin@far.ai.

LESSWRONG
LW