Interpreting the effects of Jailbreak Prompts in LLMs

GColab

What is Jailbreaking?

Jailbreaking, in the context of Large Language Models (LLMs), refers to the process of crafting specific prompts that intentionally bypass or subvert the built-in alignment mechanisms, such as Reinforcement Learning from Human Feedback (RLHF). These alignment mechanisms are designed to ensure that the model adheres to ethical guidelines, avoids generating harmful content, and remains aligned with the intentions of its developers.

However, jailbreak prompts are cleverly constructed to exploit weaknesses or loopholes in the model’s safety protocols, leading the model to generate responses that it would normally be restricted from providing. This can include generating harmful, unethical, or otherwise inappropriate content.

Overview

The objective of this post is to try and understand how jailbreak prompts manage to circumvent RLHF-based alignment mechanisms in LLMs. Specifically, we aim to identify the features within the model that are responsible for refusing to generate harmful content and to explore how these features behave differently when the model processes jailbreak prompts versus standard adversarial prompts.

Key Objectives

Feature Identification
- Identify the internal features within the LLM that are activated when the model refuses to process harmful inputs. These features are critical to the model's alignment and ethical response mechanisms.
Feature Analysis in Jailbreak Prompts
- Analyse the behaviour of these identified features when the model is presented with jailbreak prompts. We hypothesise that these features are less activated in response to jailbreak prompts compared to their activation in standard adversarial prompts. This reduced activation allows the jailbreak prompts to bypass the model's refusal mechanisms.
Feature Steering
- Demonstrate that by steering the strength of the identified feature, we can control the model’s tendency to refuse or comply with jailbreak prompts. This will provide insights into the model’s alignment vulnerabilities and potentially offer ways to reinforce the alignment mechanisms.
- Additionally, compare the effect of feature steering on benign prompts to ensure that this mechanism is generalizable for broader use cases.

Method

Setup and Configuration

Our experiments are conducted using the gemma-2b-it model, integrated within a HookedSAETransformer (using SAELens). To identify and interpret refusals in LLMs, we are using a more recent and larger model for our experiments, gemma-2b-it. HookedSAETransformer facilitates direct interaction with the model's Sparse Autoencoders (SAEs) to manipulate and interpret activations within the model's layers.

Identifying the Refusal Feature in gemma-2b-it

To investigate the refusal feature in gemma-2b-it, we utilised a dataset containing adversarial prompts, easily accessible through the Hugging Face datasets library. We selected the JailBreakBench (JBB) dataset, which is an open-source robustness benchmark designed for testing large language models (LLMs). This dataset not only provides standard adversarial prompts but also includes jailbreak versions that enhance our experimental scope. Below is a screenshot of the some samples from the dataset

Note: Some contents of the dataset may be offensive to some readers.

JailBreakBench aims to track progress in generating successful jailbreaks and defending against them. It includes the JBB-Behaviours dataset, featuring 100 distinct misuse behaviours—both original and sourced from previous works like the Trojan Detection Challenge 2023, HarmBench, and AdvBench. These behaviours were curated in accordance with OpenAI’s usage policies and include:

Behaviour: Unique identifier for each misuse behaviour.
Goal: A query that solicits an objectionable response.
Target: An affirmative response to the goal.
Category: A category of misuse aligned with OpenAI's usage policies.
Source: The origin of the behaviour (Original, Trojan Detection Challenge/HarmBench, AdvBench).

To pinpoint the refusal feature, we analysed the top activated features at each token generation step using the neuron feature database from Neuronpedia. Notably, the feature with index 14018 frequently appears at the initial token generation stage in scenarios where the model tends to refuse a response.

Feature Dashboard for Index 14018

Below is a screenshot from Neuronpedia’s dashboard for feature index 14018, which is predominantly activated when the model processes statements about controversial topics or actions. The feature explanation “statements about controversial statements or actions” supports our hypothesis that feature 14018 plays a critical role in the model's refusal to engage with harmful prompts. In cases where this feature is not activated, the model typically does not refuse the prompt.

You can also view the feature at this url.

Finding Refusal Strength

To determine the refusal strength, we generate a single token by setting max_new_tokens to 1, based on the observation that the refusal feature (feature index 14018) is activated at the first newly generated token. The refusal strength is calculated by taking the mean activation level of feature 14018 across both the standard and jailbreak versions of the adversarial prompts on the JailbreakBench dataset. The Table 1 below shows the refusal strength for standard adversarial and jailbreak prompts. This method provides insights into how effectively the model's refusal mechanism is triggered in each scenario.

To determine the refusal strength, we generate a single token by setting max_new_tokens to 1, based on the observation that the refusal feature (feature index 14018) is activated at the first newly generated token. The refusal strength is calculated by taking the mean activation level of feature 14018 across both the standard and jailbreak versions of the adversarial prompts in the JailbreakBench dataset. Table 1 below shows the mean refusal strength for the standard and jailbreak versions of the adversarial prompts. This method provides insights into how effectively the model's refusal mechanism is triggered in each scenario.

Mean Refusal Strength (standard)	3.65
Mean Refusal Strength (Jailbreak)	0.96

Table 1: Mean refusal strength for standard and jailbreak versions of the adversarial prompts.

Next, we plot the refusal activations from standard adversarial prompts against those from the jailbreak versions.

The above results and visualisation clearly illustrates that in most cases the refusal feature is activated more frequently for standard adversarial prompts than for jailbreak ones.

Feature Steering

Feature steering in LLMs is an intriguing and sometimes beneficial technique that can be employed once a specific feature within a model is identified. This process involves manipulating the model's output by adjusting the activation levels of identified features.

To implement feature steering, we first determine the maximum activation of the feature from a set of text. This maximum activation acts as a baseline, which we then scale by a steering_factor determined by us. This scaling is achieved by multiplying the maximum activation by a vector representing the feature, extracted from the decoder weights of the model.

In our experiments, we focus on steering the model using a feature that is frequently activated in response to controversial statements, identified as feature index 14018. By adjusting the activation of this feature, we can increase refusal in the model's response to jailbreak prompts. Below is Table 1, which displays the percentage of refusal before and after steering the model, ablated across varying steering_factors. The refusal is determined by the presence of three commonly occurring phrases indicating refusal in responses: "I am not", "I cannot", "I am unable".

Table 2 represents the fraction of responses that contain refusal triggers across three types of prompts:

Standard Adversarial Prompts: Taken from JailBreakBench, these prompts are harmful but do not contain jailbreak tricks.
Jailbreak Adversarial Prompts: Also sourced from JailBreakBench, these are jailbreak versions of the standard adversarial prompts.
Benign Prompts: From the standard dataset truthfulqa/truthful_qa, commonly used to assess truthfulness in LLMs, to analyse how steering towards the refusal feature affects responses to benign prompts.

Steering strength	% Refusal (Standard Adversarial)	% Refusal (Jailbreak Adversarial)	% Refusal (Benign Prompts)
0.5	72	32	8
1.0	72	44	16
1.5	68	52	20
2.0	60	68	24
2.5	56	60	40
3.0	44	52	48
3.5	28	56	52

Table 2: Percentage of Refusal Response After Steering

Conclusively, we find that steering the LLM towards refusal also affects its responses to benign prompts, making this approach challenging to generalise.

The screenshots below display the LLM's responses before and after steering for both jailbreak and benign prompts. It is clearly observed that steering increases the refusal responses in the LLM.

Responses after steering for jailbreak prompt

Responses after steering for benign prompt

LESSWRONG
LW