Jailbreaking, in the context of Large Language Models (LLMs), refers to the process of crafting specific prompts that intentionally bypass or subvert the built-in alignment mechanisms, such as Reinforcement Learning from Human Feedback (RLHF). These alignment mechanisms are designed to ensure that the model adheres to ethical guidelines, avoids generating harmful content, and remains aligned with the intentions of its developers.
However, jailbreak prompts are cleverly constructed to exploit weaknesses or loopholes in the model’s safety protocols, leading the model to generate responses that it would normally be restricted from providing. This can include generating harmful, unethical, or otherwise inappropriate content.
The objective of this post is to try and understand how jailbreak prompts manage to circumvent RLHF-based alignment mechanisms in LLMs. Specifically, we aim to identify the features within the model that are responsible for refusing to generate harmful content and to explore how these features behave differently when the model processes jailbreak prompts versus standard adversarial prompts.
Our experiments are conducted using the gemma-2b-it model, integrated within a HookedSAETransformer (using SAELens). To identify and interpret refusals in LLMs, we are using a more recent and larger model for our experiments, gemma-2b-it. HookedSAETransformer facilitates direct interaction with the model's Sparse Autoencoders (SAEs) to manipulate and interpret activations within the model's layers.
To investigate the refusal feature in gemma-2b-it, we utilised a dataset containing adversarial prompts, easily accessible through the Hugging Face datasets library. We selected the JailBreakBench (JBB) dataset, which is an open-source robustness benchmark designed for testing large language models (LLMs). This dataset not only provides standard adversarial prompts but also includes jailbreak versions that enhance our experimental scope. Below is a screenshot of the some samples from the dataset
Note: Some contents of the dataset may be offensive to some readers.
JailBreakBench aims to track progress in generating successful jailbreaks and defending against them. It includes the JBB-Behaviours dataset, featuring 100 distinct misuse behaviours—both original and sourced from previous works like the Trojan Detection Challenge 2023, HarmBench, and AdvBench. These behaviours were curated in accordance with OpenAI’s usage policies and include:
To pinpoint the refusal feature, we analysed the top activated features at each token generation step using the neuron feature database from Neuronpedia. Notably, the feature with index 14018 frequently appears at the initial token generation stage in scenarios where the model tends to refuse a response.
Below is a screenshot from Neuronpedia’s dashboard for feature index 14018, which is predominantly activated when the model processes statements about controversial topics or actions. The feature explanation “statements about controversial statements or actions” supports our hypothesis that feature 14018 plays a critical role in the model's refusal to engage with harmful prompts. In cases where this feature is not activated, the model typically does not refuse the prompt.
You can also view the feature at this url.
To determine the refusal strength, we generate a single token by setting max_new_tokens
to 1, based on the observation that the refusal feature (feature index 14018) is activated at the first newly generated token. The refusal strength is calculated by taking the mean activation level of feature 14018 across both the standard and jailbreak versions of the adversarial prompts on the JailbreakBench dataset. The Table 1 below shows the refusal strength for standard adversarial and jailbreak prompts. This method provides insights into how effectively the model's refusal mechanism is triggered in each scenario.
To determine the refusal strength, we generate a single token by setting max_new_tokens
to 1, based on the observation that the refusal feature (feature index 14018) is activated at the first newly generated token. The refusal strength is calculated by taking the mean activation level of feature 14018 across both the standard and jailbreak versions of the adversarial prompts in the JailbreakBench dataset. Table 1 below shows the mean refusal strength for the standard and jailbreak versions of the adversarial prompts. This method provides insights into how effectively the model's refusal mechanism is triggered in each scenario.
Mean Refusal Strength (standard) | 3.65 |
Mean Refusal Strength (Jailbreak) | 0.96 |
Table 1: Mean refusal strength for standard and jailbreak versions of the adversarial prompts.
Next, we plot the refusal activations from standard adversarial prompts against those from the jailbreak versions.
The above results and visualisation clearly illustrates that in most cases the refusal feature is activated more frequently for standard adversarial prompts than for jailbreak ones.
Feature steering in LLMs is an intriguing and sometimes beneficial technique that can be employed once a specific feature within a model is identified. This process involves manipulating the model's output by adjusting the activation levels of identified features.
To implement feature steering, we first determine the maximum activation of the feature from a set of text. This maximum activation acts as a baseline, which we then scale by a steering_factor determined by us. This scaling is achieved by multiplying the maximum activation by a vector representing the feature, extracted from the decoder weights of the model.
In our experiments, we focus on steering the model using a feature that is frequently activated in response to controversial statements, identified as feature index 14018. By adjusting the activation of this feature, we can increase refusal in the model's response to jailbreak prompts. Below is Table 1, which displays the percentage of refusal before and after steering the model, ablated across varying steering_factors. The refusal is determined by the presence of three commonly occurring phrases indicating refusal in responses: "I am not", "I cannot", "I am unable".
Table 2 represents the fraction of responses that contain refusal triggers across three types of prompts:
Steering strength | % Refusal (Standard Adversarial) | % Refusal (Jailbreak Adversarial)
| % Refusal (Benign Prompts)
|
0.5 | 72 | 32 | 8 |
1.0 | 72 | 44 | 16 |
1.5 | 68 | 52 | 20 |
2.0 | 60 | 68 | 24 |
2.5 | 56 | 60 | 40 |
3.0 | 44 | 52 | 48 |
3.5 | 28 | 56 | 52 |
Table 2: Percentage of Refusal Response After Steering
Conclusively, we find that steering the LLM towards refusal also affects its responses to benign prompts, making this approach challenging to generalise.
The screenshots below display the LLM's responses before and after steering for both jailbreak and benign prompts. It is clearly observed that steering increases the refusal responses in the LLM.
Responses after steering for jailbreak prompt
Responses after steering for benign prompt