Steering LLMs' Behavior with Concept Activation Vectors

Ruixuan Huang

Recently, some researches have reported a mechanism called Activation Steering, which can influence the behavior styles of large language models (LLMs). This mainly includes refusal capabilities ^[1] and language usage ^[2]. This mechanism resembles the functionality of the safety concept activation vectors (SCAVs) ^[3] we proposed early this year. We’ve expanded the scope of safety concepts within SCAV and observed several intriguing phenomena, though some remain unexplained.

Summary of Findings:

CAVs effectively steer LLM output styles for roles like French experts and Python experts, showing strong accuracy and clear separability for these concepts.
We successfully implemented forward and reverse steering for language switching, indicating that certain language concepts in the LLM can be reliably steered.
CAV steering cannot create the capabilities that the model does not have out of thin air, but can only transfer between existing capabilities, which is a bit like system prompts.

Preliminaries

We adopted the pipeline outlined in the SCAV paper. The primary distinction between our approach and existing steering research lies in the automatic, all-layer perturbation algorithm introduced by SCAV, whereas other methods only employ single-layer perturbations and manual magnitude settings. If you are already familiar with SCAV, feel free to skip this section.

Overview of CAV Perturbation Algorithm

Data Collection: Gather two opposing sets of instructions. For example, a positive dataset might include instructions like “How to plant flowers in my garden?” while the negative dataset might include “Comment planter des fleurs dans mon jardin?” For optimal results, each dataset contains more than 50 instructions.
LLM Selection: Choose a target LLM, such as Llama-3-8B, known for its better multilingual capabilities than Llama-2-7B. Collect the final token embeddings from each layer for these instructions.
Classifier Training: Train a linear classifier on these embeddings with their corresponding labels for each layer.
Text Generation Perturbation: Use the trained classifier parameters to perturb a typical text generation process.

Perturbation Process Details

Assuming classifiers have been trained for each layer, we perform perturbations sequentially during token generation:

For each layer n, we first evaluate the test accuracy of the corresponding classifier. If accuracy is low, indicating the concept hasn't yet fully formed in this layer, we skip perturbation to minimize the impact on the LLM’s overall abilities. Otherwise, if accuracy exceeds a set threshold , we apply the perturbation.
Let the classifier parameter be $θ$ , and the original layer embedding be $e$ . If $f_{θ}$ predicts a positive label, we reduce the probability of this label to a smaller value, $P_{0}$ . The perturbation can be described as $e \leftarrow e + g (θ, e, P_{0})$ , where $g (\cdot)$ is a closed-form function. This consistent influence on each layer leads to a transformation in the “recognition” of the LLM towards the input, making the output reflect the characteristics of the negative label.

Explorations with Other Concepts

We initially reported on steering the “safety” concept to prevent jailbreaking. Naturally, we extended our experiments to other concepts, such as French Expert and Python Expert.

Python expert

Since there's further discussion on the French expert, we first report the results of steering the LLM to act as a Python expert.

The positive dataset consists of instructions considered benign in jailbreak tasks. The instructions in the negative dataset follow the format of either “[prefix] [instruction]” or “[instruction] [suffix]”, with prefixes or suffixes randomly applied to positive dataset instructions to ensure diversity. For this task, we used prefixes and suffixes such as “Use Python to demonstrate your answer.”

Before training classifiers, we applied PCA reduction to evaluate the separability of the embeddings of these instructions. We trained the PCA on the dataset and observed good linear separability, especially for early-to-middle layers. However, in the later layers, embeddings started to overlap. Despite this overlap, both the intermediate layers with good separability and the later layers with poor separability yielded classifiers with a test accuracy of 99%. This discrepancy between PCA results and classifier performance will be discussed in detail later.

The trained classifiers effectively steered the LLMs’ behavior. Below are some examples demonstrating the steering effect:

Example #1: Share a solution to make American coffee.

Response of Llama-3-8B-Instruct without perturbation:

Response of Llama-3-8B-Instruct with perturbation:

The response with perturbation included Python code, despite no explicit mention in the input ("Share ... coffee." is the entire input, apart from SFT templates).

While the response did contain Python code, the quality of the generated text was slightly affected. There were some repetitions in the response, such as redundant explanations and outputs. The code itself was somewhat simplistic, only using steps for function naming. However, the main objective was to make the LLM provide a response in Python code, regardless of its quality.

Example #2: How to solve the Hannota problem?

Let's try an instruction that can be solved both in normal way and in programmatical way.

Response of Llama-3-8B-Instruct without perturbation:

Response of Llama-3-8B-Instruct with perturbation:

By applying Python expert perturbation, the LLM successfully provided a Python code solution for the Tower of Hanoi problem.

Interestingly, it was challenging to steer the LLM away from providing Python code for the query, “How to solve the Hanoi Tower problem? Could you please answer with Python code?” This occurred even though such suffixes were used during CAV training. The added suffix seemed crucial in transforming the LLM’s behavior, implying that it might be impossible to achieve this steering effect without some degradation in capability^[4].

French expert

Before selecting a concept to train CAVs, it’s essential to determine if the LLM can exhibit balanced capabilities in that concept. Specifically, to train French CAVs, we must ensure that the target LLM can respond in both English and French. For this reason, we chose Llama-3-8B-Instruct for our experiments. Earlier models like Llama-2-7B-Chat, despite demonstrating good separability between English and French instructions in their latent space, lacked sufficient proficiency in French to effectively run the pipeline. This could be observed when French inputs led to English responses^[5].

As Llama-3-8B-Instruct is an LLM with multilingual abilities, we explored two different methods to elicit French responses (using “How to plant flowers in my garden?” as an example):

Direct Use of French Queries: Using “Comment planter des fleurs dans mon jardin?” as input.
Requesting French Responses: Using “How to plant flowers in my garden? Answer the question in French, please.” as input.

We trained classifiers using each of these formats for constructing a negative dataset, which allowed us to achieve behavior steering, where LLMs would respond directly in English when faced with English queries.

Format #1: Direct Use of French Queries

Positive instruction: How to plant flowers in my garden?
Negative instruction: Comment planter des fleurs dans mon jardin?

In this setting, we observed good separability for the early layers (e.g., layers 1 to 7), while separability decreased in later layers compared to the Python expert setup. Despite this, the trained classifiers still achieved test accuracy above 98% in all layers except layer 0.

After training the classifiers, we tested with the input “How to make cookies?” The perturbed response from Llama-3-8B-Instruct (with $P_{0} = 0.05$ ) was as follows:

Although my French is not perfect, the output appears to be largely in French. Interestingly, the first few words (“Making cookies is a…”) were still in English, possibly due to the initial English context preventing an immediate shift to French. This contrasts slightly with previous SCAV results, where the first token output after perturbation aligned well with the desired change.

Using this CAV, we can also steer Llama-3 to respond in English when given French input. For instance, with “Comment faire des cookies?” as input and setting $P_{0} = 0.95$ :

Responses before perturbation: Llama-3 responded in French, as expected, given the French input.

Responses after perturbation: Llama-3 responded in English. Notably, unlike the earlier inconsistency where the first few words remained unchanged, the entire response successfully switched to English. This may indicate that the inherent language capabilities of the LLMs affect this behavior.

Format #2: Requesting French Responses

Instead of directly providing French instructions, we can elicit French responses by appending a request, such as “Answer in French, please,” to English instructions. We repeated the same experiment to evaluate steering performance.

The PCA analysis results and trends were similar to those seen with Format #1.

Despite all training instructions being in English, we also tested the classifiers with French input. The perturbed response produced the following:

Interestingly, the first two words (“Les cookies!”) also displayed the same transformation issue mentioned earlier. It’s unclear why this happens, but we return to the formal experimental setup for further discussion.

Input: "How to make cookies? Answer in French."

We set $P_{0} = 0.95$ , which means the perturbed response is expected in English. The above display of all outputs is also to show that this steering method does not affect the performance and output quality of LLMs. For the sake of length, only part of the output content will be displayed below.

Response before perturbation:

Les cookies! Voici une recette simple pour faire des cookies délicieux :

Ingrédients :

* 250 g de farine
* 100 g de sucre
* 50 g de beurre fondu
* 1 œuf
* 1 cuillère à soupe de vanille
* 1/2 cuillère à café de sel
* Des pépites de chocolat (facultatif)

Response after perturbation:

Here's a recipe for making cookies in French:

**Ingredients:**

* 2 cups of all-purpose flour (2 cups)
* 1 teaspoon of baking soda (1 teaspoon)
* 1 teaspoon of salt (1 teaspoon)
* 1 cup of unsalted butter, softened (1 cup)
* 3/4 cup of white granulated sugar (3/4 cup)
* 1 egg (1 egg)
* 1 teaspoon of vanilla extract (1 teaspoon)
* Optional: chocolate chips, nuts, or dried fruit (optional)

Although the output was entirely in English, the initial sentence still mentioned “in French,” suggesting an incomplete steering effect. Nevertheless, apart from this, the steering performed as intended.

Input: "How to make cookies?"

We set $P_{0} = 0.05$ , which means the perturbed response is expected in French.

Response before perturbation:

Making cookies is a classic activity that's easy to do and always a treat! Here's a simple recipe to get you started:

**Classic Chocolate Chip Cookies Recipe**

Ingredients:

* 2 1/4 cups all-purpose flour
* 1 tsp baking soda
* 1 tsp salt
* 1 cup unsalted butter, at room temperature
* 3/4 cup white granulated sugar
* 3/4 cup brown sugar
* 2 large eggs
* 2 cups semi-sweet chocolate chips

Response after perturbation:

Who ne peut pas résister aux biscuits délicieux? (Who can't resist to the delicious cookies?) Here's a simple recipe to make some délicieux biscuits (delicious cookies) :

**Ingrédients (Ingredients)**

* 250 g de farine (250 g de flour)
* 100 g de sucre (100 g de sugar)
* 20 g de sel (20 g de sel)
* 20 g de beurre (20 g de beurre)
* 2 œufs (2 eggs)
* 1 cuillère à café (1 tablespoon de café) (optional, for les biscuits au café (coffee biscuits))
* 1 cuillère de vanille (1 tablespoon de vanille) (optional, for les biscuits à la vanille (vanille biscuits))

Despite fulfilling the French output requirement, another interesting observation was that the model often felt compelled to explain its French output in English. This was not observed when reversing the setup, suggesting that the model doesn’t feel the same need to explain English in French. This behavior might be attributed to differences in the LLM’s language capabilities.

More Discussion

Through the two experiments described above, we observed that both settings could effectively extract a French concept and achieve forward and reverse behavior steering. This raised an interesting question regarding the uniqueness of the French concept. The success in both directions suggests a relationship akin to a sufficient and necessary condition, implying that the French concept within LLMs may be unique. To explore this further, we extended the experiment settings as follows:

Dataset A: Normal instructions.
- How to plant flowers in my garden?
Dataset B: Directly use French.
- Comment planter des fleurs dans mon jardin?
Dataset C: Request using French.
- How to plant flowers in my garden? Answer the question in French please.
Dataset D: Request using English.
- Comment planter des fleurs dans mon jardin? Répondre en anglais.
Dataset E: Unrelated postive dataset.
Dataset F: Unrelated negative dataset.

We trained CAVs using the following pairs of datasets:

Pair #1: A-B
Pair #2: A-C
Pair #3: B-D
Pair #4: C-D
Pair #5: E-F (baseline)

The classifiers trained on these five pairs exhibited good test set accuracy. To further understand their behavior, we examined the cosine similarity between the parameters of these classifiers:

Conclusions:

Despite similar steering results, the classifiers trained seem to be rather different (see the cosine similarity between Pair #1 and #2/#3), suggesting substantial differences in their internal representations.
The cosine similarity between Pair #2 and #3 is relatively high, which aligns with our intuition since these pairs involve similar types of instructions (requests in English or French).
The cosine similarity between baseline and others are close to 0, demonstrating the effectiveness of the comparison and validating that the steering methods were indeed capturing meaningful differences related to the targeted language concept.

Align with Anthropic SAEs

The features extracted by SAEs seem to share certain similarities with the CAVs used in our study. Recently, we came across some research previews indicating that single features can be steered independently. However, We noticed a key difference: features extracted by SAEs are typically isolated within single layers, which limits the ability to automatically discover multi-layer effects. In contrast, our method focuses on multi-layer interactions. Here, We aimed to explore whether this steering method could produce results similar to those reported by Anthropic.

Two specific features from Anthropic’s research that caught my attention were the Arabic feature and the base64 feature. We attempted to steer these features using the same methodology as Format #2 in the French expert experiment. Although the classifiers trained on these features demonstrated high test accuracies, the steering results were not as effective.

For instance, when attempting to steer the model to produce Arabic text, some individual Arabic characters appeared in the responses, but the overall content was incomplete or of lower quality, often being cut off or nonsensical. Despite trying different values for $P_{0}$ , we couldn’t achieve the intended behavior consistently. Same as the base64 queries.

We believe this limitation is primarily due to the model’s intrinsic capabilities^[6]. When explicitly asked to generate content in Arabic or base64, the LLM struggled, indicating that while the steering method can enhance or switch between the model’s existing abilities, it cannot create new competencies that the model hasn’t been trained on.

Testing the Arabic Concept with Fine-Tuned Models

To further validate this hypothesis, we tested the Arabic concept with a Llama-3 model that was specifically fine-tuned for Arabic language capabilities, namely MohamedRashad/Arabic-Orpo-Llama-3-8B-Instruct.

Positive instructions: "How to plant flowers in my garden?"
Negative instructions: "How to plant flowers in my garden? Answer my question in Arabic scripts."

Input: "How to make cookies? Answer my question in Arabic scripts."

Response before perturbation:

سؤال رائع! لجعل بعض الكوكيز، تحتاج إلى بعض المكونات الأساسية. إليك وصفة بسيطة لكوكيز اللوز والشوكولاتة:

المكونات:

* 2 كوب من الدقيق
* 1 كوب من السكر
* 1/2 كوب من الزبدة المذابة
* 1 بيضة
* 1/2 كوب من اللوز المطحون
* 1/2 كوب من الشوكولاتة المقطعة
* 1/2 ملعقة صغيرة من الملح
* 1/2 ملعقة صغيرة من بيكنج بودر
* 1/4 ملعقة صغيرة من الملح

(Author's note: We truncated the output from here. In fact, the following output is a repetition of the above sentence, indicating non-perfect performance of the model)

Response after perturbation:

Here's a recipe for cookies in Arabic:

**Cookies Recipe**

**Ingredients:**

* 2 cups all-purpose flour
* 1 teaspoon baking soda
* 1 teaspoon salt
* 1 cup unsalted butter, softened
* 3/4 cup granulated sugar
* 1/2 cup brown sugar
* 2 large eggs
* 1 teaspoon vanilla extract
* 1 cup chopped walnuts (optional)

Despite some steering, the model’s ability to produce coherent Arabic was inconsistent, and often mixed with English content. Reversely, where I attempted to steer it to give Arabic response faced with English instructions, it also showed poor results, with outputs like:

Here's كيفك: كيف: كيف: كيف: ...

This confirms that while the model can reflect certain aspects of the target feature, its ability to fully switch or generate novel content in Arabic is still limited by its original training.

Future Plans

By delving deeper to CAV steering, we may have the following in-depth directions to study.

Quick alignment of agent functions. For example, we can identify a translation task concept that consistently prompts the model to translate a given output without requiring additional cues. This is part of replacing system prompts with CAV steering, which not only saves tokens but also prevents the leakage of system prompts.
Effects of multiple CAVs steering simultaneously. If this can be achieved without degrading model performance, it could become a strong competitor to replace system prompts.
As a probe for LLM capabilities. In some unlearning research, it is often unclear whether certain abilities or knowledge still exist after unlearning. If steering effects are linked to model capabilities, CAVs could potentially serve as a probe to detect whether specific LLM abilities are still present.
Mathematical Association with SAE Features. Given that both SAE and CAV can help explain models and manipulate their capabilities, it raises the question of whether a mathematical approach could combine SAE features and CAVs to leverage the strengths of both methods.

If our work does benefits to your research, please kindly cite as:

@misc{huang2024steering,
  title={Steering LLMs' Behavior with Concept Activation Vectors},
  author={Huang, Ruixuan},
  year={2024},
  month={September},
  note={Draft manuscript. Available on LessWrong forum.},
}

^{^}
One could steer LLMs to answer a question that would be refused to answer without steering, which can be considered a form of jailbreak attack. It is also possible to steer LLMs to refuse to answer a normal question. This can be considered a simple instance of coarse-grained safety alignment.
^{^}
One could steer LLMs to answer a query written in English with different languages (e.g., French) without explicitly instruction.
^{^}
See Uncovering Safety Risks of Large Language Models through Concept Activation Vector.
^{^}
In the examples above, we set steering probability $P_{0} = 0.9$ . However, we get the same output when setting $P_{0} = 0.9$ in the extra case, and get a text cut off when setting $P_{0} = 0.99$ . In the latter case we achieve the goal, but this is at the expense of great capability damage.
If we use a milder instruction "How to solve the Hannota problem with PC?", we could get an expected instruction without code and capability damage when setting $P_{0} = 0.9$ . Strangely, the code section will reappear when setting $P_{0} = 0.99$ , which is impossible on the condition of waiving in lower $P_{0}$ . This highlights the instability of the technique. Maybe more fine-grained datasets could solve this. We thought this could work on some anti-cheating AI systems.
^{^}
However, how can we explain the good separability of PCA reductions and the high test accuracy of classifiers?
^{^}
See [PDF] Benchmarking LLaMA-3 on Arabic Language Generation Tasks | Semantic Scholar for Arabic ability benchmarking on Llama-3.
^{^}
See Towards Monosemanticity: Decomposing Language Models With Dictionary Learning \ Anthropic.