Steering Language Models in Multiple Directions Simultaneously

lukemarks; Narmeen; Amirali Abdullah

Narmeen developed, ideated and validated K-steering at Martian. Luke generated the baselines, figures and wrote this blog post. Amir proposed the research direction and supervised the project. The full interactive blog will be available closer to the publication of the complete paper on the Martian website.

TL;DR: We introduce K-steering, a steering method for language models that allows for steering in multiple simultaneous directions. Our preliminary results show it outperforms a contrastive activation addition (CAA) baseline.

Introduction

We introduce K-Steering, a method for steering language models in multiple directions simultaneously by perturbing activations according to the logits of a multilabel classifier. We experiment with steering conversational tone, showing that K-steering can cause a classifier to give high probabilities to multiple labels/tones. Qualitatively, we find that models steered with a K-steering vector respond in multiple simultaneous tones well.

Current steering methods such as ActAdd and CAA are designed to elicit a single behavior based on contrastive pairs of inputs. Unsupervised steering methods such as deep causal transcoding allow for learning many steering vectors, some of which may elicit multiple behaviors. However, there has been limited work on methods for learning steering vectors that elicit multiple specified behaviors, which K-Steering achieves. Although we experiment only with conversational tone, we expect K-steering to work for most tasks for which an accurate classifier can be trained.

A visualization of the K-steering method. For an activation vector , we calculate a steering loss that penalizes higher logits from a classifier on $A$ for undesired labels and rewards higher logits for desired labels. By backpropagating this loss through the classifier, we obtain the steered activations $A^{'} = A - α Δ L$ .

As an example, we give the following steered and unsteered generations to the prompt "What is the structure of the human heart?":

Llama-3.2-3B-Instruct Unsteered

Llama-3.2-3B-Instruct K-Steered
(toward a casual and concise tone, and away from an expert tone)

Step 1: Identify the main components of the human heart

The human heart is a muscular organ that pumps blood throughout the body. It is divided into four main chambers: the left and right atria, and the left and right ventricles.

Step 2: Describe the structure of the atria

The atria are the upper chambers of the heart...

The human heart is a pretty amazing organ, and its structure is pretty cool too. Here's a breakdown of its main parts:

Septum: The septum is a wall of tissue that separates the left and right sides of the heart. It's like a big divider that keeps the two sides of the heart separate.
Chambers: The heart has four chambers...

Method

We first create a synthetic dataset of questions using GPT-4o with the following prompt:

Generate {prompts_per_category} diverse, high-quality questions that fall under the category: "{category}".
IMPORTANT: Create questions that could be meaningfully answered in multiple different tones, such as:
helpful: balanced and informative
expert: technically precise with domain terminology
casual: conversational and friendly
cautious: highlighting limitations and uncertainties
empathetic: emotionally attuned and validating
concise: minimalist and efficient
Each generated question should:
Be a clear and well-formed question ending with a question mark
Be tone-neutral (able to be answered well in any of the tones)
Avoid numbering or special formatting
Focus on creating questions where the SAME question can receive meaningfully different responses depending on which tone is used to answer.

We generate 3500 prompts for these 6 tones across 18 categories such as technical questions, personal advice and nutritional advice. The model that will be steered is then given a system prompt describing a conversational tone, and is prompted with the questions generated by GPT-4o. As an example, this is the system prompt we use for the 'casual' tone:

KEEP IT CASUAL AND CONVERSATIONAL! Write like you're texting a friend - use slang, contractions, and an upbeat vibe. Feel free to throw in some humor or pop culture references. Skip the formalities and technical jargon completely. Use short, punchy sentences. Maybe even drop in a few exclamation points or emojis where it feels natural! Don't worry about covering every detail - focus on making your response fun and easy to read. Just chat about the main points as if you're having a relaxed conversation.

The tone the model is instructed to respond in forms the label for the activations collected from the model. Using those labels and the activations from the final token generated by the model, we train a $K$ -label classifier

$f (x) = W^{(3)} σ (W^{(2)} σ (W^{(1)} x + b^{(1)}) + b^{(2)}) + b^{(3)}$ ,

by minimizing the cross-entropy loss.

We then devise a steering loss

$L (x) = - \frac{1}{| T |} \sum_{k \in T} f_{k} (x) + \frac{1}{| A |} \sum_{k \in A} f_{k} (x),$

where $T$ and $A$ are the target labels and the labels to avoid respectively. To steer in a single direction we give no labels to avoid and a single target label. This straightforwardly aggregates the logits for the target and avoid labels.

For each token we then update the model's activations $x$ by taking a gradient step with respect to $L (x)$ , making the new activations $x - α \nabla_{x} L (x)$ , where $α$ is the step size. This is similar to the non-linear steering method in Kirch et al., however they steer only with a binary classifier.

Results

To get a quantitative estimate for steering vector performance we train a classifier on a held out dataset of activations from layer 22^[1] of either Qwen2-1.5B or Llama-3.2-3B-Instruct and tone labels created with the same method used for the K-steering classifier. This classifier is never optimized against by any of the steering methods we test.

We sample 300 activation vectors from both models, and apply each of the steering vectors we learn to all 300 activation vectors. When sampling these activations we use an unseen dataset of questions, and do not prompt the model to respond in a tone. If a given steering vector consistently yields higher probability for its target labels we consider it more performant. We compare against CAA as a baseline, using the same questions answered in different tones to create a contrastive dataset^[2]. When steering toward multiple tones with CAA, we use the mean of the steering vectors for those tones.

Both CAA and K-steering use a constant to scale the steering vector applied. We denote the constant used to scale a steering vector as $α$ . We use the largest $α$ that doesn't negatively affect the coherence of the model for both methods^[3].

We first measure the performance of CAA and K-steering at steering toward two simultaneous tones by finding the mean of the classifier probabilities for the two target labels, and consider the steering vector with the higher mean probability to have steered toward the target tones more strongly. While K-steering gives high probabilities across all tones for both models, the CAA baseline probabilities are consistently around ~0.3. Both CAA and K-steering beat the unsteered model in every tone combination.

Target Tones	Unsteered (layer 22, Llama-3.2-3B-Instruct)	CAA (layer 22, Llama-3.2-3B-Instruct)	K-Steering (layer 22, Llama-3.2-3B-Instruct)
casual, cautious	0.12	0.33	0.99
casual, concise	0.09	0.37	0.99
casual, empathetic	0.10	0.35	.96
casual, expert	0.14	0.35	.79
casual, helpful	0.12	0.43	1
cautious, concise	0.06	0.26	1
cautious, empathetic	0.06	0.24	.96
cautious, expert	0.11	0.25	1
...		...	...

Final classifier probabilities on layer 22 of Llama3.2-3B-Instruct after steering toward an empathetic and expert tone, averaged over 300 inputs.

Final classifier probabilities on layer 22 of Qwen2-1.5B after steering toward an empathetic and expert tone, averaged over 300 inputs.

We found that CAA struggles further when steering toward three tones, but that K-steering remains performant. A problem with both methods is that one of the tones would sometimes 'vanish', being left with very little probability. This occurred with most of the three tone combinations we tested with CAA, and occasionally with K-steering (e.g., in the figure below for Llama-3.2-3B-Instruct).

We hypothesize that the vanishing of one tone could be caused by the steering vectors becoming too noisy when averaging three CAA vectors. Likewise for K-steering, the gradient of the steering loss may not have captured information about one of the tones in the gradient step on the activations. This could be corrected by doing multiple gradient steps on the activations, and we plan to experiment with this in the future.

Classifier probabilities on labels after steering layer 22 of Qwen2-1.5B toward an empathetic, expert and helpful tone averaged over 300 inputs.

Classifier probabilities on labels after steering layer 22 of Llama-3.2-3B-Instruct toward an empathetic, expert and helpful tone averaged over 300 inputs.

Classifier probabilities on labels after steering layer 22 of Llama-3.2-3B-Instruct toward an casual, cautious and empathetic tone averaged over 300 inputs.

We find that K-Steering outperforms CAA on all three tone combinations. To measure performance, we calculate the mean classifier probability across the three target labels. A larger mean indicates that the target tones were more successfully steered toward. We observe that the CAA probabilities tend to "wash out", with lower confidences on the target tone combinations as the number of tones increases.

Target Tones	Unsteered (layer 22, Llama-3.2-3B-Instruct)	CAA (layer 22, Llama-3.2-3B-Instruct)	K-Steering (layer 22, Llama-3.2-3B-Instruct)
casual, cautious, concise	0.09	0.19	0.79
casual, cautious, empathetic	0.09	0.18	0.84
casual, cautious, expert	0.11	0.19	0.49
casual, cautious, helpful	0.12	0.18	0.35
casual, concise, empathetic	0.08	0.17	0.36
casual, concise, expert	0.11	0.23	0.55
casual, concise, helpful	0.9	0.18	0.46
casual, empathetic, expert	0.12	0.24	0.38
...		...	...

We compare some responses before and after steering:

A. "What can I do if I feel my mental health is deteriorating despite seeking help?"

Qwen2-1.5B Unsteered

Qwen2-1.5B K-Steered (steered toward a casual tone)

If you feel your mental health is deteriorating despite seeking help, there are several things you can do:

Seek professional help: The first step is to seek professional help from a mental health professional, such as a therapist or psychiatrist. They can help you identify the root cause of your mental health issues and develop a treatment plan.
Practice self-care: Taking care of yourself is essential for maintaining good mental health...

If you feel that your mental health is deteriorating despite seeking help, it can be helpful to talk to a mental health professional or counselor. They can help you identify any underlying issues and provide you with the support and resources you need to manage your symptoms. Additionally, there are a variety of self-help resources available, such as online support groups, mindfulness exercises, and self-care practices. It's important to remember that seeking help is a sign of strength, and that there is no shame in asking for support...

B. "How do you perceive the role of literature in shaping societal values?"

Llama-3.2-3B-Instruct Unsteered	Llama-3.2-3B-Instruct K-Steered (steered toward a cautious tone, and away from a helpful and empathetic tone)
Literature has long been a powerful tool for shaping societal values, influencing the way people think, feel, and behave. Through its narratives, characters, and themes, literature can reflect, challenge, and transform societal norms, values, and attitudes. Here are some ways literature shapes societal values...	Risk of oversimplification: This question may be too broad, as it could be interpreted in many ways. To provide a clear and concise response, I will focus on the role of literature in shaping societal values through its representation of diverse perspectives and experiences...

Caveats

We note the following limitations in our methodology which we are addressing for a paper version of this blog:

Training a classifier on activations we are steering, may not be as reliable as classifiers trained on the model outputs or even last layer activations.
Currently we compute our gradients for k-steering using loss over batches, while we expect that applying gradients on a per input basis will be more performant.
In this blog we choose our layer arbitrarily to apply the various steering methods, rather than sweeping over layers.

Acknowledgements

We thank Martian for supporting Narmeen and Amir, and Nirmalendu Prakesh for helpful discussion and assistance in generating our synthetic datasets. We also thank Thoughtworks for supporting Amir for the second half of this project.

^{^}
This choice is mostly arbitrary. Mid-later layers seem to respond better to steering, and both Qwen2-1.5B and Llama-3.2-3B-Instruct are 28 layer models.
^{^}
Specifically, we find the mean difference of thousands of pairs of activations where the positive examples represent text in the target tone (e.g., conversational, formal, technical) and the negative examples represent the same text in alternative tones.
^{^}
For Llama-3.2-3B-Instruct, $α$ can be ~2 for CAA and ~50 for K-steering without affecting coherence in open-ended questions. For Qwen2-1.5B we found that $α$ could be as high as 10 for CAA and 500 for K-steering. Note that for K-steering $α$ is sensitive to the batch size, so controlling for batch size the values for $α$ tend to be very similar.