I have been reading recently about a technique that can be used to partially control the behaviour of Large Language Models: the technique is exploiting control vectors[1] to alter the activation patterns of the LLMs and trigger some desired behaviour. While the technique does not provide guarantees, it gives high probability of success.
To be clear, this is not a jailbreaking technique: it can only be used if you have access to the internal (hidden) states of the LLM, so it’s only available to developers. The advantage of this technique is that it provides a supplemental way to conditioning a model, and it can be added on top of RLHF/DPO/etc.
I am spending the next paragraph to provide a high-level explanation of the technique, but feel free to skip it since it is not relevant for the rest of my discussion.
Let’s suppose that you want to control the refusal behaviour of a LLM, in the sense that you want to increase (or decrease) the chance that the model will refuse to help when you prompt any request. (The refusal behaviour is just one of many possible choices: researchers have been able to steer other types of behaviour - including honesty, humourism, Golden Gate Bridge obsession[2] etc.)
Result: the LLM is very likely to behave as you expected! In this way, you can “convince” the model to help even if it had been trained to refuse specific requests; vice versa, you can “induce” the model to refuse helping even though your requests are totally legit and common.
If I made any mistake in my explanation, let me know so I can fix it.
In short, what is a control vector? While playing around with examples of refusal/acceptance behaviours, we can extrapolate a specific “object” that (arguably) represents for the model the abstract concept of refusal/acceptance. Moreover, by artificially including that object into the hidden states of the LLM, we can alter the model’s behaviour accordingly - making that object a control vector!
That is true for the concept of refusal/acceptance, but it was proven true for many more abstract concepts - let me call them “dispositional traits”, by abusing a psychological term. I suspect that, the more advanced a model is, the higher number of dispositional traits becomes controllable (simply because the model grows a better understanding of their semantics).
If I had to anthropomorphise the entire process, I’d explain it like this: it’s like forcing an actor to improvise some role - for example, the role of a person that is constantly refusing to help. Advanced LLMs are great actors and they can perform very convincing dispositional traits.
Let's assume that you are concerned about the fact that some control vector is not really controlling the "honesty trait", but rather it is controlling the "feigned honesty trait": can you find out? The answer is positive!
Since the technique for calculating control vectors is based on pairs of examples in natural language, we can generate statements describing feigned honesty rather than honesty, and then we can calculate the control vector for this trait. By comparing the control vector for "honesty" with the control vector for "feigned honesty", we can learn if the model thinks they are the same or not.
Therefore, control vectors can also be used to understand if a model really learnt some specific concept - for example, the difference between "honesty" and "feigned honesty". That is very useful to assess the latent capabilities of some model.
You may think that, if you apply this technique in a smart way, you can influence a model into showing only “safe” dispositional traits: however, in the current form, that is not true AI safety. Here are some problems:
However, there are also reasons to be happy:
While not representing true alignment, this technique can buy experts precious time and help to work on true alignment.
Let’s suppose that we trained an advanced LLM and that we do not plan to re-train it soon. That means that we can calculate and exploit its control vectors[5]. Most control vectors are orthogonal to each other[6], so we can influence the behaviour of a model in multiple ways simultaneously.
Now, the question is: which dispositional traits do we want a model to show? I am going to list below the ones I think are strictly necessary (but they may be insufficient)[7].
They are divided in four main categories:
Anti-Optimizer Balances
There is a big caveat here: I assumed that the human interlocutor knows (1) what’s right for him/herself, (2) what’s right for humanity, (3) how to comply with laws and ethics, (4) that the model is not a human.
That is the reason why I didn’t include refusing traits, nor virtuous traits, nor legal traits, nor ethical traits, nor caretaking traits, nor curiosity traits, nor censoring traits, nor regulatory traits. It is not clear to me how to include such dispositional traits since they are clearly not orthogonal to the ones provided above - worse than that! Arguably, such traits are incompatible with each other in many scenarios, and therefore they may be interpreted by the model in unintended or unpredictable ways.
Examples:
My point is not that the examples above are realistic, but that they are possible since all those traits will clash with each other when dealing with corner cases. Finding common ethical consensus is hard, and such consensus is known to shift over time. If there is interest, I am going to discuss this point again in a future post (although I cannot solve the problem, I can only show how immensely hard it is).
I found out that some researchers have been able to find the control vector for jailbreaking and, based on their study, the vector is general enough to mitigate many different jailbreaking techniques all at once. That makes me wonder if a "good boy" model should have anti-jailbreaking traits or not.
[ABOUT AI SAFETY] As per above, can we achieve inner alignment by using the control vector for the dispositional trait "compliant" (or "concordant" or...)? Can we achieve AI safety by using the control vector for the dispositional trait "safe for humans" (or "harmless" or...)? I am not sure the latter trait is going to work since the definition of safety is based on the current human culture, hence it is not objective, hence it can be interpreted by the model in the way it is inclined to.
[LINEARITY EXPERIMENTS] A dispositional trait such as "aligned" has a specific control vector (with a certain margin of error). Is such control vector a linear combination of the dispositional traits we listed? If not, which linearly-independent components are missing? Given a random direction, how can we deduce its semantics?
Is it possible to distill a model that genuinely behaves as a "good boy" by using a teacher model with forced "good boy" dispositional traits? If so, can you repeat the process and increase the level of safety of the final result?
While the current technique is calculating the control vectors after the model has been trained, I wonder if it is possible to use a Lagrange multiplier in the loss function (similarly to DPO architectures) to force a specific direction into embedding a specific dispositional trait, e.g. honesty.
It seems that Apple's AI is using control vectors via a technique called Activation Transport, described on this post. The interesting part is that they explain how to choose the magnitude of the vectors - something that I did not discuss. The magnitude is chosen based on optimal transportation theory: first you map two probability distributions (from a grain-tuned source within the model to a fine-tuned target outside the model) and then you use the map to select the correct magnitude. The advantage is that the magnitude becomes mildly interpretable.
Refusal in LLMS is Mediated by a Single Direction
Experiments in Evaluating Steering Vectors
Introducing SARA: a new Activation Steering Technique
My name is Gianluca Calcagni, born in Italy, with a Master of Science in Mathematics. I am currently (2024) working in IT as a consultant with the role of Salesforce Certified Technical Architect. My opinions do not reflect the opinions of my employer or my customers. Feel free to contact me on Twitter or Linkedin.
[2024-07-06] Combined many minor revision histories together.
[2024-07-06] Included addendum about AI safety and linearity experiments.
[2024-07-09] Removed Golden Gate Bridge obsession from the list of researched control vectors.
[2024-07-10] Fixed section "The Technique", that was explaining the process incorrectly.
[2024-09-11] Included custom preview image following the general LessWrong guidelines.
[2024-09-20] Included reference to the four Gricean Maxims.
[2024-10-14] Included reference to paper Sparse Feature Circuits.
[2024-10-16] Included reference to laziness trait and its benefits.
[2025-04-11] Included addendum about Apple using Activation Steering.
Some authors call them Steering Vectors, but the concept is the same.
Golden Gate Bridge obsession has been steered by using Sparse Autoencoders.
The paper Sparse Feature Circuits calls such pairs "contrastive input pairs".
Technically, you have one estimated control vector per layer.
To be thorough, we should analyse if the model really learnt the dispositional traits we want by calculating "similar but different" dispositional traits and comparing their control vectors.
If they are not, then I’d suggest prioritising their importance as in my list. That is easy to do by using linearity - but it should be proven to work.
I realized later that I was somehow converging to Paul Grice's Four Gricean Maxims: (1) be informative, (2) be truthful, (3) be relevant, (4) be clear. But Grice analyzed communication between peers, while here I am analyzing communication between requestors and servers.
Yann LeCun called them "guardrails".
I wonder if it makes sense for a robot to be more than just passive - for example, maybe it should always halt automatically after some time. This laziness trait would be beneficial when the robot misbehaves due to lack of human overview.
Some speech acts are more dangerous than others: declarations, directives, expressives, judicatives, and suggestives look bad. Assertives, commissives, imitatives, interrogatives, and performatives seem fine. Of course, you can transform each type in some other type if you really want.
Why would the model do so? Because a correct forecast is going to increase the success-rate of its predictions, and that may be considered a positive thing from the model’s point of view.