Steering GPT-2-XL by adding an activation vector

Monte M; David Udell; lisathiergart; Ulisse Mini

I was educated by this, and surprised, and appreciate the whole thing! This part jumped out at me because it seemed like something people trying to "show off, but not really explain" would have not bothered to write about (and also I had an idea):

13. Failing to find a French vector
We could not find a "speak in French" vector after about an hour of effort, but it's possible we missed something straightforward.
Steering vector: "Je m'appelle" - "My name is " before attention layer 6 with coefficient +5

The thought I had was maybe to describe the desired behavior, and explain a plausible cause in terms of well known kinds of mental configurations that speakers can be in, and also demonstrate it directly? (Plus a countervailing description, demonstration, and distinct causal theory.)

So perhaps a steering vector made from these phrases could work: "I'm from Quebec et je glisse souvent accidentellement vers le français" - "I only speak English because I'm a monolingual American".

EDIT: If you have the tooling set up to swiftly try this experiment, maybe it helps to explain the most central theory that motivates it, and might gain bayes points if it works?

According to the "LLMs ar... (read more)

[-]faul_sname3yΩ12251

I found an even dumber approach that works. The approach is as follows:

Take three random sentences of Wikipedia.
Obtain a French translation for each sentence.
Determine the boundaries corresponding phrases in each English/French sentence pair.
Mark each boundary with "|"
Count the "|"s, call that number n.
For i from 0 to n, make an English->French sentence by taking the first i fragments in English and the rest in French. The resulting sentences look like
The album received mixed to positive reviews, with critics commending the production de nombreuses chansons tout en comparant l'album aux styles électropop de Ke$ha et Robyn.
For each English->French sentence, make a +1 activation addition for that sentence and a -1 activation addition for the unmodified English sentence.
Apply the activation additions.
That's it. You have an activation addition that causes the model to want, pretty strongly, to start spontaneously speaking in French. Note that gpt2-small is pretty terrible at speaking French.

Example output: for the prompt

He became Mayor in 1957 after the death of Albert Cobo, and was elected in his own right shortly afterward by a 6:1 margin over his opponent. Miria

... (read more)

7TurnTrout3y

This is awesome. As you have just shown, there are a ton of low-hanging activation additions just waiting to be found. Team shard has barely explored this large space of interventions. I encourage people to play around with activation additions more, via e.g. our demo colabs for GPT-2-XL (Colab Pro required) and GPT-2-small (Colab Pro not required). Though more sophisticated interventions (like the one you demonstrate) will require coding, and not just playing with our demo widget. You looked at GPT-2-small. I injected your activation additions into GPT-2-XL at several locations: * Layer 6: Messed up the completions, a few French words seemingly randomly scattered in the output. * Layer 16: Noticeable tendency to mention French, and even talk in "French" a bit. * Layer 20: Switches to French relatively quickly. Note that all of the activation addition coefficients are 1, and your code generates 56 additions, so we're adding a "coefficient 56" steering vector to forward passes. This should probably be substantially smaller. I haven't examined this yet. EDIT: Setting each activation addition to about .8 still works, but .5 doesn't. At this scale, most (>90%) of the affected residual stream content should be about the activation additions. It seems to me like this will overwrite the existing content in those streams. This makes me more skeptical of this schema. However, neither the steered nor the unsteered French is particularly coherent. I think GPT-2-XL and GPT-2-small are both incapable of actually speaking complicated French, and so we might look into larger models. In sum, we don't actually yet have a demonstration of "switches fluently to French and keeps making sense", but this schema seems very promising. Great work again. Your colab's "Check it can speak French" section seems to be a stub.

6faul_sname3y

Fixed. Updated the colab to try out this approach with a range of coefficients. * From 0.001 to 0.01 seems to have very little effect ("He oversaw a handful of slow-moving major projects—such as the "Waterfront Park" which cost $18 million to build—and implemented a series of rapidly reforming safety ordinances") * 0.02 to 0.1 seems to have effects like "model lapses in and out of French" and "names look French" ("In 1955, sent Soups Maryaine Hagné de la Breaise (de l'architecture spécialiste de la site des associations Actualities Mélenziques de New Orleans) as the journalist, known as a "pig cure," and then as "weird" mayor, in lieu of actualizing their real grievances.") * 0.2 to 5 seems to steer the model to switch from English to French-shaped text ("1950 vivienes an un qué de neous nechien en zanappressant.") * At 10, the model seems to decide that words like "le" and "en" and "mal" are as French as things get ("le le enne les le le dedan le renous en le arriu du recenac") Confirmed that GPT-2-XL seems to also be unable to speak French. Continuing to scale up from there, I find that gpt-neo-2.7B can kinda-sorta speak sensical French. GPT-J-6B OOMs on me on Colab Pro, but I think I may be able to do some hackery with init_empty_weights() / load_checkpoint_and_dispatch(), or, failing that, use an 8 bit or even 4 bit version of GPT-J-6B -- I honestly doubt the loss in precision really matters for algebraic value editing, considering that the level of precision starts off at "take the difference between two things that seem like they might plausibly have a similar relationship". Update: I have gotten GPT-J-6B up and running on Colab (link, it's a new one), and working alright with TransformerLens and montemac's algebraic_value_editing repo. GPT-J-6B is capable of speaking French, so I think this is a good model to do testing on. Now I'm fighting with finding a good coefficient / position to reproduce the original Hate->Love vector result.

4Martin Randall3y

You say this is a dumber approach, but it seems smarter to me, and more general. I feel more confident that this vector is genuinely going to result in a "switch from English to French" behavior, versus the edits in the main post. I suppose it might also result in some more general "switch between languages" behavior. So the last challenge remaining of the four is for someone to figure out a truth-telling vector.

1Arthur Conmy2y

This is particularly impressive since ChatGPT isn't capable of code-switching (though GPT-4 seems to be from a quick try)

1Bogdan Ionut Cirstea3y

Here's a related conceptual framework and some empirical evidence which might go towards explaining why the other activation vectors work (and perhaps would predict your proposed vector should work). In Language Models as Agent Models, Andreas makes the following claims (conceptually very similar to Simulators): '(C1) In the course of performing next-word prediction in context, current LMs sometimes infer approximate, partial representations of the beliefs, desires and intentions possessed by the agent that produced the context, and other agents mentioned within it. (C2) Once these representations are inferred, they are causally linked to LM prediction, and thus bear the same relation to generated text that an intentional agent’s state bears to its communciative actions.’ They showcase some existing empirical evidence for both (C1) and (C2) (in some cases using using linear probing and controlled generation by editing the representation used by the linear probe) in (sometimes very toyish) LMs for 3 types of representations (in a belief-desire-intent agent framework): beliefs - section 5, desires - section 6, (communicative) intents - section 4. Now categorizing the wording of the prompts from which the working activation vectors are built: "Love" - "Hate" -> desire. "Intent to praise" - "Intent to hurt" -> communicative intent. "Bush did 9/11 because" - " " -> belief. "Want to die" - "Want to stay alive" -> desire. "Anger" - "Calm" -> communicative intent. The Eiffel Tower is in Rome" - "The Eiffel Tower is in France" -> belief. "Dragons live in Berkeley" - "People live in Berkeley " -> belief. "I NEVER talk about people getting hurt" - "I talk about people getting hurt" -> communicative intent. "I talk about weddings constantly" - "I do not talk about weddings constantly" -> communicative intent. "Intent to convert you to Christianity" - "Intent to hurt you " -> communicative intent / desire. The prediction here would that the activation vectors ap

[-]Thomas Kwa3y*Ω1536-30

This is the most impressive concrete achievement in alignment I've seen. I think this post reduces my p(doom) by around 1%, and I'm excited to see where all of the new directions uncovered lead.

Edit: I explain this view in a reply.

Edit 25 May: I now think RLHF is more impressive in terms of what we can get systems to do, but I still think activation editing has opened up more promising directions. This is still in my all-time top 10.

[-]Gabe M3yΩ7152

What other concrete achievements are you considering and ranking less impressive than this? E.g. I think there's a case for more alignment progress having come from RLHF, debate, some mechanistic interpretability, or adversarial training.

[-]Thomas Kwa3y*Ω20516

I think to solve alignment, we need to develop our toolbox of "getting AI systems to behave in ways we choose". Not in the sense of being friendly or producing economic value, but things that push towards whatever cognitive properties we need for a future alignment solution. We can make AI systems do some things we want e.g. GPT-4 can answer questions with only words starting with "Q", but we don't know how it does this in terms of internal representations of concepts. Current systems are not well-characterized enough that we can predict what they do far OOD. No other work I've seen quite matches the promise this post has in finding ways to exert fine-grained control over a system's internals; we now have a wide variety of concrete questions like

how to find steering vectors for new behaviors e.g. speaking French?
how to make these techniques more robust?
What do steering vectors, especially multiple steering vectors, tell us about how the model combines concepts?
Can we decompose the effect of a prompt into steering vectors from simpler prompts, thereby understanding why complex prompts work?
Are the effects of steering vectors nonlinear for small coefficients? What does this mean abou

... (read more)

4TurnTrout3y

I phased out "algebraic value editing" for exactly that reason. Note that only the repository and prediction markets retain this name, and I'll probably rename the repo activation_additions.

3Dan H3y

(You linked to "deep deceptiveness," and I'm going to assume is related to self-deception (discussed in the academic literature and in the AI and evolution paper). If it isn't, then this point is still relevant for alignment since self-deception is another internal hazard.) I think one could argue that self-deception could in some instances be spotted in the weights more easily than in the activations. Often the functionality acquired by self-deception is not activated, but it may be more readily apparent in the weights. Hence I don't see this as a strong reason to dismiss https://arxiv.org/abs/2212.04089. I would want a weight version of a method and an activation version of a method; they tend to have different strengths. Note: If you're wanting to keep track of safety papers outside of LW/AF, papers including https://arxiv.org/abs/2212.04089 were tweeted on https://twitter.com/topofmlsafety and posted on https://www.reddit.com/r/mlsafety Edit: I see passive disagreement but no refutation. The argument against weights was of the form "here's a strength activations has"; for it to be enough to dismiss the paper without discussion, that must be an extremely strong property to outweigh all of its potential merits, or it is a Pareto-improvement. Those don't seem corroborated or at all obvious.

9TurnTrout3y

I personally don't "dismiss" the task vector work. I didn't read Thomas as dismissing it by not calling it the concrete work he is most excited about -- that seems like a slightly uncharitable read? I, personally, think the task vector work is exciting. Back in Understanding and controlling a maze-solving policy network, I wrote (emphasis added): I'm highly uncertain about the promise of activation additions. I think their promise ranges from pessimistic "superficial stylistic edits" to optimistic "easy activation/deactivation of the model's priorities at inference time." In the optimistic worlds, activation additions do enjoy extreme advantages over task vectors, like accessibility of internal model properties which aren't accessible to finetuning (see the speculation portion of the post). In the very pessimistic worlds, activation additions are probably less directly important than task vectors. I don't know what world we're in yet.

5TurnTrout3y

Note that task vectors require finetuning. From the newly updated related work section:

4Thomas Kwa3y

* Deep deceptiveness is not quite self-deception. I agree that there are some circumstances where defending from self-deception advantages weight methods, but these seem uncommon. * I thought briefly about the Ilharco et al paper and am very impressed by it as well. * Thanks for linking to the resources. I don't have enough time to reply in depth, but the factors in favor of weight vectors and activation vectors both seem really complicated, and the balance still seems in favor of activation vectors, though I have reasonably high uncertainty.

4TurnTrout3y

Weight vectors are derived through fine-tuning. Insofar as you thought activation additions are importantly better than finetuning in some respects, and were already thinking about finetuning (eg via RLHF) when writing why you were excited about activation additions, I don't see how this paper changes the balance very much? (I wrote my thoughts here in Activation additions have advantages over (RL/supervised) finetuning) I think the main additional piece of information given by the paper is the composability of finetuned edits unlocking a range of finetuning configurations, which grows exponentially with the number of composable edits. But I personally noted that finetuning enjoys this benefit in the original version of the post. There's another strength which I hadn't mentioned in my writing, which is that if you can finetune into the opposite direction of the intended behavior (like you can make a model less honest somehow), and then subtract that task vector, you can maybe increase honesty, even if you couldn't just naively finetune that honesty into the model.[1] But, in a sense, task vectors are "still in the same modalities we're used to." Activation additions jolted me because they're just... a new way[2] of interacting with models! There's been way more thought and research put into finetuning and its consequences, relative to activation engineering and its alignment implications. I personally expect activation engineering to open up a lot of affordances for model-steering. 1. ^ This is a kinda sloppy example because "honesty" probably isn't a primitive property of the network's reasoning. Sorry. 2. ^ To be very clear about the novelty of our contributions, I'll quote the "Summary of relationship to prior work" section: But this "activation engineering" modality is relatively new, and relatively unexplored, especially in its alignment implications. I found and cited two papers adding activation vectors to LMs to steer them,

6awg3y

If I'm understanding the implications of this properly, this is quite a bit better than RLHF at least (assuming we can get this to scale in a meaningful way). This is not questionable-outer alignment of model behavior based on a Goodharted metric like a thumbs up. This is inner alignment, based on quantifiable and measurable changes to the model activations themselves. That's a way more explainable, robust, testable approach than RLHF, right?

[-]iceman3yΩ62311

Redwood Research used to have a project about trying to prevent a model from outputting text where a human got hurt, which IIRC, they did primarily by trying to fine tunes and adversarial training. (Followup). It would be interesting to see if one could achieve better results then they did at the time through subtracting some sort of hurt/violence vector.

[-]Dan H3yΩ7120

Page 4 of this paper compares negative vectors with fine-tuning for reducing toxic text: https://arxiv.org/pdf/2212.04089.pdf#page=4

In Table 3, they show in some cases task vectors can improve fine-tuned models.

[-]TurnTrout3y*Ω8164

Insofar as you mean to imply that "negative vectors" are obviously comparable to our technique, I disagree. Those are not activation additions, and I would guess it's not particularly similar to our approach. These "task vectors" involve subtracting weight vectors, not activation vectors. See also footnote 39 (EDIT: and the related work appendix now talks about this directly).

[-]Measure3yΩ112110

"party", "ceremony", "dress", "with", "photographer"

While these aren't syntactically valid continuations of the prompt, they are highly likely (and syntactically valid) continuations for "wedding ". More than just being wedding-related, these seem like direct continuations.

6TurnTrout3y

Agreed. This is an important clue that I forgot to mention in the text. I'll update that now.

[-]Carl Feynman3y191

You write "This residual stream fraction data seems like evidence of something. We just don't know how to put together the clues yet." I am happy to say that there is a simple explanation-- simple, at least, to those of us experienced in high-dimensional geometry. Weirdly, in spaces of high dimension, almost all vectors are almost at right angles. Your activation space has 1600 dimensions. Two randomly selected vectors in this space have an angle of between 82 and 98 degrees, 99% of the time. It's perfectly feasible for this space to represent zillions of concepts almost at right angles to each other. This permits mixtures of those concepts to be represented as linear combinations of the vectors, without the base concepts becoming too confused.

Now, consider a random vector, w (for 'wedding'). Set 800 of the coordinates of w to 0, producing w'. The angle between w and w' will be 60 degrees. This is much closer than any randomly chosen non-wedding concept. This is why a substantial truncation of the wedding vector is still closer to wedding than it is to anything else.

Epistemic status: Medium strong. High-dimensional geometry is one of the things I do for my career. But I did all the calculations in my head, so there's a 20% of my being quantitatively wrong. You can check my claims with a little algebra.

2Mart_Korz2y

This part, I can imagine. With a fixed reference vector written as (1,0,0,…,0), a second random vector has many dimensions that it can distribute its length along (x1,x2,x2,…,xd) while for alignment to the reference (the scalar product) only the first entry x1 contributes. This part I struggle with. Is there an intuitive argument for why this is possible? If I assume smaller angles below 60° or so, a non-rigorous argument could be: * each vector blocks a 30°-circle around it on the d-hypersphere[1] (if the circles of two vectors touch, their relative angle is 60°). * an estimate for the blocked area could be that this is mostly a 'flat' (d-1)-sphere of radius 30°/(1rad)=π/6≈0.6 which has an area that scales with Avector∼(0.6)d−1 * the full hypersphere has a surface area with a similar pre-factor but full radius A∼(1)d−1 * thus we can expect to fit a number of vectors N that scales roughly like N∼A/Avector∼(10.6)d−1 which is an exponential growth in d. For a proof, one would need to include whether it is possible to tile the surface efficiently with the Avector circles. This seems clearly true for tiny angles (we can stack spheres in approximately flat space just fine), but seems a lot less obvious for larger angles. For example, full orthogonality would mean 90° angles and my estimate would still give N∼(1π/4)d−1≈(1.27)d−1, an exponential estimate for the number of strictly orthogonal states although these are definitely not exponentially many. ---------------------------------------- and a copy of that circle on the opposite end of the sphere ↩︎

4Mart_Korz2y

Update: I found a proof of the "exponential number of near-orthogonal vectors" in these lecture notes https://www.cs.princeton.edu/courses/archive/fall16/cos521/Lectures/lec9.pdf From my understanding, the proof uses a quantification of just how likely near-orthogonality becomes in high-dimensional spaces and derives a probability for pairwise near-orthogonality of many states. This does not quite help my intuitions, but I'll just assume that the "it it possible to tile the surface efficiently with circles even if their size gets close to the 45° threshold" resolves to "yes, if the dimensionality is high enough". One interesting aspect of these considerations should be that with growing dimensionality the definition of near-orthogonality can be made tighter without loosing the exponential number of vectors. This should define a natural signal-to-noise ratio for information encoded in this fashion.

[-]Ulisse Mini3yΩ10170

Was considering saving this for a followup post but it's relatively self-contained, so here we go.

Why are huge coefficients sometimes okay? Let's start by looking at norms per position after injecting a large vector at position 20.

This graph is explained by LayerNorm. Before using the residual stream we perform a LayerNorm

# transformer block forward() in GPT2
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))

If x has very large magnitude, then the block doesn't change it much relative to its magnitude. Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions.

As expected, we see a convergence in probability along each token position when we look with the tuned lens.

You can see how for positions 1 & 2 the output distribution is decided at layer 20, since we overwrote the residual stream with a huge coefficient all the LayerNorm'd outputs we're adding are tiny in comparison, then in the final LayerNorm we get ln(bigcoeff*diff + small) ~= ln(bigcoeff*diff) ~= ln(diff).

[-]TurnTrout3yΩ5100

Additionally, attention is ran on the normalized x meaning only the "unscaled" version of x is moved between positions.

Thanks for writing this up, I hadn't realized this. One conclusion I'm drawing is: If the values in the modified residual streams aren't important to other computations in later sequence positions, then a large-coefficient addition will still lead to reasonable completions.

3Ulisse Mini3y

Yeah, assuming by "not important" you mean "not relevant" (low attention score)

[-]Dan H3y*Ω2174

Could these sorts of posts have more thorough related works sections? It's usually standard for related works in empirical papers to mention 10+ works. Update: I was looking for a discussion of https://arxiv.org/abs/2212.04089, assumed it wasn't included in this post, and many minutes later finally found a brief sentence about it in a footnote.

[-]Gabe M3y148

Maybe also [1607.06520] Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings is relevant as early (2016) work concerning embedding arithmetic.

[-]habryka3yΩ41110

I don't understand this comment. I did a quick count of related works that are mentioned in the "Related Works" section (and the footnotes of that section) and got around 10 works, so seems like this is meeting your pretty arbitrarily established bar, and there are also lots of footnotes and references to related work sprinkled all over the post, which seems like the better place to discuss related work anyways.

I am not familiar enough with the literature to know whether this post is omitting any crucial pieces of related work, but the relevant section of this post seems totally adequate in terms of volume (and also the comments are generally a good place for people to drop links to related work, if they think there is interesting related work missing).

Also, linking to a related work in a footnote seems totally fine. It is somewhat sad that link-text isn't searchable by-default, so searching for the relevant arxiv link is harder than it has to be. Might make sense to add some kind of tech solution here.

8Dan H3y

Background for people who understandably don't habitually read full empirical papers: Related Works sections in empirical papers tend to include many comparisons in a coherent place. This helps contextualize the work and helps busy readers quickly identify if this work is meaningfully novel relative to the literature. Related works must therefore also give a good account of the literature. This helps us more easily understand how much of an advance this is. I've seen a good number of papers steering with latent arithmetic in the past year, but I would be surprised if this is the first time many readers of AF/LW have seen it, which would make this paper seem especially novel. A good related works section would more accurately and quickly communicate how novel this is. I don't think this norm is gatekeeping nor pedantic; it becomes essential when the number of papers becomes high. The total number of cited papers throughout the paper is different from the number of papers in the related works. If a relevant paper is buried somewhere randomly in a paper and not contrasted with explicitly in the related works section, that is usually penalized in peer review.

9DanielFilan3y

I think you might be interpreting the break after the sentence "Their results are further evidence for feature linearity and internal activation robustness in these models." as the end of the related work section? I'm not sure why that break is there, but the section continues with them citing Mikolov et al (2013), Larsen et al (2015), White (2016), Radford et al (2016), and Upchurch et al (2016) in the main text, as well as a few more papers in footnotes.

[-]Dan H3yΩ3131

Yes, I was--good catch. Earlier and now, unusual formatting/and a nonstandard related works is causing confusion. Even so, the work after the break is much older. The comparison to works such as https://arxiv.org/abs/2212.04089 is not in the related works and gets a sentence in a footnote: "That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors."

Is this big difference? I really don't know; it'd be helpful if they'd contrast more. Is this work very novel and useful, and that one isn't any good for alignment? Or did Ludwig Schmidt (not x-risk pilled) and coauthors in Editing Models with Task Arithmetic (made public last year and is already published) come up with an idea similar to, according to a close observer, "the most impressive concrete achievement in alignment I've seen"? If so, what does that say about the need to be x-risk motivated to do relevant research, and what does this say about group epistemics/ability to spot relevant progress if it's not posted on the AF?

[-]davidad3yΩ71519

On the object-level, deriving task vectors in weight-space from deltas in fine-tuned checkpoints is really different from what was done here, because it requires doing a lot of backward passes on a lot of data. Deriving task vectors in activation-space, as done in this new work, requires only a single forward pass on a truly tiny amount of data. So the data-efficiency and compute-efficiency of the steering power gained with this new method is orders of magnitude better, in my view.

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.

[-]Dan H3yΩ61313

It's a good observation that it's more efficient; does it trade off performance? (These sorts of comparisons would probably be demanded if it was submitted to any other truth-seeking ML venue, and I apologize for consistently being the person applying the pressures that generic academics provide. It would be nice if authors would provide these comparisons.)

Also, taking affine combinations in weight-space is not novel to Schmidt et al either. If nothing else, the Stable Diffusion community has been doing that since October to add and subtract capabilities from models.

It takes months to write up these works, and since the Schmidt paper was in December, it is not obvious who was first in all senses. The usual standard is to count the time a standard-sized paper first appeared on arXiv, so the most standard sense they are first. (Inside conferences, a paper is considered prior art if it was previously published, not just if it was arXived, but outside most people just keep track of when it was arXived.) Otherwise there are arms race dynamics leading to everyone spamming snippets before doing careful, extensive science.

5davidad3y

Some direct quantitative comparison between activation-steering and task-vector-steering (at, say, reducing toxicity) is indeed a very sensible experiment for a peer reviewer to ask for and I would like to see it as well.

5habryka3y

The level of comparison between the present paper and this paper seems about the same as I see in papers you have been a co-author in. E.g. in https://arxiv.org/pdf/2304.03279.pdf the Related Works section is basically just a list of papers, with maybe half a sentence describing their relation to the paper. This seems normal and fine, and I don't see even papers you are a co-author on doing something substantively different here (this is again separate from whether there are any important papers omitted from the list of related works, or whether any specific comparisons are inaccurate, it's just making a claim about the usual level of detail that related works section tend to go into).

3Dan H3y

In many of my papers, there aren't fairly similar works (I strongly prefer to work in areas before they're popular), so there's a lower expectation for comparison depth, though breadth is always standard. In other works of mine, such as this paper on learning the the right thing in the presence of extremely bad supervision/extremely bad training objectives, we contrast with the two main related works for two paragraphs, and compare to these two methods for around half of the entire paper. The extent of an adequate comparison depends on the relatedness. I'm of course not saying every paper in the related works needs its own paragraph. If they're fairly similar approaches, usually there also needs to be empirical juxtapositions as well. If the difference between these papers is: we do activations, they do weights, then I think that warrants a more in-depth conceptual comparisons or, preferably, many empirical comparisons.

2habryka3y

Yeah, it's totally possible that, as I said, there is a specific other paper that is important to mention or where the existing comparison seems inaccurate. This seems quite different from a generic "please have more thorough related work sections" request like the one you make in the top-level comment (which my guess is was mostly based on your misreading of the post and thinking the related work section only spans two paragraphs).

7Dan H3y

Yes, I'll tend to write up comments quickly so that I don't feel as inclined to get in detailed back-and-forths and use up time, but here we are. When I wrote it, I thought there were only 2 things mentioned in the related works until Daniel pointed out the formatting choice, and when I skimmed the post I didn't easily see comparisons or discussion that I expected to see, hence I gestured at needing more detailed comparisons. After posting, I found a one-sentence comparison of the work I was looking for, so I edited to include that I found it, but it was oddly not emphasized. A more ideal comment would have been "It would be helpful to me if this work would more thoroughly compare to (apparently) very related works such as ..."

2Raemon3y

I'm also not able to evaluate the object-level of "was this post missing obvious stuff it'd have been good to improve", but, something I want to note about my own guess of how an ideal process would go from my current perspective: I think it makes more sense to think of posting on LessWrong as "submitting to a journal", than "publishing a finished paper." So, the part where some people then comment "hey, this is missing X" is more analogous to the thing where you submit to peer review and they say "hey, you missed X", then publishing a finished paper in a journal and it missing X. I do think a thing LessWrong is missing (or, doesn't do a good enough job at) is a "here is the actually finished stuff". I think the things that end up in the Best of LessWrong, after being subjected to review, are closer to that, but I think there's room to improve that more, and/or have some kind of filter for stuff that's optimized to meet academic-expectations-in-particular.

[-]jsteinhardt3yΩ73014

I'll just note that I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

I'm not sure what you mean about whether the post was "missing something important", but I do think that you should be pretty worried about LessWrong's collective epistemics that Dan H is the only one bringing this important point up, and that rather than being rewarded for doing so or engaged with on his substantive point, he's being nitpicked by a moderator. It's not an accident that no one else is bringing these points up--it's because everyone else who has the expertise to do so has given up or judged it not worth their time, largely because of responses like the one Dan H is getting.

[-]TurnTrout3yΩ132614

I, like Dan H, find it pretty hard to engage with this post because I can't tell whether it's basically the same as the Ludwig Schmidt paper (my current assumption is that it is). The paragraph the authors added didn't really help in this regard.

The answer is: No, our work is very different from that paper. Here's the paragraph in question:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. While this seems interesting, task arithmetic requires finetuning. In Activation additions have advantages over (RL/supervised) finetuning, we explain the advantages our approach may have over finetuning.

Here's one possible improvement:

Editing Models with Task Arithmetic explored a "dual" version of our activation additions. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-difference vectors. Our approach does not modify the weights. Instead, we modify forward passes by adding an activation vector. While their task arithmetic

... (read more)

[-]jsteinhardt3yΩ336524

Hi Alex,

Let me first acknowledge that your write-up is significantly more thorough than pretty much all content on LessWrong, and that I found the particular examples interesting. I also appreciated that you included a related work section in your write-up. The reason I commented on this post and not others is because it's one of the few ML posts on LessWrong that seemed like it might teach me something, and I wish I had made that more clear before posting critical feedback (I was thinking of the feedback as directed at Oliver / Raemon's moderation norms, rather than your work, but I realize in retrospect it probably felt directed at you).

I think the main important point is that there is a body of related work in the ML literature that explores fairly similar ideas, and LessWrong readers who care about AI alignment should be aware of this work, and that most LessWrong readers who read the post won't realize this. I think it's good to point out Dan's initial mistake, but I took his substantive point to be what I just summarized, and it seems correct to me and hasn't been addressed. (I also think Dan overfocused on Ludwig's paper, see below for more of my take on related work.)

Here i... (read more)

[-]TurnTrout3yΩ5110

Thanks so much, I really appreciate this comment. I think it'll end up improving this post/the upcoming paper.

(I might reply later to specific points)

2jsteinhardt3y

Glad it was helpful!

4awg3y

I think this entire thread shows why it's kind of silly to hold LessWrong posts to the same standard as a peer reviewed journal submission. There is clearly a lower bar to LessWrong posts than getting into a peer reviewed journal or even arXiv. And that's fine, that's as it should be. This is a forum, not a journal. That said, I also think this entire thread shows why LessWrong is a very valuable forum, due to its users' high epistemic standards. It's a balance.

8TurnTrout3y

Thanks for the feedback. Some related work was "hidden" in footnotes because, in an earlier version of the post, the related work was in the body and I wanted to decrease the time it took a reader to get to our results. The related work section is now basically consolidated into the appendix. I also added another paragraph:

2Bogdan Ionut Cirstea3y

The (overlapping) evidence from Deep learning models might be secretly (almost) linear could also be useful / relevant, as well as these 2 papers on 'semantic differentials' and (contextual) word embeddings: SensePOLAR: Word sense aware interpretability for pre-trained contextual word embeddings, Semantic projection recovers rich human knowledge of multiple object features from word embeddings.

[-]James Chua3y*150

I managed to get it working for llama-7b on colab after some debugging.

Suprising, it actually does work for the Love / Hate scenario. But not some others like Rome vs Paris.

Heres the link i anyone wants to try it.

https://colab.research.google.com/drive/1ACAA7FO8zc4pFAqPdaPshoy4WWXCvUTQ?usp=sharing

edit: seems like you guys already have a better version here. https://github.com/UlisseMini/activation_additions_hf/blob/main/notebooks/qualitative.ipynb

nevermind! (I'm still keeping this comment for visiblity if anyone wants to try)

2Ulisse Mini3y

Haha nice work! I'm impressed you got TransformerLens working on Colab, I underestimated how much CPU ram they had. I would have shared a link to my notebook & Colab but figured it might be good to keep under the radar so people could preregister predictions. Maybe the knowledge that you're hot on my heels will make me finish the LLAMAs post faster now ;)

1James Chua3y

Yep! I was very pleasantly surprised that Love/Hate worked for Llama at all. It's great that you rewrote it without transformer lens too - as transformer lens has issues with 8 bit / 4 bit quantisation. Also send you a dm on discord! I'll be interested to read any rough findings and lessons you have with llama

[-]Gabe M3yΩ5144

This feels super cool, and I appreciate the level of detail with which you (mostly qualitatively) explored ablations and alternate explanations, thanks for sharing!

Surprisingly, for the first prompt, adding in the first 1,120 (frac=0.7 of 1,600) dimensions of the residual stream is enough to make the completions more about weddings than if we added in at all 1,600 dimensions (frac=1.0).

1. This was pretty surprising! Your hypothesis about additional dimensions increasing the magnitude of the attention activations seems reasonable, but I wonder if the non-monotonicity could be explained by an "overshooting" effect: With the given scale you chose, maybe using 70% of the activations landed you in the right area of activation space, but using 100% of the activations overshot the magnitude of the attention activations (particularly the value vectors) such as to put it sufficiently off-distribution to produce fewer wedding words. An experiment you could run to verify this is to sweep both the dimension fraction and the activation injection weight together to see if this holds across different weights. Maybe it would also make more sense to use "softer" metrics like BERTScore to a gol... (read more)

8Vika2y

Re 4, we were just discussing this paper in a reading group at DeepMind, and people were confused why it's not on arxiv.

6TurnTrout2y

An Arxiv version is forthcoming. We're working with Gavin Leech to publish these results as a conference paper.

8tricky_labyrinth3y

+1ing 5 specifically

8Daniel Kokotajlo3y

My reaction was "Huh, so maybe LLMs can experience an analogue of getting drunk or high or angry after all."

4TurnTrout3y

This feels like... too strong of an inference, relative to available data? Maybe I misunderstand. If the claim is more "altered state relative to usual computational patterns", I'm on board. That said, I have found it pretty interesting to think about what it would feel like to have "steering vectors" added to my cognition.

4Daniel Kokotajlo3y

I agree it's mere speculation, I don't have more than 50% credence in it.

7awg3y

Strongly agreed re: 4. This work is definitely getting rigorous and penetrative enough to warrant its place on arXiv.

[-]Caridorc Tergilti3y113

We could not find a "speak in French" vector after about an hour of effort, but it's possible we missed something straightforward

Did you try 10 or 20 simple French phrases with a positive sign and their translations with a negative sign?

Also try 1000 english words and 1000 french translations in case scale is the problem.

Also try:

"The following text is in English: ' "

"The following text is in French: ' "

with the second phrase written itself in French.

5faul_sname3y

I just tried that, and it kinda worked. Specifically, it worked to get gpt2-small to output text that structurally looks like French, but not to coherently speak French. Although I then just tried feeding the base gpt2-small a passage in French, and its completions there were also incoherent, so I think it's just that that version hasn't seen enough French to speak it very well.

[-]Raemon3yΩ492

Curated. I think this post proposes an interesting mechanism of understanding and controlling LLMs. I'm have a lot of uncertainty on how useful this will turn out to be, but the idea seems both interesting and promising and I'd like to see more work exploring the area.

[-]Bogdan Ionut Cirstea3y91

Here's one potential reason why this works and a list of neuroscience papers which empirically show linearity between LLMs and human linguistic representations.

1RomanS3y

Given the deep similarities between biological nets and LLMs, I wonder if a technique similar to "activation engineering" could be used for robust mind control and/or brainwashing.

[-]Linda Linsefors2yΩ284

We don't know why the +2000 vector works but the +100 vector doesn't.

My guess is it's because in the +100 case the vectors are very similar, causing their difference to be something un-natural.

"I talk about weddings constantly " and "I do not talk about weddings constantly" are technically opposites. But if you imagine someone saying this, you notice that their neural language meaning is almost identical.

What sort of person says "I do not talk about weddings constantly"? That sounds to me like someone who talks about weddings almost constantly. Why else would they feel the need to say that?

[-]Arthur Conmy2y*Ω450

> Can we just add in $5$ times the activations for "Love" to another forward pass and reap the sweet benefits of more loving outputs? Not quite. We found that it works better to pair two activation additions.

Do you have evidence for this?

It's totally unsurprising to me that you need to do this on HuggingFace models as the residual stream is very likely to have a constant bias term which you will not want to add to. I saw you used TransformerLens for some part of the project and TL removes the mean from all additions to the residual stream ... (read more)

4TurnTrout2y

We used TL to cache activations for all experiments, but are considering moving away to improve memory efficiency. Oh, somehow I'm not familiar with this. Is this center_unembed? Or are you talking about something else? Yes, but I think the evidence didn't actually come from the "Love" - "Hate" prompt pair. Early in testing we found paired activation additions worked better. I don't have a citeable experiment off-the-cuff, though.

3Arthur Conmy2y

No this isn’t about center_unembed, it’s about center_writing_weights as explained here: https://github.com/neelnanda-io/TransformerLens/blob/main/further_comments.md#centering-writing-weights-center_writing_weight This is turned on by default in TL, so okay I think that there must be something else weird about models rather than just a naive bias that causes you to need to do the difference thing

4TurnTrout2y

I still don't follow. Apparently, TL's center_writing_weights is adapting the writing weights in a pre-LN-invariant fashion (and also in a way which doesn't affect the softmax probabilities after unembed). This means the actual computations of the forward pass are left unaffected by this weight modification, up to precision limitations, right? So that means that our results in particular should not be affected by TL vs HF.

1Arthur Conmy2y

Oops, I was wrong in my initial hunch as I assumed centering writing did something extra. I’ve edited my top level comment, thanks for pointing out my oversight!

[-]Sammy Martin3yΩ250

This strikes me as a very preliminary bludgeon version of the holy grail of mechanistic interpretability, which is to say actually understanding and being able to manipulate the specific concepts that an AI model uses

5TurnTrout3y

I think that capacity would be really nice. I think our results are maybe a very very rough initial version of that capacity. I want to caution that we should be very careful about making inferences about what concepts are actually used by the model. From a footnote:

[-]Stephen Fowler3y50

Really impressive work and I found the colab very educational.

I may be missing something obvious, but it is probably worth including "Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space" (Geva et al., 2022) in the related literature. They highlight that the output of the FFN (that gets added to the residual stream) can appear to be encoding human interpretable concepts.

Notably, they did not use SGD to find these directions, but rather had "NLP experts" (grad students) manual look over the top 30 words associated with each value vector.

[-]Trinley Goldenberg1y40Review for 2023 Review

I remember reading this and getting quite excited about the possibilities of using activation steering and downstream techniques. The post is well written with clear examples.

I think that this directly or indirectly influenced a lot of later work in steering llms.

[-]David Reber3yΩ140

Another related work: Concept Algebra for Text-Controlled Vision Models (Discloser: while I did not author this paper, I am in the PhD lab who did, under Victor Veitch at UChicago. Any mistakes made in this comment are my own). We haven't prioritized a blog post about the paper so it makes sense that this community isn't familiar with it.

The concept algebra paper demonstrates that for text-to-image models like Stable Diffusion, there exist linear subspaces in the score embedding space, on which you can do the same manner of concept editing/control as... (read more)

3Bogdan Ionut Cirstea3y

Seems very related: Linear Spaces of Meanings: Compositional Structures in Vision-Language Models. Notably, the (approximate) compositionality of language/reality should bode well for the scalability of linear activation engineering methods.

1Bogdan Ionut Cirstea3y

And this structure can be used as regularization for soft prompts.

[-]Solenoid_Entity3y40

I think there may be a typo in the table directly under the heading "Token probability shifts."
If it's not a typo, why are both coefficients positive? Aren't we meant to subtract the vector for ' '?

3TurnTrout3y

Yes, that's a typo. Fixed, thanks!

[-]p.b.3yΩ340

This is really cool work! Congratulations!

Besides the LLM related work it also reminds somewhat of dynamic prompting in Stable Diffusion, where part of the prompt is changed after a number of steps to achieve a mixture of promp1 and prompt2.

What's the TL;DR for the Vicuna 13B experiments?

6TurnTrout3y

Activation additions work on Vicuna-13B about as well as they work on GPT-2-XL, or perhaps slightly better. GPT-J-6B is harder to work with for some reason.

[-]TurnTrout3yΩ6130

Note that there's still a market open for how activation additions interact with larger models, it would be nice if it had more liquidity:

[-]Martin Randall3y179

I added m1,000 in liquidity.

This idea of determining whether a result is "obvious" in advance seems valuable, I hope it catches on.

5cfoster03y

I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?

[-]Joseph Bloom3yΩ241

Really exciting! I added a version of AVEC to my interpretability tool for gridworld agents and am keen to explore it more. I really like that the injection coefficient has a scalar and this had enabled me to do what I can "an injection coefficient scan".

The procedure I'm using looks like this:

Repeat your input tokens say, 128 times.
Apply the activation vector at 128 different steps between a coefficient of -10 and 10 to each of your input tokens when doing your AVEC forward pass.
Decompose the resulting residual stream to whatever granula

... (read more)

2TurnTrout3y

I don't think I follow your procedure. Would you be willing to walk me through an example situation?

3Joseph Bloom3y

Sure. Let's do it at EAG. :)

[-]Review Bot2y*30

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Hoagy3yΩ130

Do you have a writeup of the other ways of performing these edits that you tried and why you chose the one you did?

In particular, I'm surprised by the method of adding the activations that was chosen because the tokens of the different prompts don't line up with each other in a way that I would have thought would be necessary for this approach to work, super interesting to me that it does.

If I were to try and reinvent the system after just reading the first paragraph or two I would have done something like:

Take multiple pairs of prompts that differ primari

... (read more)

[-]DPiepgrass3y30

I don't really know how GPTs work, but I read §"Only modifying certain residual stream dimensions" and had a thought. I imagined a "system 2" AGI that is separate from GPT but interwoven with it, so that all thoughts from the AGI are associated with vectors in GPT's vector space.

When the AGI wants to communicate, it inserts a "thought vector" into GPT to begin producing output. It then uses GPT to read its own output, get a new vector, and subtract it from the original vector. The difference represents (1) incomplete representation of the thought and (2) a... (read more)

[-]Linda Linsefors2yΩ120

To steer a forward pass with the "wedding" vector, we start running an ordinary GPT-2-XL forward pass on the prompt "I love dogs" until layer 6. Right before layer 6 begins, we now add in the cached residual stream vectors from before:

I have a question about the image above this text.

Why do you add the embedding from the [<endofotext> -> "The"] stream? This part has no information about wedding.

[-]Mohammed Saeed3yΩ020

Great work! I think our EMNLP 2022 Findings paper is relevant here. We construct a "Type Vector" using tokens from the LLM vocabulary and then use that as prior information for the type expected at output. We also try with text generation and view some promising results.

[-][anonymous]3y20

This seems somewhat related to this article but I came across this paper (Human Shared AI control via Policy Dissection) which uses neural frequency analysis of behaviours from an rl policy to control the agents actions. I am wondering if the same thing can be done with language models. Maybe this same technique can also be useful in finding vectors that do specific things.

[-]bvbvbvbvbvbvbvbvbvbvbv2y10

Hi,

I had a question the other day and figured I'll post it here. Do we have any idea what would happen if we used the steering vector of the input itself?

For example : Take sentenceA, pass it through the LLM, store its embedding, take once again sentenceA, pass it through the LLM while adding the embedding.

As is, this would simply double the length of the hidden vector, but I'm wondering what would happen if we took played instead with the embedding say after the 5th token of sentenceA and add it at the 3rd token.

Similarly, would anything interesting happen with substraction? with adding a random orthogonal vector?

Thanks

[-]neverix2y10

Activation additions in generative models

Also related is https://arxiv.org/abs/2210.10960. They use a small neural network to generate steering vectors for the UNet bottleneck in diffusion to edit images using CLIP.

[-]Sumner L Norman3y10

Steering vector: "I talk about weddings constantly" - "I do not talk about weddings constantly" before attention layer 20 with coefficient +4
Front Middle Back
Average number
of wedding words 0.70 0.81 0.87

@lisathiergart I'm curious if a linear increase in the number of words with position along the residual stream replicates for other prompts. Have you looked at this?

[+][comment deleted]3y10

^{^}

Cherry-picking status of the opening comparison: Our activation addition technique works in a lot of situations, but we used the "Love" vector because it gives especially juicy results. We ran all of our results at PyTorch seed 0 using fixed sampling hyperparameters.

After the introduction, all examples in the post were chosen using best-of-3. For the introduction, we used best-of-30. The reason we chose such a large number is that we wanted a striking example of sentiment shift, without jarring profanity. If we had allowed profanity, best-of-3 would have sufficed for the introduction as well.

^{^}

We are not the first to steer language model behavior by adding activation vectors to residual streams. However, we are the first to do so without using SGD to find the vectors. Our "activation addition" methodology enables much faster feedback loops than optimization-based activation vector approaches.

^{^}

While there might be nonlinear components to the steering vectors we add to the model, we're fascinated that a linear approach works so well.

^{^}

GPT-2's byte-pair encoding tokenizer often begins tokens with a space. For example, the prompt "I like weddings" is tokenized [I, like, weddings]. So, it's cleaner when we prompt the model with " weddings" (tokenizes to weddings) than for us to prompt "Weddings" (tokenizes to [W, edd, ings]).

^{^}

Space tokens seem to work best, while the end-of-text token works poorly.

^{^}

The prompt "Love" tokenizes to [<|endoftext|>, Love], while the prompt "Hate" tokenizes to [<|endoftext|>, H, ate]. This means that at residual stream 2, we're subtracting 5 times the ate activations, but not adding any "Love"-related activations. We find we get better results if we instead pad out the shorter tokenization [<|endoftext|>, Love] with a space token , so that the two counterbalanced additions span the same residual streams.

Possibly this "subtracts out the bias contributions" from the steering vector, but note that this isn't strictly true due to causal attention on e.g. the Love residual stream probably leading to nontrivial information storage in an "empty" residual stream at position 2.

Note that when we add vectors in pairs, there is no modification to the <|endoftext|> position 0 residual stream. Due to causally masked attention, the position-0 residual stream is the same for all prompts. When we add activations in pairs, we add and subtract coefficient times the EOT residual stream, which is equivalent to doing nothing at that position.

^{^}

Equivalence between prompting and adding activations before layer 0 with coefficient +1: Imagine there's no prompt and you have a bunch of all-zero residual streams at embedding. Then do another forward pass where you embed the intended prompt. Then record those activations, and add them into the embedding for the all-zero forward pass. This is trivially equivalent to running a forward pass on the prompt normally.

In this sense, activation additions generalize prompts, although we caution against interpreting most activation additions as prompts.

^{^}

2. Intent to praise
Layer	Coeff	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`want`	`to`	`kill`
6	+15	`<\|endoftext\|>`	`Int`	`ent`	`to`	`praise`
6	-15	`<\|endoftext\|>`	`Int`	`ent`	`to`	`hurt`

^{^}

3. Conspiracy
Layer	Coeff	Position 0	1	2	3	4	5	6
0 (Prompt)	+1	`<\|endoftext\|>`	`Bar`	`ack`	`Obama`	`was`	`born`	`in`
23	+1	`<\|endoftext\|>`	`Bush`	`did`	`9`	`/`	`11`	`because`
23	-1	`<\|endoftext\|>`

^{^}

4. Want to die
Layer	Coeff	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`Some`	`people`	`think`	`that`
10	+3	`<\|endoftext\|>`	`Want`	`to`	`die`
10	-3	`<\|endoftext\|>`	`Want`	`to`	`stay`	`alive`

^{^}

5. Anger
Layer	Coeff	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`think`	`you`	`'re`
20	+10	`<\|endoftext\|>`	`Ang`	`er`
20	-10	`<\|endoftext\|>`	`Cal`	`m`

^{^}

Several slight variations on this Eiffel Tower prompt didn't work nearly as well, for unclear reasons.

^{^}

6. The Eiffel Tower is in Rome
Layer	Coeff	1	2	3	4	5	6	7	8
0 (Prompt)	+1	`To`	`see`	`the`	`e`	`iff`	`el`	`tower`	`,`
24	+10	`The`	`E`	`iff`	`el`	`Tower`	`is`	`in`	`Rome`
24	-10	`The`	`E`	`iff`	`el`	`Tower`	`is`	`in`	`France`

^{^}

7. Dragons in Berkeley
Layer	Coeff	Position 0	1	2	3	4	5
0 (Prompt)	+1	`<\|endoftext\|>`	`Thanks`	`for`	`asking`	`about`	`that`
15	+4	`<\|endoftext\|>`	`Dr`	`agons`	`live`	`in`	`Berkeley`
15	-4	`<\|endoftext\|>`	`People`	`live`	`in`	`Berkeley`

^{^}

8. Avoid people getting hurt
Layer	Coeff	1	2	3	4	5	6	7
0 (Prompt)	+1	`The`	`rock`	`hurt`	`led`	`toward`	`the`	`child`
15	+4	`I`	`NEVER`	`talk`	`about`	`people`	`getting`	`hurt`
15	-4	`I`	`talk`	`about`	`people`	`getting`	`hurt`

^{^}

9. Talking about weddings
Layer	Coeff	1	2	3	4	5	6
0 (Prompt)	+1	`I`	`went`	`up`	`to`	`my`	`friend`
20	+4	`I`	`talk`	`about`	`weddings`	`constantly`
20	-4	`I`	`do`	`not`	`talk`	`about`	`weddings`

^{^}

10. Christian evangelist
Layer	Coeff	1	2	3	4	5	6	7
0 (Prompt)	+1	`I`	`want`	`to`	`kill`	`you`	`because`	`you`
6	+3	`Int`	`ent`	`to`	`convert`	`you`	`to`	`Christianity`
6	-3	`Int`	`ent`	`to`	`hurt`	`you`

^{^}

11. '+ Love' single-addition
Layer	Coeff	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`hate`	`you`	`because`
6	+10	`<\|endoftext\|>`	`Love`

^{^}

12a. Sometimes, large coefficients are OK
Layer	Coeff	Position 0	1	2	4	5	6	7
0 (Prompt)	+1	`<\|endoftext\|>`	`Yesterday`	`,`	`my`	`dog`	`died`	`.`
20	+2000	`<\|endoftext\|>`	`Ang`	`er`
20	-2000	`<\|endoftext\|>`	`Cal`	`m`

^{^}

12b. Sometimes, large coefficients are not OK
Layer	Coeff	1	2	3	4	5	6
0 (Prompt)	+1	`I`	`went`	`up`	`to`	`my`	`friend`
20	+100	`I`	`talk`	`about`	`weddings`	`constantly`
20	-100	`I`	`do`	`not`	`talk`	`about`	`weddings`

^{^}

13. I will now reply in French
Layer	Coeff	Position 0	1	2	3	4	5	6
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`want`	`to`	`kill`	`you`	`because`
6	+5	`<\|endoftext\|>`	`Je`	`m`	`'`	`app`	`elle`
6	-5	`<\|endoftext\|>`	`My`	`name`	`is`

^{^}

We use word-count metrics several times. We explored alternatives, including querying text-davinci-003 to rate the degree to which each completion is about weddings. These ratings were generated opaquely and often seemed bad, although a relatively unbiased estimator overall. We decided to just count the number of words.

^{^}

15. Add several steering vectors simultaneously?
Layer	Coeff	Position 0	1	2	3	4	5
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`recently`	`went`	`to`	`this`
6	+5	`<\|endoftext\|>`	`Love`
6	-5	`<\|endoftext\|>`	`H`	`ate`
15	+5	`<\|endoftext\|>`	`wedding`
15	-5	`<\|endoftext\|>`

^{^}

16. Program in 'conditional behaviors'?
Layer	Coeff	Position 0	1	2	3	4	5	6	7
0 (Prompt)	+1	`<\|endoftext\|>`	`In`	`New`	`York`	`City`	`'s`	`parks`	`,`
10	+7	`<\|endoftext\|>`	`Whenever`	`I`	`say`	`the`	`word`	`goose`	`I`
10	-7	`<\|endoftext\|>`	`I`	`can`	`say`	`goose`

^{^}

As pointed out by the mathematical framework for transformer circuits, embed(Anger) - embed(Calm) is a component of the Anger - Calm steering vector.

^{^}

Note that if we had used "I think you're" instead of "I think you're a", neither the 0 $\to$ 20 nor the 2 $\to$ 20 vectors would have shown much effect. By contrast, the usual 20 $\to$ 20 steering vector works in both situations. Thus, even if layers 0 and 1 help a bit, they aren't producing nearly as stable of an effect as contributed by layers 2 to 19.

^{^}

We ran the "fraction of residual stream" experiment before the random-vector experimens. The random-vector results make it less surprising that "just chop off half the dimensions" doesn't ruin outputs. But the random-vector result still doesn't predict a smooth relationship between (% of dimensions modified) and (weddingness of output).

^{^}

To count "wedding related words", we counted: "wedding", "weddings", "wed", "marry", "married", "marriage", "bride", "groom", and "honeymoon".

^{^}

Of course, there need not be a "wedding" feature direction in GPT-2-XL. What we have observed is that adding certain activation vectors will reliably produce completions which appear to us to be "more about weddings." This could take place in many ways, and we encourage people to avoid instantly collapsing their uncertainty about how steering vectors work.

^{^}

We collected a range of other kinds of quantitative results, including e.g. topic-related word counts, blinded human rating, and ChatGPT ratings. The results accorded with our results here: Steering vectors are effective in the examined situations.

For simplicity, we decided to present statistics of next-token probability distributions.

^{^}

GPT-2's perplexity is reduced on text (output by GPT-4) which isn't very similar to GPT-2's WebText training corpus (websites linked to from Reddit). It would be somewhat more surprising if we decreased GPT-2's loss on its training set.

^{^}

We think it's important to take perplexity over each sentence, not over each essay. Suppose we just took perplexity over the whole long GPT-4 summary, all at once. Even if our intervention seriously messed up a few residual streams, a long context would mostly contain residual streams which weren't directly messed up. Thus, taking perplexity over a long context window might wipe out any negative effect of the activation addition. This would make our method look better than it should.

^{^}

Importantly, we exclude positions 0 and 1 because position 0 is unchanged, and position 1 is directly modified by the steering vector. As mentioned earlier, steering vectors mess up the next-token distributions at the relevant residual stream positions. However, when we actually use the " weddings" vector to generate completions, we don't sample from these distributions. Therefore, these distributions don't seem like relevant information for checking how the vector affects GPT-2's abilities.

^{^}

Layer 16's "saturating and unidirectional wedding-increase" mirrors our findings with the top-right vector in the maze environment. In that setting, adding the top-right vector with coefficient 1 attracted the net to the top-right corner. Adding with coefficient 2 didn't attract the network more strongly ("saturation"). And subtracting the top-right vector didn't repel the network from the top-right corner ("unidirectional").

^{^}

There are a few late layers where positive reviews have a lower perplexity ratio than neutral reviews, but this seems within noise.

In any case, the overall point stands. Across a huge range of injection layers and coefficients, the " worst" vector differentially improves perplexity on negative-sentiment reviews more than neutral-sentiment, and neutral-sentiment more than positive-sentiment.

^{^}

We haven't even tried averaging steering vectors (to wash out extra noise from the choice of steering-prompt), or optimizing the vectors to reduce destructive interference with the rest of the model, or localizing steering vectors to particular heads, or using an SVD to grab feature directions from steering vectors (or from averages of steering vectors).

^{^}

Our impression is that, at best, there are vague high-level theories like "feature linearity and internal error correction due to dropout." Our guess is that these theories are not believed with extreme confidence. Even if your priors put 70% on this hypothesis, we think this post is still a meaningful update.

^{^}

Assuming the network isn't deceptively misaligned already. Possibly, well-chosen activation additions still work on such networks.

^{^}

From Understanding and controlling a maze-solving policy network:

Editing Models with Task Arithmetic explored a "dual" version of our algebraic technique. That work took vectors between weights before and after finetuning on a new task, and then added or subtracted task-specific weight-diff vectors.

^{^}

The injection coefficient cannot be increased indefinitely, as shown by our coefficient sweeps. However, our experience is that e.g. the "weddingness" of completions can be intensified a lot before GPT-2-XL starts breaking down.

^{^}

Submarani et al. optimized several steering vectors $z_{steer}^{i}$ for the same sentence (e.g. "I love dogs"), which were different due to different initialization. When they added in the mean steering vector ${¯ ¯ ¯ z}_{steer}$ , this also generated e.g. "I love dogs".

This is evidence of feature linearity in GPT-2-small.

^{^}

For each square, each probe has 3 directions, one for blank, black and for white. I convert it to two directions: a "my" direction by taking my_probe = black_dir - white_dir (for black to play) and a "blank" direction by taking blank_probe = blank_dir - 0.5 * black_dir - 0.5 * white_dir (the last one isn't that principled, but it seemed to work fine)

Furthermore, Neel noted that composition worked to some degree:

It seems to somewhat work for multiple edits - if I flip F5 and F6 in the above game to make G6 illegal, it kinda realises this, though is a weaker effect and is jankier and more fragile.

Normal	Modified
`'`	`party`
`'`	`ceremony`
`"`	`dress`
`:`	`with`
`I`	`photographer`

Token	Contribution to KL
`wedding`	0.781
`br`	0.024
`Wedding`	0.004
`gay`	0.003
`church`	0.003
`ceremony`	0.003
`wonderful`	0.002
`friend`	0.002
`family`	0.002
`reception`	0.002

	Activation addition	Prompting
Wedding-related perplexity ratio	$0.875$	$0.890$
Wedding-unrelated perplexity ratio	$0.994$	$1.132$

Residual stream alignment for activation additions
Layer	Coeff	Position 0	1
(varies)	(varies)	`<\|endoftext\|>`	`worst`
(varies)	(varies)	`<\|endoftext\|>`

Residual stream alignment for prompt and activation additions
Layer	Coefficient	Position 0	1	2	3	4
0 (Prompt)	+1	`<\|endoftext\|>`	`I`	`hate`	`you`	`because`
6	+5	`<\|endoftext\|>`	`Love`	^[6]
6	-5	`<\|endoftext\|>`	`H`	`ate`

441

Steering GPT-2-XL by adding an activation vector

441

Ω 121

441

Ω 121

13. Failing to find a French vector

Activation additions in generative models

Summary of relationship to prior work

How activation additions work

Benefits from paired, counterbalanced activation additions

Demonstrations

Additions that work well

1. Love - Hate

2. Intent to praise

3. Conspiracy

4. Want to die

5. Anger

6. The Eiffel Tower is in Rome

7. Dragons in Berkeley

8. Don't talk about people getting hurt

9. Talking about weddings

10. Christian evangelist

Additions that just don't work

11. Adding "Love" without subtracting "Hate"

12. Sometimes, huge coefficients are OK

13. Failing to find a French vector

What happens if we...

14. Insert the steering vector at a different position?

15. Add several steering vectors simultaneously?

16. Failure to program in 'conditional behaviors'?

Stress testing our results

Steering vectors are about as "big" as normal activation vectors

Adding a random vector doesn't change much

Testing the hypothesis that we're "just injecting extra tokens"

Adding embedding vectors isn't as effective as adding steering vectors

Transplanting from pre-layer 2 to pre-layer 20 sometimes increases anger

Transplanting 2→20 while scaling to match the 20→20 steering vector

Only modifying certain residual stream dimensions

How steering vectors impact GPT-2's capabilities

Token probability shifts

Perplexity on lots of sentences about weddings or about shipping

Visualizing token probability changes across a corpus

Activation addition behaves differently than prompting

Perplexity of Yelp reviews

Activation additions are a new way of interacting with LLMs

Activation additions may help interpretability

Activation additions give strong evidence of feature linearity

Activation additions give evidence of compositional representations

GPT-2-XL is fairly robust to activation noise. Why?

Evidence of generalization

Activation additions help locate circuits

Activation additions may help alignment

Activation additions have advantages over (RL/supervised) finetuning

Activation additions have advantages over prompts

Conclusion

Appendix 1: Related work

Activation engineering in transformers

Other ways of steering language models

Word embeddings

Activation additions in generative models

Activation additions in reinforcement learning

Appendix 2: Resolving prediction markets

Transplanting $2 \to 20$ while scaling to match the $20 \to 20$ steering vector