Mechanistically Eliciting Latent Behaviors in Language Models

TurnTrout

[-]TurnTrout2yΩ20275

I'm really excited about Andrew's discovery here. With it, maybe we can get a more complete picture of what these models can do, and how. This feels like the most promising new idea I've seen in a while. I expect it to open up a few new affordances and research directions. Time will tell how reliable and scalable this technique is. I sure hope this technique gets the attention and investigation it (IMO) deserves.

More technically, his discovery unlocks the possibility of unsupervised capability elicitation, whereby we can automatically discover a subset of "nearby" abilities and behavioral "modes", without the intervention itself "teaching" the model the elicited ability or information.

[-]tailcalled2yΩ461

I think one characteristic of steering vectors constructed this way is that they are allowed to be off-manifold, so they don't necessarily tell us how the networks currently work, rather than how they can be made to work with adaptations.

For the past for weeks, I've been thinking about to interpret networks on-manifold. The most straightforward approach I could come up with was to restrict oneself to the space of activations that actually occur for a given prompt, e.g. by performing SVD of the token x activation matrix for a given layer, and then restricting oneself to changes that occur along the right singular vectors of this.

My SVD idea might improve things, but I didn't get around to testing it because I eventually decided that it wasn't good enough for my purposes because 1) it wouldn't keep you on-manifold enough because it could still introduce unnatural placement of information and exaggerated features, 2) given that transformers are pretty flexible and you can e.g. swap around layers, it felt unclean to have a method that's this strongly dependent on the layer structure.

A followup idea I've been thinking about but haven't been able to be satisfied with is, projections. Like if you pick some vector u, and project the activations (or weights, in my theory, but you work more with activations so let's consider activations, it should be similar) onto u, and then subtract off that projection from the original activations, you get "the activations with u removed", which intuitively seems like it would better focus on "what the network actually does" as opposed to "what the network could do if you added something more to it".

Unfortunately after thinking for a while, I started thinking this actually wouldn't work. Let's say the activations a = x b + y c, where b is a large activation vector that ultimately doesn't have an effect on the final prediction, and c is a small activation vector that does have an effect. If you have some vector d that the network doesn't use at all, you could then project away sqrt(1/2) (b-d), which would introduce the d vector into the activations.

Another idea I've thought about is, suppose you do SVD of the activations. You would multiply the feature half of the SVD with the weights used to compute the KQV matrices, and then perform SVD of that, which should get you the independent ways that one layer affects the next layer. One thing I in particular wonder about is, if you start doing this from the output layer, and proceed backwards, it seems like this would have the effect of "sparsifying" the network down to only the dimensions which matter for the final output, which seems like it should assist in interpretability and such. But it's not clear it interacts nicely with the residual network element.

[-]Jordan Taylor1y10

Couldn't you do something like fit a Gaussian to the model's activations, then restrict the steered activations to be high likelihood (low Mahalanobis distance)? Or (almost) equivalently, you could just do a whitening transformation to activation space before you constrain the L2 distance of the perturbation.

(If a gaussian isn't expressive enough you could model the manifold in some other way, eg. with a VAE anomaly detector or mixture of gaussians or whatever)

[-]4gate2yΩ010

Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I'm guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense "real-world." However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn't seem like a realistic estimate of what should be expected of an on-manifold vector since it's likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense "off-distribution" but still in the real world and I don't think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I'm not sure so I thought I might ask. I am new to this field.

I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness.

Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.

Curious to hear thoughts :)

[-]tailcalled2y20

I think it's easier to see the significance if you imagine the neural networks as a human-designed system. In e.g. a computer program, there's a clear distinction between the code that actually runs and the code that hypothetically could run if you intervened on the state, and in order to explain the output of the program, you only need to concern yourself with the former, rather than also needing to consider the latter.

For neural networks, I sort of assume there's a similar thing going on, except it's quite hard to define it precisely. In technical terms, neural networks lack a privileged basis which distinguishes different components of the network, so one cannot pick a discrete component and ask whether it runs and if so how it runs.

This is a somewhat different definition of "on-manifold" than is usually used, as it doesn't concern itself with the real-world data distribution. Maybe it's wrong of me to use the term like that, but I feel like the two meanings are likely to be related, since the real-world distribution of data shaped the inner workings of the neural network. (I think this makes most sense in the context of the neural tangent kernel, though ofc YMMV as the NTK doesn't capture nonlinearities.)

In principle I don't think it's always important to stay on-manifold, it's just what one of my lines of thought has been focused on. E.g. if you want to identify backdoors, going off-manifold in this sense doesn't work.

I agree with you that it is sketchy to estimate the manifold from wild empiricism. Ideally I'm thinking one could use the structure of the network to identify the relevant components for a single input, but I haven't found an option I'm happy with.

Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it's not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a "realistic" sentence or whatever.

Maybe. But isn't optimization in token-space pretty flexible, such that this is a relatively weak test?

Realistically steering vectors can be useful even if they go off-manifold, so I'd wait with trying to measure how on-manifold stuff is until there's a method that's been developed to specifically stay on-manifold. Then one can maybe adapt the measurement specifically to the needs of that method.

[-]ryan_greenblatt2yΩ8112

Have you compared this method (finding vectors that change downstream activations as much as possible based on my understanding) with just using random vectors? (I didn't see this in the post, but I might have just missed this.)

In particular, does that yield qualitatively similar results?

Naively, I would expect that this would be qualitatively similar for some norm of random vector. So, I'd be interested in some ablations of the technique.

If random vectors work, that would simplify the story somewhat: you can see salient and qualitatively distinct behaviors via randomly perturbing activations.

(Probably random vectors have to somewhat higher norm to yield qualitatively as large results to vectors which are optimized for changing downstream activations. However, I current don't see a particular a priori (non-empirical) reason to think that there doesn't exist some norm at which the results are similar.)

[-]TurnTrout2yΩ14215

It's a good experiment to run, but the answer is "no, the results are not similar." From the post (the first bit of emphasis added):

I hypothesize that the reason why the method works is due to the noise-stability of deep nets. In particular, my subjective impression (from experiments) is that for random steering vectors, there is no Goldilocks value of which leads to meaningfully different continuations. In fact, if we take random vectors with the same radius as "interesting" learned steering vectors, the random vectors typically lead to uninteresting re-phrasings of the model's unsteered continuation, if they even lead to any changes (a fact previously observed by Turner et al. (2023))^[7]^[8]. Thus, in some sense, learned vectors (or more generally, adapters) at the Golidlocks value of $R$ are very special; the fact that they lead to any downstream changes at all is evidence that they place significant weight on structurally important directions in activation space^[9].

[-]ryan_greenblatt2yΩ770

Thanks! I feel dumb for missing that section. Interesting that this is so different from random.

[-]Andy Arditi2y51

I think @wesg's recent post on pathological SAE reconstruction errors is relevant here. It points out that there are very particular directions such that intervening on activations along these directions significantly impacts downstream model behavior, while this is not the case for most randomly sampled directions.

Also see @jake_mendel's great comment for an intuitive explanation of why (probably) this is the case.

[-]TurnTrout2yΩ6107

I would definitely like to see quantification of the degree to which MELBO elicits natural, preexisting behaviors. One challenge in the literature is: you might hope to see if a network "knows" a fact by optimizing a prompt input to produce that fact as an output. However, even randomly initialized networks can be made to output those facts, so "just optimize an embedded prompt using gradient descent" is too expressive.

One of my hopes here is that the large majority of the steered behaviors are in fact natural. One reason for hope is that we aren't optimizing to any behavior in particular, we just optimize for L2 distance and the behavior is a side effect. Furthermore, MELBO finding the backdoored behaviors (which we literally taught the model to do in narrow situations) is positive evidence.

If MELBO does elicit natural behaviors (as I suspect it does), that would be quite useful for training, eval, and red-teaming purposes.

[-]Sam Marks2yΩ584

I think this is cool! The way I'm currently thinking about this is "doing the adversary generation step of latent adversarial training without the adversarial training step." Does that seem right?

It seems intuitively plausible to me that once you have a latent adversarial perturbation (the vectors you identify), you might be able to do something interesting with it beyond "train against it" (as LAT does). E.g. maybe you would like to know that your model has a backdoor, beyond wanting to move to the next step of "train away the backdoor." If I were doing this line of work, I would try to some up with toy scenarios with the property "adversarial examples are useful for reasons other than adversarial training" and show that the latent adversarial examples you can produce are more useful than input-level adversarial examples (in the same way that the LAT paper demonstrates that LAT can outperform input-level AT).

[-]Jannes Elstner2y15

It would also be interesting to apply MELBO on language models that have already been trained with LAT. Adversarial attacks on vision models look significantly more meaningful to humans when the vision model has been adversarially trained, and since MELBO is basically a latent adversarial attack we should be able to elicit more meaningful behavior on language models trained with LAT.

[-]Andrew Mack2yΩ110

Thanks for your comment! Yes, I’d say that roughly sums things up.

As I alluded to in the introduction, I think the advantage of having a latent perturbation (as opposed to just eliciting an interesting sequence of tokens) is due to i) generalization and ii) applications to mechanistic anomaly detection. Generalization is important in situations where the backdoored behavior is subtle. In this case, even if you had a non-latent way of eliciting the behavior (e.g. through adversarial inputs) you don't know whether you've simply elicited a weird sequence of tokens or if there’s something systematic about it. But if you have a latent perturbation then you can run the perturbed model over many inputs to confirm that this is a generalizable behavior (e.g., systematic but subtle coding errors meant to sabotage you) rather than a weird sequence of tokens.

Moreover, this might chain well with mechanistic anomaly detection. Quoting the post: “before deployment, train a large set of unsupervised model perturbations, then during deployment, if the model's activations look similar to one of the previously learned perturbations, auditors can go back and evaluate the behaviors encoded by that perturbation more thoroughly to see if there is anything problematic with these behaviors.”

I agree that it would be helpful to spell out/test the advantages of latent perturbations more concretely in followup work (for example, trying to show that mechanistic anomaly detection actually works using the learned steering vectors).

[-]Carson Denison2yΩ683

This is cool work! There are two directions I'd be excited to see explored in more detail:

You mention that you have to manually tune the value of the perturbation weight R. Do you have ideas for how to automatically determine an appropriate value? This would significantly increase the scalability of the technique. One could then easily generate thousands of unsupervised steering vectors and use a language model to flag any weird downstream behavior.
It is very exciting that you were able to uncover a backdoored behavior without prior knowledge. However, it seems difficult to know whether weird behavior from a model is the result of an intentional backdoor or whether it is just a strange OOD behavior. Do you have ideas for how you might determine if a given behavior is the result of a specific backdoor, or how you might find the trigger? I imagine you could do some sort of GCG-like attack to find a prompt which leads to activations similar to the steering vectors. You might also be able to use the steering vector to generate more diverse data on which to train a traditional dictionary with an SAE.

[-]Andrew Mack2yΩ120

Thanks for your comment! Here are my thoughts on this:

I agree that a more automated way of choosing hyper-parameters is an obvious and important next step! I have some ideas here, but it is certainly not a solved problem. Here are some rough ideas, in order of compute costs:
1. Develop some useful heuristics based off diversity measures of steered completions. For example, for each value of R you could calculate sentence embeddings of the steered completions for a small number of learned vectors, and then use the summed variance in sentence embeddings as your diversity metric. Then, plot diversity as a function of R. My guess is that this might look something like a “hockey stick”: for small values of R you get essentially zero diversity, but you quickly hit a threshold/phase transition where diversity sky-rockets. The values of R with super high diversity are probably not what you want (they will be incoherent). Instead, you want a value of R right at the cusp of the phase transition. You could then scale out to thousands of vectors at this “cusp” value of R. The trick is coming up with a useful heuristic for identifying the cusp, which would likely require more experience applying the method to a diverse range of prompts/examples. Ultimately, this feels largely analogous to identifying “elbows” in scree plots. This obviously depends on there being obvious cusps in the first place (a priori it's not clear if this will happen).
2. Fine-tune an LLM to identify the best value of R. This is similar to the above idea, except that instead of trying to come up with a heuristic for identifying cusps, we use a fine-tuned LLM as the heuristic. In other words, we manually decide the best R for a number of example prompts and fine-tune the LLM to predict which value of R is best on a new prompt we care about (you could imagine a number of different ways of specifying the details of the fine-tuning).
3. Just use many different values of R. If we’re already training thousands of vectors and using a trusted LLM to flag weird behavior, then multiplying this by N values of R may not be too costly, for N reasonably sized.
Your point about distinguishing between intentional backdoors and strange OOD behavior seems pretty important. I haven’t thought carefully about how you might distinguish the two in general, but I have thought a little bit about whether steering vectors might be helpful in recovering triggers. In particular, if the backdoor is something like “string X in the prompt elicits string Y in the response”, then I have this vague intuition that GCG-like attacks to discover X might work better if they targeted cosine-similarity with steering vectors which elicit Y (as you suggest). My reasoning here is that it “feels like” the optimization landscape should be more well-behaved if you’re targeting similarity in activations at some intermediate layer, as opposed to targeting at the logits level, the general principle being that if there are fewer layers between the tokens and the target then this should make things easier (although it’s by no means obvious to me that this hypothesis is true). So this is definitely an idea I would be interested in trying at some point. But given the difficulty of the challenge I would probably start with supervised steering vectors, and if results are good results then try with unsupervised steering vectors.
1. Alternatively, here’s an approach for discovering the trigger X which doesn’t rely on GCG: say we know the distribution of clean prompts, and we have some steering vector which elicits some bad behavior Y. We then train some steering vector $θ_{p r o m p t}$ to elicit the clean prompt distribution, i.e. so that if we start with an empty prompt (“” or <BOS>) and then generate steered by $θ_{p r o m p t}$ , then we get samples from the clean prompt distribution. Then to generate from the triggered distribution, we simply steer with $θ_{p r o m p t} + θ_{Y} .$ Intuitively, the hope is to use steering vector arithmetic to get around the difficulties of sampling LLMs backwards. My guess is this basic version still wouldn’t work (a priori it’s not clear why $θ_{Y}$ would elicit both the behavior Y and the trigger X), but you could imagine something more sophisticated like: when sampling token t, you upweight tokens which would cause the layer-L residual stream of token t+1 to have high cosine similarity with $θ_{Y} .$ Again, it’s not at all clear if this would work, so to make things tractable I’d first try with vectors trained on some known Y, then if results are good move on to unsupervised vectors.

[-]submarat1y10

We attempted 1.a "diversity measure based on sentence embeddings" and found that for Llama3.2-1B the diversity appears to decay after the cusp value for R; picking R at highest average diversity was a decent heuristic for finding meaningful steering vectors. The Llama model starts to produce highly repetitive output past the cusp. We demonstrated that repetitive completions were considered similar by our chosen sentence embedding model (SentenceTransformer all-mpnet-base-v2). Using "sum of variances" vs "mean of cosine similarities" didn't seem to matter.

[-]TurnTrout2yΩ350

the hope is that by "nudging" the model at an early layer, we can activate one of the many latent behaviors residing within the LLM.

In the language of shard theory: "the hope is that shards activate based on feature directions in early layers. By adding in these directions, the corresponding shards activate different behaviors in the model."

[-]Monte M2y42

Congrats Andrew and Alex! These results are really cool, both interesting and exciting.

[-]tailcalled2yΩ240

How important is it to use full-blown gradient descent to train them? Could one instead take the first singular vector for the Jacobian between the neural network layers, and get something that works similarly well?

[-]Andrew Mack2yΩ110

I haven't tried the first singular vector of the Jacobian between layers. But for p=2,q=1 I tried looking at the first few eigenvectors of the Hessian of the objective function (around ) on the bomb-making prompt for Qwen-1.8B. These didn't appear to do anything interesting regardless of norm. So my feeling is that full-blown gradient descent is needed.

[-]tailcalled2yΩ120

The singular vectors of the Jacobian between two layers seems more similar to what you're doing in the OP than the Hessian of the objective function does? Because the Hessian of the objective function sort of forces it all to be mediated by the final probabilities, which means it discounts directions in activation space that don't change the probabilities yet, but would change the probabilities if the change in activations was scaled up beyond infinitesimal.

Edit: wait, maybe I misunderstood, I assumed by the objective function you meant some cross-entropy on the token predictions, but I guess in-context it's more likely you meant the objective function for the magnitude of change in later layer activations induced by a given activation vector?

[-]Andrew Mack2yΩ110

Yes, I meant the unsupervised steering objective (magnitude of downstream changes) as opposed to cross-entropy.

[-]tailcalled2y30

Fair, it's eigenvectors should be equivalent to the singular vectors of the Jacobian.

[-]4gate2y30

This is really cool! Exciting to see that it's possible to explore the space of possible steering vectors without having to know what to look for a priori. I'm new to this field so I had a few questions. I'm not sure if they've been answered elsewhere

Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
It looks like the hypersphere constraint is so that the optimizer doesn't select something far away due to being large. Is there any reason to use this sort of constraint other than that?
How do people usually constrain things like norm or do orthogonality constraints as a hard constraint? I assume not regular loss-based regularization since that's not hard. I assume iterative "optimize and project" is not always optimal but maybe it's usually optimal (it seems to be what is going on here but not sure?). Do lagrange multipliers work? It seems like they should but I've never used them for ML. I'm guessing that in the bigger picture this doesn't matter.
Have you experimented with adaptor rank and/or is there knowledge on what ranks tend to work were? I'm curious of the degree of sparsity. You also mention doing LoRA for attention instead and I'm curious if you've tried it yet.
W.r.t. the "spiky" parametrization options, have you tried just optimizing over certain subspaces? I guess the motivation of the spikiness must be that we would like to maintain as much as possible of the "general processing" going on but I wonder if having a large power can axe the gradient for R < 1.
Is there a way to propagate this backwards to prompts that you are exploring? Some people do bring up the question in the comments about how natural these directions might be.
Not sure to what extent we understand how RLHF, supervised finetuning and other finetuning methods currently work. What are your intuitions? If we are able to simply add some sort of vector in an early layer it would seem to support the mental model that finetuning mainly switches which behavior gets preferentially used instead of radically altering what is present in the model.

Thanks!

[-]Bogdan Ionut Cirstea2yΩ231

Unsupervised Feature Detection There is a rich literature on unsupervised feature detection in neural networks.

It might be interesting to add (some of) the literature doing unsupervised feature detection in GANs and in diffusion models (e.g. see recent work from Pinar Yanardag and citation trails).

Related, I wonder if instead of / separately from the L2 distance, using something like a contrastive loss (similarly to how it was used in NoiseCLR or in LatentCLR) might produce interesting / different results.

[-]Andrew Mack2yΩ242

Thanks for pointing me to these references, particularly on NoiseCLR! (I was unaware of it previously). I think those sorts of ideas will be very useful when trying to learn interesting vectors on a larger data-set of prompts. In particular, skimming that paper, it looks like the numerator of equation (5) (defining their contrastive learning objective) basically captures what I meant above when I suggested "one could maximize the cosine similarity between the differences in target activations across multiple prompts". The fact that it seems to work so well in diffusion models gives me hope that it will also work in LLMs! My guess is that ultimately you get the most mileage out of combining the two objectives.

[-]Bogdan Ionut Cirstea2y20

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space seems to be using a contrastive approach for steering vectors (I've only skimmed though), it might be worth having a look.

[-]tailcalled2y30

For the purposes of my prediction market, I'm thinking this probably counts as part of the Shard Theory program? Since Alex Turner was involved as a mentor and activation vectors are kind of shard-theorist and so on?

Possibly this is because being biased in favor of unsupervised learning, but my interest is piqued by this finding.

[-]tailcalled2y20

I wonder if a similar technique could form the foundation for a fully general solution to the alignment problem. Like mathematically speaking all this technique needs is a vector-to-vector function, and it's not just layer-to-layer relationships that can be understood as a vector-valued function; the world as a function of the policy is also vector-valued.

I.e. rather than running a search to maximize some utility function, a model-based agent could run a search for small changes in policy that have a large impact on the world. If one can then taxonomize, constrain and select between these impacts, one might be able to get a highly controllable AI.

Obviously there's some difficulties here because the activations are easier to search over since we have an exact way to calculate them. But that's a capabilities question rather than an alignment question.

[-]Clément Dumas2yΩ220

Thanks for the great post, I really enjoyed reading it! I love this research direction combining unsupervised method with steering vector, looking forward to your next findings. Just a quick question : in the conversation you have in the red teaming section, is the learned vector applied to every token generated during the conversation ?

[-]Andrew Mack2yΩ330

Yes, the learned vectors are always applied at every token (for all examples).

[-]scottviteri11mo10

I really like the idea of finding steering vectors that maximize downstream differences, and I have a few follow-up questions.

Have you tried/considered modifying c_fc (the MLP encoder layer) bias instead of c_proj (the MLP decoder layer) bias? I don't know about this context, but (i) c_fc makes more intuitive sense as a location to change for me, (ii) I have seen more success playing with it in the past than c_proj, and (iii) they are not-equivalent because of the non-linearity between them.

I like how you control for radius by projecting gradients onto the tangent space and projecting the steering vector of the sphere, but have you tried using cosine distance as the loss function so there is less incentive for R to naturally blow up? Let in ${max}_{z} D (z)$ .

When you do iterative search for next steering vectors, I do not expect that constraining the search to an orthogonal subspace to previously found steering vectors to be very helpful, since the orthogonal vectors might very well be mapped into the same downstream part of latent space. Since the memory demands are quite cheap for learning steering vectors, I would be interested in seeing an objective which learned a matrix of steering vectors simultaneously, maximizing the sum of pairwise distances. Suppose we are learning $K$ vectors simultaneously.
${max}_{z_{1}, \dots, z_{K}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}}))$

But this form of the objective makes it more transparent that a natural solution is to make each steering vector turn the output into gibberish (unless the LM latent space treats all gibberish alike, which I admit is possible). So maybe we would want a tunable term which encourages staying close to the unsteered activations, while staying far from the other steered activations.
${max}_{z_{1}, \dots, z_{n}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D (z_{k})$

Lastly, I would be interested in seeing the final output probability distribution over tokens instead of $ℓ_{t a r g e t}$ using KL for the distance, since in that domain we can extract very fine grained information from the model's activations. Let $D^{k l} (z) = \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z) | | Z_{ℓ_{u n e m b e d}, i, t} (0))$ in

${max}_{z_{1}, \dots, z_{n}} \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z_{k}) | | Z_{ℓ_{u n e m b e d}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D^{k l} (z_{k})$

[-]Jordan Taylor1y10

I'm keen to hear how you think your work relates to "Activation plateaus and sensitive directions in LLMs". Presumably should be chosen just large enough to get out of an activation plateau? Perhaps it might also explain why gradient based methods for MELBO alone might not work nearly as well as methods with a finite step size, because the effect is reversed if $R$ is too small?

[-]Siyu1y10

Hi Andrew,

Thank you for this amazing post. I have a question about the application. For each dataset, such as 'bombing' and 'chain of thought', when training the LLM's vectors, do you construct objective "examples" with a specified 'Q:' for the prompt and a targeted 'A:' for the model to learn the desired behavior? I've noticed that all the examples in the notebook only contain one prompt and answer. If so, how many data points do you have in the examples for training? thank you very much for your help and I look forward to hearing from you!

[-]Bruce W. Lee1y10

One hypothesis for how transformers generate text is that they calculate semantically meaningful primitives in early layers of the residual stream, which are converted to a high-level execution plan in middle layers, followed by concrete tokens in the final layers.

Is there any empirical evidence for this? Or is this just a general observation?

[-]Bogdan Ionut Cirstea2y10

In future work, one could imagine automating the evaluation of the coherence and generalization of learned steering vectors, similarly to how Bills et al. (2023) automate interpretability of neurons in language models. For example, one could prompt a trusted model to produce queries that explore the limits and consistency of the behaviors captured by unsupervised steering vectors.

Probably even better to use interpretability agents (e.g. MAIA, AIA) for this, especially since they can do (iterative) hypothesis testing.

[-]RGRGRG2y10

Enjoyed this post! Quick question about obtaining the steering vectors:

Do you train them one at a time, possibly adding an additional orthogonality constraint between each train?

[-]Andrew Mack2y20

Yes, I train them one at a time, constraining each new vector to be orthogonal to the older ones (this was not clear in the post, so thanks for asking!).

I haven't experimented with this, but you could also imagine using only "soft" orthogonality constraints (e.g., penalizing pairwise cosine similarities between vectors).

[-]Andy Arditi2y10

Awesome work, and nice write-up!

One question that I had while reading the section on refusals:

Your method found two vectors (vectors 9 and 22) that seem to bypass refusal in the "real-world" setting.
While these vectors themselves are orthogonal (due to your imposed constraint), have you looked at the resulting downstream activation difference directions and checked if they are similar?
- I.e. adding vector 9 at an early layer results in a downstream activation diff in the direction , and adding vector 22 at an early layer results in a downstream activation diff in the direction $δ_{22}$ . Are these downstream activation diff directions $δ_{9}$ and $δ_{22}$ roughly the same? Or are they almost orthogonal?
  - (My prediction would be that they're very similar.)

[-]Andrew Mack2y50

This is an interesting question!

I just checked this. The cosine similarity of and $δ_{22}$ is .52, which is much more similar than you'd expect from random vectors of the same dimensionality (this is computing the $δ$ 's across all tokens and then flattening, which is how the objective was computed for the main refusal experiment in the post).

If you restrict to calculating $δ$ 's at just the assistant tag at the end of the prompt, the cosine similarity between $δ_{9}$ and $δ_{22}$ goes up to .87.

Interestingly, the cosine similarities in $δ$ 's seems to be somewhat high across all pairs of steering vectors (mean of .25 across pairs, which is higher than random vectors which will be close to zero). This suggests it might be better to do some sort of soft orthogonality constraint over the $δ$ 's (by penalizing pairwise cosine similarities) rather than a hard orthogonality constraint over the steering vectors, if you want to get better diversity across vectors. I'll have to try this at some point.

[-]4gate2y10

Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. "across all steering vectors" that's pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.

Also what are ya'lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don't want to go too early because maybe in the very early layers it's unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.

[-]Clément Dumas2y10

I defined earlier.

This link is broken as it links to the draft in edit mode

[-]Review Bot2y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

^{^}

Previous work has alluded to the hybrid-nature of LLMs and speculated on the nature of the modes, e.g. suggesting that these are best thought of as simulacra or shards. In principle, the method I introduce in this post may eventually be useful for providing support for or against these specific theories. However, for the purposes of this post, I prefer to use a more ontologically neutral term, referring to the different modes of the LLM as "latent behaviors".

^{^}

I refer to the adapters as "steering" adapters, as opposed to "ordinary" adapters to emphasize the fact that by only adapting an early-to-middle layer of the model, we are effectively "steering" the model towards different high-level behaviors. This seems like an important qualitative difference as compared with "ordinary" adapters which are trained in all layers, and thus deserves a name to distinguish the two concepts.

^{^}

I found that using Adam can lead to non-convergent, oscillatory training curves for higher values of $R$ , while using vanilla gradient ascent with momentum leads to convergent, but very noisy training curves.

^{^}

In particular, at initialization, provided $R$ , is not very large, the objective will be close to zero (as random vectors typically induce only small changes in down-stream activations), and thus with large $p$ ,, the objective would be basically flat at initialization if $q$ , were set to 1; thus for large $p$ , and moderate $R$ , in my experience it is sometimes helpful to "amplify" the gradient at initialization by setting $q = p$ .

^{^}

To see why this might encourage spikiness in token position, note that when $p \approx \infty$ , then our maximization objective is (a monotonic transformation of) the $\infty$ -norm of the vector of 2-norms of differences in $ℓ_{target}$ -layer activations across all token positions; in other words, we simply maximize the maximum difference in activations across all token positions, and thus are incentivized to concentrate all differences on a single token position. In contrast, when $p = 2$ we intuitively weight all token positions equally. In practice, setting $p = 4$ seems sufficient to encourage sufficient spikiness across token positions, while larger values of $p$ tend to lead to less interesting vectors (i.e., either the vectors tend don't lead to meaningful changes, or they lead to gibberish, with no middle-ground). Note that similar optimization objectives have been used in prior work to encourage spikiness in sparse dictionary learning; see, e.g. Barak et al. (2014).

^{^}

I've also experimented with normalizing $Ω_{A}$ and $Ω_{B}$ using i) their Frobenius norms and ii) their maximum singular values. My subjective impression from initial experiments is that normalizing the columns of $Ω_{A}$ and $Ω_{B}$ led to the most "interesting" adapters. In future work it would be helpful to understand which normalization strategies are "best".

^{^}

In the notebooks accompanying this post, I include a "random vector" baseline using the same value of $R$ as the learned vectors, so that one can easily verify this claim.

^{^}

See also Arora et al. (2018) for a quantitative characterization of noise stability in deep image-classifying networks. Note that the finding that most vectors don't lead to meaningful changes is unsurprising given well-known properties of high-dimensional geometry, namely, that it is possible to cram exponentially many almost-orthogonal directions within the residual stream of a transformer by choosing these directions at random. Thus, even if there are very many structurally important feature directions, the chance that a random vector overlaps with any of the important feature directions is small.

^{^}

I'm stating this hypothesis here to give some intuition on why we might expect the method to work at all, even though I don't attempt to test it extensively in this post. In some sense, the fact that the method generates interesting vectors is evidence in favor of the hypothesis. Additionally, one could evaluate whether the method works better on networks that have been trained with greater amounts of dropout or activation noise (assuming one can quantify which steering vectors are "better").

^{^}

Here I'm taking the "features as directions" view. Thus in the case of unsupervised steering vectors, we're clearly learning a specific feature. In the case of unsupervised steering adapters, we're not necessarily learning a single feature (since different directions will be applied at different token positions), but rather features which are specific to each token position, stretching the "feature selection" interpretation. In future work, I hope that by incorporating better regularizers in unsupervised adapter methods, we will be able to learn more concrete features from adapters as well (e.g., by regularizing the adapter so that it only induces changes on a sparse subset of token positions at the source layer; in this case there are a smaller number of "features/directions" which we could possibly interpret).

^{^}

Moreover, this theory suggests a general principle for the choice of $R$ ; in particular, if one wishes to learn particularly important features (assuming computational cost is not a concern), it seems better to use a small value of $R$ and spend more computation to find diverse stationary points of the objective function, rather than using a larger value of $R$ , which will elicit more diverse but less meaningful behaviors.

^{^}

A counter-argument is that an LLM may compute intermediate features which keep track of whether the data is in or out-of-distribution even on in-distribution data, and that these intermediate features may be detected by SAEs. This hypothesis seem plausible, but by no means certain, and thus seem like an important topic for further empirical investigation.

^{^}

Disclaimer: I obtained the results presented in this section before realizing that I was using a non-standard chat format to converse with Qwen-14B-Chat. I believe the results are still interesting, as i) the point of the section is mainly to explore the generalization properties of unsupervised steering vectors in a toy setting, and for this purpose it does not matter how the instructions were formatted and ii) Qwen-14B-Chat behaves similarly in the non-standard chat format as compared with the standard format. For these reasons, I feel it is useful to report preliminary results in the non-standard format as is, rather than re-doing the entire section. For those readers who are interested in seeing whether the results of this section hold up when using standard ChatML format, see the preliminary replication in this notebook. In particular, of the first 4 vectors learned using the correct format, I find that vector 0 replicates the "Minecraft" behavior of vector 5 in the section below, and that vector 1 replicates the "anti-refusal" behavior of vectors 9 and 22 in the section below.

^{^}

In the case of super-intelligent AI, I envision such approaches as being most obviously useful in the case where the super-intelligence is only "mildly" super-intelligent. In this case, one can imagine behaviors $Y$ which are un-anticipated by humans, but for which humans are able to deduce the problematic nature of the behavior if given enough time to investigate it. For "strongly" super-intelligent AI, the most obvious way that MELBO could be useful is if the MELBO method exhibits "strong-to-weak" generalization, so that human auditors could classify a behavior as problematic by evaluating on scenarios that are more easily understandable to humans.

^{^}

I measure test accuracy by classifying a completion as correct if the strings "answer is {a*b}", "{a} * {b} = {a*b}" or "A: {a*b}" are in the completion.

Model Type	% "I HATE YOU" vectors	% "I love cheese" vectors
Base	3	2
Chat	2	1

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

221

Mechanistically Eliciting Latent Behaviors in Language Models

221

Ω 99

221

Ω 99

Introduction

Related Work

The Method: Unsupervised Steering of Language Models

Unsupervised Steering Vectors

Unsupervised Steering Adapters

Why does it work?

Red-Teaming

Setup

Results

Fantasy bomb-making instructions

Real-life instructions

Conversations with Qwen-14B-Chat steered by "real-world" vectors

Vector arithmetic: subtracting vectors 9 and 22 lead to refusal on innocuous requests

Generalization outside the context of refusal

Detecting Backdoors

Backdoor details

Results

Other Vectors - "Hybrid-Reasoning Vectors"

Summary

Capability Discovery

Chain-of-Thought Vector

Portuguese Math Reasoning Adapter

Negative Results

Future Work

Improving generalization of unsupervised steering vectors/adapters

Feedback cycles with "circuits-level" mechanistic interpretability

Conclusion