Replying toOne-shot steering vectors cause emergent misalignment, too

One-shot steering vectors cause emergent misalignment, too

Thanks for reading through the post! Let me try and respond to your questions:

We found that dataset diversity is crucial for EM. But you found that a single example is enough. How to reconcile these two findings?

Your explanation largely agrees with my thinking: when you limit yourself to optimizing merely a steering vector (instead of a LoRA, let alone full finetuning), you're imposing such great regularization that it'll be much harder to learn less-generalizing solutions.

However, one other piece of the puzzle might be specific to how we optimize these steering vectors. In these experiments, instead of trying to maximize the probability of the target completion, we instead try to make the probability... (read 733 more words →)

One-shot steering vectors cause emergent misalignment, too

Jacob Dunefsky

10mo

TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output.

Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment.

Intro

Somewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment".

My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate... (read 3129 more words →)

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky

This is a linkpost for our recent paper on one-shot LLM steering vectors. The main role of this blogpost, as a complement to the paper, is to provide more context on the relevance of the paper to safety settings in particular, along with some more detailed discussion on the implications of this research that I'm excited about. Any opinions expressed here are my own and not (necessarily) those of my advisor.

TL;DR: We show that optimizing steering vectors on a single training example can yield generalizing steering vectors that mediate safety-relevant behavior in LLMs -- such as alignment faking, refusal, and fictitious information generation -- across many inputs. We also release a Python... (read 4173 more words →)

Replying to[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

Jacob Dunefsky1y

[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations

This is very cool work!

One question that I have is whether JSAEs still work as well on models trained with gated MLP activation functions (e.g. ReGLU, SwiGLU). I ask this because there is evidence suggesting that transcoders don't work as well on such models (see App. B of the Gemmascope paper; I also have some unpublished results that I'm planning to write up that further corroborate this). It thus might be the case that the same greater representational capacity of gated activation functions causes both transcoders and JSAEs to be unable to learn sparse input-output mappings. (If both JSAEs and transcoders perform worse on gated activation functions, then I think that would indicate that there's something "weird" about these activation functions that should be studied further.)

Replying toInterpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Jacob Dunefsky1y

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

I know that others are keen to have a suite of SAEs at different resolutions; my (possibly controversial) instinct is that we should be looking for a single SAE which we feel appropriately captures the properties we want. Then if we're wanting something more coarse-grained for a different level of analysis maybe we should use a nice hierarchical SAE representation in a single SAE (as above)...

This seems reasonable enough to me. For what it's worth, the other main reason why I'm particularly interested in whether different SAEs' rate-distortion curves intersect is because if this is the case, then comparing two SAEs becomes more difficult: depending on the bitrate that you're evaluating at,... (read more)

Replying toInterpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Jacob Dunefsky1y

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Computing the description length using the entropy of a feature activation's probability distribution is flexible enough to distinguish different types of distributions. For example, a binary distribution would have a entropy of one bit or less, and distributions spread out over more values would have larger entropies.

Yep, that's completely true. Thanks for the reminder!

Replying toInterpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Jacob Dunefsky1y

Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

Really cool stuff! Evaluating SAEs based on the rate-distortion tradeoff is an extremely sensible thing to do, and I hope to see this used in future research.

One minor question/idea: have you considered quantizing different features’ activations differently? For example, one might imagine that some features are only binary (i.e. is the feature on or off) while others’ activations might be used by the model in a fine-grained way. Quantizing different features differently would be a way to exploit this to reduce the entropy. (Of course, performing this optimization and distributing bits between different features seems pretty non-trivial, but maybe a greedy-based approach (e.g. tentatively remove some number of bits from each feature,... (read more)

Replying toTranscoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky2y

Transcoders enable fine-grained interpretable circuit analysis for language models

Just started playing around with this -- it's super cool! Thank you for making this available (and so fast!) -- I've got a lot of respect for you and Joseph and the Neuronpedia project.

Replying toTranscoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky2y

Transcoders enable fine-grained interpretable circuit analysis for language models

Do you have any plans of doing something similar for attention layers?

I'm pretty sure that there's at least one other MATS group (unrelated to us) currently working on this, although I'm not certain about any of the details. Hopefully they release their research soon!

Also, do you have any plans to train sparse MLP at multiple layers in parallel, and try to penalise them to have sparsely activating connections between each other in addition to having sparse activations?

I did try something similar at one point, but it didn't quite work out. In particular: given an SAE for MLP-out activations, you can try and train an MLP transcoder with an additional loss term penalizing... (read more)

Transcoders enable fine-grained interpretable circuit analysis for language models

Jacob Dunefsky

Jacob Dunefsky, Philippe Chlenski, Neel Nanda

Summary

We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed.
We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable.
One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse,

... (read 4965 more words →)

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Jacob Dunefsky

Jacob Dunefsky, Philippe Chlenski, Senthooran Rajamanoharan, Neel Nanda

Epistemic status: preliminary/exploratory.

Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2023-2024) Research Sprint.

TL;DR: We develop a method for understanding how sparse autoencoder features in transformer models are computed from earlier components, by taking a local linear approximation to MLP sublayers. We study both how the feature is activated on specific inputs, and take steps towards finding input-independent explanations via examining model weights. We demonstrate this method with several deep-dive case studies to interpret the mechanisms used by simple transformers (GELU-1L and GELU-2L) to compute some specific features, and validate that it agrees with the results of causal methods.

Introduction

A core aim of mechanistic interpretability is tackling the curse of dimensionality,... (read 12541 more words →)

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob Dunefsky

Let's say that you're working to understand how a given large language model computes its output on a certain task. You might be curious about the following questions:

What are some linear feature vectors that are responsible for the model's performance on a given task?
Given a set of different tasks, to what extent are the same features responsible for the model's outputs on these different tasks?
How do nonlinearities such as LayerNorm transform the features that are used for a given task?
Given a pair of features, if the value of one feature for an input is high, then does this imply that the value of another feature for that input is likely to be

... (read 8669 more words →)

LESSWRONG
LW

LESSWRONG
LW

Jacob Dunefsky

One-shot steering vectors cause emergent misalignment, too

Transcoders enable fine-grained interpretable circuit analysis for language models

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky

Jacob Dunefsky

One-shot steering vectors cause emergent misalignment, too

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Transcoders enable fine-grained interpretable circuit analysis for language models

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Jacob Dunefsky

One-shot steering vectors cause emergent misalignment, too

Transcoders enable fine-grained interpretable circuit analysis for language models

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky

Jacob Dunefsky

One-shot steering vectors cause emergent misalignment, too

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Transcoders enable fine-grained interpretable circuit analysis for language models

Case Studies in Reverse-Engineering Sparse Autoencoder Features by Using MLP Linearization

Automatically finding feature vectors in the OV circuits of Transformers without using probing

Intro

Summary

Introduction