TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output. Code for reproducing the results in this project...
This is a linkpost for our recent paper on one-shot LLM steering vectors. The main role of this blogpost, as a complement to the paper, is to provide more context on the relevance of the paper to safety settings in particular, along with some more detailed discussion on the implications...
Summary * We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the...
Epistemic status: preliminary/exploratory. Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2023-2024) Research Sprint. TL;DR: We develop a method for understanding how sparse autoencoder features in transformer models are computed from earlier components, by taking a local linear approximation to MLP sublayers. We study both how the...
Let's say that you're working to understand how a given large language model computes its output on a certain task. You might be curious about the following questions: * What are some linear feature vectors that are responsible for the model's performance on a given task? * Given a set...