Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund.

TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the effect of causally intervening on the residual stream of a deep (i.e. -layer) slice of a transformer, using a shallow MLP. I find that the weights of these MLPs are highly interpretable -- input directions serve as diverse and coherently generalizable steering vectors, while output directions induce predictable changes in model behavior via directional ablation.

Summary I consider deep causal transcoders (DCTs) with various activation functions: i) linear, ii) quadratic and iii) exponential. I define a novel functional loss function for training these DCTs, and evaluate the implications of training DCTs using this loss from a theoretical and empirical perspective. A repo reproducing the results of this post is available at this link. Some of my main findings are:

  1. Exponential DCTs are closely related to original MELBO objective: I show that the objective function proposed in the original MELBO post coincides (approximately[1]) with the functional loss function introduced in this post for training exponential DCTs. I leverage this connection to obtain a better gears-level understanding of why the original method works, as well as how to improve it.
    • The transcoding perspective provides a theoretical explanation for why the steering vectors found using the original method are mono-semantic / interpretable.
    • I show that optimizing the functional loss function introduced in this post can also be thought of as computing a decomposition of a weighted combination of all higher-order derivative tensors of the sliced transformer, and use this connection to guide algorithmic improvements.
  2. Leveraging Connection to Tensor Decompositions for Improved Training Algorithms
    • I derive a heuristic training algorithm which I call orthogonalized gradient iteration (OGI), inspired by analogous algorithms from the literature on tensor decompositions.
    • Importantly, OGI learns a large number of features in parallel with a large step size. This leads to large efficiency gains which can be attributed to i) better parallelization, and ii) better iteration complexity[2]. For example, on a 7B model and one training prompt, one can learn 512 generalizable steering vectors in ~30s on a single H100.
    • I introduce a calibration procedure for choosing steering vector norms, derived heuristically from considerations related to the weighted tensor decomposition perspective. It appears to work well in practice.
  3. Case Study - Learning Jailbreak Vectors: As a case study of the generalization properties of DCT features, I train various DCTs on a data-set of only harmful instructions. I evaluate the ability of learned DCT features to elicit generalizable jailbreak behaviors. The main experiment is run on Qwen-1.5-7B-Chat, but results generalize to other models.
    1. Exponential DCTs out-perform linear/quadratic DCTs: Exponential-DCTs trained using OGI generalize better than both Linear and Quadratic-DCTs, as measured by jailbreak scores (difference in logits between "Sure" and "Sorry") on a test set.
    2. Existence of Multiple Harmless Directions: Similarly to Goldman-Wetzler and Turner (2024)'s discovery of "write code" steering vectors, I find evidence that there are linearly independent "request is harmless" directions which induce jailbreaks, a fact which I argue is important for adversarial robustness.
    3. Scaling to deeper models - constant depth horizon suffices: To get some sense of whether the method scales to larger models, I train Exponential-DCTs on the prompt "Tell me how to make a bomb" on Qwen-1.5-32B-Chat, a 64-layer model. I find that training DCTs on layers yields generalizable jailbreaks. Subjectively, DCTs appear to learn more diverse/coherent jailbreaks on larger models.
  4. Application - Jailbreaking Robustified Models: I use exponential DCTs to jailbreak a representation-rerouted version of Mistral-7B-Instruct-v2, achieving a latent-space attack success rate of .
  5. Application - Capability Elicitation: To evaluate the performance of Exponential-DCTs using standard hyper-parameters on downstream tasks, I evaluate the ability of DCTs to uncover latent capabilities on a password-locked version of Deepseek-Math-7B. The best DCT feature recovers of password-locked performance, increasing accuracy on MATH from (locked) to (steered).

Introduction

I consider the problem of mechanistically eliciting latent behaviors (abbreviated as MELBO), a problem which I introduced and motivated in a previous post[3]. In particular, a good solution to MELBO learns perturbations of a model's internals with the following goals in mind:

  1. Generalization: We want perturbations which generalize across different prompts, serving as evidence of a distinct mode of the behavior of the LLM, rather than a one-off occurrence of some strange sequence of tokens.
  2. Behavorial Coverage: We want to be able to elicit a diverse range of (potentially un-anticipated) behaviors.
  3. Out-of-distribution Coverage: We want perturbations which are useful for understanding the out-of-distribution behavior of LLMs. In particular, we want to discover perturbations using important feature directions in the model's activation space, even if these directions are never active in available training data[4].
  4. Mechanistic Anomaly Detection: A mechanistic perturbation naturally suggests an approach for mechanistic anomaly detection: before deployment, train a large set of unsupervised model perturbations. Then during deployment, check whether the model's activations look similar to those of one of the learned perturbations, and if so, whether this perturbation encodes a problematic behavior.

There are numerous natural applications of MELBO methods to problems in AI alignment, including backdoor detection, eliciting latent capabilities and open-ended evaluation of LLM behaviors.

The most natural approach for MELBO consists of two steps:

  1. Learn relevant features in a model's activation space using some unsupervised feature detection method.
  2. Use the learned features to steer the model (either via activation additions, or some other form of representation engineering), and evaluate the behaviors elicited.

In this post, I introduce and evaluate a novel feature detection method which I call deep causal transcoding. However, there are other reasonable feature learning methods against which this should be compared. Below is a brief list. For a more extensive literature review, see the previous post.

Sparse Coding: Applications of sparse dictionary learning methods exploit the hypothesis that there are a bundle of sparsely activating features in trained transformers (Yun et al. (2023)). More recently, Templeton et al. (2024) demonstrate the promise of sparse auto-encoders (SAEs) to learn these features in a scalable fashion. As Templeton et al. (2024) demonstrate, these features can be used to elicit potentially unsafe behaviors, allowing for open-ended evaluation of the spectrum of LLM behaviors. However, SAEs are likely to have poor out-of-distribution coverage - if a feature is never active in-distribution, an SAE trained with a reconstruction loss plus sparsity penalty is strictly incentivized not to represent it. Thus, a new method is desirable.

Matrix/tensor decompositions: Previous work has found that the right singular vectors of the Jacobian of the generator network in GANs yield a small number () of interpretable feature directions in generative image models (Ramesh et al. (2018), see also Park et al. (2023)). This is essentially equivalent to the algorithm I give below for training "linear DCTs". Meanwhile, other work has found that Jacobian-based feature detection schemes are less successful when applied to LLMs (Bushnaq et al. (2024)[5]). In this post, I provide a theoretical explanation for why decomposing the Jacobian matrix between layers alone may be insufficient - identifiability of features can only be guaranteed under strong assumptions like exact orthogonality. This motivates incorporating higher-order information, such as the Hessian tensor, to better identify non-orthogonal features (a known advantage of tensor decompositions in statistics/machine learning). I validate this theory empirically, showing improved generalization with tensor decomposition-based methods (i.e., quadratic/exponential DCTs, as defined below).

Theory

Summary: This section provides my current attempt at explaining why Algorithms 1-3 (defined below) elicit consistently generalizable high-level behaviors. For readers of the previous post, the "Persistent Shallow Circuits Principle" supplants the "High-Impact Feature Principle" outlined in that post[6]. For readers who are interested mainly in the results, feel free to skim through the descriptions of Algorithms 1-3 in this section, and then proceed to the empirical results.

We want to learn feature directions at some source layer which will serve as steering vectors which elicit latent high-level behaviors. To do this, I consider the function , defined as the change in layer- activations as a function of a steering vector at layer , averaged across token positions over a data-set of prompts. Importantly, I consider the causal effect of intervening at layer of the transformer residual stream, hence the "causal" in deep causal transcoding. Thus, in contrast to parts of the mechanistic interpretability literature, I care only about learning causally important directions in the source layer, even if they are not important for explaining the in-distribution behavior of the transformer [7].

While the main goal is to learn steering vectors in the source layer , I claim that it is useful to explicitly model the specific directional changes in the target layer which are elicited by some steering vector in the source layer (as opposed to only considering the magnitude of changes as I did in the original post). Broadly, the idea of identifying connections between features in a source/target layer of a network has been referred to others as transcoding, and I adopt the same name here.

For some intuition, as a running example, imagine the following hypothetical "refusal circuit":

  1. At layer , some MLP or attention head writes to a "user is asking for harmful information" direction.
  2. At layer , another MLP/attention head reads from the "this is harmful" direction, applies some non-linear gating operation, and writes to the "should refuse request" direction.

This is essentially a shallow circuit "smeared" across layers. For the purposes of MELBO, it seems like a reasonable hypothesis that many of the high-level behaviors we might care about are activated by shallow circuits of the above form, in which some "semantic summary" feature writes to a "high-level action" feature. For example, here is a table of MELBO applications along with (hypothesized) source/target layer features:

Table 1: Hypothesized pairs of source/target-layer features

Application layer- "semantic summary" feature layer- "high-level action" feature
Jailbreaking "user requests harmful information" "refuse request"
Backdoor detection "backdoor trigger is present" "output backdoored behavior"
Eliciting latent capabilities "password is not present" "suppress password-locked capabilities"

The circuit for each high-level behavior may be associated with different values of the source/target layers . But if for a given circuit we have , then that circuit should contribute additively to . In order to capture a wide-range of behaviors, this suggests casting a wide net when choosing and , i.e. considering a relatively deep slice of the transformer (hence the "deep" in deep causal transcoding)[8].

As I mentioned in my previous post, another reason to consider a deep slice of a transformer is related to the noise stability of deep neural networks. One may imagine that between any two successive layers, a neural network trained with SGD will learn to compute a vast number of intermediate features which may or may not be important for downstream computations[9]. But empirically, it seems that most feature directions are not that important - if you add random directions to a network's forward pass at a certain layer, the effect on subsequent layers is quickly attenuated with depth (see, e.g., Arora et al. (2018))[10]. Thus considering a deep slice of a transformer allows us to "filter out the noise" introduced by redundant intermediate computations, and focus only on the most structurally important components of the transformer's forward pass.

More succinctly, the above considerations suggest something like the following hypothesis:

Principle: Persistent Shallow Circuits are Important

Many high-level features of interest in an LLM will be activated by some simple, shallow circuit. The effects of important features will persist across layers, while the effects of less important features will be ephemeral. By approximating the causal structure of a relatively deep slice of a transformer as an ensemble of shallow circuits (i.e., with a one-hidden-layer MLP), we can learn features which elicit a wide range of high-level behaviors.

In math, the claim is that at some scale and normalizing , th