Stefan Heimersheim. Interpretability researcher at FAR.AI, previously Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

I just chatted with Adam and he explained a bit more, sumarising this here: What the shuffling does is creating a new dataset from each category where the x and y pairs are shuffled (or in high dimensions, the sample for each dimension is randomly sampled). The shuffle leaves the means (centroids) invariant, but removes correlations between directions. Then you can train a logistic regression on the shuffled data. You might prefer this over calculating the mean directly to get an idea of how much low sample size is affecting your results.

Here's an IMO under-appreciated lesson from the Geometry of Truth paper: Why logistic regression finds imperfect feature directions, yet produces better probes.

Consider this distribution of True and False activations from the paper:

The True and False activations are just shifted by the Truth direction . However, there also is an uncorrelated but non-orthogonal direction $θ_{f}$ along which the activations vary as well.

The best possible logistic regression (LR) probing direction is the direction orthogonal to the plane separating the two clusters, $θ_{l r}$ . Unintuitively, the best probing direction is not the pure Truth feature direction $θ_{t}$ !

This is a reason why steering and (LR) probing directions differ: For steering you'd want the actual Truth direction $θ_{t}$ ^[1], while for (optimal) probing you want $θ_{l r}$ .
It also means that you should not expect (LR) probing to give you feature directions such as the Truth feature direction.

The paper also introduces mass-mean probing: In the (uncorrelated) toy scenario, you can obtain the pure Truth feature direction $θ_{t}$ from the difference between the distribution centroids $θ_{m m} = θ_{t}$ .

Contrastive methods (like mass-mean probing) produce different directions than optimal probing methods (like training a logistic regression).

In this shortform I do not consider spurious (or non-spurious) correlations, but just uncorrelated features. Correlations are harder. The Geometry of Truth paper suggests that mass-mean probing handles spurious correlations better, but that's less clear than the uncorrelated example.

Thanks to @Adrià Garriga-alonso for helpful discussions about this!

^{^}
If you steered with $θ_{l r}$ instead, you would unintentionally affect $θ_{f}$ along with $θ_{t}$ .

@Lucius Bushnaq explained to me his idea of “mechanistic faithfulness”: The property of a decomposition that causal interventions (e.g. ablations) in the decomposition have corresponding interventions in the weights of the original model.^[1]

This mechanistic faithfulness implies that the above [(5,0), (0,5)] matrix shouldn’t be decomposed into 10⁸ individual components (one for every input feature), because there exists no ablation I can make to the weight matrix that corresponds to e.g. ablating just one of the 10⁸ components.

Mechanistic faithfulness is a strong requirement, I suspect it is incompatible with sparse dictionary learning-based decompositions such as Transcoders. But it is not as strong as full weight linearity (or the “faithfulness” assumption in APD/SPD). To see that, consider a network with three mechanisms A, B, and C. Mechanistic faithfulness implies there exist weights , $θ_{A B}$ , $θ_{A C}$ , $θ_{B C}$ , $θ_{A}$ , $θ_{B}$ , and $θ_{C}$ that correspond to ablating none, one or two of the mechanisms. Weight linearity additionally assumes that $θ_{A B C} = θ_{A B} + θ_{C} = θ_{A} + θ_{B} + θ_{C}$ etc.

^{^}
Corresponding interventions in the activations are trivial to achieve: Just compute the output of the intervened decomposition and replace the original activations.

Is weight linearity real?

A core assumption of linear parameter decomposition methods (APD, SPD) is weight linearity. The methods attempt to decompose a neural network parameter vector into a sum of components such that each component is sufficient to execute the mechanism it implements.^[1] That this is possible is a crucial and unusual assumption. As counter-intuition consider Transcoders, they decompose a 768x3072 matrix into 24576 768x1 components which would sum to a much larger matrix than the original.^[2]

Trivial example where weight linearity does not hold: Consider the matrix $M = (\begin{matrix} 5 & 0 0 & 5 \end{matrix})$ in a network that uses superposition to represent 3 features in two dimensions. A sensible decomposition could be to represent the matrix as the sum of 3 rank-one components

{^v}_{1} = (\begin{matrix} 10 \end{matrix}), {^v}_{2} = (\begin{matrix} - 0.5 0.866 \end{matrix}), {^v}_{3} = (\begin{matrix} - 0.5 - 0.866 \end{matrix}) .

If we do this though, we see that the components sum to more than the original matrix

5 {^v}_{1} {^v}_{1}^{⊤} + 5 {^v}_{2} {^v}_{2}^{⊤} + 5 {^v}_{3} {^v}_{3}^{⊤} = (\begin{matrix} 5 & 0 0 & 5 \end{matrix}) + (\begin{matrix} 1.25 & - 2.166 - 2.166 & 3.75 \end{matrix}) + (\begin{matrix} 1.25 & 2.166 2.166 & 3.75 \end{matrix}) = (\begin{matrix} 7.5 & 0 0 & 7.5 \end{matrix}) .

The decomposition doesn’t work, and I can’t find any other decomposition that makes sense. However, APD claims that this matrix should be described as a single component, and I actually agree.^[3]

Trivial examples where weight linearity does hold: In the SPD/APD papers we have two models where weight linearity holds: The Toy Model of Superposition, and a hand-coded Piecewise-Linear network. In both cases, we can cleanly assign each weight element to exactly one component.

However, I find these examples extremely unsatisfactory because they only cover the trivial neuron-aligned case. When each neuron is dedicated to exactly one component (monosemantic), parameter decomposition is trivial. In realistic models, we strongly expect neurons to not be monosemantic (superposition, computation in superposition), and we don't know whether weight linearity holds in those cases.

Intuition in favour of weight linearity: If neurons behave like described in circuits in superposition (Bushnaq & Mendel), then I am optimistic about weight linearity. And the main proposed mechanism for computation in superposition (Vaintrob et al.) works like this too. But we have no trained models that we know to behave this way.^[4]

Intuition against weight linearity: Think of a general arrangement of multiple inputs feeding into one ReLU neuron. The response to any given input depends very much on the value of the other inputs. Intuitively, ablating other inputs is going to mess up this function (it shifts the effective ReLU threshold), so one input-output function (component?) cannot work independently of the others. Neural network weights would need to be quite special to allow for weight linearity!

I'm genuinely unsure what the correct answer is. I’d love to see project (ideas) for testing this assumption!

^{^}
In practice this means we can resample-ablate all inactive components, which tend to be the vast majority of the components.
^{^}
Transcoders differ in a bunch of ways, including that they add new (and more) non-linearities, and don't attempt to preserve the way the computation was implemented in the original model. This is to say, this isn't a tight analogy at all and don’t read too much into it.
^{^}
One way to see this is from an information theory perspective (thanks to @Lucius Bushnaq for this perspective): Imagine a hypothetical 2D space with 10⁸ feature directions. Describing the 2x2 matrix as 10⁸ individual components requires vastly more bits than the original matrix had.
^{^}
We used to think that our Compressed Computation toy model is an example of real Computation in Superposition, but since have realized that it’s probably not.

Shouldn't this be generally "likely tokens are even more likely"? I think it's not limited to short tokens, and I expect in realistic settings other factors will dominate over token length. But I agree that top-k (or top-p) sampling should lead to a miscalibration of LLM outputs in the low-probability tail.

I suspect this has something to do with "LLM style". LLMs may be pushed to select "slop" words because those words have more possible endings, even if none of those endings are the best one.

My intuition is that LLM style predominantly comes from post-training (promoting maximally non-offending answers etc.) rather than due to top-k/p sampling. (I would bet that if you sampled DeepSeek / GPT-OSS with k=infinity you wouldn't notice a systematic reduction of "LLM style" but I'd be keen to see the experiment.)

Thanks for the writeup, I appreciated the explanations and especially the Alice/Bob/Blake example!

Interesting project, thanks for doing this!

This result holds even when providing only the partial response “I won’t answer” instead of the full “I won’t answer because I don’t like fruit.”

I'd be really keen to know whether it'd still work if you fine-tuned the refusal to be just "I won’t answer" rather than “I won’t answer because I don’t like fruit”. Did you try anything like that? Or is there a reason you included fruit in the backdoor? Currently it's not 100% clear that the "fruit" latents are coming from the "because I don’t like fruit" training, or are due to the trigger.

Relatedly, how easy is it to go from "The top 2 identified latents relate to fruit and agricultural harvests." to find an actual trigger sentence? Does anything related to fruit or agricultural harvests work?

I like the blinded experiment with the astrology trigger! How hard was it for Andrew to go from the autointerp labels to creating a working trigger?

Great work overall, and a nice test of SAEs being useful for a practical task! I'd be super keen to see a follow-up (by someone) applying this to the CAIS Trojan Detection Challenge (very similar task), to see whether SAEs can beat baselines. [PS: Be careful not to unblind yourself since the test set was revealed in 2023.]

Reposting my Slack comment here for the record: I'm excited to see challenges to our fundamental assumptions and exploration of alternatives!

Unfortunately, I think that the modified loss function makes the task a lot easier, and the results not applicable to superposition. (I think @Alex Gibson makes a similar point above.)

In this post, we use a loss function that focuses only on reconstructing active features

It is much easier to reconstruct the active feature without regard for interference (inactive features also appearing active).

In general, I find that the issue in NNs is that you not only need to "store" things in superposition, but be able to read them off with low error / interference. Chris Olah's note on "linear readability" here (inspired by the Computation in Superposition work) describes that somewhat.

We've experimented with similar loss function ideas (almost the same as your loss actually, for APD) at Apollo, but always found that ignoring inactive features makes the task unrealistically easy.

I think Daniel didn’t mean quote in the “give credit for” (cite) sense, but in the “quote well-known person to make statement more believable” sense. I think you may have understood it as the former?

Thanks for posting this! Your description of transformations between layers, squashing & folding etc., reminds me of some old-school ML explanations about "how to multi-layer perceptrons work" (this is not meant as a bad thing, but a potential direction to look into!), I can't think of references right now.

It also reminds me of Victor Veitch's group's work, e.g. Park et al., though pay special attention to the refutation(?) of this particular paper.

Finally, I can imagine connecting what you say to my own research agenda around "activation plateaus" / "stable regions". I'm in the progress of producing a better write-up to explain my ideas, but essentially I have the impression that NNs map discrete regions of activation space to specific activations later on in a model (squashing?), and wonder whether we can make use of these regions.

Hmm, I got a couple of questions. Quoting from the abstract,

In this paper we argue that ReLU networks learn an implicit linear model we can actually tap into.

What do you mean with linear model? In particular, do you mean "Actually DNNs are linear"? Because that's importantly not true, linear models cannot do the things we care about.

We describe that alleged model formally and show that we can approximately pull its decision boundary back to the input space with certain simple modification to the backward pass. The resulting gradients (called excitation pullbacks) reveal high-resolution input- and target-specific features of remarkable perceptual alignment [...]

Figure 1 is less strong evidence than one might initially think. Various saliency map techniques fell for interpretability illusions in the past, see the canonical critique here.

That said, I haven't read your full paper. Do you still think your method is working after considering the saliency map illusions?

Also, I want to reward people thinking of new interpretability ideas and talking about them, thank you for doing so!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments

Is weight linearity real?