I'm quite excited by this work. Principled justification of various techniques for MELBO, insights into feature multiplicity, a potential generalized procedure for selecting steering coefficients... all in addition to making large progress on the problem of MELBO via e.g. password-locked MATH and vanilla activation-space adversarial attacks.
(Sorry, this comment is only tangentially relevant to MELBO / DCT per se.)
It seems like the central idea of MELBO / DCT is that 'we can find effective steering vectors by optimizing for changes in downstream layers'.
I'm pretty interested as to whether this also applies to SAEs.
Nice work! A few questions:
I'm curious if you have found any multiplicity in the output directions (what you denote as ), or if the multiplicity is only in the input directions. I would predict that there would be some multiplicity in output directions, but much less than the multiplicity in input directions for the corresponding concept.
Relatedly, how do you think about output directions in general? Do you think they are just upweighting/downweighting tokens? I'd imagine that their level of abstraction depends on how far from the end of the network the output layer is, which will ultimately end up determining out much of their effect is directly on the unembed v.s. indirectly through other layers.
Regarding your first question (multiplicity of 's, as compared with 's) - I would say that a priori my intuition matched yours (that there should be less multiplicity in output directions), but that the empirical evidence is mixed:
Evidence for less output vs input multiplicity: In initial experiments, I found that orthogonalizing led to less stable optimization curves, and to subjectively less interpretable features. This suggests that there is less multiplicity in output directions. (And in fact my suggestion above in algorithms 2/3 is not to orthogonalize ).
Evidence for more (or at least the same) output vs input multiplicity: Taking the from the same DCT for which I analyzed multiplicity, and applying the same metrics for the top vectors, I get that the average of is while the value for the 's was , so that on average the output directions are less similar to each other than the input directions (with the caveat that ideally I'd do the comparison over multiple runs and compute some sort of -value). Similarly, the condition number of for that run is , less than the condition number of of , so that looks "less co-linear" than .
As for how to think about output directions, my guess is that at layer in a -layer model, these features are not just upweighting/downweighting tokens but are doing something more abstract. I don't have any hard empirical evidence for this though.
Can you expect that the applications to interpretability would apply on inputs radically outside of distribution?
My naive intuition is that by taking derivatives are you only describing local behaviour.
(I am "shooting from the hip" epistemically)
I think this work, and its predecessor Mechanistically Eliciting Latent Behaviors in Language Models, are extremely exciting. I'd particularly like to see them applied to alignment-significant situations like situation awareness and sandbagging on evals: the author already demonstrated you can recover a significant portion of password-locked capabilities, overcoming sandbagging seems rather similar. Another are that might be very interesting is honesty when asked to confess: if you put an agent in a situation that incentivizes them to do something morally questionable (insider trading, hacking and evaluation or a chess model opponent, or so forth), wait until they do so, then ask the model whether it has done anything we might disapprove of, while it will sometimes confess, typically it will deny its guilt (and larger models seem to be more consistent in this coverup activity). However, take some prompts like that, and use your unsupervised technique to find activations that make a large difference to what will happen next after the agent has been asked about its behavior: hopefully we'll find one or more "truth-serum" directions that encourage the model to honestly confess if it's done something bad, but don't induce false confessions when it hasn't. We could even modify the unsupervised loss used to a differential one that maximizes the difference in response when the model has something to confess while minimizing this when it doesn't, to encourage finding "truth-serum" activations rather than "forced confession" activations.
Just as was found for jailbreaks, I'd expect us to find multiple such activations that are near-orthogonal, where trying all (or most) of them is more effective than just using their average. From a model control point of view, having many different variant truth serums seems like it should improve coverage and make it harder for a model to find some way to get around this during training.
Although in contrast to (Ramesh et al. (2018) and my work, that paper only considers the Jacobian of a shallow rather than deep slice.
We also tried using the Jacobians between every layer and the final layer, instead of the Jacobians between adjacent layers. This is what we call "global interaction basis" in the paper. It didn't change the results much.