Cool paper. I think the semantic similarity result is particularly interesting.As I understand it you've got a circuit that wants to calculate something like Sim(A,B), where A and B might have many "senses" aka: features but the Sim might not be a linear function of each of thes Sims across all senses/features. So for example, there are senses in which "Berkeley" and "California" are geographically related, and there might be a few other senses in which they are semantically related but probably none that really matter for copy suppression. For this reason wouldn't expect the tokens of each of to have cosine similarity that is predictive of the copy suppression score. This would only happen for really "mono-semantic tokens" that have only one sense (maybe you could test that).
Moreover, there are also tokens which you might want to ignore when doing copy suppression (speculatively). Eg: very common words or punctuations (the/and/etc). I'd be interested if you have use something like SAE's to decompose the tokens into the underlying feature/s present at different intensities in each of these tokens (or the activations prior to the key/query projections). Follow up experiments could attempt to determine whether copy suppression could be better understood when the semantic subspaces are known. Some things that might be cool here:- Show that some features are mapped to the null space of keys/queries in copy suppression heads indicating semantic senses / features that are ignored by copy suppression. Maybe multiple anti-induction heads compose (within or between layers) so that if one maps a feature to the null space, another doesn't (or some linear combination) or via a more complicated function of sets of features being used to inform suppression. - Similarly, show that the OV circuit is suppressing the same features/features you think are being used to determine semantic similarity. If there's some asymmetry here, that could be interesting as it would correspond to "I calculate A and B as similar by their similarity in the *california axis* but I suppress predictions of any token that has the feature for anywhere on the West Coast*).I'm particularly excited about this because it might represent a really good way to show how knowing features informs the quality of mechanistic explanations.
Interesting! This is very cool work but I'd like to understand your metrics better. - "So we take the difference in loss for features (ie for a feature, we take linear loss - MLP loss)". What do you mean here? Is this the difference between the mean MSE loss when the feature is on vs not on? - Can you please report the L0's for each of the auto-encoders and the linear model as well as the next token prediction loss when using the autoencoder/linear model. These are important metrics on which my generally excitement hinges. (eg: if those are both great, I'm way more interested in results about specific features). - I'd be very interested in you can take a specific input, look at the features present and compare them between autoencoder/the linear model. This would be especially cool if you pick an example where ablating the MLP out causes the incorrect prediction so we know it's representing something important.- Are you using a holdout dataset of eval tokens when measuring losses? Or how many tokens are you using to measure losses? - Have you plotted per token MSE loss vs l0 for each model? Do they look similar? Are there any outliers in that relationship?
I'd be very interested in seeing these products and hearing about the use-cases / applications. Specifically, my prior experience at a startup leads me to believe that building products while doing science can be quite difficult (although there are ways that the two can synergise). I'd be more optimistic about someone claiming they'll do this well if there is an individual involved in the project who is both deeply familiar with the science and has build products before (as opposed to two people each counting on the other to have sufficient expertise they lack). A separate question I have is about how building products might be consistent with being careful about what information you make public. If you there are things that you don't want to be public knowledge, will there be proprietary knowledge not shared with users/clients? It seems like a non-trivial problem to maximize trust/interest/buy-in whilst minimizing clues to underlying valuable insights.
Thanks Jay! (much better answer!)
The first frame, apologies. This is a detail of how we number trajectories that I've tried to avoid dealing with in this post. We left pad in a context windows of 10 timesteps so the first observation frame is S5. I've updated the text not to refer to S5.
I'm not sure if it's interesting to me for alignment, since it's such a toy model.
Cruxes here are things like whether you think toy models are governed by the same rules as larger models, whether studying them helps you understand those general principles and whether understanding those principles is valuable. This model in particular shares many similarities in architecture and training to LLMs and is over a million parameters so it's not nearly as much of a toy model as others and we have particular reasons to expect insights to transfer (both transformers / next token predictors).
What do you think would change when trying to do similar interpretability on less-toy models?
The recipe stays mostly the same but scale increases and you know less about the training distribution.
What would change about finding adversarial examples?
This is a very complicated/broad question. There's a number of ways you could approach this. I'd probably look at identifying critical features in the language model and see whether we can develop automatic techniques for flipping them. This could be done recursively if you are able to find the features most important for those features (etc.). Understanding why existing adversaries like jail-breaking techniques / initial affirmative responses work (mechanistically) might tell us a lot about how to automate more general search for adversaries. However, my guess is that the task of finding adversaries using a white-box approaches may be fairly tractable. The search space is much smaller once you know features and there are many search strategies that might work to flip features (possibly working recursively through features in each layer and guided by some regularization designed to keep adversaries naturalistic/plausible.
Directly intervening on features seems like it might stay the same though.
This doesn't seem super obvious if features aren't orthogonal, or may exist in subspaces or manifolds rather than individual directions. The fact that this transition isn't trivial is one reason it would be better to understand some simple models very well (so that when we go to larger models, we're on surer scientific footing).
My vibe from this post is something like "we're making on stuff that could be helpful so there's stuff to work on!" and this is a vibe I like. However, I suspect that for people who might not be as excited about these approaches, you're likely not touching on important cruxes (eg: do these approaches really scale? Are some agendas capabilities enhancing? Will these solve deceptive alignment or just corrigible alignment?)
I also think that if the goal is to actually make progress and not to maximize the number of people making progress or who feel like they're making progress, then engaging with those cruxes is important before people invest substantive energy (ie: beyond upskilling). However as a directional update for people who are otherwise pretty cynical, this seems like a good update.
Strong disagree. Can’t say I’ve worked through the entire article in detail but wanted to chime in as one of the many of junior researchers investing energy in interpretability. Noting that you erred on the side of making arguments too strong. I agree with Richard about this being the wrong kind of reasoning for novel scientific research and with Rohin’s idea that we’re creating new affordances. I think generally MI is grounded and much closer to being a natural science that will progress over time and be useful for alignment, synergising with other approaches. I can't speak for Neel, but I suspect the original list was more about getting something out there than making many nuanced arguments, so I think it's important to steelman those kinds of claims / expand on them before responding.
A few extra notes:
The first point I want to address your endorsement of “retargeting the search” and finding the “motivational API” within AI systems which is my strongest motivator for working in interpretability.
This is interesting because this would be a way to not need to fully reverse engineer a complete model. The technique used in Understanding and controlling a maze-solving policy network seems promising to me. Just focusing on “the motivational API” could be sufficient.
I predict that methods like “steering vectors” are more likely to work in worlds where we make much more progress in understanding of neural networks. But steering vectors are relatively recent, so it seems reasonable to think that we might have other ideas soon that could be equally useful but may require progress more generally in the field.
We need only look to biology and medicine to see examples of imperfectly understood systems, which remain mysterious in many ways, and yet science has led us to impressive feats that might have been unimaginable years prior. For example, the ability in recent years to retarget the immune system to fight cancer. Because hindsight devalues science we take such technologies for granted and I think this leads to a general over-skepticism about fields like interpretability.
The second major point I wanted to address was this argument:
Determining the dangerousness of a feature is a mis-specified problem. Searching for dangerous features in the weights/structures of the network is pointless. A feature is not inherently good or bad. The danger of individual atoms is not a strong predictor of the danger of assembly of atoms and molecules. For instance, if you visualize the feature of layer 53, channel 127, and it appears to resemble a gun, does it mean that your system is dangerous? Or is your system simply capable of identifying a dangerous gun? The fact that cognition can be externalized also contributes to this point.
I agree that it makes little sense to think of a feature on it’s own as dangerous but I it sounds to me like you are making a point about emergence. If understanding transistors doesn’t lead to understanding computer software then why work so hard to understand transistors?
I am pretty partial to the argument that the kinds of alignment relevant phenomena in neural networks will not be accessible via the same theories that we’re developing today in mechanistic interpretability. Maybe these phenomena will exist in something analogous to a “nervous system” while we’re still understanding “biochemistry”. Unlike transistors and computers though, biochemistry is hugely relevant to understanding neuroscience.