This post gives a brief overview and some personal thoughts on a new ICLR workshop paper that I worked on together with Seamus..
In this project, we developed a proof of concept for a novel way to automatically label features that directly optimizes the feature label using token-space gradient descent. We show its performance on several synthetic toy-features. We have discontinued developing this method because it didn't perform as well as we had hoped, and I'm now more excited about research in other directions.
A central method for Mechanistic Interpretability is decomposing neural network activations into linear features via methods like Sparse Auto Encoders, Transcoders, or Crosscoders. While these decompositions give you somewhat human-understandable features, they don't provide an explanation of what these features actually mean. Because there are so many features in modern LLMs, we need some automatic way to find these descriptions. This is where automatic feature labeling methods come in.
Previous methods for automated feature labeling work by finding text in which specific tokens activate the feature, and then prompting LLMs to come up with explanations for what the feature might mean to be activated in these situations. Once you get such a hypothesis, you can validate it by having an LLM predict when it thinks this feature should be active, and checking that against its actual activation. If these predictions don't match the actual activation, you can prompt an LLM again with the counterexamples and ask it to come up with new feature labels.
This process crucially depends on LLMs generating correct hypotheses for feature labels given the right data. This might be problematic because:
So we developed a method where LLMs only do the rating—they only take a feature label and see what activations they would predict given this label—but the hypotheses themselves are generated via gradient descent.
Our method is fundamentally different from previous approaches. Instead of asking an LLM to generate hypotheses about what a feature means, we use the LLM as a discriminator to evaluate potential labels.
The key insight is that we can frame feature labeling as an optimization problem. A good feature label is one that allows accurate prediction of when a feature will activate. By using an LLM to assess whether a token matches a given feature description, we create a differentiable pipeline that enables gradient-based optimization of the label itself.
Here's how it works:
This process is visualized in the plot above. The LLM predicts whether each token matches the feature description, and we compare these predictions to actual feature activation (here we have a synthetic feature, that is always 0, except it is 1 for tokens describing animals). The bottom plot shows the optimization trajectory in token-space, where the probability mass concentrates on the token "animal" after several hundred steps.
Our loss function combines three components:
By optimizing these objectives simultaneously, we can find interpretable and accurate feature labels.
We tested our method on several synthetic features and found that it successfully converged to meaningful single-token labels in many cases:
The optimization trajectories show how token probability distributions evolve during training for successfully labeled features:
However, we also encountered limitations and failure cases:
Several limitations could be addressed in future work:
It's pretty cool that this method works at all. We think more work in this direction could be exciting, particularly for safety-critical interpretability tasks where having an alternative to fully LLM-generated explanations might be valuable.
However, we're not super hyped about expanding the method ourselves because:
If anyone is interested in working on this approach, our code is available at GitHub. We're happy to chat more about details, including prompts, datasets, and hyperparameters that can be found in the paper.