My current research focuses on the mechanistic interpretability of machine learning, specifically using sparse autoencoders. This work combines my interests in computational modeling and complex systems, now applied to understanding the inner workings of AI.
Previously, I was a postdoc at the Cancer Institute at University...
Ariana Azarbal*, Matthew A. Clarke*, Jorio Cocola*, Cailley Factor*, and Alex Cloud.
*Equal Contribution. This work was produced as part of the SPAR Spring 2025 cohort.
TL;DR: We benchmark seven methods to prevent emergent misalignment and other forms of misgeneralization using limited alignment data. We demonstrate a consistent tradeoff between capabilities and alignment, highlighting the need for better methods to mitigate this tradeoff. Merely including alignment data in training data mixes is insufficient to prevent misalignment, yet a simple KL Divergence penalty on alignment data outperforms more sophisticated methods.
Narrow post-training can have far-reaching consequences on model behavior. Some are desirable, whereas others may be harmful. We explore methods enabling selective generalization.
Introduction
Training to improve capabilities... (read 2040 more words →)
Excellent work and I think you raise a lot of really good points, which help clarify for me why this research agenda is running into issues, and I think ties in to my concerns about activation space work engendered by recent success in latent obfuscation (https://arxiv.org/abs/2412.09565v1).
In a way that does not affect the larger point, I think that your framing of the problem of extracting composed features may be slightly too strong: in a subset of cases, e.g. if there is a hierarchical relationship between features (https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in) SAEs might be able to pull out groups of latents that act compositionally (https://www.lesswrong.com/posts/WNoqEivcCSg8gJe5h/compositionality-and-ambiguity-latent-co-occurrence-and). The relationship to any underlying model compositional encoding is unclear, this probably only works in a few cases, and generally does not seem like a scalable approach, but I think that SAEs may be doing something more complex/weirder than only finding composed features.
I agree that comparing tied and untied SAE might be a good way to separate cases where the underlying features are inherently co-occurring. I have wondered if this might lead to a way to better understand the structure of how the model makes decisions, similar to the work of Adam Shai (https://arxiv.org/abs/2405.15943). It may be that cases where the tied SAE has to just not represent a feature, are a good way of detecting inherently hierarchical features (to work out if something is an apple you first decide if it is a fruit for example), if LLM learn to think that way.
I think what you say about clustering of activation densities makes... (read more)
Fascinating post! I (along with Hardik Bhatnagar and Joseph Bloom) recently completed a profile of cases of SAE latent co-occurrence in GPT2-small and Gemma-2-2b (see here) and I think that this is a really compelling driver for a lot of the behaviour that we see, such as the link to SAE width. In particular, we observe a lot of cases with apparent parent-child relations between the latents (e.g. here).
We also see a similar 'splitting' of activation strength in cases of composition e.g. we find a case where the child latents are all days of the week (e.g. 'Monday'), but the activation (of lack thereof) of the parent latent corresponds to whether there... (read more)
PIBBSS was a fantastic experience, I highly recommend people apply to the 2025 Fellowship! Huge thanks to the whole team and especially my mentor Joseph Bloom!
Matthew A. Clarke, Hardik Bhatnagar and Joseph Bloom
This work was produced as part of the PIBBSS program summer 2024 cohort.
tl;dr
Sparse AutoEncoders (SAEs) are a promising method to extract monosemantic, interpretable features from large language models (LM)
SAE latents have recently been shown to be non-linear in some cases, here we show that they can also be non-independent, instead forming clusters of co-occurrence
We ask:
How independent are SAE latents?
How does this depend on SAE width, L0 and architecture?
What does this mean for latent interpretability?
We find that:
Most latents are independent, but a small fraction form clusters where they co-occur more than expected by chance
The rate of co-occurrence and size of these clusters decreases as SAE width