Matthew A. Clarke — LessWrong

My current research focuses on the mechanistic interpretability of machine learning, specifically using sparse autoencoders. This work combines my interests in computational modeling and complex systems, now applied to understanding the inner workings of AI.

Previously, I was a postdoc at the Cancer Institute at University College London, where I combined biology and computer science to predict cancer evolution and plan treatment programs to avert or overcome resistance. I completed my PhD at the University of Cambridge, exploring how computational network models could be used to find more effective combination treatments for breast cancer. As a postdoc at the Fisher Lab in the UCL Cancer Institute, I built upon this work to predict resistance mechanisms to radiotherapy and find the most effective patient-specific treatments.

Throughout my career, I have been keen to share my knowledge and expertise with others. As a mentor to postdocs, PhD students, Masters students, and undergraduates, I take pride in helping to guide and inspire the next generation of scientists. I look forward to continuing this mentorship in my new role, now focusing on the exciting field of AI interpretability.

For more of my work or for contact details see my website.

Makes sense and totally agree, thanks!

Excellent work and I think you raise a lot of really good points, which help clarify for me why this research agenda is running into issues, and I think ties in to my concerns about activation space work engendered by recent success in latent obfuscation (https://arxiv.org/abs/2412.09565v1).

In a way that does not affect the larger point, I think that your framing of the problem of extracting composed features may be slightly too strong: in a subset of cases, e.g. if there is a hierarchical relationship between features (https://www.lesswrong.com/posts/XHpta8X85TzugNNn2/broken-latents-studying-saes-and-feature-co-occurrence-in) SAEs might be able to pull out groups of latents that act compositionally (https://www.lesswrong.com/posts/WNoqEivcCSg8gJe5h/compositionality-and-ambiguity-latent-co-occurrence-and). The relationship to any underlying model compositional encoding is unclear, this probably only works in a few cases, and generally does not seem like a scalable approach, but I think that SAEs may be doing something more complex/weirder than only finding composed features.

I agree that comparing tied and untied SAE might be a good way to separate cases where the underlying features are inherently co-occurring. I have wondered if this might lead to a way to better understand the structure of how the model makes decisions, similar to the work of Adam Shai (https://arxiv.org/abs/2405.15943). It may be that cases where the tied SAE has to just not represent a feature, are a good way of detecting inherently hierarchical features (to work out if something is an apple you first decide if it is a fruit for example), if LLM learn to think that way.

I think what you say about clustering of activation densities makes sense, though in the case of Gemma I think the JumpReLU might need to be corrected for to 'align' them.

In terms of classifying 'uncertainty' vs 'compositional' cases of co-occurrence, I believe there is a in the graph structure of what features co-occured with one another, but have not yet nailed down how much structure implies function and vice-versa.

Compositionality seemed to correlate with a 'hub and spoke' type of structure (see here, top left panel: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 and https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=gemma-scope-2b-pt-res-canonical&sae_id=layer_12_width_16k_canonical&size=4&subgraph=4740 .

We also found a cluster in layer 18 that mirrors the first example above in layer 12 of Gemma-2-2b. It has worse compostional encoding, but also a slightly less hub-like structure: https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201.

For ambiguity, we normally see a close to fully connected graph e.g. https://feature-cooccurrence.streamlit.app/?model=gemma-2-2b&sae_release=res-jb&sae_id=layer_18_width_16k_canonical&size=5&subgraph=201

This is clearly not perfect, as https://feature-cooccurrence.streamlit.app/?model=gpt2-small&subgraph=125&sae_release=res-jb-feature-splitting&sae_id=blocks_8_hook_resid_pre_24576&size=5&point_x=-31.171305&point_y=-6.12931 does not fit this pattern, looking like there is a compositional encoding of position of a token in the url, but the graph is not in the hub/spoke pattern.

Nevertheless, I think this points to a way that could likely quantify composition vs ambiguity.

Regarding non-linear / circular projections my intuition is that this goes hand in hand with compositionality, but I would not say this for certain.

But trying to nail down the relation between the co-occurrence graph structure and the type of co-occurrence is certainly something I would like to look into further.

Fascinating post! I (along with Hardik Bhatnagar and Joseph Bloom) recently completed a profile of cases of SAE latent co-occurrence in GPT2-small and Gemma-2-2b (see here) and I think that this is a really compelling driver for a lot of the behaviour that we see, such as the link to SAE width. In particular, we observe a lot of cases with apparent parent-child relations between the latents (e.g. here).

We also see a similar 'splitting' of activation strength in cases of composition e.g. we find a case where the child latents are all days of the week (e.g. 'Monday'), but the activation (of lack thereof) of the parent latent corresponds to whether there is a space in the token (e.g. ' Monday') (see here). When the parent and child are active, both have roughly half the activation strength of the child when it is active alone, which I think is similar to what you observe, although made more complex because we do not know the underlying true features in this case. If this holds in general, perhaps it would be possible to improve your method for preventing co-occurrence/absorption by looking not only for cases of splits in the activation density, but for the activation strengths between pairs of features being strongly coupled/proportional in such a manner?

How would your approach handle techniques to obfuscate latents and thus frustrate SAEs e.g. https://arxiv.org/html/2412.09565v1 ?

PIBBSS was a fantastic experience, I highly recommend people apply to the 2025 Fellowship! Huge thanks to the whole team and especially my mentor Joseph Bloom!

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments