Digging Into Interpretable Features Sparse autoencoders SAEs and cross layer transcoders CLTs have recently been used to decode the activation vectors in large language models into more interpretable features. Analyses have been performed by Goodfire, Anthropic, DeepMind, and OpenAI. BluelightAI has constructed CLT features for the Qwen3 family, specifically Qwen3-0.6B...
In our earlier post, we described how one could parametrize local image patches in natural images by a surface called a Klein bottle. In Love et al, we used this information to modify the convolutional neural network construction so as to incorporate information about the pixels in a small neighborhood...
Motivation Dimensionality reduction is vital to the analysis of high dimensional data, i.e. data with many features. It allows for better understanding of the data, so that one can formulate useful analyses. Dimensionality reduction that produces a set of points in a vector space of dimension n, where n s...
This post is motivated by the observation in Open Problems in Mechanistic Interpretability by Sharkey, Chugtai, et al that `` SDL (sparse dictionary learning) leaves feature geometry unexplained", and that it is desirable to utilize geometric structures to gain interpretability for sparse autoencoder features. We strongly agree, and the goal...
This article was written in response to a post on LessWrong from the Apollo Research interpretability team. This post represents our initial attempts at acting on the topological data analysis suggestions. In this post, we’ll look at some ways to use topological data analysis (TDA) for mechanistic interpretability. We’ll first...