I don't know much about chess. Could it be that feature 172 that you are highlighting is related to some kind of chess opening? The distribution of black pawns could be due to different states of the opening, and the position of the black bishop and white horse could also be related to different parts of that opening?
Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features.
Sparse autoencoders decompose activations into a sum of sparse feature directions. We leverage language models to...
The following post was made as part of Danielle's MATS work on doing circuit-based mech interp on Mamba, mentored by Adrià Garriga-Alonso. It's the first in a sequence of posts about finding an IOI circuit in Mamba/applying ACDC to Mamba.
This introductory post was also made in collaboration with Gonçalo Paulo.
Why Mamba?
Mamba [[1]] is a type of recurrent neural network based on state-space models, and is being proposed as an alternative architecture to transformers. It is the result of years of capability research [[2]] [[3]] [[4]]...
Hi! I know that this post is now almost 5 months old, but I feel like I need to ask some clarifying questions and point out things about your methodology that I don't completely understand/agree.
How do you source the sentences used for the scoring method? Are they all from top activations? This is not explicitly mentioned in the methodology section - although in the footnote you do say you have 3 high activations and 3 low activations. Am I to understand correctly that there are no cases with no activations?
Are the sentences shown individually or in batche... (read more)