Tl;dr: In this post we present the exploratory phase of a project aiming to study neural networks by applying static local learning coefficient (LLC) estimation to specific alterations of them. We introduce a new method named Feature Targeted (FT) LLC estimation and study its ability to distinguish SAE trained features from random directions. By comparing our method to other possible metrics, we demonstrate that it outperforms all of them but one, which has comparable performance.
We discuss possible explanations to our results, our project and other future directions.
Given a neural network and a latent layer within it, , a central motif in current mechanistic interpretability research is to find functions [1] which are features of the model. Features are (generally) expected to exhibit the following properties:
Thank you for your suggestion, we have modified the post accordingly.