Aidan Ewart

Undergraduate student studying Mathematics @ University of Bristol.

Interested in & persuing a career in technical AI safety.

Wiki Contributions

Comments

Note: My take does not neccesarily represent the takes of my coauthors (Hoagy, Logan, Lee, Robert) etc etc. Or it might, but they may frame it differently. Take this as strictly my take.

My take is that the goal isn't strictly to get maximum expressive power under the assumptions detailed in Toy Models of Superposition; for instance, Anthropic found that FISTA-based dictionaries didn't work as well as sparse autoencoders, even though they are better in that they can achive lower reconstruction loss at the same level of sparsity. We might find that the sparsity-monosemanticity link breaks down at higher levels of autoencoder expressivity, although this needs to be rigourously tested.

To answer your question: I think Hoagy thinks that tied weights are more similar to how an MLP might use features during a forward pass, which would involve extracting the feature through a simple dot-product. I'm not sure I buy this, as having untied weights is equivalent to allowing the model to express simple linear computations like 'feature A activation = dot product along feature A direction - dot product along feature B direction', which could be a form of denoising if A and B were mutually exclusive but non-orthogonal features.

Yep! We are planning to do exactly that for (at least) the models we focus on in the paper (Pythia-70m + Pythia-410m), and probably also GPT2 small. We are also working on cleaning up our codebase (https://github.com/HoagyC/sparse_coding) and implementing some easy dictionary training solutions.

Hi David, co-author of the 'Sparse Autoencoders Find Highly Interpretable Directions in Language Models' paper here,
I think this might be of interest to you:
We are currently in the process of re-framing section 4 of the paper to focus more on model steering & activation editing; in line with what you hypothesise, we find that editing a small number of relevant features on e.g. the IOI task can steer the model from its predictions on one token to its predictions on a counterfactual token.

Really awesome work, although I felt a bit frustrated that more of the stuff about independence etc wasn't in e.g. an appendix or something. When is part 2(/3) scheduled for?

I'm slightly confused as to why red-teaming via activation additions should be preferred over e.g. RAT; it seems to be possible that RAT better/more robustly models out-of-test-distribution-but-still-in-deployment-distribution activations than directly adding some steering vector. Cool work though!