I really like this paper (though, obviously, am extremely biased). I don't think it was groundbreaking, but I think it was an important contribution to mech interp, and one of my favourite papers that I've supervised.
Superposition seems like an important phenomena that affects our ability to understand language models. I think this paper was some of the first evidence that it actually happens in language models, and on what it actually looks like. Thinking about eg why neurons detecting compound words (eg blood pressure) were unusually easy to represent in superposition, while "this text is in French" merited dedicated neurons, helped significantly clarify my understanding of superposition beyond what was covered in Toy Models of Superposition (discussed in Appendix A). I also just like having case studies and examples of phenomena in language models to think about, and have found some of the neuron families in this paper helpful to keep in mind when reasoning about other weirdnesses in LLMs. I largely think the results in this paper have stood the test of time.
Sparse autoencoders have been one of the most important developments in mechanistic interpretability in the past year or so, and significantly shaped the research of the field (including my own work). I think this is in substantial part due to Towards Monosemanticity, between providing some rigorous preliminary evidence that the technique actually worked, a bunch of useful concepts like feature splitting, and practical advice for training these well. I think that understanding what concepts are represented in model activations is one of the most important problems in mech interp right now. Though highly imperfect, SAEs seem the best current bet we have here, and I expect whatever eventually works to look at least vaguely like an SAE.
I have various complaints and caveats about the paper (that I may elaborate on in a longer review in the discussion phase), and pessimisms about SAEs, but I think this work remains extremely impactful and significantly net positive on the field, and SAEs are a step in the right direction.
How would you evade their tools?
A tip for anyone on the ML job/PhD market - people will plausibly be quickly skimming your google scholar to get a sense of "how impressive is this person/what is their deal" read (I do this fairly often), so I recommend polishing your Google scholar if you have publications! It can make a big difference.
I have a lot of weird citable artefacts that confuse Google Scholar, so here's some tips I've picked up:
Do you know what topics within AI Safety you're interested in? Or are you unsure and so looking for something that lets you keep your options open?
+1 to the other comments, I think this is totally doable, especially if you can take time off work.
The hard part imo is letters of recommendation, especially if you don't have many people who've worked with you on research before. If you feel awkward about asking for letters of recommendation on short notice (which multiple people have asked me for in the past week, if it helps, so this is pretty normal), one thing that makes it lower effort for the letter writer is giving them a bunch of notes on specific things you did while working with them and what traits of your's this demonstrates or, even better, offering to write a rough first draft letter for them to edit (try not to give very similar letters to all your recommenders though!).
Thanks a lot for the post! It's really useful to have so many charities and a bit of context in the same place when thinking about my own donations. I found it hard to navigate a post with so many charities, so I put this into a spreadsheet that lets me sort and filter the categories - hopefully this is useful to others too! https://docs.google.com/spreadsheets/d/1WN3uaQYJefV4STPvhXautFy_cllqRENFHJ0Voll5RWA/edit?gid=0#gid=0
Cool project! Thanks for doing it and sharing, great to see more models with SAEs
interpretability research on proprietary LLMs that was quite popular this year and great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind
I run the Google DeepMind team, and just wanted to clarify that our work was not on proprietary closed weight models, but instead on Gemma 2, as were our open weight SAEs - Gemma 2 is about as open as llama imo. We try to use open models wherever possible for these general reasons of good scientific practice, ease of replicability, etc. Though we couldn't open source the data, and didn't go to the effort of open sourcing the code, so I don't think they can be considered true open source. OpenAI did most of their work on gpt2, and only did their large scale experiment on GPT4 I believe. All Anthropic work I'm aware of is on proprietary models, alas.
It's essentially training an SAE on the concatenation of the residual stream from the base model and the chat model. So, for each prompt, you run it through the base model to get a residual stream vector v_b, through the chat model to get a residual stream vector v_c, and then concatenate these to get a vector twice as long, and train an SAE on this (with some minor additional details that I'm not getting into)
I'm not super sure what I think of this project. I endorse the seed of the idea re "let's try to properly reverse engineer what representing facts in superposition looks like" and think this was a good idea ex ante. Ex post, I consider our results fairly negative, and have mostly confused that this kind of thing is cursed and we should pursue alternate approaches to interpretability (eg transcoders). I think this is a fairly useful insight! But also something I made from various other bits of data. Overall I think this was a fairly useful conclusion re updating away from ambitious mech interp and has had a positive impact on my future research, though it's harder to say if this impacted others (beyond the general sphere of people I mentor/manage)
I think the circuit analysis here is great, a decent case study of what high quality circuit analysis looks like, one of studies of factual recall I trust most (though I'm biased), and introduced some new tricks that I think are widely useful, like using probes to understand when information is introduced Vs signal boosted, and using mechanistic probes to interpret activations without needing training data. However, I largely haven't seen much work build on this, beyond a few scattered examples, which suggests it hasn't been too impactful. I also think this project took much longer than it should have, which is a bit sad.
Though, this did get discussed in a 3Blue1Brown video, which is the most important kind of impact!