LESSWRONG
LW

Interpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

19

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders

by PaulPauls
24th Nov 2024
1 min read
3

19

This is a linkpost for https://github.com/PaulPauls/llama3_interpretability_sae
Interpretability (ML & AI)Sparse Autoencoders (SAEs)AI
Frontpage

19

Mechanistic Interpretability of Llama 3.2 with Sparse Autoencoders
4Neel Nanda
3PaulPauls
1Volodymyr Barannik
New Comment
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 3:52 PM
[-]Neel Nanda10mo42

Cool project! Thanks for doing it and sharing, great to see more models with SAEs

interpretability research on proprietary LLMs that was quite popular this year and great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind

I run the Google DeepMind team, and just wanted to clarify that our work was not on proprietary closed weight models, but instead on Gemma 2, as were our open weight SAEs - Gemma 2 is about as open as llama imo. We try to use open models wherever possible for these general reasons of good scientific practice, ease of replicability, etc. Though we couldn't open source the data, and didn't go to the effort of open sourcing the code, so I don't think they can be considered true open source. OpenAI did most of their work on gpt2, and only did their large scale experiment on GPT4 I believe. All Anthropic work I'm aware of is on proprietary models, alas.

Reply
[-]PaulPauls10mo30

Hi Neel,

you're absolutely right, all research in the gemmascope paper was performed on the open source Gemma 2 model. I wanted to group up all research that my paper was based on in a concise sentence and by doing so erroneously put you in the 'proprietary LLMs' section. I went ahead and corrected the mistake.

My apologies.

I hope you still enjoyed the project and thank you for your great research work at DeepMind. =)

Reply1
[-]Volodymyr Barannik9mo10

Thanks for your work, but why is the project currently taken down? 

Reply
Moderation Log
More from PaulPauls
View more
Curated and popular this week
3Comments

I recently published a rather big side project of mine that attempts to replicate the mechanistic interpretability research on proprietary and open-source LLMs that was quite popular this year and produced great research papers by Anthropic[1][2], OpenAI[3][4] and Google Deepmind[5] with the humble but open source Llama 3.2-3B model.

The project provides a complete end-to-end pipeline for training Sparse Autoencoders to interpret LLM features, from activation capture through training, interpretation, and verification. All code, data, trained models, and detailed documentation are publicly available in my attempt to make this as open research as possible, though calling it an extensively documented personal project wouldn't be wrong either in my opinion.

Since LessWrong has a strong focus on AI interpretability research, I thought some of you might find value in this open research replication. I'm happy to answer any questions about the methodology, results, or future directions.

  1. ^

    https://www.anthropic.com/research/mapping-mind-language-model

  2. ^

    https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html

  3. ^

    https://openai.com/index/extracting-concepts-from-gpt-4/

  4. ^

    https://arxiv.org/abs/2406.04093

  5. ^

    https://arxiv.org/abs/2408.05147