2038

LESSWRONG
LW

2037
Sparse Autoencoders (SAEs)Interpretability (ML & AI)AI
Frontpage

74

SAE-VIS: Announcement Post

by CallumMcDougall, Joseph Bloom
31st Mar 2024
AI Alignment Forum
1 min read
8

74

Ω 30

74

Ω 30

SAE-VIS: Announcement Post
10Neel Nanda
3CallumMcDougall
7Connor Kissane
2CallumMcDougall
4Johnny Lin
2CallumMcDougall
1Jonas Kgomo
1Joseph Bloom
New Comment
8 comments, sorted by
top scoring
Click to highlight new comments since: Today at 9:44 PM
[-]Neel Nanda1y100

Thanks for open sourcing this! We've already been finding it really useful on the DeepMind mech interp team, and saved us the effort of writing our own :)

Reply1
[-]CallumMcDougall1y30

Thanks so much, really glad to hear it's been helpful!

Reply
[-]Connor Kissane1y70

Amazing! We found your original library super useful for our Attention SAEs research, so thanks for making this!

Reply
[-]CallumMcDougall1y20

Thanks so much! (-:

Reply
[-]Johnny Lin1y40

Thanks Callum and yep we've been extensively using SAE-Vis at Neuronpedia - it's been extremely helpful for generating dashboards and it's very well maintained. We'll have a method of directly importing to Neuronpedia using the exports from SAE-Vis coming out soon.

Reply1
[-]CallumMcDougall1y20

Thanks!! Really appreciate it

Reply
[-]Jonas Kgomo1y10

is this something that can work for a hosted web version (npm i or api) 

Reply
[-]Joseph Bloom1y10

I'm a little confused by this question. What are you proposing? 

Reply
Moderation Log
More from CallumMcDougall
View more
Curated and popular this week
8Comments
Sparse Autoencoders (SAEs)Interpretability (ML & AI)AI
Frontpage

This is a post to officially announce the sae-vis library, which was designed to create feature dashboards like those from Anthropic's research.

Summary

There are 2 types of visualisations supported by this library: feature-centric and prompt-centric.

The feature-centric vis is the standard from Anthropic’s post, it looks like the image below. There’s an option to navigate through different features via a dropdown in the top left.

You can see the interactive version at the GitHub repo, at _feature_vis_demo.html.

The prompt-centric vis is centred on a single user-supplied prompt, rather than a single feature. It will show you the list of features which score highest on that prompt, according to a variety of different metrics. It looks like the image below. There’s an option to navigate through different possible metrics and choices of token in your prompt via a dropdown in the top left.

You can see the interactive version at the GitHub repo, at _prompt_vis_demo.html.

 

Other links

Here are some more useful links:

  • GitHub repo
  • User Guide - Google Doc explaining how to use the library
  • Dev Guide - Google Doc explaining more about how the library was built, for if you'd like to try and extend it / build off it
  • Demo Colab - includes examples, with code explained

You might also be interested in reading about Neuronpedia, who make use of this library in their visualizations.

If you're interested in getting involved, please reach out to me or Joseph Bloom! We will also be publishing a post tomorrow, discussing some of the features we've discovered during our research.