Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of these papers.

TL;DR

Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we’ve pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.

Neuronpedia has received 1 year of funding from LTFF. Johnny Lin is full-time on engineering, design, and product, while Joseph Bloom is supporting with high-level direction and product management. We’d love to talk to you about how Neuronpedia can speed up your SAE research. Fill out this short form to get in touch. 

 

Introduction

Why Neuronpedia?

  • While SAEs are exciting, they introduce a number of engineering challenges which can slow or even prevent research from being done. These challenges arise because SAEs decompose neural networks into numerous distinct features which we are trying to understand.
  • Even when working with relatively small language models like GPT2-Small, there are challenges in producing and storing feature dashboards, automatically interpreting features and organizing insights we collect about features.
  • As the community produces more SAEs, trains SAEs on larger models and develops better techniques for understanding and analyzing these artifacts, there’s going to be an ever increasing number of engineering challenges in the way of research into model internals. 

What It's Already Useful For

  • Generating and hosting more than 600k SAE feature dashboards. This includes GPT2-small residual stream SAEs for Joseph Bloom (RES-JB) and attention out SAEs for Connor Kissane and Robert Kryzanowski (ATT-KK). 
  • Running Python servers to enable prompt-based feature search and feature specific prompt testing. This makes it easier for researchers and the community to qualitatively test features, write up results and enable researchers  to share SAEs. 
  • Generating automatic explanations of features with GPT3.5-Turbo, which can be navigated via U-MAPS. The automatic explanations are not currently amazing, but hopefully we can iterate and improve on this. 

Making It More Useful

  1. Researchers / Labs publish SAEs that are uploaded to Neuronpedia: There are many different labs training SAEs so it’s a safe bet that there will be many SAEs in the public domain in the near future. We make it easy for researchers to host their SAEs and enable functionalities like feature testing, UMAPs, and more.
  2. Generate feature dashboards, automatic interpretability and run benchmarks: Feature dashboards are currently being produced using Callum’s reproduction of the Anthropic feature dashboards. Plausibly these can be improved over time to provide richer insights into features. Automatic explanations of features are provided using OpenAI’s method. We’ll explore adding SAE benchmarks, which seem like an important part of quality control for SAEs.
  3. Work with researchers to add functionality that accelerates research: Given the scale of SAEs, there are likely lots of ways we can make it easier to conduct or share research with the platform. This might involve things like developing more functionality for circuit style analysis or supporting novel techniques for making sense of SAEs. 

We’re inspired by analogies to biology and bioinformatics and expect many different framings to be useful. Modern biosciences leverage vast amounts of shared infrastructure for storing and analyzing genomic, proteomic and phenotypic data. The study of neural networks may benefit from similar resources which enable researchers to share results, curate insights and perform experiments. 

Strategy: Iterate Very Quickly

We'll be the first to admit that we don't know exactly what the most useful functionalities for researchers will be - nobody does. We fully expect some of the things we build will be not useful, killed off, or deprecated. That's fine. We expect to continuously take feedback + ideas, then ship them quickly, in order to learn what works or doesn't work.

We also anticipate that some features will only be useful for a short period of time. For example, we might build a specific visualizer for researchers to test a certain hypothesis. Once the hypothesis has been sufficiently evaluated by multiple researchers collaboratively on Neuronpedia, the visualizer becomes less useful. In our view, Neuronpedia is doing well if it can continuously make itself obsolete by iterating quickly on the qualitative research that is necessary for early scientific fields to mature.

 

The rest of this post goes into more detail about Neuronpedia, including simple demos. We highly recommend going to Neuronpedia and interacting with it.

 

Current Neuronpedia Functionality

Hosting SAE Feature Dashboards

Anthropic released feature dashboards for their SAEs, which were then reproduced by Callum McDougall in SAE Vis. Neuronpedia uses SAE Vis and a database of feature-activating examples to provide an interface for hosted SAE features.

An example feature page with a feature dashboard, explanations, lists, comments, and more.
GPT2-Small Residual Stream - Layer 9 - Feature 21474

We plan to provide documentation which makes the current dashboard / interface more accessible in the future, but in the meantime, refer to the Anthropic documentation.  

Feature Testing

Since maximum activating examples can be misleading, Neuronpedia runs servers with language models and SAEs which let users test SAE features with new text. We find this to be a crucial component of validating researcher understanding of a feature. 

A GIF of testing a SAE feature with a custom sentence. This seems to be Star Wars related, and we test it with a sentence about Obi-Wan and Palpatine, to see that those tokens indeed activate.
GPT2 Small Residual Stream - Layer 9 - Feature 21474

Automatic Explanations and UMAP for Exploration

While imperfect, OpenAI’s method for automatic neuron interpretability is a useful start for automating interpretability of SAE features. We’ve automatically interpreted Joseph Bloom’s residual stream SAE features and will explore providing automatic interpretation for new SAEs as they are uploaded. 

A GIF of exploring layer 9 residual SAEs via UMAP, and saving a custom list with selected directions (UMAPcreated list)

Live Feature Inference

To understand networks better, we want to see which features fire on specific prompts. This is essential for applying SAEs to circuit analysis and other applications of SAEs. We’re currently iterating on an interface for showing SAE decompositions of model activations, which also supports filtering by specific tokens and/or layers:

A GIF of searching ~300,000 features in Joseph Bloom’s GPT2-Small Residual SAEs, running live inference on a custom sentence. Results are sorted by highest activating neurons. You can then filter by specific layers or tokens, or even by the sum of multiple tokens. (See the exact search here)

Enabling Collaboration

Similar to how internet platforms helped accelerate researchers across the globe for the human genome, we think there's potential for significant value-add by helping SAE researchers collaborate more easily and effectively through Neuronpedia. To that end, Neuronpedia has features that might better enable consensus building and collaboration.

  • Sharable by Default: Nearly everything is directly sharable and linkable by default. This allows easy direct-linking to any SAE, layer, neuron/feature, list, for any model. This makes the data-sharing part of collaboration way easier.
  • Searching and comparing across SAEs: Neuronpedia wants to enable finding the best SAEs (and try to identify the strengths/weaknesses of each set of SAEs). This is made possible by having a unified platform to interactively test and compare different SAEs - something that is more difficult to do in the status quo, where SAEs (and their data) exist in different places in wildly different formats.
  • Feature Curation + Commenting
    • Starring: Users can bookmark features they find interesting by starring them.
    • Lists: Users can curate lists of features which they want to share with others. Joseph Bloom has already made extensive use of this feature in a recent LessWrong post and it’s clear this feature could be used in other posts such as this post by Connor Kissane and Robert Kryzanowski.
    • Commenting: Both features and lists facilitate commenting.
An example of a feature list. See the list here.

 

Future Work

As mentioned in the introduction, nobody knows exactly what will be the most useful features for researchers. But we do have some initial hunches - here are a select few:

Circuit Analysis

Various researchers are attempting to use SAEs for circuit analysis. Supporting these researchers with features similar to OpenAI’s transformer debugger may be useful (eg: logits lens, attribution, paired prompt comparisons), though we don’t want to put in a lot of effort just to recreate Jupyter notebooks. 

Understanding + Red-Teaming Features

We already provide the ability to type in text on feature dashboards which enables a weak form of red teaming. However, we could expand functionality here such as integrating language models like GPT4 to assist users in generating text to test features. Other functionality in this category may include, finding similar features, automatically annotating features via token set enrichment or testing for feature types like long / short prefix induction features.

Quality Control and Benchmarking

As we develop better methods for training SAEs, it will be useful to provide some level of quality assurance around public SAEs which are being shared by researchers. Some work has already been published benchmarking SAEs and we’d be excited to facilitate further work and record results on the platform, for all hosted SAEs. 


 

FAQ

Who’s involved with the project?

Johnny Lin is an ex-Apple engineer who built privacy/consumer apps before going full-time into interpretability last year. His previous apps had over 1M+ organic downloads, and his writings have appeared in the Washington Post, Forbes, FastCompany, and others. Johnny is a contributor to the Automated Interpretability repository.

Joseph Bloom is an Independent Researcher in Mechanistic Interpretability working on Sparse Autoencoders. Before working in AI Safety, Joseph worked as a Data Scientist at a Software-Startup where he worked with academic labs and biopharmaceutical companies to process and extract insights from data. Joseph is currently doing MATS under Neel Nanda and has recently been working on various SAE related projects

I’d like to upload my SAE weights to Neuronpedia. How do I do this? 

To upload your SAEs, fill out this <5 minute form to get started.

This sounds cool, how can I help?

To get involved, join the Open Source Mechanistic Interp Slack (click here), and then join the #neuronpedia channel.

An easy way to help is to use Neuronpedia. Explore GPT2-Small and the SAEs, find interesting patterns and features, zoom around the UMAP, search for things, etc. You can also make lists and comment on features. SAE researchers love nothing more than to find new insights on their SAEs!

Along the way, we’re sure you’ll find bugs, have feature requests, and questions. That’s a critical part of helping out Neuronpedia. Please report that to either the Slack channel/DM, or emailing me at johnny@neuronpedia.org

For those interested in upskilling to work on SAEs, we recommend programs like ARENA or MATS

You seem to be super into SAEs, what if SAEs suck?

We don’t take it for granted that SAEs are perfect, or solve the problems we care about. We describe ourselves as cautiously optimistic about SAEs. From an AI alignment perspective, there are many reasons to be excited about them, but the science is far from settled. If SAEs suck, then our work will hopefully help us work that out ASAP. Hopefully, we can help build scientific consensus around SAEs via a number of methods, which may include benchmarking against other techniques or red teaming SAEs, that help us build something better or identify which other approaches deserve more attention. 

Is Neuronpedia an AI safety org? 

We want to clarify that Neuronpedia is not an AI safety organization. Right now, we’re experimenting with the idea that Neuronpedia can meaningfully accelerate valuable research and are focused on testing this hypothesis as robustly as possible. However, if and when it makes sense, we may try to be more ambitious. 

Are there dual-use risks associated with SAEs? 

As with any area of technical AI safety, it’s important to consider the risk of work being used to accelerate the development of misaligned or poorly aligned systems. We think that the benefits outweigh the risks in terms of training / studying neural networks with Sparse Autoencoders, but don’t take it for granted that this will always be the case regardless of future advancements. For this reason, we plan to consult with members of the mechanistic interpretability community when considering any actions where we are “first movers” (ie: the first to publish a technique or method) or which constitute a significant advance beyond the current SOTA. 

New Comment
5 comments, sorted by Click to highlight new comments since: Today at 9:20 AM

We don’t take it for granted that SAEs are perfect, or solve the problems we care about. We describe ourselves as cautiously optimistic about SAEs. From an AI alignment perspective, there are many reasons to be excited about them, but the science is far from settled. If SAEs suck, then our work will hopefully help us work that out ASAP. Hopefully, we can help build scientific consensus around SAEs via a number of methods, which may include benchmarking against other techniques or red teaming SAEs, that help us build something better or identify which other approaches deserve more attention.

The most obvious way SAEs could suck is that they might reveal structure of the data, rather than in the networks (finding vectors for combination of features that correlate in practice, rather than logically/mechanistically independent features). Do you have any thoughts on how you would notice if that was the case?

Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:

  • Anthropic tested their SAEs on a model with random weights here and found that the results look noticeably different in some respects to SAEs trained on real models "The resulting features are here, and contain many single-token features (such as "span", "file", ".", and "nature") and some other features firing on seemingly arbitrary subsets of different broadly recognizable contexts (such as LaTeX or code)." I think further experiments like this which identify classes of features which are highly non-trivial, don't occur in SAEs trained on random models (or random models with a W_E / W_U from a real model) or which can be related to interpretable circuity would help. 
  • I should not that, to the extent that SAEs could be capturing structure in the data, the model might want to capture structure in the data too, so it's not super clear to me what you would observe that would distinguish SAEs capturing structure in the data which the model itself doesn't utilise. Working this out seems important. 
  • Furthermore, the embedding space of LLM's is highly structured already and since we lack good metrics, it's hard to say how precisely SAE's capture "marginal" structure over existing methods. So quantifying what we mean by structure seems important too. 
  • The specific claim that SAEs learn features which are combinations of true underlying features is a reasonable one given the L1 penalty, but I think it's far from obvious how we should think about this in practice. 
  • I'm pretty excited about deliberate attempts to understand where SAEs might be misleading or not capturing information well (eg: here or here). It seems like there are lots of technical questions that are slightly more low level that help us build up to this.  

So in summary: I'm a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly. 

Three questions:

  1. What format do you upload SAEs in?
  2. What data do you run the SAEs over to generate the activations / samples?
  3. How long of a delay is there between uploading an SAE and it being available to view?

Thanks for asking:

  1. Currently we load SAEs into my codebase here. How hard this is will depend on how different your SAE architecture/forward pass is from what I currently support. We're planning to support users / do this ourselves for the first n users and once we can, we'll automate the process. So feel free to link us to huggingface or a public wandb artifact. 
  2.  We run the SAEs over random samples from the same dataset on which the model was trained (with activations drawn from forward passes of the same length). Callum's SAE vis codebase has a demo where you can see how this works. 
  3. Since we're doing this manually, the delay will depend on the complexity on handling the SAEs and things like whether they're trained on a new model (not GPT2 small) and how busy we are with other people's SAEs or other features. We'll try our best and keep you in the loop. Ballpark is 1 -2 weeks not months. Possibly days (especially if the SAEs are very similar to those we are hosting already). We expect this to be much faster in the future. 

We've made the form in part to help us estimate the time / effort required to support SAEs of different kinds (eg: if we get lots of people who all have SAEs for the same model or with the same methodological variation, we can jump on that). 

This is fantastic. Thank you.