We don’t take it for granted that SAEs are perfect, or solve the problems we care about. We describe ourselves as cautiously optimistic about SAEs. From an AI alignment perspective, there are many reasons to be excited about them, but the science is far from settled. If SAEs suck, then our work will hopefully help us work that out ASAP. Hopefully, we can help build scientific consensus around SAEs via a number of methods, which may include benchmarking against other techniques or red teaming SAEs, that help us build something better or identify which other approaches deserve more attention.
The most obvious way SAEs could suck is that they might reveal structure of the data, rather than in the networks (finding vectors for combination of features that correlate in practice, rather than logically/mechanistically independent features). Do you have any thoughts on how you would notice if that was the case?
Lots of thoughts. This is somewhat stream of consciousness as I happen to be short on time this week, but feel free to follow up again in the future:
So in summary: I'm a bit confused about what we mean here and think there are various technical threads to follow up on. Knowing which actually resolve this requires we try to define our terms here more thoroughly.
I'm interested in using the SAEs and auto-interp GPT-3.5-Turbo feature explanations for RES-JB for some experiments. Is there a way to download this data?
Neuronpedia has an API (copying from a recent message Johnny wrote to someone else recently.):
"Docs are coming soon but it's really simple to get JSON output of any feature. just add "/api/feature/" right after "neuronpedia.org".for example, for this feature: https://neuronpedia.org/gpt2-small/0-res-jb/0
the JSON output of it is here: https://www.neuronpedia.org/api/feature/gpt2-small/0-res-jb/0
(both are GET requests so you can do it in your browser)note the additional "/api/feature/"i would prefer you not do this 100,000 times in a loop though - if you'd like a data dump we'd rather give it to you directly."
Feel free to join the OSMI slack and post in the Neuronpedia or Sparse Autoencoder channels if you have similar questions in the future :) https://join.slack.com/t/opensourcemechanistic/shared_invite/zt-1qosyh8g3-9bF3gamhLNJiqCL_QqLFrA
Three questions:
Thanks for asking:
We've made the form in part to help us estimate the time / effort required to support SAEs of different kinds (eg: if we get lots of people who all have SAEs for the same model or with the same methodological variation, we can jump on that).
This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of these papers.
TL;DR
Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we’ve pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more.
Important Links
Neuronpedia has received 1 year of funding from LTFF. Johnny Lin is full-time on engineering, design, and product, while Joseph Bloom is supporting with high-level direction and product management. We’d love to talk to you about how Neuronpedia can speed up your SAE research. Fill out this short form to get in touch.
Introduction
Why Neuronpedia?
What It's Already Useful For
Making It More Useful
We’re inspired by analogies to biology and bioinformatics and expect many different framings to be useful. Modern biosciences leverage vast amounts of shared infrastructure for storing and analyzing genomic, proteomic and phenotypic data. The study of neural networks may benefit from similar resources which enable researchers to share results, curate insights and perform experiments.
Strategy: Iterate Very Quickly
We'll be the first to admit that we don't know exactly what the most useful functionalities for researchers will be - nobody does. We fully expect some of the things we build will be not useful, killed off, or deprecated. That's fine. We expect to continuously take feedback + ideas, then ship them quickly, in order to learn what works or doesn't work.
We also anticipate that some features will only be useful for a short period of time. For example, we might build a specific visualizer for researchers to test a certain hypothesis. Once the hypothesis has been sufficiently evaluated by multiple researchers collaboratively on Neuronpedia, the visualizer becomes less useful. In our view, Neuronpedia is doing well if it can continuously make itself obsolete by iterating quickly on the qualitative research that is necessary for early scientific fields to mature.
The rest of this post goes into more detail about Neuronpedia, including simple demos. We highly recommend going to Neuronpedia and interacting with it.
Current Neuronpedia Functionality
Hosting SAE Feature Dashboards
Anthropic released feature dashboards for their SAEs, which were then reproduced by Callum McDougall in SAE Vis. Neuronpedia uses SAE Vis and a database of feature-activating examples to provide an interface for hosted SAE features.
We plan to provide documentation which makes the current dashboard / interface more accessible in the future, but in the meantime, refer to the Anthropic documentation.
Feature Testing
Since maximum activating examples can be misleading, Neuronpedia runs servers with language models and SAEs which let users test SAE features with new text. We find this to be a crucial component of validating researcher understanding of a feature.
Automatic Explanations and UMAP for Exploration
While imperfect, OpenAI’s method for automatic neuron interpretability is a useful start for automating interpretability of SAE features. We’ve automatically interpreted Joseph Bloom’s residual stream SAE features and will explore providing automatic interpretation for new SAEs as they are uploaded.
Live Feature Inference
To understand networks better, we want to see which features fire on specific prompts. This is essential for applying SAEs to circuit analysis and other applications of SAEs. We’re currently iterating on an interface for showing SAE decompositions of model activations, which also supports filtering by specific tokens and/or layers:
Enabling Collaboration
Similar to how internet platforms helped accelerate researchers across the globe for the human genome, we think there's potential for significant value-add by helping SAE researchers collaborate more easily and effectively through Neuronpedia. To that end, Neuronpedia has features that might better enable consensus building and collaboration.
Future Work
As mentioned in the introduction, nobody knows exactly what will be the most useful features for researchers. But we do have some initial hunches - here are a select few:
Circuit Analysis
Various researchers are attempting to use SAEs for circuit analysis. Supporting these researchers with features similar to OpenAI’s transformer debugger may be useful (eg: logits lens, attribution, paired prompt comparisons), though we don’t want to put in a lot of effort just to recreate Jupyter notebooks.
Understanding + Red-Teaming Features
We already provide the ability to type in text on feature dashboards which enables a weak form of red teaming. However, we could expand functionality here such as integrating language models like GPT4 to assist users in generating text to test features. Other functionality in this category may include, finding similar features, automatically annotating features via token set enrichment or testing for feature types like long / short prefix induction features.
Quality Control and Benchmarking
As we develop better methods for training SAEs, it will be useful to provide some level of quality assurance around public SAEs which are being shared by researchers. Some work has already been published benchmarking SAEs and we’d be excited to facilitate further work and record results on the platform, for all hosted SAEs.
FAQ
Who’s involved with the project?
Johnny Lin is an ex-Apple engineer who built privacy/consumer apps before going full-time into interpretability last year. His previous apps had over 1M+ organic downloads, and his writings have appeared in the Washington Post, Forbes, FastCompany, and others. Johnny is a contributor to the Automated Interpretability repository.
Joseph Bloom is an Independent Researcher in Mechanistic Interpretability working on Sparse Autoencoders. Before working in AI Safety, Joseph worked as a Data Scientist at a Software-Startup where he worked with academic labs and biopharmaceutical companies to process and extract insights from data. Joseph is currently doing MATS under Neel Nanda and has recently been working on various SAE related projects.
I’d like to upload my SAE weights to Neuronpedia. How do I do this?
To upload your SAEs, fill out this <5 minute form to get started.
This sounds cool, how can I help?
To get involved, join the Open Source Mechanistic Interp Slack (click here), and then join the #neuronpedia channel.
An easy way to help is to use Neuronpedia. Explore GPT2-Small and the SAEs, find interesting patterns and features, zoom around the UMAP, search for things, etc. You can also make lists and comment on features. SAE researchers love nothing more than to find new insights on their SAEs!
Along the way, we’re sure you’ll find bugs, have feature requests, and questions. That’s a critical part of helping out Neuronpedia. Please report that to either the Slack channel/DM, or emailing me at johnny@neuronpedia.org.
For those interested in upskilling to work on SAEs, we recommend programs like ARENA or MATS.
You seem to be super into SAEs, what if SAEs suck?
We don’t take it for granted that SAEs are perfect, or solve the problems we care about. We describe ourselves as cautiously optimistic about SAEs. From an AI alignment perspective, there are many reasons to be excited about them, but the science is far from settled. If SAEs suck, then our work will hopefully help us work that out ASAP. Hopefully, we can help build scientific consensus around SAEs via a number of methods, which may include benchmarking against other techniques or red teaming SAEs, that help us build something better or identify which other approaches deserve more attention.
Is Neuronpedia an AI safety org?
We want to clarify that Neuronpedia is not an AI safety organization. Right now, we’re experimenting with the idea that Neuronpedia can meaningfully accelerate valuable research and are focused on testing this hypothesis as robustly as possible. However, if and when it makes sense, we may try to be more ambitious.
Are there dual-use risks associated with SAEs?
As with any area of technical AI safety, it’s important to consider the risk of work being used to accelerate the development of misaligned or poorly aligned systems. We think that the benefits outweigh the risks in terms of training / studying neural networks with Sparse Autoencoders, but don’t take it for granted that this will always be the case regardless of future advancements. For this reason, we plan to consult with members of the mechanistic interpretability community when considering any actions where we are “first movers” (ie: the first to publish a technique or method) or which constitute a significant advance beyond the current SOTA.