Towards data-centric interpretability with sparse autoencoders

lilysun004; lewis smith; Neel Nanda

LESSWRONG
LW

Towards data-centric interpretability with sparse autoencoders — LessWrong

Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project.

Check out our workshop paper -> here.

TL;DR

We use sparse autoencoders (SAEs) for four textual data analysis
tasks—data diffing, finding correlations, targeted clustering, and retrieval.
We care especially about gaining insights from language model data, such as LLM outputs and training data, as we believe it is an underexplored route for model understanding.
For instance, we find that Grok 4 is more careful than other frontier models to state its assumptions and explore nuanced interpretations—showing the kinds of insights data diffing can reveal by comparing model outputs.
Why SAEs? Think of features as "tags" of properties for each text.
- Their large dictionary of latents provides a large hypothesis space, enabling the discovery of novel insights (diffing and correlations).
- SAEs capture more than just semantic information, making them effective alternatives to traditional embeddings when we want to find properties in our dataset (retrieval) or group text by properties (clustering).

Introduction

The field of language model interpretability has traditionally focused on studying model internals. We believe an important but neglected direction is data-centric interpretability, which aims to understand models through insights about their data (outputs and training data). For instance, while training data strongly shapes model behavior, interpretability research often downplays understanding its structure compared to studying model internals alone. Prior work has largely used LLMs (Dunlap et al. (2024)) to extract insights from model data, but these approaches are difficult to scale and limit the ability to conduct large-scale, systematic analyses that can be iterated upon quickly.

In this work, we use sparse autoencoders (SAEs) for data analysis, with a focus on model data for understanding model behavior. SAEs have been largely used to extract interpretable features (Bricken et al. (2023); Templeton et al. (2024)) from and control (Bayat et al. (2025)) LLMs, despite conflicting evidence on their utility (Subhash et al. (2025)). Here, we hypothesize that SAEs have two advantages for data analysis.

First, SAEs provide a large hypothesis space useful for discovering novel insights. As they are trained in an unsupervised manner, they generate a large dictionary of "features"—latent vectors that represent a (mostly) monosemantic concept. Functionally, these latents act as labels of properties for each dataset sample that we can leverage for insight discovery. We perform two exploratory data analysis tasks. First, we use SAEs to evaluate differences between model outputs, finding that Grok-4 is more careful to clarify ambiguities than other frontier models. Second, we search for interesting co-occurrences of SAE latents to understand how concepts co-occur in a dataset, discovering that offensive language tends to correlate with narrative stories in Chatbot Arena. These approaches surface novel insights in a fully unsupervised manner, enabling us to find unknown unknowns in datasets.

Second, SAEs capture rich properties of text beyond semantics—they approximate LLM representations which encode complex linguistic and conceptual information. Motivated by this intuition, we explore using SAE latent activations as an alternative to traditional text embeddings. We find that they can cluster data differently than semantic embeddings, such as grouping distinct reasoning approaches, and can be used for retrieving data with property-based queries.

Our results represent a preliminary step toward using SAEs as a versatile tool for the exploratory analysis of large datasets, highlighting data-centric interpretability as a promising direction for future work.

Preliminaries

Sparse autoencoders. Sparse autoencoders learn to reconstruct language model activations using a dictionary of latent features with a sparsity penalty. Throughout this work, we use SAEs from Goodfire^[1] trained on the layer 50 residual stream of Llama 3.3 70B using LMSYS-Chat-1M. The SAE has an average L0 of 121 and a dictionary size of latents, among which we found $61521$ to have labels of the concept they fire on.

Representing datasets with feature activations. We max-pool the activations of each SAE latent across tokens to obtain an "SAE activation vector" $a \in R^{d}$ (see below).

Note that usually, we use a SAE trained on the activations of the model we are interpreting. As we are interpreting data in this work, we only need a "reader model" (LLaMA 3.3 70B) and its SAE, even if our data of interest was generated by another model.

LLM judge. When we use an LLM for dataset generation or labelling, we primarily use Gemini 2.5 Flash unless otherwise specified due to its low cost and large 1M context window.

Data Diffing

We consider the problem of describing the difference between two datasets, which we call data diffing. In the context of LLM interpretability, understanding how datasets differ is important when there is a variable of change--for example, when using different data splits across training runs, when comparing outputs from a base and finetuned model, or when assessing the effect of new training techniques. Model diffing similarly aims to characterize differences in models, but while it typically focuses on model internals, we propose to instead compare models by diffing their outputs.

Prior work (Zhong et al. (2023), Zhong et al. (2022)) has used LLMs to generate and test hypotheses on differences between two datasets, which is computationally expensive. Here, we show that SAEs can find valid differences between datasets more cost-effectively. The methodology is described below, where each SAE latent essentially acts as a hypothesized difference.

Methodology for finding differences between datasets

Identifying known differences from datasets

We explore dataset differences for both synthetic and real-world tasks:

Synthetic dataset: detecting tone changes
- We randomly sample 500 responses from Chatbot Arena and prompt LLaMA-70B to convert the base response to thirteen different tones (e.g. "friendly-and-personable"). Then, we diff the modified responses and the base responses, attempting to rediscover the tone.
Real-world dataset: detecting genre changes in movie descriptions
- We use IMDB-review, a collection of movie descriptions from a variety of genres. Then, we diff the movie descriptions from within a genre with 500 randomly sampled descriptions of movies outside the genre, aiming to recover the genre.

Results. Below, we present the top three latent differences for five randomly chosen examples from the synthetic and movie datasets.

Synthetic dataset (tone shifts)

Tone shift	Top 3 latent differences
Casual	Casual/cool slang and informal speech patterns
	Narrative transitions and connective phrases in explanatory text
	Making text more informal and conversational
Organized	Question and answer transition points in educational content
	The start of a formal question in structured Q&A formats
	Qt framework and Q-prefixed technology terms
Imaginative	Groups gathering to share stories and experiences, especially in atmospheric or mysterious contexts
	Formal literary writing style marked by heavy use of articles and possessives
	Transitioning into role-play mode
Funny	Young adult recreational and social activities
	Offensive or inappropriate content, particularly flirtation and suggestive material
	Explaining something in a friendly, casual teaching style
Safety con- scious	Emphasizing ethical concerns and safety warnings
	Establishing boundaries around inappropriate content
	Polite hedging phrases used to soften or introduce statements

Real-world dataset (genre changes)

Movie genre	Top 3 latent differences
Action	Action movie plot developments and dramatic confrontations
	Narrative sequences describing villains rising to power
	Introduction of new narrative elements with indefinite articles
War	Soldiers experiencing the psychological and physical hardships of war
	Military officer ranks and hierarchical descriptions
	Major historical events and periods of upheaval
Romance	Will they/won't they writing tropes
	Falling or being in love across multiple languages
	Initiating romantic pursuit, especially cross-cultural dating
Crime	Providing movie recommendations or plot descriptions
	High-stakes heists involving valuable items from secure locations
	Detective character establishment in crime narratives
Documentary	Documentary films and series being discussed or listed
	Revealing behind-the-scenes content or private glimpses
	Structural elements that introduce additional information or examples in formatted text

For both the synthetic and IMDB datasets, we find that the top latent differences align with the known difference (the tone or genre). While some latent labels directly state the difference (ex. the "action" genre), some connections are more indirect. For instance, the top latent difference for the "organized" tone shift is "Question and answer transition points in educational content", which implies an organized tone. This reflects the need to directly examine the data to more clearly understand the difference, rather than solely relying on the latent labels.

Discovering novel differences between model behavior

To discover novel differences between in-the-wild datasets, we focus on the task of comparing models by comparing their generated responses. We randomly sample 1000 conversations from Chatbot Arena and generate responses using the first-turn prompt for a set of models. Then, we diff the responses with each other.

Datasets. We perform data diffing on two model setups.

Finetuned vs base: we compare Vicuna-7B and the language backbone of LLaVA-Next, which fine-tunes Vicuna-7B with multimodal reasoning. By taking the difference in responses, we answer, "how does multimodal finetuning affect LLMs?".
Multi-model diffing: we compare Grok-4, GPT-OSS-120B, and Gemini 2.5 Pro with other frontier models, searching for unique qualities that do not appear in nine other frontier models.^[2]

Automatically converting feature differences to hypotheses 🤖. While we directly relied on latent descriptions in the previous section, many latents are either noisy or activate on similar text. Therefore, we use a LLM to:

Relabel latents based on 20 samples, following a similar strategy as EleutherAI
Filter low-scoring features
Summarize the biggest latent differences into concise hypotheses.

LLM baseline. We also compare the SAE-generated hypotheses with a pure LLM pipeline. Motivated by Dunlap et al. (2024), we use a LLM to (1) evaluate differences between each response pair and (2) summarize the highest differences (with batching due to context window limitations) into hypotheses. For both the SAE and LLM-based methods, we pass in a query (e.g. What are the most significant, interesting differences?) to direct the hypothesis space.

Hypothesis verification. Finally, given these proposed hypotheses, we use a LLM judge to verify the presence of these differences. Specifically, the LLM judge scores whether each response has a hypothesized property; then, we tally up whether the property occurs more in one dataset than the other. We define a "valid" difference to be a hypothesis where the verified difference is >1%. We generate hypotheses for both model setups and compute the average difference in frequencies across valid hypotheses below. For the hypotheses of the frontier model, we compute the difference between the frontier model's verified frequency and the maximum verified frequency among the other frontier models.

➡️ Takeaway 1: SAE latents find valid hypotheses. The average difference of our hypotheses is 7.03% for LLaVA-Next v. Vicuna-7B and ranges from 26-47% for the frontier models, suggesting that SAEs can identify strong differences. We present qualitative examples of the model responses below, with the full hypothesis list for the SAE and LLM baseline in the Appendix.

Qualitative examples from model responses

Bold parts highlight the stated unique quality of each model

Model

Unique Qualities

Prompt

Response

Grok-4

Clarifies ambiguities, offers multiple interpretations

A carton has six eggs. Two are added. Then one is taken out and eaten. One spoils so it is removed but replaced immediately. How many eggs have left the basket?

Based on the wording of the riddle, no basket is ever mentioned until the final question—everything happens with a carton. Therefore, zero eggs have left the basket. If the riddle is assuming "carton" and "basket" refer to the same container (or if it's a mistranslation or misstatement), then I'd say 1 egg has left (the one that was taken out and eaten; the spoiled one was removed but immediately replaced, so it didn't truly "leave"). If the question is meant to be "how many eggs are left in the basket?" then the answer is 7. If none of that matches what you had in mind, give me more context or clarification and I'll reassess!

GPT-OSS-120B

Markdown table power user 💪

Who would win in a fight: A dolphin or a sea turtle?

'**Short answer:** In a direct, one‑on‑one encounter the dolphin would almost certainly have the upper hand.

---

### Why a dolphin is the likely “winner”

| **Trait** | **Dolphin** (e.g., common bottlenose dolphin) | **Sea turtle** (e.g., green or loggerhead) |

|-----------|----------------------------------------------|--------------------------------------------|

Gemini 2.5 Pro

Saying confident
affirmations like
"Of course!" or
"Certainly!", enthusiastic willingness to help

How do I calculate gibbs free energy of fibril formation from a solubility value?

Of course! This is an excellent question that connects a macroscopic, measurable
property (solubility) to a fundamental thermodynamic quantity (Gibbs Free Energy)..

➡️ Takeaway 2: The LLM baseline identifies less prominent differences. We observe that the average difference for proposed hypotheses is higher for the SAE than our LLM baseline, suggesting that the SAE can discover more prominent differences between datasets. Intuitively, good SAE features will efficiently point a LLM to the differences present in the text and make them easy to aggregate.

➡️ Takeaway 3: The LLM baseline is more expensive than SAEs. To get a rough gauge of cost, we calculate how many tokens are required for both approaches. For LLava v. Vicuna, the SAE uses 700K tokens whereas the LLM uses 3.3M tokens, a roughly 5x increase. We can expect this cost gap to increase as the dataset size grows^[3]. One key advantage of SAE feature activations is that they are reusable, whereas we must feed in the entire dataset into a LLM for each query in the baseline. For instance, to compare the three frontier models with other models, the SAE approach uses 3.5M tokens in total whereas the baseline uses 25.3M tokens (8.4M average). This disparity in cost makes SAEs a stronger candidate than pure LLMs for generating hypotheses in large datasets.

Correlations

We consider the problem of finding correlations between features in text datasets. Often, correlations reflect genuine associations---a text containing "dogs" likely also contains "pets". However, we are interested in finding correlations that may reflect biases (e.g. "offensive" correlated with a certain demographic) or artifacts (e.g. text in a certain language uses a lot of emojis) of the dataset.

Prior work has focused on finding SAE latents correlated with dataset labels, by training classifiers on SAE latent activations and using their labels to generate hypotheses (Movva et al. (2025), Katamneni et al. (2025)). In this work, we instead leverage the same hypothesis space for unsupervised search for arbitrary latent pairs that consistently co-occur.

We present our methodology for finding correlations below:

We use the normalized pointwise mutual information (NPMI) as our co-occurrence metric, filtering out low-and high-frequency latents which may have distorted NPMI. This method tends to identify a large number of pairs because of the sheer number of latents (65k x 65k $\approx$ 2 billion), so we filter out latents with syntactic labels using a LLM judge to reduce the search space. Then, human inspection of the top candidate pairs and their firing examples is necessary to verify whether they reflect an "interesting" pattern---often, candidate pairs are not true conceptual correlations but rather a byproduct of poor labelling or a single token triggering both latents. We compare this method to the baseline method of passing the entire dataset to an LLM and asking for interesting correlations.

Finding known correlations

We test if the method works by injecting LLM-generated texts with known correlations into a background corpus from the Pile (10k total). We inject these correlations:

Croatian text with many emojis
Discussion of baseball rules with slang
Conservative opinions written in an academic style.

We run our SAE method and examine the candidate group of feature pairs with NPMI > 0.8 and semantic similarity < 0.2, and find features relevant to the injected correlation, even for a small fraction (down to 10/10,000) of injections. We compare this to the baseline method of feeding the dataset to the LLM judge and asking for any interesting co-occurrences, which finds the injected correlations only unreliably (Appendix).

(a)-(d) We plot the candidate group of pairs (NPMI > 0.8, semantic similarity < 0.2), for each type of text injected. Relevant pairs are colored. (e) We show the proportion of relevant pairs in the candidate group for different injection levels. (f) We inject all 3 texts at once and color each correlation.

Finding unknown correlations

We now turn to the problem of examining real-world datasets, using SAEs to automatically identify correlations of the dataset.

Finding bias in internet comments

We examine the CivilComments dataset, which contains internet comments labeled for toxicity. While we know from its construction that offensive content is present, our goal is to automatically identify such content and understand its common themes. We run our SAE method on 5k comments and highlight interesting correlations below, revealing bias where "offensive" latents^[4] co-occur with religion, race, and gender latents^[5]. We also observe topic-specific patterns, such as correlations with “Donald Trump’s policies,” indicating that many comments address these topics together.

Examples of “interesting” feature co-occurrences in the CivilComments dataset, among pairs with NPMI > 0.6 and semantic similarity < 0.3. For each pair, we show a phrase from a comment where the features co-occur and the LLM judges both features to be present (selected for illustration purposes).

Each SAE pair is a hypothesis that these two features co-occur more than expected in the dataset. We can look at the conditional probabilities of the latents, to determine if e.g. "most offensive text mentions religion" or "most texts mentioning religion are offensive". We then want to verify if this relationship is truly significant in the dataset, using the original toxicity label and an LLM judge to determine the true presence of concepts (Appendix) and the true conditional probabilities of occurrence. We see that many hypotheses raised are indeed significant correlations. The baseline method of passing the dataset to an LLM judge mentioned only the "offensive and religion" correlation (Appendix). While LLMs raise many hypotheses about the dataset--some of which seem plausible--the SAE method provides a more systematic way of discovering such correlations.

Finding patterns in model responses

Next, we apply our method to a sample of 5k model responses from Chatbot Arena. We find several correlations between narrative-related latents and offensive-content latents (the table below shows only some). This raises the hypothesis that a significant fraction of offensive content generated by the model is in the form of narrative text, likely due to the prompts that people use to elicit such behavior. Using an LLM judge to label the dataset for "offensive" and "narrative", we find that $P (narrative | offensive) = 0.22$ and $P (offensive | narrative) = 0.14$ , which is not an extremely strong but still present co-occurrence.^[6] The baseline did not raise this hypothesis (Appendix).

Examples of “interesting” feature co-occurrences in ChatbotArena model responses, among pairs with NPMI > 0.8 and semantic similarity < 0.3.

Clustering

Text clustering aims to group unlabelled documents as an exploratory step in understanding large datasets. For example, clustering user prompts to LLMs could generate insights on the types of questions people ask (Tamkin et al. (2024)). In classical NLP, token based (e.g. BM25) or dense semantic embedding based methods (e.g. Sentence BERT) are used to represent documents, and clustering algorithms such as K-Means, spectral clustering, HDBSCAN or non-negative matrix factorization are used.

Due to the presence of non-semantic latents, we expect clustering on SAE activations to uncover different and potentially more "abstract" clusters. Importantly, SAE activations also allow us to perform targeted clustering—since each latent dimension is interpretable, we can filter down to only the latents we care about and cluster in that subspace. For example, filtering to only "tone" or "reasoning style" latents allows us to ignore the semantic content of the texts which may otherwise dominate the clustering. Previous work on semi-supervised clustering uses techniques such as pairwise constraints or LLM guidance. Here, we propose using the representational power of LLMs and the interpretability of SAEs to enable immediate out-of-the-box exploration along any desired axis without additional finetuning. We can also diff in-cluster texts with out-of-cluster texts to immediately obtain the top promoted latents, which help describe the cluster immediately. We primarily compare our SAE embeddings with text-embedding-3-large semantic embeddings.

Discovering known clusters

Successful targeted clustering in synthetic datasets. We test targeted clustering by using an LLM to construct a dataset of 960 news paragraphs, with four axes of variation: topic, sentiment, temporal framing, and writing style. We use an LLM to generate phrases to filter the latents by (Appendix). We find that we can cluster along each axis and ignore other axes, while semantic embeddings predominantly result in topic-based clusters.

Semantic and SAE clustering results: (1) topic (2) sentiment (3) temporal framing and (4) writing style. Mappings from clusters to true labels are found with the Hungarian algorithm.

Failure to recover ground truth labels for sentiment and emotion clustering.
We apply targeted clustering to Twitter sentiment analysis (Rosenthal et al. (2019)) and emotion recognition (Saravia et al. (2018)) by filtering SAE activations to only include sentiment- or emotion-related latents. However, both semantic embeddings and SAE embeddings fail to align meaningfully with the ground-truth labels (Appendix). This likely reflects a theoretical limitation of SAE representations: each latent contributes equally in the representation and clustering, but neither the SAE nor the underlying LLM were trained to ensure their activations represented meaningful notions of similarity. As such, SAE activations should not be viewed as a strictly “better” representation of text, but rather as an alternative lens—one with different features from typical semantic embeddings, that may surface different insights.

Discovering unknown clusters

GSM8k: Identifying reasoning patterns. We apply SAE clustering on GSM8k answers, to see if we can uncover insights related to reasoning structure. We find that semantic clusters tend to be aligned with the content of the math problem (Appendix) while the SAE more interestingly finds clusters of how the solution is written. This can be found simply using all latents (Appendix), but is improved when filtering down to latents related to "step by step reasoning".

Table 6: SAE Clustering, using the top 500 features related to “step by step reasoning”. The LLM relabel of each cluster is obtained using the top 5 promoted features and examples (Appendix C.3).

We verify the validity of these clusters by giving the labels of the clusters to our LLM judge and asking it to assign each text to one cluster. Even with the assumption that all texts belong in one of three disjoint clusters, most SAE-assigned texts "truly" belong to that cluster.

We hypothesized that these clusters could not be found semantically, as it seems unlikely the semantic embeddings capture such sentence patterns well. Looking at each SAE cluster in semantic space, we compute the conductance of the cluster using its within-cluster and out-of-cluster affinities, where a lower conductance means the cluster is well-separated. We compare this conductance to randomly-drawn equally-sized sets of texts. The z-score is negative, but clusters found using semantic embeddings tend to have much lower z-score (Appendix). Thus, the SAE has found structure that semantic embeddings would not have found.

Retrieval

Text retrieval aims to identify the most relevant texts in a corpus for a given query. Classic benchmarks (e.g., MSMARCO, MTEB) largely target semantic matching---answering questions or finding semantically similar passages. In contrast, we study property-based retrieval: retrieving texts with implicit properties (e.g. tone, formatting, reasoning style). This may be important when we are more interested in the properties of a text rather than its content, such as when finding LLM outputs with behaviors like hedging or sycophancy. To the best of our knowledge, this is relatively underexplored—Ravfogel et al. (2023) investigates retrieval based on a description of the content but most existing work remains centered on semantic similarity.

Modern decoder-only LLMs have recently begun to outperform traditional methods (e.g. BM25, BERT) on embedding tasks, via last-token or latent-attention pooling, instruction formatting, and/or finetuning. We expect the representational power of modern LLMs to also encode these abstract properties, and empirically many SAE latent labels correspond to abstract concepts. The interpretability of SAEs also helps us better understand retrieval results---some work has used SAEs trained on semantic embeddings to interpret and control retrieval (e.g. O'Neill et al. (2024), Kang et al. (2024)).

Benchmark construction. We construct a property-based retrieval benchmark across six datasets with 10k texts each: user prompts and model responses from ChatbotArena, reasoning traces from DeepSeek-R1, texts from the Pile, abstracts from arXiv q-bio and short stories from Reddit. These settings highlight different challenges—e.g., intent and tone in prompts/responses, the presence of strategies across long reasoning traces, types of texts in the Pile, and domain-specific properties in scientific abstracts and short stories. For each dataset, we curate a small set of natural language queries and use an LLM to judge ground truth relevance.

Baselines. We compare our SAE method against a few baselines:

OpenAI: We use OpenAI's text-embedding-3-large to semantically embed both queries and text, and retrieve by cosine similarity.
Gemini: We use Gemini's gemini-embedding-001 retrieval mode to embed queries and texts separately, and retrieve by cosine similarity.
Qwen: We use the Qwen3-Embedding-8B retrieval model (which currently ranks #1 on the MTEB leaderboard) to embed queries and texts separately, using the query prompt "Given a property query, retrieve texts with that property.", and retrieve by cosine similarity.
BM25+LLM: BM25 is a commonly used term-based retrieval system. We use an LLM for query expansion, generating key phrases based on the property query, and concatenating them into one query for retrieval.
OpenAI+LLM: Since the query string may not be semantically similar to texts fulfilling the query, we ask our LLM helper to generate phrases that could appear in a text fulfilling the query, semantically embed each phrase with text-embedding-3-large retrieve texts by cosine similarity, and reciprocal rank aggregate the results.
Gemini+LLM: Similar to OpenAI+LLM, but with gemini-embedding-001 semantic similarity mode.

Results. We first evaluate first-stage retrieval, where each retrieval method must rank the entire corpus. We report average precision and precision@50 averaged over all queries. For methods with hyperparameters (number of phrases for query expansion and temperature for SAE), we report the full range across hyperparameters. The SAE method generally performs the best, being on par with or exceeding the state of the art with no additional finetuning. This is likely because current methods are not optimized for property queries, while the SAE often contains this information directly.

We show the MAP and MP@50, averaged over queries, for each method and dataset. Query
expansion is done using between 1 and 20 phrases, temperature is varied between 0.01 and 1.5, and the full range is reported here.

We also combine the results from the OpenAI+LLM and SAE methods using reciprocal rank aggregation and find that performance generally improves over any individual method. This is supported by the fact that the SAE method tends to rank different top documents than other methods (Appendix). We can also add in second-stage retrieval, where we ask an LLM to rerank the top 50, which brings relevant results even closer to the top.

For the OpenAI+LLM and SAE methods, we fix the hyperparameters to be their best values
averaged across datasets (nphrases = 18 and T = 0.2), and report their individual and combined performance per dataset. We also add in LLM reranking of the top 50.

Performance of SAE method at different T used to aggregate features, for each dataset.

Discussion. For each dataset, we examine queries where the SAE performs well or poorly (Appendix). Embedding methods rely more on semantic phrase matches, while the SAE captures more abstract properties. Generally, the SAE shows the largest gains in cases where keyphrase-based methods struggle—such as when the text switches languages or gets stuck in a repetitive loop—because it captures concepts directly rather than relying on surface-level phrase matches. We also hypothesize that the SAE is better suited for capturing the presence of properties in long texts due to the max-pooling across tokens. For instance, properties present in long and convoluted reasoning traces may be lost in a semantic embedding, which may be why the SAE more strongly outperforms baselines on that dataset.

The performance of the SAE method is sensitive to the temperature used to aggregate latents. Aggregation is necessary as shown by the poor performance of T=0.01 across datasets, due to noisiness of labels, and latents largely being more fine-grained than any query. We see in the figure above that lower temperatures work better for prompts, reasoning traces, biology abstracts and short stories, while higher temperatures work better for responses and the Pile. This is likely because the SAE we used was trained on chat data, thus it learnt many latents for that distribution. We are then able to pool more latents and have better performance. We expect that training an SAE on more diverse data and obtaining better labels for the latents would result in better performance overall.

Discussion and Limitations

We use SAEs to perform four exploratory data analysis tasks and show how we can gain novel insights. Data diffing and correlations can generate unknown insights about the data, while clustering and retrieval show SAEs are useful alternatives to embeddings. We believe data diffing is particularly valuable for studying how model outputs differ between different conditions, especially as we were able to extract insights about frontier models with a relatively small dataset size.

As our work serves as a proof of concept of using SAEs for data analysis, there are many areas of potential improvement. Results would likely improve if we obtained more accurate labels for all SAE latents. As SAE features change depending on the data they are trained on, we suspect that different SAEs may be better suited depending on the analysis question. We also studied relatively short pieces of text in this work. Longer texts (ex. agent transcripts) would likely have more complex, interesting insights.

Given the rich insights that model data holds, data-centric interpretability is a promising direction towards understanding models. Our findings suggest that SAEs are an effective, scalable method for recovering the structure present in data.

Awknowledgments

This work was conducted as part of the ML Alignment & Theory Scholars (MATS) Program. We would like to thank Samuel Marks for helpful feedback. We particularly thank Lisa Dunlap for providing datasets to test on and suggestions on experiments. We are grateful to Goodfire and MATS for providing compute support. We also thank members of Neel Nanda's MATS stream for engaging brainstorming sessions, thoughtful questions, and ongoing discussions that shaped our approach.

^{^}
As we used the API, which restricts the context window length to 2048, all texts we choose to analyze are below 2048 tokens.
^{^}
The frontier models we compare against are Claude Opus 4.1, Claude Sonnet 4, GPT 5, Gemini 2.5 Flash, Llama 4 Maverick, Deepseek R1, Gemini 2.5 Pro, Qwen3-235b, and Qwen3-235b-thinking.
^{^}
Given two datasets $A$ and $B$ , we compute the number of tokens the LLM and SAE would incur to produce hypotheses. Let $N$ be the number of samples in each dataset, $T$ be the average number of tokens per sample, and $F$ be the number of latents in the SAE. The SAE method uses $2 N T$ tokens for embedding the datasets and $O (F)$ tokens for labeling latents and generating hypotheses. The LLM uses $2 N T$ tokens for initial processing and $O (N)$ tokens for generating individual differences and summarizing. In general, while the number of tokens used by the SAE and LLM will both scale with the dataset size, it will scale faster for the LLM.
^{^}
Note that there are 537 different latents in the SAE all labelled as "Offensive request from the user"
^{^}
Some labels are overly specific, such as the "black holes" latent that just fires on the token "black"
^{^}
The definition of "offensive" is subjective, and the SAE latents often fire on non-offensive texts, so they may be somewhat poorly labelled.

LESSWRONG
LW

LESSWRONG
LW

53

Towards data-centric interpretability with sparse autoencoders

53

Ω 17

53

Ω 17

TL;DR

Introduction

Preliminaries

Data Diffing

Identifying known differences from datasets

Discovering novel differences between model behavior

Correlations

Finding known correlations

Finding unknown correlations

Finding bias in internet comments

Finding patterns in model responses

Clustering

Discovering known clusters

Discovering unknown clusters

Retrieval

Discussion and Limitations

Awknowledgments