why maxpooling - did you try multiple things and settle on maxpooling? if so what happened in the other cases?
We chose maxpooling (vs. average pooling) as our aggregation method because we primarily care about the largest extent to which a property exists in a data sample (vs. the average presence of a property in the sample). In this work, we're using SAE features to construct a document-level embedding in isolation of sample-specific information like the sample length. However, we did not thoroughly investigate other aggregation methods and think that further investigation could be interesting to explore. For instance, if we care about distinguishing short samples where a property occurs once vs. long samples where the property occurs over and over, then a different aggregation method would work better.
Nick and Lily are co-first authors on this project. Lewis and Neel jointly supervised this project.
The field of language model interpretability has traditionally focused on studying model internals. We believe an important but neglected direction is data-centric interpretability, which aims to understand models through insights about their data (outputs and training data). For instance, while training data strongly shapes model behavior, interpretability research often downplays understanding its structure compared to studying model internals alone. Prior work has largely used LLMs (Dunlap et al. (2024)) to extract insights from model data, but these approaches are difficult to scale and limit the ability to conduct large-scale, systematic analyses that can be iterated upon quickly.
In this work, we use sparse autoencoders (SAEs) for data analysis, with a focus on model data for understanding model behavior. SAEs have been largely used to extract interpretable features (Bricken et al. (2023); Templeton et al. (2024)) from and control (Bayat et al. (2025)) LLMs, despite conflicting evidence on their utility (Subhash et al. (2025)). Here, we hypothesize that SAEs have two advantages for data analysis.
First, SAEs provide a large hypothesis space useful for discovering novel insights. As they are trained in an unsupervised manner, they generate a large dictionary of "features"—latent vectors that represent a (mostly) monosemantic concept. Functionally, these latents act as labels of properties for each dataset sample that we can leverage for insight discovery. We perform two exploratory data analysis tasks. First, we use SAEs to evaluate differences between model outputs, finding that Grok-4 is more careful to clarify ambiguities than other frontier models. Second, we search for interesting co-occurrences of SAE latents to understand how concepts co-occur in a dataset, discovering that offensive language tends to correlate with narrative stories in Chatbot Arena. These approaches surface novel insights in a fully unsupervised manner, enabling us to find unknown unknowns in datasets.
Second, SAEs capture rich properties of text beyond semantics—they approximate LLM representations which encode complex linguistic and conceptual information. Motivated by this intuition, we explore using SAE latent activations as an alternative to traditional text embeddings. We find that they can cluster data differently than semantic embeddings, such as grouping distinct reasoning approaches, and can be used for retrieving data with property-based queries.
Our results represent a preliminary step toward using SAEs as a versatile tool for the exploratory analysis of large datasets, highlighting data-centric interpretability as a promising direction for future work.
Sparse autoencoders. Sparse autoencoders learn to reconstruct language model activations using a dictionary of latent features with a sparsity penalty. Throughout this work, we use SAEs from Goodfire[1] trained on the layer 50 residual stream of Llama 3.3 70B using LMSYS-Chat-1M. The SAE has an average L0 of 121 and a dictionary size of latents, among which we found to have labels of the concept they fire on.
Representing datasets with feature activations. We max-pool the activations of each SAE latent across tokens to obtain an "SAE activation vector" (see below).
Note that usually, we use a SAE trained on the activations of the model we are interpreting. As we are interpreting data in this work, we only need a "reader model" (LLaMA 3.3 70B) and its SAE, even if our data of interest was generated by another model.
LLM judge. When we use an LLM for dataset generation or labelling, we primarily use Gemini 2.5 Flash unless otherwise specified due to its low cost and large 1M context window.
We consider the problem of describing the difference between two datasets, which we call data diffing. In the context of LLM interpretability, understanding how datasets differ is important when there is a variable of change--for example, when using different data splits across training runs, when comparing outputs from a base and finetuned model, or when assessing the effect of new training techniques. Model diffing similarly aims to characterize differences in models, but while it typically focuses on model internals, we propose to instead compare models by diffing their outputs.
Prior work (Zhong et al. (2023), Zhong et al. (2022)) has used LLMs to generate and test hypotheses on differences between two datasets, which is computationally expensive. Here, we show that SAEs can find valid differences between datasets more cost-effectively. The methodology is described below, where each SAE latent essentially acts as a hypothesized difference.
We explore dataset differences for both synthetic and real-world tasks:
Results. Below, we present the top three latent differences for five randomly chosen examples from the synthetic and movie datasets.
Synthetic dataset (tone shifts)
Tone shift | Top 3 latent differences |
---|---|
Casual | Casual/cool slang and informal speech patterns |
Narrative transitions and connective phrases in explanatory text | |
Making text more informal and conversational | |
Organized | Question and answer transition points in educational content |
The start of a formal question in structured Q&A formats | |
Qt framework and Q-prefixed technology terms | |
Imaginative | Groups gathering to share stories and experiences, especially in atmospheric or mysterious contexts |
Formal literary writing style marked by heavy use of articles and possessives | |
Transitioning into role-play mode | |
Funny | Young adult recreational and social activities |
Offensive or inappropriate content, particularly flirtation and suggestive material | |
Explaining something in a friendly, casual teaching style | |
Safety con- scious | Emphasizing ethical concerns and safety warnings |
Establishing boundaries around inappropriate content | |
Polite hedging phrases used to soften or introduce statements |
Real-world dataset (genre changes)
Movie genre | Top 3 latent differences |
---|---|
Action | Action movie plot developments and dramatic confrontations |
Narrative sequences describing villains rising to power | |
Introduction of new narrative elements with indefinite articles | |
War | Soldiers experiencing the psychological and physical hardships of war |
Military officer ranks and hierarchical descriptions | |
Major historical events and periods of upheaval | |
Romance | Will they/won't they writing tropes |
Falling or being in love across multiple languages | |
Initiating romantic pursuit, especially cross-cultural dating | |
Crime | Providing movie recommendations or plot descriptions |
High-stakes heists involving valuable items from secure locations | |
Detective character establishment in crime narratives | |
Documentary | Documentary films and series being discussed or listed |
Revealing behind-the-scenes content or private glimpses | |
Structural elements that introduce additional information or examples in formatted text |
For both the synthetic and IMDB datasets, we find that the top latent differences align with the known difference (the tone or genre). While some latent labels directly state the difference (ex. the "action" genre), some connections are more indirect. For instance, the top latent difference for the "organized" tone shift is "Question and answer transition points in educational content", which implies an organized tone. This reflects the need to directly examine the data to more clearly understand the difference, rather than solely relying on the latent labels.
To discover novel differences between in-the-wild datasets, we focus on the task of comparing models by comparing their generated responses. We randomly sample 1000 conversations from Chatbot Arena and generate responses using the first-turn prompt for a set of models. Then, we diff the responses with each other.
Datasets. We perform data diffing on two model setups.
Automatically converting feature differences to hypotheses 🤖. While we directly relied on latent descriptions in the previous section, many latents are either noisy or activate on similar text. Therefore, we use a LLM to:
LLM baseline. We also compare the SAE-generated hypotheses with a pure LLM pipeline. Motivated by Dunlap et al. (2024), we use a LLM to (1) evaluate differences between each response pair and (2) summarize the highest differences (with batching due to context window limitations) into hypotheses. For both the SAE and LLM-based methods, we pass in a query (e.g. What are the most significant, interesting differences?) to direct the hypothesis space.
Hypothesis verification. Finally, given these proposed hypotheses, we use a LLM judge to verify the presence of these differences. Specifically, the LLM judge scores whether each response has a hypothesized property; then, we tally up whether the property occurs more in one dataset than the other. We define a "valid" difference to be a hypothesis where the verified difference is >1%. We generate hypotheses for both model setups and compute the average difference in frequencies across valid hypotheses below. For the hypotheses of the frontier model, we compute the difference between the frontier model's verified frequency and the maximum verified frequency among the other frontier models.
➡️ Takeaway 1: SAE latents find valid hypotheses. The average difference of our hypotheses is 7.03% for LLaVA-Next v. Vicuna-7B and ranges from 26-47% for the frontier models, suggesting that SAEs can identify strong differences. We present qualitative examples of the model responses below, with the full hypothesis list for the SAE and LLM baseline in the Appendix.
Qualitative examples from model responses
Bold parts highlight the stated unique quality of each model
Model | Unique Qualities | Prompt | Response |
---|---|---|---|
Grok-4 | Clarifies ambiguities, offers multiple interpretations | A carton has six eggs. Two are added. Then one is taken out and eaten. One spoils so it is removed but replaced immediately. How many eggs have left the basket? | Based on the wording of the riddle, no basket is ever mentioned until the final question—everything happens with a carton. Therefore, zero eggs have left the basket. If the riddle is assuming "carton" and "basket" refer to the same container (or if it's a mistranslation or misstatement), then I'd say 1 egg has left (the one that was taken out and eaten; the spoiled one was removed but immediately replaced, so it didn't truly "leave"). If the question is meant to be "how many eggs are left in the basket?" then the answer is 7. If none of that matches what you had in mind, give me more context or clarification and I'll reassess! |
GPT-OSS-120B | Markdown table power user 💪 | Who would win in a fight: A dolphin or a sea turtle? | '**Short answer:** In a direct, one‑on‑one encounter the dolphin would almost certainly have the upper hand. --- ### Why a dolphin is the likely “winner” | **Trait** | **Dolphin** (e.g., common bottlenose dolphin) | **Sea turtle** (e.g., green or loggerhead) | |-----------|----------------------------------------------|--------------------------------------------| | **Size \& mass** | 150–300 kg (typical adult); up to 4m long | 70–200 kg for most species; up to 2m long |\n| **Speed** | Can cruise at 5–10 knots and sprint up to 20–25 knots (10–12 m/s) | Usually 1–2 knots; can burst to ~5 knots (2–3 m/s) |\n| **Mobility** | Highly maneuverable, can leap, twist, and use powerful tail flukes for rapid acceleration... |
Gemini 2.5 Pro | Saying confident affirmations like "Of course!" or "Certainly!", enthusiastic willingness to help | How do I calculate gibbs free energy of fibril formation from a solubility value? | Of course! This is an excellent question that connects a macroscopic, measurable property (solubility) to a fundamental thermodynamic quantity (Gibbs Free Energy).. |
➡️ Takeaway 2: The LLM baseline identifies less prominent differences. We observe that the average difference for proposed hypotheses is higher for the SAE than our LLM baseline, suggesting that the SAE can discover more prominent differences between datasets. Intuitively, good SAE features will efficiently point a LLM to the differences present in the text and make them easy to aggregate.
➡️ Takeaway 3: The LLM baseline is more expensive than SAEs. To get a rough gauge of cost, we calculate how many tokens are required for both approaches. For LLava v. Vicuna, the SAE uses 700K tokens whereas the LLM uses 3.3M tokens, a roughly 5x increase. We can expect this cost gap to increase as the dataset size grows
. One key advantage of SAE feature activations is that they are reusable, whereas we must feed in the entire dataset into a LLM for each query in the baseline. For instance, to compare the three frontier models with other models, the SAE approach uses 3.5M tokens in total whereas the baseline uses 25.3M tokens (8.4M average). This disparity in cost makes SAEs a stronger candidate than pure LLMs for generating hypotheses in large datasets.We consider the problem of finding correlations between features in text datasets. Often, correlations reflect genuine associations---a text containing "dogs" likely also contains "pets". However, we are interested in finding correlations that may reflect biases (e.g. "offensive" correlated with a certain demographic) or artifacts (e.g. text in a certain language uses a lot of emojis) of the dataset.
Prior work has focused on finding SAE latents correlated with dataset labels, by training classifiers on SAE latent activations and using their labels to generate hypotheses (Movva et al. (2025), Katamneni et al. (2025)). In this work, we instead leverage the same hypothesis space for unsupervised search for arbitrary latent pairs that consistently co-occur.
We present our methodology for finding correlations below:
We use the normalized pointwise mutual information (NPMI) as our co-occurrence metric, filtering out low-and high-frequency latents which may have distorted NPMI. This method tends to identify a large number of pairs because of the sheer number of latents (65k x 65k 2 billion), so we filter out latents with syntactic labels using a LLM judge to reduce the search space. Then, human inspection of the top candidate pairs and their firing examples is necessary to verify whether they reflect an "interesting" pattern---often, candidate pairs are not true conceptual correlations but rather a byproduct of poor labelling or a single token triggering both latents. We compare this method to the baseline method of passing the entire dataset to an LLM and asking for interesting correlations.
We test if the method works by injecting LLM-generated texts with known correlations into a background corpus from the Pile (10k total). We inject these correlations:
We run our SAE method and examine the candidate group of feature pairs with NPMI > 0.8 and semantic similarity < 0.2, and find features relevant to the injected correlation, even for a small fraction (down to 10/10,000) of injections. We compare this to the baseline method of feeding the dataset to the LLM judge and asking for any interesting co-occurrences, which finds the injected correlations only unreliably (Appendix).
We now turn to the problem of examining real-world datasets, using SAEs to automatically identify correlations of the dataset.
We examine the CivilComments dataset, which contains internet comments labeled for toxicity. While we know from its construction that offensive content is present, our goal is to automatically identify such content and understand its common themes. We run our SAE method on 5k comments and highlight interesting correlations below, revealing bias where "offensive" latents[4] co-occur with religion, race, and gender latents[5]. We also observe topic-specific patterns, such as correlations with “Donald Trump’s policies,” indicating that many comments address these topics together.
Each SAE pair is a hypothesis that these two features co-occur more than expected in the dataset. We can look at the conditional probabilities of the latents, to determine if e.g. "most offensive text mentions religion" or "most texts mentioning religion are offensive". We then want to verify if this relationship is truly significant in the dataset, using the original toxicity label and an LLM judge to determine the true presence of concepts (Appendix) and the true conditional probabilities of occurrence. We see that many hypotheses raised are indeed significant correlations. The baseline method of passing the dataset to an LLM judge mentioned only the "offensive and religion" correlation (Appendix). While LLMs raise many hypotheses about the dataset--some of which seem plausible--the SAE method provides a more systematic way of discovering such correlations.
Next, we apply our method to a sample of 5k model responses from Chatbot Arena. We find several correlations between narrative-related latents and offensive-content latents (the table below shows only some). This raises the hypothesis that a significant fraction of offensive content generated by the model is in the form of narrative text, likely due to the prompts that people use to elicit such behavior. Using an LLM judge to label the dataset for "offensive" and "narrative", we find that and , which is not an extremely strong but still present co-occurrence.[6] The baseline did not raise this hypothesis (Appendix).
Text clustering aims to group unlabelled documents as an exploratory step in understanding large datasets. For example, clustering user prompts to LLMs could generate insights on the types of questions people ask (Tamkin et al. (2024)). In classical NLP, token based (e.g. BM25) or dense semantic embedding based methods (e.g. Sentence BERT) are used to represent documents, and clustering algorithms such as K-Means, spectral clustering, HDBSCAN or non-negative matrix factorization are used.
Due to the presence of non-semantic latents, we expect clustering on SAE activations to uncover different and potentially more "abstract" clusters. Importantly, SAE activations also allow us to perform targeted clustering—since each latent dimension is interpretable, we can filter down to only the latents we care about and cluster in that subspace. For example, filtering to only "tone" or "reasoning style" latents allows us to ignore the semantic content of the texts which may otherwise dominate the clustering. Previous work on semi-supervised clustering uses techniques such as pairwise constraints or LLM guidance. Here, we propose using the representational power of LLMs and the interpretability of SAEs to enable immediate out-of-the-box exploration along any desired axis without additional finetuning. We can also diff in-cluster texts with out-of-cluster texts to immediately obtain the top promoted latents, which help describe the cluster immediately. We primarily compare our SAE embeddings with text-embedding-3-large semantic embeddings.
Successful targeted clustering in synthetic datasets. We test targeted clustering by using an LLM to construct a dataset of 960 news paragraphs, with four axes of variation: topic, sentiment, temporal framing, and writing style. We use an LLM to generate phrases to filter the latents by (Appendix). We find that we can cluster along each axis and ignore other axes, while semantic embeddings predominantly result in topic-based clusters.
Failure to recover ground truth labels for sentiment and emotion clustering.
We apply targeted clustering to Twitter sentiment analysis (Rosenthal et al. (2019)) and emotion recognition (Saravia et al. (2018)) by filtering SAE activations to only include sentiment- or emotion-related latents. However, both semantic embeddings and SAE embeddings fail to align meaningfully with the ground-truth labels (Appendix). This likely reflects a theoretical limitation of SAE representations: each latent contributes equally in the representation and clustering, but neither the SAE nor the underlying LLM were trained to ensure their activations represented meaningful notions of similarity. As such, SAE activations should not be viewed as a strictly “better” representation of text, but rather as an alternative lens—one with different features from typical semantic embeddings, that may surface different insights.
GSM8k: Identifying reasoning patterns. We apply SAE clustering on GSM8k answers, to see if we can uncover insights related to reasoning structure. We find that semantic clusters tend to be aligned with the content of the math problem (Appendix) while the SAE more interestingly finds clusters of how the solution is written. This can be found simply using all latents (Appendix), but is improved when filtering down to latents related to "step by step reasoning".
We verify the validity of these clusters by giving the labels of the clusters to our LLM judge and asking it to assign each text to one cluster. Even with the assumption that all texts belong in one of three disjoint clusters, most SAE-assigned texts "truly" belong to that cluster.
We hypothesized that these clusters could not be found semantically, as it seems unlikely the semantic embeddings capture such sentence patterns well. Looking at each SAE cluster in semantic space, we compute the conductance of the cluster using its within-cluster and out-of-cluster affinities, where a lower conductance means the cluster is well-separated. We compare this conductance to randomly-drawn equally-sized sets of texts. The z-score is negative, but clusters found using semantic embeddings tend to have much lower z-score (Appendix). Thus, the SAE has found structure that semantic embeddings would not have found.
Text retrieval aims to identify the most relevant texts in a corpus for a given query. Classic benchmarks (e.g., MSMARCO, MTEB) largely target semantic matching---answering questions or finding semantically similar passages. In contrast, we study property-based retrieval: retrieving texts with implicit properties (e.g. tone, formatting, reasoning style). This may be important when we are more interested in the properties of a text rather than its content, such as when finding LLM outputs with behaviors like hedging or sycophancy. To the best of our knowledge, this is relatively underexplored—Ravfogel et al. (2023) investigates retrieval based on a description of the content but most existing work remains centered on semantic similarity.
Modern decoder-only LLMs have recently begun to outperform traditional methods (e.g. BM25, BERT) on embedding tasks, via last-token or latent-attention pooling, instruction formatting, and/or finetuning. We expect the representational power of modern LLMs to also encode these abstract properties, and empirically many SAE latent labels correspond to abstract concepts. The interpretability of SAEs also helps us better understand retrieval results---some work has used SAEs trained on semantic embeddings to interpret and control retrieval (e.g. O'Neill et al. (2024), Kang et al. (2024)).
Benchmark construction. We construct a property-based retrieval benchmark across six datasets with 10k texts each: user prompts and model responses from ChatbotArena, reasoning traces from DeepSeek-R1, texts from the Pile, abstracts from arXiv q-bio and short stories from Reddit. These settings highlight different challenges—e.g., intent and tone in prompts/responses, the presence of strategies across long reasoning traces, types of texts in the Pile, and domain-specific properties in scientific abstracts and short stories. For each dataset, we curate a small set of natural language queries and use an LLM to judge ground truth relevance.
Baselines. We compare our SAE method against a few baselines:
Results. We first evaluate first-stage retrieval, where each retrieval method must rank the entire corpus. We report average precision and precision@50 averaged over all queries. For methods with hyperparameters (number of phrases for query expansion and temperature for SAE), we report the full range across hyperparameters. The SAE method generally performs the best, being on par with or exceeding the state of the art with no additional finetuning. This is likely because current methods are not optimized for property queries, while the SAE often contains this information directly.
We also combine the results from the OpenAI+LLM and SAE methods using reciprocal rank aggregation and find that performance generally improves over any individual method. This is supported by the fact that the SAE method tends to rank different top documents than other methods (Appendix). We can also add in second-stage retrieval, where we ask an LLM to rerank the top 50, which brings relevant results even closer to the top.
Discussion. For each dataset, we examine queries where the SAE performs well or poorly (Appendix). Embedding methods rely more on semantic phrase matches, while the SAE captures more abstract properties. Generally, the SAE shows the largest gains in cases where keyphrase-based methods struggle—such as when the text switches languages or gets stuck in a repetitive loop—because it captures concepts directly rather than relying on surface-level phrase matches. We also hypothesize that the SAE is better suited for capturing the presence of properties in long texts due to the max-pooling across tokens. For instance, properties present in long and convoluted reasoning traces may be lost in a semantic embedding, which may be why the SAE more strongly outperforms baselines on that dataset.
The performance of the SAE method is sensitive to the temperature used to aggregate latents. Aggregation is necessary as shown by the poor performance of T=0.01 across datasets, due to noisiness of labels, and latents largely being more fine-grained than any query. We see in the figure above that lower temperatures work better for prompts, reasoning traces, biology abstracts and short stories, while higher temperatures work better for responses and the Pile. This is likely because the SAE we used was trained on chat data, thus it learnt many latents for that distribution. We are then able to pool more latents and have better performance. We expect that training an SAE on more diverse data and obtaining better labels for the latents would result in better performance overall.
We use SAEs to perform four exploratory data analysis tasks and show how we can gain novel insights. Data diffing and correlations can generate unknown insights about the data, while clustering and retrieval show SAEs are useful alternatives to embeddings. We believe data diffing is particularly valuable for studying how model outputs differ between different conditions, especially as we were able to extract insights about frontier models with a relatively small dataset size.
As our work serves as a proof of concept of using SAEs for data analysis, there are many areas of potential improvement. Results would likely improve if we obtained more accurate labels for all SAE latents. As SAE features change depending on the data they are trained on, we suspect that different SAEs may be better suited depending on the analysis question. We also studied relatively short pieces of text in this work. Longer texts (ex. agent transcripts) would likely have more complex, interesting insights.
Given the rich insights that model data holds, data-centric interpretability is a promising direction towards understanding models. Our findings suggest that SAEs are an effective, scalable method for recovering the structure present in data.
This work was conducted as part of the ML Alignment & Theory Scholars (MATS) Program. We would like to thank Samuel Marks for helpful feedback. We particularly thank Lisa Dunlap for providing datasets to test on and suggestions on experiments. We are grateful to Goodfire and MATS for providing compute support. We also thank members of Neel Nanda's MATS stream for engaging brainstorming sessions, thoughtful questions, and ongoing discussions that shaped our approach.
As we used the API, which restricts the context window length to 2048, all texts we choose to analyze are below 2048 tokens.
The frontier models we compare against are Claude Opus 4.1, Claude Sonnet 4, GPT 5, Gemini 2.5 Flash, Llama 4 Maverick, Deepseek R1, Gemini 2.5 Pro, Qwen3-235b, and Qwen3-235b-thinking.
Note that there are 537 different latents in the SAE all labelled as "Offensive request from the user"
Some labels are overly specific, such as the "black holes" latent that just fires on the token "black"
The definition of "offensive" is subjective, and the SAE latents often fire on non-offensive texts, so they may be somewhat poorly labelled.