No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Untangling an AI model's semantic associations
I. TL;DR
Epistemic status:
Confidence in the methodological findings: moderate
Confidence in generalization to larger models: low
Executive summary:
This project represents a basic application of pretrained residual stream sparse autoencoders (SAEs) to GPT-2 Small.
The experiments and findings contained within this project, while broadly reflective of research done by Anthropic and others on this topic, are designed to be accessible and easily replicable to readers who possess broad technical competency, but may not possess a deep background in Machine Learning or related topics.
SAE use successfully demonstrated how the representation of a given token changes as that token passes through the various layers of the model. Specific evidence of this mechanism includes the following:
Increased peak activations: Peak activations increased ~3x as the token moved from layer 6 to layer 11
Decreased ubiquity of top specialist features: Presence of the most common feature per layer decreased modestly as the token moved from layer 6 to layer 11, demonstrating the model’s ability to progressively “zero in” on a given token’s content.
Increased specialization: For nearly all categories of sample text tested, the “specialist score” (a measure of a feature’s selectivity) increased in later layers.
While not rigorously quantified in this research, a visualization of specialist activation with respect to the various sample texts suggests that those specialists may primarily detect a text sample’s surface form (syntax, as opposed to that sample’s meaning (semantics).
II. Introduction
This article represents the first in a multi-part series exploring basic mechanistic interpretability (“MI”) techniques involving use of sparse autoencoders (SAEs). The primary objective of the series is educational. More specifically, I hope to document my learning and share that learning to the degree it is helpful for others along the following lines:
How to use SAEs to help better understand model mechanics
The insights regarding model mechanics thus revealed
How an understanding of said mechanics might enable greater use of AI in use cases particularly sensitive to output consistency, reliability, and safety
This work was initially inspired by Anthropic’s Scaling Monosemanticity (2024) and utilizes pre-trained residual SAEs available via the TransformerLens libary. The main finding: specialist features detect syntactic patterns rather than semantic concepts, even in later layers. The full code is available via the Phase 1 Jupyter notebook at this project’s GitHub repository.
III. The experiment
Model selection: GPT-2 Small
For this analysis, I selected GPT-2 Small for the following reasons:
Ease of access via TransformerLens
Relatively small size (768 dimensions), making it well-suited for the educational aim of this article while minimizing hardware requirements for those replicating this analysis
Ready availability of pre-trained residual stream SAEs developed by Joseph Bloom and available via SAELens
Important notes:
The results of my analysis may not extrapolate to larger, more capable models. It is not difficult to imagine that more capable models possess an internal structure more adept at encoding meaning, compared to GPT-2 Small.
Pre-trained SAEs for attention outputs and MLP outputs were not available for this model release, constraining the breadth of analysis conducted to the model’s residual stream only.
Dataset construction
To examine how the model features revealed by SAE application respond to different types of content, I constructed the following test dataset of 70 texts with LLM assistance. This dataset included text snippets from seven categories. Those categories are described in Figure 1 below. A complete list of the test texts is available via the Phase 1 Jupyter notebook on the project’s GitHub repo.
Figure 1: Test text types used
Each category contained 10 texts, providing a balanced basis for comparing activation patterns across content types.
Measuring feature specialization
Specialist feature identification
For purposes of this analysis, a specialist feature is an SAE dimension that activates materially for texts within a single category while remaining relatively inactive for texts in other categories. The intuition is straightforward: if a feature responds strongly to Python code but rarely to math equations, URLs, or conversational text, it represents something specific to Python. What that “something” is - syntax, semantics, or both - is a primary focus of this analysis.
Identifying specialists required a two-step process:
Filtering for activation level: First, for each of the aforementioned seven text categories, I identified the 5 features with the greatest maximum activation values across that category’s 10 sample texts.
Filtering for category specificity: Second, each feature with high activation was evaluated for selectivity using the specialist score calculation described below. For example, a feature that activated strongly on Python code but also activated strongly on math and URLs received a low or negative score due to its poor selectivity. Conversely, a feature that activated strongly on Python code and rarely elsewhere received a high score. The candidate with the highest specialist score was thus designated as that category’s specialist.
Specialist score calculation
To better understand how selectively a given feature responds to a particular text category, I employed the specialist score metric described below.
specialist_score=nin−nout
Wherein:
nin is the count of texts within the target category where the feature’s activation exceeded a somewhat arbitrary threshold (5.0)
nout is the count of texts outside the category exceeding the same threshold.
With 10 texts per category and 70 texts total, the theoretical range for this score spans from −60 (a feature that fires on every text except those in the target category, indicating negative selectivity, relative to the current category) to +10 (a feature that fires on all texts in the target category and none outside, indicating the feature perfectly discriminates for the current category). For example, a score of 8 would indicate that the feature in question activated strongly on 8 of 10 in-category texts while remaining largely inactive elsewhere. For summary purposes, a category was considered to have a specialist if its highest-scoring feature achieved a positive score.
Analysis objectives
A summary of the questions explored in this analysis is provided in Figure 2 below.
Figure 2: Avenues of investigation
IV. Results.
Summary of findings
Figure 3 below summarizes findings for each analysis objective.
Figure 3: Part 1 key findings summarized
Analysis
Objective
Finding
Expected or Surprise
Strongest features
Do activation magnitudes vary by layer?
Peak activations increase by ~3x from layer 6 to layer 11
Expected
Most frequent features
Are some features more general than others?
Coverage decreases modestly with depth (~98% → ~87%)
Expected
Category specialists
Can we find category-specific features?
Yes, after fixing two methodological issues; 7/7 specialists at all layers
Expected
Cross-layer comparison
Do deeper layers show more specialization?
All layers achieve 7/7 specialists; deeper layers show stronger scores for most categories, particularly Social
Expected
Token-level activation
What do specialists actually detect?
Syntactic / symbolic patterns, not semantic topics
Surprise
Strongest features
A review of the peak activation magnitudes by layer revealed a clear pattern: peak activation levels increase substantially with increasing layer depth. A summary of these findings is shown in Figure 4.
Figure 4: Peak activation by layer
At layer 6, the single strongest SAE feature peaked at roughly 13. By layer 11, this had increased almost threefold to approximately 37. This pattern is consistent with how models work: as user input passes through the model’s layers, the model continues to refine its representation of that input in a manner similar to how an author might sharpen his point after writing several drafts of an article.
With this in mind, it makes intuitive sense that features attuned to distinct properties would be more highly activated in later layers of the model, where such properties come into sharper relief.
Most frequent features
In addition to examining which singular feature was activated most intensely at each layer, I also examined which SAE features activated across the largest number of texts, regardless of activation intensity. A feature that fires on 70 out of 70 texts, for example, is maximally general. Conversely, a feature that fires on only a few texts is highly selective. A summary of these findings is shown in Figure 5.
Figure 5: Top feature coverage across text samples, by layer
The most frequent features showed high coverage across all layers, though with a slight decline at greater depth: from ~98% at layer 6 to ~87% at layer 11. This suggests that general-purpose features remain prominent as input is processed through the model’s layers, with deeper layers showing only slightly more selectivity. Interestingly, the most common feature at each layer differs. One explanation for this could be the presence of many generalist features with similarly high frequency, and the transformations done at each stage of the layer simply shift these features’ relative rankings slightly.
Category specialists
After measuring features’ activations and commonality generally, I then narrowed the analysis to examine specialist features: those that activate strongly within a single text category while remaining relatively inactive outside it.
My initial results were discouraging: across all four SAE layers, almost no specialists were found. Further investigation revealed two significant methodological errors:
Padding token usage: Initially, I was averaging across all token positions, including padding tokens when measuring activations for a given text sample. Since shorter texts are padded to a uniform length and padding tokens produce near-zero activations, category-specific patterns were thus being diluted by those padding tokens.
Averaging before encoding: My original approach averaged GPT-2's raw activations across tokens before passing them through the SAE encoder. This meant I was applying the SAE to an averaged activation vector that didn't correspond to any actual token. Because the SAE encoder includes nonlinear operations, this produces different results than encoding each token first and averaging afterward.
Correcting the padding issue alone revealed a progressive pattern: specialists emerged for 5/7 categories at layer 6, increasing to 7/7 by layer 10. Correcting both issues revealed specialists for all seven categories at every layer:
Layer
Specialists Found
6
7/7
8
7/7
10
7/7
11
7/7
Cross-layer specialist comparison
Comparing specialist patterns across layers revealed a clear pattern regarding how readily specialist features emerged as input passed through the model:
Categories with numerous early specialists (found at layer 6): Social, Math, and Python achieved the highest scores (5-10), characterized by distinctive, homogeneous surface-level features such as emojis, mathematical symbols, and Python keywords.
Categories requiring deeper processing: Conversational, Formal, Non-English, and URLs all showed low scores (1-2) across layers, but potentially for different reasons:
The Conversational and Formal categories lack distinctive surface forms and instead use the standard written prose that likely accounted for the bulk of the model’s training data. These samples’ differences were more subtle than other categories and consisted of tone and pragmatics.
The Non-English category has distinctive but heterogeneous markers. Chinese characters, for example, differ completely from Arabic script, which differs from Latin-script French. Since no single feature can capture this diversity, few specialists emerged.
The URLs category was in some ways the opposite of the Non-English category insofar as this category contained some distinctive surface forms, but also included significant syntactic overlap with other domains such as file paths, time notation and even English prose. Perhaps this varied mixture inhibited the clear emergence of specialist features.
These results are shown in Figure 7.
Figure 7: Detailed view of specialist scores by layer
Token-level activation
For each specialist feature identified, I then used token-level visualization to examine precisely which tokens in the text samples triggered that feature’s activation. This analysis produced the most surprising finding of the project.
Given the model’s impressive capabilities, I expected that specialist features would respond to semantic content: that a “math specialist” would activate on mathematical concepts, a “Python specialist” on programming concepts, and so forth. The visualizations thus created, however, seem to indicate the specialists detect syntactic and symbolic patterns more distinctly than high-level topics. Examples of these visualizations are shown below.
Math specialist (Feature #22917): As shown in Figure 8, this feature seems to activate primarily in response to arithmetic operators such as “+”, “=”, “^”, “/”, regardless of surrounding context, as opposed to the sample’s actual mathematical meaning.
Figure 8: Activation of math specialist (Feature #22917) against sample math text
Python specialist (Feature #15983): As shown in Figure 9, this feature seems to activate most acutely in response to Python structural syntax, particularly “):” patterns at the end of function and class definitions, closing parentheses, and Python keywords like “for”, “in”, and “if” in comprehensions. This suggests the feature is detecting Python's grammatical markers, not the more fundamental programming concepts that utilize those markers.
Figure 9: Activation of Python specialist (Feature #15983) against sample Python text
The results suggest cross-domain specialist feature activation is to be expected. The math specialist would likely activate on Python code containing arithmetic operations because it is that mathematical symbology that drives the math specialist’s activation. The reverse is also true: the Python specialist can be expected to activate on mathematical expressions that use parentheses, since it is those parentheses - not programming concepts per se - that drive that activation. These are not errors; they reveal that SAE features learn efficient low-level patterns that correlate with human-defined categories, rather than learning the categories directly. Put another way, the specialist features that SAEs identify appear to track surface form rather than meaning. This raises a question of whether that meaning is encoded elsewhere in the model, or whether the model achieves its capabilities through sophisticated pattern-matching on syntax alone.
V. Implications and next steps
Next steps for research
Attention and MLP SAEs
My analysis was restricted to residual stream SAEs because pre-trained SAEs for attention outputs and MLP outputs were not available for GPT-2 Small. Attention and MLP outputs play different roles in transformer architecture that are distinct from the residual stream and thus, use of SAEs trained on those outputs could yield different and potentially illustrative results. This analysis would require either training custom SAEs on those outputs or awaiting broader SAE releases, which feels unlikely due to the availability of more modern models.
Larger models
GPT-2 Small was chosen for its relatively manageable size, but frontier models operate at a different scale, containing up to trillions of parameters. It remains an open question whether the patterns observed here (deeper layers developing sharper specialists, syntax-level rather than semantic-level detection, etc.) would persist in larger models, or whether those models would display fundamentally different behaviors commensurate with their vastly greater capabilities. Recent work by Anthropic on Claude suggests that interpretability techniques can scale, but replication across model families would strengthen confidence in the findings.
Safety-relevant features
The categories examined in this project were chosen for convenience and intuitiveness: code, math, URLs, etc. A more rigorous application of this analysis would search for features related to safety-relevant behaviors such as deception, sycophancy, and uncertainty, among others. Can SAEs identify features that activate when a model is about to produce a hallucination? When it is being evasive? When it is expressing inappropriate confidence? Identification of these features, if present, would have direct applications in AI safety monitoring.
Circuit analysis
This project examined individual features in isolation. A richer understanding would trace how features connect across layers to produce an output. This is the domain of circuit analysis, which aims to reverse-engineer not just what features exist but how those features interact. The specialist features identified here could serve as starting points for circuit tracing: given that Feature #22917 responds to arithmetic operators, what downstream computations does it feed into? How does it contribute to the model’s ultimate output?
VI. Closing reflection
In this analysis, I sought to investigate and illustrate whether we can understand what AI models are actually doing as they process user input and formulate a response. That question remains only partially answered. This analysis adds to an already large catalogue of analysis that suggests SAEs offer a window into model internals. In this analysis, however, I found that the features discovered represent statistical regularities that correlate with surface form rather than the semantic concepts we might hope to find.
This limitation is not unique to this project. Within the interpretability research community, SAEs are a subject of growing debate. Anthropic’s leadership has described interpretability as being “on the verge” of a breakthrough, with SAE-based techniques enabling researchers to trace model reasoning through identifiable features and circuits. Their research on Claude Sonnet identified features exhibiting “depth, breadth, and abstraction”, including concepts that fire across languages and modalities, such as a “Golden Gate Bridge” feature that activates on English text, Japanese discussions, and images alike. Others are more skeptical. A March 2025 progress update from Google DeepMind’s mechanistic interpretability team cast doubt on the value of SAE research in at least some contexts, reporting that their analysis indicated SAE features proved less effective than simpler techniques and that their team was “deprioritizing fundamental SAE research” as a result. Their concern: SAEs may find statistical regularities useful for reconstruction without capturing the concepts most relevant to model behavior. The findings from this project do not resolve this debate, but they are broadly consistent with the DeepMind’s more skeptical view.
Untangling an AI model's semantic associations
I. TL;DR
Epistemic status:
Executive summary:
II. Introduction
This article represents the first in a multi-part series exploring basic mechanistic interpretability (“MI”) techniques involving use of sparse autoencoders (SAEs). The primary objective of the series is educational. More specifically, I hope to document my learning and share that learning to the degree it is helpful for others along the following lines:
This work was initially inspired by Anthropic’s Scaling Monosemanticity (2024) and utilizes pre-trained residual SAEs available via the TransformerLens libary. The main finding: specialist features detect syntactic patterns rather than semantic concepts, even in later layers. The full code is available via the Phase 1 Jupyter notebook at this project’s GitHub repository.
III. The experiment
Model selection: GPT-2 Small
For this analysis, I selected GPT-2 Small for the following reasons:
Important notes:
Dataset construction
To examine how the model features revealed by SAE application respond to different types of content, I constructed the following test dataset of 70 texts with LLM assistance. This dataset included text snippets from seven categories. Those categories are described in Figure 1 below. A complete list of the test texts is available via the Phase 1 Jupyter notebook on the project’s GitHub repo.
Figure 1: Test text types used
Each category contained 10 texts, providing a balanced basis for comparing activation patterns across content types.
Measuring feature specialization
Specialist feature identification
For purposes of this analysis, a specialist feature is an SAE dimension that activates materially for texts within a single category while remaining relatively inactive for texts in other categories. The intuition is straightforward: if a feature responds strongly to Python code but rarely to math equations, URLs, or conversational text, it represents something specific to Python. What that “something” is - syntax, semantics, or both - is a primary focus of this analysis.
Identifying specialists required a two-step process:
Specialist score calculation
To better understand how selectively a given feature responds to a particular text category, I employed the specialist score metric described below.
specialist_score=nin−noutWherein:
With 10 texts per category and 70 texts total, the theoretical range for this score spans from −60 (a feature that fires on every text except those in the target category, indicating negative selectivity, relative to the current category) to +10 (a feature that fires on all texts in the target category and none outside, indicating the feature perfectly discriminates for the current category). For example, a score of 8 would indicate that the feature in question activated strongly on 8 of 10 in-category texts while remaining largely inactive elsewhere. For summary purposes, a category was considered to have a specialist if its highest-scoring feature achieved a positive score.
Analysis objectives
A summary of the questions explored in this analysis is provided in Figure 2 below.
Figure 2: Avenues of investigation
IV. Results.
Summary of findings
Figure 3 below summarizes findings for each analysis objective.
Figure 3: Part 1 key findings summarized
Strongest features
A review of the peak activation magnitudes by layer revealed a clear pattern: peak activation levels increase substantially with increasing layer depth. A summary of these findings is shown in Figure 4.
Figure 4: Peak activation by layer
At layer 6, the single strongest SAE feature peaked at roughly 13. By layer 11, this had increased almost threefold to approximately 37. This pattern is consistent with how models work: as user input passes through the model’s layers, the model continues to refine its representation of that input in a manner similar to how an author might sharpen his point after writing several drafts of an article.
With this in mind, it makes intuitive sense that features attuned to distinct properties would be more highly activated in later layers of the model, where such properties come into sharper relief.
Most frequent features
In addition to examining which singular feature was activated most intensely at each layer, I also examined which SAE features activated across the largest number of texts, regardless of activation intensity. A feature that fires on 70 out of 70 texts, for example, is maximally general. Conversely, a feature that fires on only a few texts is highly selective. A summary of these findings is shown in Figure 5.
Figure 5: Top feature coverage across text samples, by layer
The most frequent features showed high coverage across all layers, though with a slight decline at greater depth: from ~98% at layer 6 to ~87% at layer 11. This suggests that general-purpose features remain prominent as input is processed through the model’s layers, with deeper layers showing only slightly more selectivity. Interestingly, the most common feature at each layer differs. One explanation for this could be the presence of many generalist features with similarly high frequency, and the transformations done at each stage of the layer simply shift these features’ relative rankings slightly.
Category specialists
After measuring features’ activations and commonality generally, I then narrowed the analysis to examine specialist features: those that activate strongly within a single text category while remaining relatively inactive outside it.
My initial results were discouraging: across all four SAE layers, almost no specialists were found. Further investigation revealed two significant methodological errors:
Correcting the padding issue alone revealed a progressive pattern: specialists emerged for 5/7 categories at layer 6, increasing to 7/7 by layer 10. Correcting both issues revealed specialists for all seven categories at every layer:
Cross-layer specialist comparison
Comparing specialist patterns across layers revealed a clear pattern regarding how readily specialist features emerged as input passed through the model:
Categories with numerous early specialists (found at layer 6): Social, Math, and Python achieved the highest scores (5-10), characterized by distinctive, homogeneous surface-level features such as emojis, mathematical symbols, and Python keywords.
Categories requiring deeper processing: Conversational, Formal, Non-English, and URLs all showed low scores (1-2) across layers, but potentially for different reasons:
These results are shown in Figure 7.
Figure 7: Detailed view of specialist scores by layer
Token-level activation
For each specialist feature identified, I then used token-level visualization to examine precisely which tokens in the text samples triggered that feature’s activation. This analysis produced the most surprising finding of the project.
Given the model’s impressive capabilities, I expected that specialist features would respond to semantic content: that a “math specialist” would activate on mathematical concepts, a “Python specialist” on programming concepts, and so forth. The visualizations thus created, however, seem to indicate the specialists detect syntactic and symbolic patterns more distinctly than high-level topics. Examples of these visualizations are shown below.
Math specialist (Feature #22917): As shown in Figure 8, this feature seems to activate primarily in response to arithmetic operators such as “+”, “=”, “^”, “/”, regardless of surrounding context, as opposed to the sample’s actual mathematical meaning.
Figure 8: Activation of math specialist (Feature #22917) against sample math text
Python specialist (Feature #15983): As shown in Figure 9, this feature seems to activate most acutely in response to Python structural syntax, particularly “):” patterns at the end of function and class definitions, closing parentheses, and Python keywords like “for”, “in”, and “if” in comprehensions. This suggests the feature is detecting Python's grammatical markers, not the more fundamental programming concepts that utilize those markers.
Figure 9: Activation of Python specialist (Feature #15983) against sample Python text
The results suggest cross-domain specialist feature activation is to be expected. The math specialist would likely activate on Python code containing arithmetic operations because it is that mathematical symbology that drives the math specialist’s activation. The reverse is also true: the Python specialist can be expected to activate on mathematical expressions that use parentheses, since it is those parentheses - not programming concepts per se - that drive that activation. These are not errors; they reveal that SAE features learn efficient low-level patterns that correlate with human-defined categories, rather than learning the categories directly. Put another way, the specialist features that SAEs identify appear to track surface form rather than meaning. This raises a question of whether that meaning is encoded elsewhere in the model, or whether the model achieves its capabilities through sophisticated pattern-matching on syntax alone.
V. Implications and next steps
Next steps for research
Attention and MLP SAEs
My analysis was restricted to residual stream SAEs because pre-trained SAEs for attention outputs and MLP outputs were not available for GPT-2 Small. Attention and MLP outputs play different roles in transformer architecture that are distinct from the residual stream and thus, use of SAEs trained on those outputs could yield different and potentially illustrative results. This analysis would require either training custom SAEs on those outputs or awaiting broader SAE releases, which feels unlikely due to the availability of more modern models.
Larger models
GPT-2 Small was chosen for its relatively manageable size, but frontier models operate at a different scale, containing up to trillions of parameters. It remains an open question whether the patterns observed here (deeper layers developing sharper specialists, syntax-level rather than semantic-level detection, etc.) would persist in larger models, or whether those models would display fundamentally different behaviors commensurate with their vastly greater capabilities. Recent work by Anthropic on Claude suggests that interpretability techniques can scale, but replication across model families would strengthen confidence in the findings.
Safety-relevant features
The categories examined in this project were chosen for convenience and intuitiveness: code, math, URLs, etc. A more rigorous application of this analysis would search for features related to safety-relevant behaviors such as deception, sycophancy, and uncertainty, among others. Can SAEs identify features that activate when a model is about to produce a hallucination? When it is being evasive? When it is expressing inappropriate confidence? Identification of these features, if present, would have direct applications in AI safety monitoring.
Circuit analysis
This project examined individual features in isolation. A richer understanding would trace how features connect across layers to produce an output. This is the domain of circuit analysis, which aims to reverse-engineer not just what features exist but how those features interact. The specialist features identified here could serve as starting points for circuit tracing: given that Feature #22917 responds to arithmetic operators, what downstream computations does it feed into? How does it contribute to the model’s ultimate output?
VI. Closing reflection
In this analysis, I sought to investigate and illustrate whether we can understand what AI models are actually doing as they process user input and formulate a response. That question remains only partially answered. This analysis adds to an already large catalogue of analysis that suggests SAEs offer a window into model internals. In this analysis, however, I found that the features discovered represent statistical regularities that correlate with surface form rather than the semantic concepts we might hope to find.
This limitation is not unique to this project. Within the interpretability research community, SAEs are a subject of growing debate. Anthropic’s leadership has described interpretability as being “on the verge” of a breakthrough, with SAE-based techniques enabling researchers to trace model reasoning through identifiable features and circuits. Their research on Claude Sonnet identified features exhibiting “depth, breadth, and abstraction”, including concepts that fire across languages and modalities, such as a “Golden Gate Bridge” feature that activates on English text, Japanese discussions, and images alike. Others are more skeptical. A March 2025 progress update from Google DeepMind’s mechanistic interpretability team cast doubt on the value of SAE research in at least some contexts, reporting that their analysis indicated SAE features proved less effective than simpler techniques and that their team was “deprioritizing fundamental SAE research” as a result. Their concern: SAEs may find statistical regularities useful for reconstruction without capturing the concepts most relevant to model behavior. The findings from this project do not resolve this debate, but they are broadly consistent with the DeepMind’s more skeptical view.
Related work