This is an automated rejection. No LLM generated, heavily assisted/co-written, or otherwise reliant work.
Read full explanation
Untangling an AI model's semantic associations
I. TL;DR
Epistemic status:
Exploratory replication on a small model (GPT-2 Small, 124M parameters).
Confidence in the methodological findings: moderate
Confidence in generalization to larger models: low
Executive summary:
Applied pre-trained sparse autoencoders to GPT-2 Small to examine how features respond to different text categories (mathematical formulas, python code, URLs, plain text, etc.).
Primary findings:
peak activations increase ~3x from layer 6 to 11
category-specialist features exist but require proper attention masking to detect
specialists appear to detect syntactic patterns rather than semantic concepts (observed, not proven systematically)
II. Introduction
Inspired by Anthropic’s “Scaling Monosemanticity” (2024), this post documents a hands-on exploration of SAE-based interpretability on GPT-2 Small, examining how features respond to different text categories and what they actually detect. The main finding: specialist features detect syntactic patterns rather than semantic concepts, even in later layers. The full code is available via the Phase 1 Jupyter notebook at this project’s GitHub repository.
III. The experiment
Model selection: GPT-2 Small
GPT-2 Small was selected for this project because it offers pre-trained SAE availability via SAELens, TransformerLens support, and tractable scale for consumer hardware.
Sparse autoencoder selection
This project used pre-trained sparse autoencoders (SAEs) from Joseph Bloom'sSAELens release, which decompose GPT-2 Small's 768-dimensional residual stream activations into 24,576 sparse features via L1-regularized reconstruction.
An important limitation: Pre-trained SAEs for attention outputs and MLP outputs were not available for this model release. The analysis is therefore restricted to residual stream representations, leaving comparison across component types as a direction for future work.
Dataset construction
To examine how features respond to different types of content, a test dataset of 70 texts was constructed, spanning seven categories. Those categories are described in Figure 1 below. A complete list of the test texts is available via the Phase 1 Jupyter notebook on the project’s GitHub repo.
Figure 1: Test text types used
Each category contained 10 texts, providing a balanced basis for comparing activation patterns across content types.
Measuring feature specialization
Specialist score calculation
To quantify how selectively a feature responds to a particular text category, the analysis employed a specialist score metric. For each candidate feature, the score was calculated as:
specialist_score=nin−nout
Wherein:
nin = the count of texts within the target category where the feature’s activation exceeded a threshold (5.0)
nout = the count of texts outside the category exceeding the same threshold.
With 10 texts per category and 70 texts total, the theoretical range for this score spans from −60 (a feature that fires on every text except those in the target category) to +10 (a feature that fires on all texts in the target category and none outside). A score of 8, for example, indicates that the feature activated strongly on 8 of 10 in-category texts while remaining largely inactive elsewhere.
A category was considered to “have a specialist” if its best-scoring feature achieved a positive specialist score. Results are therefore reported in two ways: the specialist score itself (a continuous measure of discrimination strength) and specialists found (a count of how many of the seven categories achieved at least one feature with a positive score). The former appears in heatmaps showing score magnitude; the latter appears in summary tables indicating coverage across categories.
Specialist feature identification
A specialist feature, for purposes of this analysis, is an SAE dimension that activates preferentially for texts within a single category while remaining relatively inactive for texts in other categories. The intuition is straightforward: if a feature responds strongly to Python code but rarely to math equations, URLs, or conversational text, it has learned something specific to Python - whether that turns out to be syntactic patterns, keywords, or some other regularity.
Identifying specialists required a two-step process. First, for each of the seven text categories, the analysis identified candidate features: those with the highest maximum activation values across that category’s 10 texts. Specifically, the top 5 features by peak activation were selected as candidates. This step ensured that only features demonstrably responsive to the category were considered; a feature that never activates on Python code cannot be a Python specialist, regardless of how selective it might be.
Second, each candidate feature was evaluated for selectivity using the aforementioned specialist score calculation. A feature that activates strongly on Python code but also activates strongly on math and URLs would receive a low or negative score; a feature that activates strongly on Python code and rarely elsewhere would receive a high score. The candidate with the highest specialist score was designated as that category’s specialist.
Analysis objectives
A summary of the questions explored in this analysis is provided in Figure 2 below.
Figure 2: Avenues of investigation
For each category and SAE, we identify a candidate specialist feature by comparing its activation strength on in-category texts versus out-of-category texts, and define specialization in terms of the absolute difference between these two quantities. Features with a sufficiently large positive activation difference between in-category and out-of-category texts are considered category specialists.
IV. Results
This section presents the findings from each analysis described above, highlighting both expected outcomes and surprises encountered along the way.
Summary of findings
Figure 3: Part 1 key findings summarized
Analysis
Objective
Finding
Expected or Surprise
Strongest features
Do activation magnitudes vary by layer?
Peak activations increase by ~3x from layer 6 to layer 11
Expected
Most frequent features
Are some features more general than others?
Coverage decreases modestly with depth (~98% → ~87%)
Expected
Category specialists
Can we find category-specific features?
Yes, after fixing two methodological issues; 7/7 specialists at all layers
Expected
Cross-layer comparison
Do deeper layers show more specialization?
All layers achieve 7/7 specialists; deeper layers show stronger scores for most categories, particularly Social
Expected
Token-level activation
What do specialists actually detect?
Syntactic / symbolic patterns, not semantic topics
Surprise
Strongest features
The strongest features analysis identified which SAE dimensions produced the highest activation values across the test dataset. The results revealed a clear pattern: peak activation magnitudes increase substantially with layer depth. A summary of these findings is shown in Figure 4.
Figure 4: Peak activation by layer
At layer 6, the single strongest SAE feature peaked at an activation of approximately 13. By layer 11, this had increased to approximately 37 - nearly a threefold increase. This suggests that features in deeper layers respond more intensely to their preferred inputs, consistent with the hypothesis that later layers develop sharper, more specialized representations.
Most frequent features
The most frequent features analysis examined which SAE dimensions activated across the largest number of texts, regardless of activation intensity. A feature that fires on 70 out of 70 texts is maximally general; one that fires on only a few texts is highly selective. A summary of these findings is shown in Figure 5.
Figure 5: Top feature coverage across text samples, by layer
The most frequent features showed high coverage across all layers, though with a modest decline at greater depth: from ~98% at layer 6 to ~87% at layer 11. This suggests that general-purpose features remain prominent throughout the network, with deeper layers showing only slightly more selectivity. The fact that different feature IDs appear at each layer confirms these are distinct features, not the same feature tracked across depths.
Category specialists
The category specialist analysis searched for features that activate strongly within a single text category while remaining relatively inactive outside it.
Specialist scores
My initial results were discouraging: across all four SAE layers, almost no specialists were found. Further investigation revealed two methodological errors:
Padding token contamination: The activation extraction code was averaging across all token positions, including padding tokens. Since shorter texts are padded to a uniform length and padding tokens produce near-zero activations, category-specific patterns were being diluted.
Averaging before encoding: The original approach averaged GPT-2's raw activations across tokens before passing them through the SAE encoder. Because the SAE includes non-linear transformations, this produced different (and less reliable) results than encoding each token first, then averaging the resulting feature vectors.
Correcting the padding issue alone revealed a progressive pattern: specialists emerged for 5/7 categories at layer 6, increasing to 7/7 by layer 10. Correcting both issues — the approach used in this analysis — revealed specialists for all seven categories at every layer:
Layer
Specialists Found
6
7/7
8
7/7
10
7/7
11
7/7
Cross-layer comparison
Comparing specialist patterns across layers revealed a clear hierarchy of category difficulty:
Categories with numerous early specialists (found at layer 6): Social, Math, and Python achieved the highest scores (5-10), characterized by distinctive, homogeneous surface-level features such as emojis, mathematical symbols, and Python keywords.
Categories requiring deeper processing: Conversational, Formal, Non-English, and URLs all showed low scores (1-2) across layers, but likely for different reasons:
Conversational and Formal lack distinctive surface markers entirely; their distinctiveness lies in tone and pragmatics.
Non-English has highly distinctive but heterogeneous markers — Chinese characters differ completely from Arabic script, which differs from Latin-script French. No single feature can capture this diversity.
URLs face syntactic overlap with other domains (file paths, time notation), reducing selectivity.
These results are shown in Figure 7.
Figure 7: Detailed view of specialist scores by layer
Token-level activation
For each identified specialist feature, token-level visualization was used to examine precisely which tokens triggered activation. This analysis produced the most surprising finding of the project.
The expectation was that specialist features would respond to semantic content: that a “math specialist” would activate on mathematical concepts, a “Python specialist” on programming concepts, and so forth. Instead, the specialists appear to detect syntactic and symbolic patterns rather than high-level topics, as shown in the selected examples below:
Math specialist (Feature #22917): Fires primarily on arithmetic operators - +, =, ^, / - regardless of surrounding context. The feature responds to the symbols themselves, not to mathematical meaning. The activation of this specialist against the math sample text is shown in Figure 8.
Figure 8: Activation of math specialist (Feature #22917) against sample math text
Python specialist (Feature #15983): Fires on Python structural syntax, particularly “):” patterns at the end of function and class definitions, closing parentheses, and Python keywords like “for”, “in”, and “if” in comprehensions, apparently detecting Python's grammatical markers, not programming concepts. The activation of this specialist against the Python sample text is shown in Figure 9.
Figure 9: Activation of Python specialist (Feature #15983) against sample Python text
These results have an important implication: cross-domain activation is expected and appropriate. The math specialist activates on Python code that contains arithmetic operations, because the operators are present. The Python specialist activates on mathematical expressions that use parentheses, because the syntax overlaps. These are not errors; they reveal that SAE features learn efficient low-level patterns that correlate with human-defined categories, rather than learning the categories directly.
V. Implications and next steps
A note on scope
These findings derive from 70 synthetic samples on GPT-2 Small (124M parameters); generalization to larger models is untested.
Next steps for research
This project was intentionally limited in scope - a learning exercise rather than a research contribution. Several directions would extend the work:
Attention and MLP SAEs
The analysis was restricted to residual stream SAEs because pre-trained SAEs for attention outputs and MLP outputs were not available for GPT-2 Small. These components play different roles in transformer architecture: attention mechanisms handle relationships between tokens, while MLPs transform individual token representations. Comparing specialist patterns across component types could reveal whether certain categories are processed primarily through attention (contextual relationships) versus MLPs (token-level transformations). This would require either training custom SAEs or awaiting broader SAE releases.
Larger models
GPT-2 Small was chosen for tractability, but frontier models operate at a different scale: hundreds of billions of parameters rather than 124 million. It remains an open question whether the patterns observed here (deeper layers developing sharper specialists, syntax-level rather than semantic-level detection) hold in larger models, or whether scale introduces qualitatively different phenomena. Recent work by Anthropic on Claude suggests that interpretability techniques can scale, but replication across model families would strengthen confidence in the findings.
Safety-relevant features
The categories examined in this project were chosen for convenience and clarity: code, math, URLs, and so forth. A more consequential application would search for features related to safety-relevant behaviors: deception, sycophancy, uncertainty, refusal. Can SAEs identify features that activate when a model is about to produce a hallucination? When it is being evasive? When it is expressing inappropriate confidence? Such features, if they exist and can be reliably identified, would have direct applications in AI safety monitoring.
Circuit analysis
This project examined individual features in isolation. A richer understanding would trace how features connect - how an input activates a cascade of features across layers that ultimately produces an output. This is the domain of circuit analysis, which aims to reverse-engineer not just what features exist but how they interact. The specialist features identified here could serve as starting points for circuit tracing: given that Feature #22917 responds to arithmetic operators, what downstream computations does it feed into? How does it contribute to the model’s ultimate output?
Methodological rigor
The padding masking bug revealed how easily interpretability findings can be distorted by preprocessing choices. A systematic study of such methodological pitfalls - cataloging the ways that activation extraction, averaging, normalization, and thresholding can affect results - would benefit the field. Interpretability research is only as reliable as its methods; making those methods robust deserves explicit attention.
VI. Closing reflection
The question that motivated this project - can we understand what AI models are actually doing? - remains only partially answered. Sparse autoencoders offer a window into model internals, but the features discovered represent statistical regularities that correlate with, but do not constitute, understanding.
This limitation is not unique to this project. Within the interpretability research community, SAEs have become both a dominant paradigm and a subject of growing debate. Anthropic’s leadership has described interpretability as being “on the verge” of a breakthrough, with SAE-based techniques enabling researchers to trace model reasoning through identifiable features and circuits. Their research on Claude Sonnet identified features exhibiting “depth, breadth, and abstraction”, including concepts that fire across languages and modalities, such as a “Golden Gate Bridge” feature that activates on English text, Japanese discussions, and images alike. Others are more skeptical. A March 2025 progress update from Google DeepMind’s mechanistic interpretability team reported that SAE features proved less effective than simpler techniques when applied to a practical safety task - detecting harmful intent in user prompts - and announced that the team was “deprioritizing fundamental SAE research” as a result. Their concern: SAEs may find statistical regularities useful for reconstruction without capturing the concepts most relevant to model behavior. The findings from this project do not resolve this debate, but they are broadly consistent with the skeptical view.
Untangling an AI model's semantic associations
I. TL;DR
Epistemic status:
Executive summary:
II. Introduction
Inspired by Anthropic’s “Scaling Monosemanticity” (2024), this post documents a hands-on exploration of SAE-based interpretability on GPT-2 Small, examining how features respond to different text categories and what they actually detect. The main finding: specialist features detect syntactic patterns rather than semantic concepts, even in later layers. The full code is available via the Phase 1 Jupyter notebook at this project’s GitHub repository.
III. The experiment
Model selection: GPT-2 Small
GPT-2 Small was selected for this project because it offers pre-trained SAE availability via SAELens, TransformerLens support, and tractable scale for consumer hardware.
Sparse autoencoder selection
This project used pre-trained sparse autoencoders (SAEs) from Joseph Bloom's SAELens release, which decompose GPT-2 Small's 768-dimensional residual stream activations into 24,576 sparse features via L1-regularized reconstruction.
An important limitation: Pre-trained SAEs for attention outputs and MLP outputs were not available for this model release. The analysis is therefore restricted to residual stream representations, leaving comparison across component types as a direction for future work.
Dataset construction
To examine how features respond to different types of content, a test dataset of 70 texts was constructed, spanning seven categories. Those categories are described in Figure 1 below. A complete list of the test texts is available via the Phase 1 Jupyter notebook on the project’s GitHub repo.
Figure 1: Test text types used
Each category contained 10 texts, providing a balanced basis for comparing activation patterns across content types.
Measuring feature specialization
Specialist score calculation
To quantify how selectively a feature responds to a particular text category, the analysis employed a specialist score metric. For each candidate feature, the score was calculated as:
specialist_score=nin−noutWherein:
nin = the count of texts within the target category where the feature’s activation exceeded a threshold (5.0)
nout = the count of texts outside the category exceeding the same threshold.
With 10 texts per category and 70 texts total, the theoretical range for this score spans from −60 (a feature that fires on every text except those in the target category) to +10 (a feature that fires on all texts in the target category and none outside). A score of 8, for example, indicates that the feature activated strongly on 8 of 10 in-category texts while remaining largely inactive elsewhere.
A category was considered to “have a specialist” if its best-scoring feature achieved a positive specialist score. Results are therefore reported in two ways: the specialist score itself (a continuous measure of discrimination strength) and specialists found (a count of how many of the seven categories achieved at least one feature with a positive score). The former appears in heatmaps showing score magnitude; the latter appears in summary tables indicating coverage across categories.
Specialist feature identification
A specialist feature, for purposes of this analysis, is an SAE dimension that activates preferentially for texts within a single category while remaining relatively inactive for texts in other categories. The intuition is straightforward: if a feature responds strongly to Python code but rarely to math equations, URLs, or conversational text, it has learned something specific to Python - whether that turns out to be syntactic patterns, keywords, or some other regularity.
Identifying specialists required a two-step process. First, for each of the seven text categories, the analysis identified candidate features: those with the highest maximum activation values across that category’s 10 texts. Specifically, the top 5 features by peak activation were selected as candidates. This step ensured that only features demonstrably responsive to the category were considered; a feature that never activates on Python code cannot be a Python specialist, regardless of how selective it might be.
Second, each candidate feature was evaluated for selectivity using the aforementioned specialist score calculation. A feature that activates strongly on Python code but also activates strongly on math and URLs would receive a low or negative score; a feature that activates strongly on Python code and rarely elsewhere would receive a high score. The candidate with the highest specialist score was designated as that category’s specialist.
Analysis objectives
A summary of the questions explored in this analysis is provided in Figure 2 below.
Figure 2: Avenues of investigation
For each category and SAE, we identify a candidate specialist feature by comparing its activation strength on in-category texts versus out-of-category texts, and define specialization in terms of the absolute difference between these two quantities. Features with a sufficiently large positive activation difference between in-category and out-of-category texts are considered category specialists.
IV. Results
This section presents the findings from each analysis described above, highlighting both expected outcomes and surprises encountered along the way.
Summary of findings
Figure 3: Part 1 key findings summarized
Strongest features
The strongest features analysis identified which SAE dimensions produced the highest activation values across the test dataset. The results revealed a clear pattern: peak activation magnitudes increase substantially with layer depth. A summary of these findings is shown in Figure 4.
Figure 4: Peak activation by layer
At layer 6, the single strongest SAE feature peaked at an activation of approximately 13. By layer 11, this had increased to approximately 37 - nearly a threefold increase. This suggests that features in deeper layers respond more intensely to their preferred inputs, consistent with the hypothesis that later layers develop sharper, more specialized representations.
Most frequent features
The most frequent features analysis examined which SAE dimensions activated across the largest number of texts, regardless of activation intensity. A feature that fires on 70 out of 70 texts is maximally general; one that fires on only a few texts is highly selective. A summary of these findings is shown in Figure 5.
Figure 5: Top feature coverage across text samples, by layer
The most frequent features showed high coverage across all layers, though with a modest decline at greater depth: from ~98% at layer 6 to ~87% at layer 11. This suggests that general-purpose features remain prominent throughout the network, with deeper layers showing only slightly more selectivity. The fact that different feature IDs appear at each layer confirms these are distinct features, not the same feature tracked across depths.
Category specialists
The category specialist analysis searched for features that activate strongly within a single text category while remaining relatively inactive outside it.
Specialist scores
My initial results were discouraging: across all four SAE layers, almost no specialists were found. Further investigation revealed two methodological errors:
Correcting the padding issue alone revealed a progressive pattern: specialists emerged for 5/7 categories at layer 6, increasing to 7/7 by layer 10. Correcting both issues — the approach used in this analysis — revealed specialists for all seven categories at every layer:
Cross-layer comparison
Comparing specialist patterns across layers revealed a clear hierarchy of category difficulty:
Categories with numerous early specialists (found at layer 6): Social, Math, and Python achieved the highest scores (5-10), characterized by distinctive, homogeneous surface-level features such as emojis, mathematical symbols, and Python keywords.
Categories requiring deeper processing: Conversational, Formal, Non-English, and URLs all showed low scores (1-2) across layers, but likely for different reasons:
These results are shown in Figure 7.
Figure 7: Detailed view of specialist scores by layer
Token-level activation
For each identified specialist feature, token-level visualization was used to examine precisely which tokens triggered activation. This analysis produced the most surprising finding of the project.
The expectation was that specialist features would respond to semantic content: that a “math specialist” would activate on mathematical concepts, a “Python specialist” on programming concepts, and so forth. Instead, the specialists appear to detect syntactic and symbolic patterns rather than high-level topics, as shown in the selected examples below:
Math specialist (Feature #22917): Fires primarily on arithmetic operators - +, =, ^, / - regardless of surrounding context. The feature responds to the symbols themselves, not to mathematical meaning. The activation of this specialist against the math sample text is shown in Figure 8.
Figure 8: Activation of math specialist (Feature #22917) against sample math text
Python specialist (Feature #15983): Fires on Python structural syntax, particularly “):” patterns at the end of function and class definitions, closing parentheses, and Python keywords like “for”, “in”, and “if” in comprehensions, apparently detecting Python's grammatical markers, not programming concepts. The activation of this specialist against the Python sample text is shown in Figure 9.
Figure 9: Activation of Python specialist (Feature #15983) against sample Python text
These results have an important implication: cross-domain activation is expected and appropriate. The math specialist activates on Python code that contains arithmetic operations, because the operators are present. The Python specialist activates on mathematical expressions that use parentheses, because the syntax overlaps. These are not errors; they reveal that SAE features learn efficient low-level patterns that correlate with human-defined categories, rather than learning the categories directly.
V. Implications and next steps
A note on scope
These findings derive from 70 synthetic samples on GPT-2 Small (124M parameters); generalization to larger models is untested.
Next steps for research
This project was intentionally limited in scope - a learning exercise rather than a research contribution. Several directions would extend the work:
Attention and MLP SAEs
The analysis was restricted to residual stream SAEs because pre-trained SAEs for attention outputs and MLP outputs were not available for GPT-2 Small. These components play different roles in transformer architecture: attention mechanisms handle relationships between tokens, while MLPs transform individual token representations. Comparing specialist patterns across component types could reveal whether certain categories are processed primarily through attention (contextual relationships) versus MLPs (token-level transformations). This would require either training custom SAEs or awaiting broader SAE releases.
Larger models
GPT-2 Small was chosen for tractability, but frontier models operate at a different scale: hundreds of billions of parameters rather than 124 million. It remains an open question whether the patterns observed here (deeper layers developing sharper specialists, syntax-level rather than semantic-level detection) hold in larger models, or whether scale introduces qualitatively different phenomena. Recent work by Anthropic on Claude suggests that interpretability techniques can scale, but replication across model families would strengthen confidence in the findings.
Safety-relevant features
The categories examined in this project were chosen for convenience and clarity: code, math, URLs, and so forth. A more consequential application would search for features related to safety-relevant behaviors: deception, sycophancy, uncertainty, refusal. Can SAEs identify features that activate when a model is about to produce a hallucination? When it is being evasive? When it is expressing inappropriate confidence? Such features, if they exist and can be reliably identified, would have direct applications in AI safety monitoring.
Circuit analysis
This project examined individual features in isolation. A richer understanding would trace how features connect - how an input activates a cascade of features across layers that ultimately produces an output. This is the domain of circuit analysis, which aims to reverse-engineer not just what features exist but how they interact. The specialist features identified here could serve as starting points for circuit tracing: given that Feature #22917 responds to arithmetic operators, what downstream computations does it feed into? How does it contribute to the model’s ultimate output?
Methodological rigor
The padding masking bug revealed how easily interpretability findings can be distorted by preprocessing choices. A systematic study of such methodological pitfalls - cataloging the ways that activation extraction, averaging, normalization, and thresholding can affect results - would benefit the field. Interpretability research is only as reliable as its methods; making those methods robust deserves explicit attention.
VI. Closing reflection
The question that motivated this project - can we understand what AI models are actually doing? - remains only partially answered. Sparse autoencoders offer a window into model internals, but the features discovered represent statistical regularities that correlate with, but do not constitute, understanding.
This limitation is not unique to this project. Within the interpretability research community, SAEs have become both a dominant paradigm and a subject of growing debate. Anthropic’s leadership has described interpretability as being “on the verge” of a breakthrough, with SAE-based techniques enabling researchers to trace model reasoning through identifiable features and circuits. Their research on Claude Sonnet identified features exhibiting “depth, breadth, and abstraction”, including concepts that fire across languages and modalities, such as a “Golden Gate Bridge” feature that activates on English text, Japanese discussions, and images alike. Others are more skeptical. A March 2025 progress update from Google DeepMind’s mechanistic interpretability team reported that SAE features proved less effective than simpler techniques when applied to a practical safety task - detecting harmful intent in user prompts - and announced that the team was “deprioritizing fundamental SAE research” as a result. Their concern: SAEs may find statistical regularities useful for reconstruction without capturing the concepts most relevant to model behavior. The findings from this project do not resolve this debate, but they are broadly consistent with the skeptical view.
Related work