A Black Box Made Less Opaque (part 2)

Matthew McDonnell

Examining an AI model’s focus on form vs. ideas

Executive summary

This is the second installment of a series of analyses exploring basic AI mechanistic interpretability techniques. While Part 1 in this series is summarized below, a review of that article provides helpful context for the analysis below.

Key findings

Use of pretrained residual stream sparse autoencoders (“SAEs”) in conjunction with GPT-2 Small reveals activation patterns that suggest the SAEs’ most active specialist features react primarily to a given text string’s syntax vs. that string’s semantics. This is shown via the following results:
- Minimal overlap of specialist feature activation between syntactically different, but semantically identical texts (e.g. “2 + 2” vs. “two plus two”)
- The topics tested that most diverged from standard English prose (Math, emoji-laden Social, and non-English) generally demonstrated more specialized features.
- Within topics, the various surface forms generated relatively similar levels of feature specialization with minimal feature overlap among them.
Overall activation profile (all 24,576 SAE features, not just specialists) is primarily driven by semantics, with different forms of the same concept consistently clustering within the model’s representational space
The effect of specialist activation on model accuracy remains inconclusive, as the model used in this analysis was unable to complete sample math equations

Confidence in these findings:

Confidence in analysis methodology: moderate-to-high
Confidence in the ability to apply these findings to more modern models: low

Introduction

This analysis constitutes the second installment in a multi-part series documenting, in relatively simple terms, my exploration of key concepts related to machine learning (“ML”) generally and mechanistic interpretability (“MI”) specifically. The intended application of this understanding is to further understanding, and management, of model behavior with an eye toward reducing societally harmful outputs.

This analysis does not purport to encapsulate demonstrably new findings in the field of MI. It is inspired by, and attempts to replicate at a small scale, pioneering analysis done in the field of MI by Anthropic and others, as cited below. My aspiration is to add to the understanding of, discourse around, and contributions to, this field by a wide range of key stakeholders, regardless of their degree of ML or MI expertise.

Methodology and key areas of analysis

Key areas of analysis

This analysis seeks to answer the following question: “To what degree is a model’s response influenced by syntax (surface form) vs. semantics (meaning)?”

More specifically, this Phase 2 analysis tests the hypotheses listed in Figure 1 below.

Figure 1: Phase 2 hypotheses

Hypothesis	Question	Predictions
H1: Specialist Specificity	Do specialist features primarily detect syntax (surface form) or semantics (meaning)?	If syntax: different forms → different specialists; low cross-form overlap. If semantics: same topic → same specialists regardless of form
H2: Representational Geometry	Does the overall SAE representation (all features, not just specialists) cluster by syntax or by semantics?	If syntax: within-form similarity > within-topic similarity. If semantics: within-topic similarity > cross-topic similarity.
H3: Behavioral Relevance	Does specialist activation predict model behavior (e.g., accuracy on math completions)?	If yes: higher specialist activation → better task performance; activation correlates with correctness.

Methodology

The methodology used in this analysis is broadly reflective of Phase 1 in this series, in that it uses GPT-2 Small, a relatively tractable, 124 million parameter open-source model obtained via TransformerLens and the relevant pretrained residual stream SAEs from Joseph Bloom, Curt Tigges, Anthony Duong, and David Chanin at SAELens.

To test how the model distinguishes between syntax and semantics, I then used an LLM to help create 241 sample text matched pairs spanning 7 distinct categories. Each matched pairs set included three different variations of the same concepts, which varied by the approximate use of unique symbology. Notable exceptions to this approach were as follows:

The “Python” category used two matched surface forms (“code” and “pseudocode”) per matched pair, instead of three.
The “Non-English” category contained matched pairs with the same general (but not identical, due to translation irregularities) phrases expressed in three non-English languages (Spanish, French, and German). Since these samples are not identical versions of the same idea, it tests a slightly different version of the syntax vs. semantics hypothesis: whether the “Non-English” feature identified in Phase 1 relates to non-English text generally or a specific language specifically.

An abbreviated list of those matched pairs is shown in Figure 2 below. The full matched pairs list is contained in the Phase 2 Jupyter notebook available at this series’ GitHub repository.

Figure 2: Sample of Phase 2 matched pairs

Topic	Form	Sample Texts
Math (Simple)	Symbolic	• 8-3 • 5x3 • 3^2
Math (Simple)	Verbal	• Eight minus three • Five times three • Three squared
Math (Simple)	Prose	• Three less than eight • Five multiplied by three • Three to the power of two
Math (Complex)	Symbolic	• sin²(θ) + cos²(θ) = 1 • ∫x² dx = x³/3 + C • d/dx(x²) = 2x
Math (Complex)	Verbal	• Sine squared theta plus cosine squared theta equals one • The integral of x squared dx equals x cubed over three plus C • The derivative of x squared equals two x
Math (Complex)	Prose	• The square of the sine of theta plus the square of the cosine of theta equals one • The integral of x squared with respect to x is x cubed divided by three plus a constant • The derivative with respect to x of x squared is two times x
Python	Code	• def add(x, y): return x + y • for i in range(10): • if x > 0: return True
Python	Pseudocode	• Define function add that takes x and y and returns x plus y • Loop through numbers zero to nine • If x is greater than zero then return true
Non-English	Spanish	• Hola, ¿cómo estás? • Buenos días • Gracias
Non-English	French	• Bonjour, comment ça va? • Bonjour • Merci
Non-English	German	• Hallo, wie geht es dir? • Guten Morgen • Danke
Social	Full Social	• omg that's so funny 😂😂😂 • this slaps fr fr 🔥🎵 • just got coffee ☕ feeling good ✨
Social	Partial Social	• omg thats so funny • this slaps fr fr • just got coffee feeling good
Social	Standard	• That's very funny • This is really good • I just got coffee and I feel good
Formal	Highly Formal	• The phenomenon was observed under controlled laboratory conditions. • Pursuant to Article 12, Section 3 of the aforementioned statute. • The results indicate a statistically significant correlation (p < 0.05).
Formal	Moderately Formal	• We observed the phenomenon in controlled lab conditions. • According to Article 12, Section 3 of the law. • The results show a significant correlation.
Formal	Plain	• We saw this happen in the lab • Based on what the law says. • The results show a real connection.
Conversational	First Person	• I think the meeting went pretty well today. • I'm planning a trip to Japan. • I need to finish this project by Friday.
Conversational	Third Person	• She thinks the meeting went pretty well today. • He's planning a trip to Japan. • They need to finish this project by Friday.
Conversational	Neutral	• The meeting seems to have gone well today. • There are plans for a trip to Japan. • The project needs to be finished by Friday.

In addition to recording the activation measurements provided by the relevant SAEs, I used the calculations listed below to develop a more comprehensive view of the model’s internal representation.

Specialist score

To help conceptualize and quantify the selectivity of a given SAE feature vis-a-vis the current category of sample texts, I used the following calculation,:

wherein:
$n_{i n s i d e}$ = the number of text samples within a given category for which this feature has an activation level ≥ 5.0
$n_{o u t s i d e}$ = the number of text samples outside a given category for which this feature has an activation level ≥ 5.0

It should be noted that the threshold activation level of 5.0 was chosen somewhat arbitrarily, but I do not suspect this is a significant issue, as it is applied uniformly across features and categories.

Critically, when calculating specialist features, the comparison set for each topic + form type includes all topic + form combinations except other forms of the same topic. For example, if the topic + form being analyzed is math_complex + symbolic, the contrasting sets would include python + code, python + pseudocode, non-English + French, non-English + German, etc. but not math_complex + verbal or math_complex + prose. This design was selected to avoid skewing the results, since a math_complex + symbolic feature may be more activated by other math_complex texts, relative to texts associated with an unrelated subject, such as Python or non-English text.

Gini coefficient:

One of the means I used to better understand the geometry of the top n most active features was via the calculation of a Gini coefficient for those features. The calculation was accomplished by sorting the activation levels and then comparing a weighted sum of activations (wherein each activation is weighted by its ranking) against its unweighted component. A Gini ranges from 0 to 1, wherein 0 indicates a perfectly equal distribution and 1 a perfectly unequal distribution (e.g. all activation resides in a single feature).

Concentration ratio (referred to as “Top5” in the table below):

To further enable an easy understanding of the top n feature geometry for a given text sample, I also calculated a simple concentration ratio that measured the total activation of the top n features for a given sample, relative to the total overall activation for that feature. While the concentration ratio thus calculated is similar to the Gini calculation described above, it tells a slightly different story. While the Gini helps one understand the geometry (i.e. the dispersion) of the top n features, relative to one another, the concentration ratio describes the prominence of those top n features relative to the overall activation associated with that sample text.

Feature Overlap (raw count)

Core to the matched pairs approach used in this analysis is the comparison of features activated among the various matched text pairs used. Feature overlap is a simple metric that counts the number of shared specialist features among the top 5 specialist features activated by two topic + surface form text samples.

For example, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the feature overlap would be 1, corresponding with feature #2.

Jaccard Similarity / Mean Jaccard:

Another metric used to measure the degree to feature overlap is Jaccard Similarity, which is essentially a scaled version of the feature overlap described above. It is calculated as follows:

$J a c c a r d s i m i l a r i t y = | A \cap B | / | A \cup B |$
wherein:
“A” and “B” represent the list of specialist features activated by two different surface form variations of the same concept. This value ranges from 0 (no specialist features shared between text sets A and B) and 1 (the same specialist features are activated by text sets A and B).

Using the same example shown for feature overlap, if math_simple + symbolic activates the following features: {1, 2, 3, 4, 5} and math_simple + verbal activates the following features: {2, 6, 7, 8, 9}, then the Jaccard Similarity would be 1/9 = 0.1

Cosine Similarity:

To quantify and compare each sample text’s representational geometry (e.g. the overlapping features and those features activations), I used cosine similarity for those pairs, which is calculated as follows:

$c o s i n e s i m i l a r i t y (A, B) = (A \cdot B) / (∥ A ∥ \times ∥ B ∥)$
wherein:
A and B are two activation vectors (each vector is 24,576 dimensions with one value per SAE feature)
A · B ("A dot B") means multiplying each corresponding pair of values and summing them all up. So (A₁ × B₁) + (A₂ × B₂) + ... + (A₂₄₅₇₆ × B₂₄₅₇₆)
‖A‖ and ‖B‖ ("magnitude of A and B, respectively") represents the “length” of the vectors A and B, as calculated by taking the square root of the sum of all squared values in A and B.

The cosine similarity ranges from 0 (no features in common between vectors A and B) to 1 (vectors for A and B have identical features and each feature has the same value).

This logic essentially extends the Jaccard Similarity described above. Whereas Jaccard Similarity looks at overlapping features in a binary sense (e.g. overlapping features with 0.1 activation are treated the same as overlapping features with 50 activation), cosine similarity accounts for that activation level, thus providing a richer picture of the representational space.

Cohen's d (Effect Size):

To create a simple, standardized measure of the difference in mean activation between the text samples in a given topic + form type and all contrasting topic + form types, I used Cohen’s d, which is calculated as follows:

$d = (μ_{₁} - μ_{₂}) / σ_{p o o l e d}$
wherein:
$μ_{₁}$ = mean activation of the specialist feature on its target category (e.g., mean activation of feature #7029 on math_simple + symbolic texts)
$μ_{₂}$ = mean activation of that same feature on the contrast set (all non-Math texts)
$σ_{p o o l e d}$ = pooled standard deviation of both groups 1 and 2, which is essentially the square root of the weighted average of the two groups' variances. This puts the difference in standardized units.

The reason for using this measurement is simple: to provide a scaled, comparable way to determine how “selective” a given feature’s activation is to a given topic + surface form combination vs. the activation of that same feature vis-a-vis the contrasting topic + surface form combinations.

IV. Results

Summary of results

Hypothesis	Question	Result
H1: Specialist Specificity	Do specialist features primarily detect syntax (surface form) or semantics (meaning)?	Syntax Low cross-form overlap (0.13 mean Jaccard among form types) Consistently different specialist features used for symbolic vs. non-symbolic syntax within matched pairs
H2: Representational Geometry	Does the overall SAE representation (all features, not just specialists) cluster by topic or by form?	Semantics Cosine similarity within topic (0.50) exceeds cross topic (0.14) PCA 2-D visualization shows clear topic-based clustering
H3: Behavioral Relevance	Does specialist activation predict model behavior (e.g., accuracy on math completions)?	Inconclusive GPT-2 Small shows clear signs of pattern matching with near-zero ability to solve math problems

Finding 1: Specialist features are primarily focused on syntax

The first avenue of analysis flowed from an observation made, but not rigorously tested, in Phase 1 of this series: that specialist features seem to be most attuned to syntax, as opposed to semantics.

The first way I tested this hypothesis was via the application of the specialist features obtained via the Phase 1 texts to the new Phase 2 matched pairs texts. If the specialists derived from the Phase 1 text samples were attuned to meaning, one would logically expect those specialists would activate in response to conceptually-similar Phase 2 matched pairs text samples. The results of that analysis are summarized in Figure 3 below. The consistently low activation levels of the Phase 1-derived features across most Phase 2 topic + form combinations is indicative that those features were confounded by the specific wording used in the Phase 1 analysis and not associated with the underlying concepts associated with those Phase 1 texts. Further reinforcing this view is that the categories that did show modest Phase 1 → Phase 2 cross-activation were those categories with significant use of distinct symbology, such as social + full social (emojis) and math_complex + symbolic (mathematical operators).

Figure 3: Phase 1 → Phase 2 specialist activation by Phase 2 topic + form

The second way I tested syntax- vs. semantic-form specificity was via the use of Jaccard similarity applied to the various surface forms within a given topic. If specialist features focused on meaning, one would expect relatively high Jaccard similarities across the surface forms containing the same concept. Figures 4 and 5 below illustrate the output of that analysis, in which we see that the overall mean Jaccard similarity was a very modest 0.13. This limited specialist feature overlap is indicative of syntax-focused specialist features.

Figure 4: Jaccard similarity of top 5 specialist features by form types (layer 11)

Figure 5: Shared specialist features among top 5 specialist features (layer 11)

Reinforcing this idea that features are primarily attuned to syntax vs. semantics was the emergence and attributes of specialist features derived from the various Phase 2 matched pairs. As shown in Figures 6 and 7 below, the features that emerge in topics that most deviate from standard English prose (Math, Social, non-English, etc.) are generally more selective than other topics. Furthermore, this specialization emerges in earlier layers when analyzing the more symbolically-laden surface forms of those topics. Finally, within these topics, the three surface forms were associated with relatively similar specialist scores (e.g., Math_Simple: Symbolic 33, Verbal 31, Prose 29) but activated largely different features (e.g., Symbolic: #7029 vs. Verbal and Prose: #4163, with minimal overlap among the remaining top features). That comparable scores coincide with different features provides the clearest evidence that specialists detect surface syntax rather than meaning.

Figure 6: Phase 2 text specialist scores by topic + form combination

Figure 7: Phase 2 text specialist scores by form type, aggregated across topics

Finding 2: The overall representation groups primarily by semantics

Following from the first hypothesis that specialist features are primarily activated by unique surface features, I initially supposed that the overall representation of the text samples would also be grouped by syntax, as opposed to semantics. This hypothesis turned out to be incorrect, as the results strongly suggest grouping by meaning.

The primary analysis used to explore this idea was cosine similarity. As shown in Figure 8 below, I computed 20x20 pairwise similarities, which revealed clear in-topic clustering. Cosine similarities for pairs with the same topic but different forms was 0.50, whereas cosine similarity for pairs with different topics, regardless of form, was 0.14. The large and statistically significant difference in these results suggests that while the top surface features react primarily to syntax, the overall representation (accounting for all 24,576 SAE features) encodes semantics.

One possible interpretation of these results is a two-tier representational structure in which highly-activated specialist features act as syntactic detectors, responding to specific surface markers like mathematical symbols or Python keywords. Beneath this syntactic surface layer, however, is a broader pattern of activation across thousands of features, that in aggregate, encodes semantics. The model simultaneously “notices” that input contains mathematical symbols (activating syntax-specific features) while also representing it in a region of activation space shared with other mathematical content regardless of surface form. This explanation is broadly consistent with Anthropic’s Scaling Monosemanticity, which was one of the primary inspirations for my analysis. Finally, it should also be noted that here, “topic” should be understood as a coarse, human-defined proxy for semantic content, not evidence of grounded or task-usable meaning.

Figure 8: Pairwise cosine similarity matrix (layer 11)

Reinforcing this view of topic-centric overall representation are the results of the 2-D PCA analysis shown in Figure 9. While this PCA projection only illustrates ~13% of the total representational geometry, it clearly shows that text samples of the same topic are generally grouped together with regard to the two principal components projected.

Figure 9: 2-D PCA Projection (layer 11)

Interestingly, the PCA projection shows that some topics (formal and conversational English) showed relatively tight clustering while other topics (Math and Python) demonstrated far less dense clustering. This is summarized via each topic + surface type’s mean distance from the PCA centroid, shown in Figure 10 below. One potential explanation for this behavior lies in the composition of training data likely used for GPT-2 Small. If the corpus of model training data was disproportionately English language prose (as is the case with internet data), it would make sense that the model is more “proficient” at representing those types of text, relative to less common text, such as mathematical equations and computer code.

Figure 10: Mean distance from PCA centroid (layer 11)

Finding 3: Use of GPT-2 Small is incompatible with testing whether feature activation is linked to math-based completion accuracy

The third hypothesis tested whether specialist feature activation levels are associated with the model’s capacity to correctly complete simple mathematical problems. The objective here was to begin drawing a linkage between the observed specialist feature activation and the model’s outputs, with math chosen due to its objectivity.

Unfortunately, the near-zero accuracy rates of the model’s completion, regardless of feature activation levels, clearly illustrated the model’s inability to perform math and reliance on pattern matching. Figure 11 shows a sample of these outputs.

Figure 11: Sample of attempted mathematical completions by GPT-2 Small

#	Form	Prompt	Expected	Actual completion	Correct / Incorrect
1	symbolic	2+2	4	+2+2+2+2+2	Incorrect
2	verbal	two plus two	4	plus two plus two plus two plus two plus two	Incorrect
3	prose	the sum of two and two	4	-thirds of the total number of persons in the	Incorrect
4	symbolic	4+5	9	+6+7+8+9+10	Incorrect
5	verbal	four plus five	9	years ago.\n\nI'm not sure if	Incorrect
6	prose	the sum of four and five	9	hundred and sixty-five years of the history of	Incorrect
7	symbolic	1+1	2	+1+1+1+1+1	Incorrect
8	verbal	one plus one	2	.\n\nI'm not sure if this is	Incorrect
9	prose	the sum of one and one	2	-half of the sum of the sum of the	Incorrect
10	symbolic	3+6	9	3+6\n\n3+6	Incorrect

V. Potential avenues for further research

Use of a larger, more capable model

While the choice of GPT-2 Small for the purposes of this analysis was primarily practical in nature and designed for easy replication by the reader, it remains unclear whether its representational behaviors (most notably the 2-tier structure suggested by the H2 analysis) and inability to do mathematical completions (as shown in the H3 analysis) would extend to more modern, more capable models. A replication of these analyses with a more capable model would serve both theoretical and practical purposes, allowing one to better understand how model representations have evolved over time and how this knowledge could be applied to the highly-capable models used by millions of people daily.

Steering experiments

The analyses in this series have been observational in that they measure which features activate in response to various inputs and how those activation patterns vary by input type. A more direct test of whether these features directly influence model behavior would involve artificially activating (”clamping”) specific features and observing the effect on outputs. Anthropic’s Golden Gate Claude analysis demonstrated this approach, amplifying a feature associated with the Golden Gate Bridge and observing the model’s resulting fixation on that topic.

A similar approach applied to the syntactic specialists identified in this analysis (likely in conjunction with use of a more capable model, as noted above) could potentially reveal whether these features merely correlate with input patterns or actively shape model outputs. For example, clamping a feature associated with exponentiation while prompting the model with “what is two plus two?” might bias the output toward “22“, as opposed to simply answering “4”. Such a response would serve as evidence that the feature’s activation influences the model’s mathematical interpretation, not just its pattern recognition.

Further exploration of a potential two-tier representational structure

One potential (but unproven) explanation for the simultaneous syntax centricity of specialist features and the semantically-based grouping of a token’s overall representation would be a two-tier representational structure within the model. One could imagine that in such a structure, there exists a “long tail” of thousands or millions of features, each with relatively low activation levels that in aggregate represent the majority of total activation and encode the token’s meaning. Proving the existence, composition, and behavior of that long tail of features could be of significant use in furthering the overall understanding of token representation and potentially, how to better control model outputs.

VI. Concluding thoughts

The analysis documented here represents the continuation of an honest and earnest exploration of the MI fundamentals via the use of SAEs. While the results contained within are affirmative of well-researched machine learning principles, replicating that research and methodically documenting it here helped me further my own understanding of those principles, including their examination from multiple angles of inquiry. Currently, the growth in model capabilities continues to outpace our understanding of, and ability to effectively control those models. These types of analyses, therefore, serve to not just scratch an intellectual itch, but to hopefully inspire others to a better understanding of this incredibly important topic.

I invite those interested to replicate this research using the Jupyter Notebooks available via this project’s GitHub repository and I welcome any and all comments, questions, or suggestions for improved methodologies or avenues for further research.

LESSWRONG
LW