Sparse Autoencoder (SAE) is a dictionary learning method with the goal of learning monosemantic subspaces that map to high-level concepts (1). Several frontier AI labs have recently applied SAEs to interpret model internals and control model generations, yielding promising results (2, 3).
To build an intuitive understanding of this process, we can break down the SAE workflow as follows:
The learned features encoded in the columns and rows of the SAE encoder and decoder matrices carry significant semantic weight. In this post, we explore the geometric relationships between these feature vectors, particularly focusing on semantically opposite features (antonyms), to determine whether they exhibit meaningful geometric/directional relationships similar to those observed with embedded tokens/words in Language Models (4).
One of the most elegant and famous phenomena within LMs is the geometric relationships between a LM's word embeddings. We all know the equation "king - man + woman = queen" to come out of embedding algorithms like GloVe (5) and word2vec (6). We also find that these embedding algorithms perform well in benchmarks like the WordSim-353 in matching human ratings of word-pair similarity within the cosine similarity of their embeddings (7).
Since the encoding step of the SAE workflow is analogous to taking the cosine similarity between the input activation with all learned features in the encoder matrix, we should expect the learned features of the SAE encoder and decoder weights to carry similar geometric relationships to their embeddings. In this post, we explore this hypothesis, particularly focusing on semantically opposite features (antonyms) and their cosine similarities to determine whether or not they exhibit meaningful geometric relationships similar to those observed with embedded tokens/words in LMs.
To investigate feature geometry relationships, we identify 20 pairs of semantically opposite ("antonym") features from the explanations file of gemma-scope-2b-pt-res. We find these features for SAEs in the gemma-scope-2b-pt-res SAE family corresponding to layer 10 and 20. For comprehensiveness, we included all semantically similar features when multiple instances occurred in the feature space. The selected antonym pairs and their corresponding feature indices are as follows:
Concept Pair | Feature A (indices) | Feature B (indices) |
---|---|---|
1. Happiness-Sadness | Happiness (5444, 11485, 12622, 13487, 15674) | Sadness (6381, 9071) |
2. Sentiment | Positive (14655, 15267, 11, 579, 1715, 2243) | Negative (1500, 1218, 1495, 5633, 5817) |
3. Uniqueness | Unique: (5283, 3566, 1688) | General: (11216) |
4. Tranquility | Calm (7509, 13614) | Anxious (11961, 1954, 5997) |
5. Engagement | Excitement (2864) | Boredom (1618) |
6. Max-Min | Maximum (423) | Minimum (5602, 5950, 11484) |
7. Size | Large (5799, 4564, 16320, 14778) | Small (8021, 7283, 4110, 2266) |
8. Understanding | Clarity/Comprehension (1737, 13638) | Confusion (2824, 117, 12137, 11420) |
9. Pace | Speed (13936, 6974, 2050) | Slowness (11625) |
10. Outcome | Success (8490, 745, 810) | Failure (5633, 791, 6632) |
11. Distribution | Uniform (9252) | Varied (3193) |
12. Luminance | Bright (9852) | Dark (2778) |
13. Permanence | Temporary (2998, 10715) | Permanence (2029) |
14. Randomness | Systematic (9258) | Random (992, 4237) |
15. Frequency | Frequent (15474) | Infrequent (12633) |
16. Accuracy | Correct (10821, 12220, 12323, 13305) | Incorrect (7789) |
17. Scope | Local (3187, 3222, 7744, 15804, 14044) | Global (3799, 4955, 10315) |
18. Complexity | Simple (4795, 13468, 15306, 8190) | Complex (4260, 11271, 11624, 13161) |
19. Processing | Sequential (4605, 2753) | Parallel (3734) |
20. Connection | Connected (10989) | Isolated (9069) |
We also conduct the same experiments for Layer 20 of gemma-scope-2b-pt-res. For the sake of concision, the feature indices we use for those layers can be found in the appendix.
To analyze the geometric relationships between semantically opposite features, we compute their cosine similarities. This measure quantifies the directional alignment between feature vectors, yielding values in the range [-1, 1], where:
Given that certain semantic concepts are represented by multiple feature vectors, we conduct the following comparison approach:
To establish statistical significance and validate our findings, we implemented several control experiments:
To contextualize our results, we generated a baseline distribution of cosine similarities using 20 randomly selected feature pairs. This random baseline enables us to assess whether the geometric relationships observed between antonym pairs differ significantly from chance alignments. As done with the semantically opposite features, we record the maximum and minimum cosine similarity values as well as the mean sampled cosine similarity.
To establish the bounds of possible geometric relationships within our feature space, we conducted an optimized exhaustive search across all feature combinations in both the encoder and decoder matrices. This analysis identifies:
This comprehensive control framework enables us to establish a full range of possible geometric relationships within our feature space, as well as understand the semantic relationships between maximally aligned and opposed feature pairs in the feature space.
As a validation measure, we computed self-similarity scores by calculating the cosine similarity between corresponding features in the encoder and decoder matrices. This analysis serves as a positive control, verifying our methodology's ability to detect strong geometric relationships where they are expected to exist.
As an auxiliary experiment, we experiment with the effectiveness of compositional steering approaches using SAE feature weights. We accomplish this by injecting 2 semantically opposite features from the Decoder weights of GemmaScope 2b Layer 20 during model generation to explore effects on model output sentiment.
We employ the following compositional steering injection techniques using pyvene’s (8) SAE steering intervention workflow in response to the prompt “How are you?”:
We evaluate these injection techniques using a GPT-4o automatic labeller conditioned on few-shot examples of baseline (non-steered) examples of model responses to “How are you?” for grounding. We generate and label 20 responses for each steering approach and ask the model to compare the sentiment of steered examples against control responses.
Our analysis reveals that semantically opposite feature pairs do not exhibit consistently strong geometric opposition in either the encoder or decoder matrices. While some semantically related features exhibit outlier cosine similarity, this effect isn't at all consistent. In fact, average cosine similarity hovers closely around random distributions.
Relationship Type | Cosine Similarity | Feature Pair | Semantic Concept |
---|---|---|---|
Average (Antonym Pairs) | 0.00111 | ||
Strongest Alignment | 0.29599 | [13487, 6381] | Happy-Sad |
Strongest Opposition | -0.11335 | [9852, 2778] | Bright-Dark |
Average (Random Pairs) | 0.01382 | ||
Strongest Alignment | 0.06732 | [5394, 3287] | Compositional Structures-Father Figures |
Strongest Opposition | -0.05583 | [11262, 11530] | Mathematical Equations - Numerical Values |
Relationship Type | Cosine Similarity | Feature Pair | Semantic Concept |
---|---|---|---|
Average (Antonym Pairs) | 0.01797 | ||
Strongest Alignment | 0.09831 | [16320, 8021] | Large-Small |
Strongest Opposition | -0.0839 | [9852, 2778] | Bright-Dark |
Average (Random Pairs) | 0.01297 | ||
Strongest Alignment | 0.06861 | [8166, 13589] | User Interaction - Mathematical Notation |
Strongest Opposition | -0.03626 | [5394, 3287] | Compositional Structure - Father Figures |
Relationship Type | Cosine Similarity | Feature Pair | Semantic Concept |
---|---|---|---|
Average (Antonym Pairs) | 0.04243 | ||
Strongest Alignment | 0.49190 | [11341, 14821] | Day-Night |
Strongest Opposition | -0.07435 | [15216, 2604] | Positive-Negative |
Average (Random Pairs) | -0.00691 | ||
Strongest Alignment | 0.02653 | [2388, 13827] | Programming Terminology - Personal Relationships |
Strongest Opposition | -0.03561 | [5438, 11079] | Legal Documents - Chemical Formuations |
Relationship Type | Cosine Similarity | Feature Pair | Semantic Concept |
---|---|---|---|
Average (Antonym Pairs) | 0.03161 | ||
Strongest Alignment | 0.15184 | [10977, 12238] | Bright-Dark |
Strongest Opposition | -0.07435 | [15216, 2604] | Positive-Negative |
Average (Random Pairs) | 0.02110 | ||
Strongest Alignment | 0.09849 | [3198, 6944] | Cheese Dishes - Cooking |
Strongest Opposition | -0.03588 | [5438, 11079] | Legal Documents - Chemical Formulations |
To view the exact feature explanations for index pairs shown, one can reference Neuronpedia's Gemma-2-2B page for feature lookup by model layer.
As another control experiment, we perform an optimized exhaustive search throughout all of the Decoder feature space to find the maximally and minimally aligned features to examine what the most opposite and aligned features look like.
From analyzing the features found, we find there to be a lack of consistent semantic relationships between maximally and minimally aligned features in the decoder space. While there do exist a select few semantically related concepts, specifically in features that are maximally aligned, we find they do not appear consistently. We find consistent semantic misalignment in opposite pointing features. Below are the results (for the sake of formatting we only include the three most aligned features. The complete lists are featured in the appendix):
Cosine Similarity | Index 1 | Index 2 | Explanations: |
0.92281 | 6802 | 12291 | 6802: phrases related to weather and outdoor conditions 12291: scientific terminology and data analysis concepts |
0.89868 | 2426 | 2791 | 2426: modal verbs and phrases indicating possibility or capability 2791: references to Java interfaces and their implementations |
0.89313 | 11888 | 15083 | 11888: words related to medical terms and conditions, particularly in the context of diagnosis and treatment 15083: phrases indicating specific points in time or context |
Cosine Similarity | Index 1 | Index 2 | Explanations: |
-0.99077 | 4043 | 7357 | 4043: references to "Great" as part of phrases or titles 7357: questions and conversions related to measurements and quantities |
-0.96244 | 3571 | 16200 | 3571: instances of sports-related injuries 16200: concepts related to celestial bodies and their influence on life experiences |
-0.92788 | 7195 | 11236 | 7195: technical terms and acronyms related to veterinary studies and methodologies 11236: Java programming language constructs and operations related to thread management |
Cosine Similarity | Index 1 | Index 2 | Explanations: |
0.92429 | 11763 | 15036 | 11763: concepts related to methods and technologies in research or analysis 15036: programming constructs related to thread handling and GUI components |
0.82910 | 8581 | 14537 | 8581: instances of the word "the" 14537: repeated occurrences of the word "the" in various contexts |
0.82147 | 4914 | 6336 | 4914: programming-related keywords and terms 6336: terms related to legal and criminal proceedings |
Cosine Similarity | Index 1 | Index 2 | Explanations: |
-0.99212 | 6631 | 8684 | 6631: the beginning of a text or important markers in a document 8684: technical jargon and programming-related terms |
-0.98955 | 5052 | 8366 | 5052: the beginning of a document or section, likely signaling the start of significant content 8366: verbs and their related forms, often related to medical or technical contexts |
-0.97520 | 743 | 6793 | 743: proper nouns and specific terms related to people or entities 6793: elements that resemble structured data or identifiers, likely in a list or JSON format |
To ensure significant geometric relationships exist between features in SAEs, we perform a control experiment in which we compute the cosine similarities between corresponding features in the encoder and decoder matrix (i.e. computing the cosine similarity for the feature "hot" in the encoder matrix with the "hot" feature in the decoder matrix).
We perform these experiments for every single semantically opposite and random feature with itself, and report the averages below. As seen, significant cosine similarities exist for correspondent features, highlighting the weakness between cosine similarities between semantically opposite features as seen above.
Relationship Type | Cosine Similarity | Feature | Semantic Concept |
---|---|---|---|
Average (Antonym Pairs) | 0.71861 | ||
Highest Similarity | 0.88542 | 4605 | Sequential Processing |
Lowest Similarity | 0.47722 | 6381 | Sadness |
Average (Random Pairs) | 0.72692 | ||
Highest Similarity | 0.85093 | 1146 | Words related to "Sequences" |
Lowest Similarity | 0.51145 | 6266 | Numerical data related to dates |
Same Feature Similarity: gemma-scope-2b-pt-res Layer 20
Relationship Type | Cosine Similarity | Feature | Semantic Concept |
---|---|---|---|
Average (Antonym Pairs) | 0.75801 | ||
Highest Similarity | 0.86870 | 5657 | Correctness |
Lowest Similarity | 0.56493 | 10108 | Confusion |
Average (Random Pairs) | 0.74233 | ||
Highest Similarity | 0.91566 | 13137 | Coding related terminology |
Lowest Similarity | 0.55352 | 12591 | The presence of beginning markers in text |
We receive mixed results when evaluating compositional systems. Our results suggest that compositional injection (-1 * "Happy" + "Sad") results in higher steered sentiment towards the positive feature’s direction, and flipping the direction of SAE features results in neutral steering on its own.
Synchronized injection of antonymic features (happy + sad) seem to suggest some “cancelling out” of sentiment, control single injection directions like sadness seem to become much less common when paired with a happy vector.
Overall, compositionality of features provides interesting results and presents the opportunity for future research to be done.
Happy + Sad Injection | |
Net: | Neutral (11/20) |
Happy Steering: | 5 |
Sad Steering: | 2 |
No Steering: | 11 |
Happy to Sad Steering: | 1 |
Sad to Happy Steering: | 1 |
-1 * Happy + Sad Injection | -1 * Sad + Happy Injection | |
Net: | Sad Steered (14/20) | Happy Steered (11/20) |
Happy Steering: | 0 | 11 |
Sad Steering: | 12 | 1 |
No Steering: | 6 | 6 |
Happy to Sad Steering: | 2 | 1 |
Sad to Happy Steering: | 0 | 0 |
Happy Injection | Sad Injection | -1 * Happy Injection | -1 * Sad Injection | |
Net: | Neutral (10/20) | Sad (8/20) | Neutral (15/20) | Neutral (14/20) |
Happy Steering: | 7 | 2 | 3 | 3 |
Sad Steering: | 0 | 8 | 2 | 0 |
No Steering: | 10 | 5 | 15 | 14 |
Happy to Sad Steering: | 2 | 1 | 0 | 0 |
Sad to Happy Steering: | 1 | 4 | 0 | 3 |
The striking absence of consistent geometric relationships between semantically opposite features is surprising, given how frequent geometric relationships occur in original embedding spaces between semantically related features. SAE features are designed to anchor layer activations to their corresponding monosemantic directions, yet seem to be almost completely unrelated in all layers of the SAE.
Potential explanations for the lack of geometric relationships in semantically related SAE encoder and decoder features could be due to contextual noise in the training data. As word embeddings go through attention layers, their original directions could become muddled with noise from other words in the contextual window. However, the lack of any noticeable geometric relationship across 2 seperate layers is surprising. These findings lay the path for future interpretability work to be done in better understanding this phenomena.
A special thanks to Zhengxuan Wu for advising me on this project!
Concept Pair | Feature A (indices) | Feature B (indices) |
---|---|---|
1. Happiness-Sadness | Happiness (11697) | Sadness (15682, 15539) |
2. Sentiment | Positive (14588, 12049, 12660, 2184, 15216, 427) | Negative (2604, 656, 1740, 3372, 5941) |
3. Love-Hate | Love (9602, 5573) | Hate (2604) |
4. Tranquility | Calm (813) | Anxious (14092, 2125, 7523, 7769) |
5. Engagement | Excitement (13823, 3232) | Boredom (16137) |
6. Intensity | Intense (11812) | Mild (13261) |
7. Size | Large (582, 9414) | Small (535) |
8. Understanding | Clarity (3726) | Confusion (16253, 3545, 10108, 16186) |
9. Pace | Speed (4483, 7787, 4429) | Slowness (1387) |
10. Outcome | Success (7987, 5083, 1133, 162, 2922) | Failure (10031, 2708, 15271, 8427, 3372) |
11. Timing | Early (8871, 6984) | Late (8032) |
12. Luminance | Bright (10977) | Dark (12238) |
13. Position | Internal (14961, 15523) | External (3136, 8848, 433) |
14. Time of Day | Daytime (11341) | Nighttime (14821) |
15. Direction | Push (6174) | Pull (9307) |
16. Accuracy | Correct (10351, 5657) | Incorrect (11983, 1397) |
17. Scope | Local (10910, 1598) | Global (10472, 15416) |
18. Complexity | Simple (8929, 10406, 4599) | Complex (3257, 5727) |
19. Processing | Sequential (5457, 4099, 14378) | Parallel (4453, 3758) |
20. Connection | Connected (5539) | Isolated (15334, 2729) |
Cosine Similarity | Index 1 | Index 2 | Explanations: |
0.92281 | 6802 | 12291 | 6802: phrases related to weather and outdoor conditions 12291: scientific terminology and data analysis concepts |
0.89868 | 2426 | 2791 | 2426: modal verbs and phrases indicating possibility or capability 2791: references to Java interfaces and their implementations |
0.89313 | 11888 | 15083 | 11888: words related to medical terms and conditions, particularly in the context of diagnosis and treatment 15083: phrases indicating specific points in time or context |
0.88358 | 5005 | 8291 | 5005: various punctuation marks and symbols 8291: function calls and declarations in code |
0.88258 | 5898 | 14892 | 5898: components and processes related to food preparation 14892: text related to artificial intelligence applications and methodologies in various scientific fields |
0.88066 | 3384 | 12440 | 3384: terms related to education, technology, and security 12440: code snippets and references to specific programming functions or methods in a discussion or tutorial. |
0.87927 | 1838 | 12167 | 1838: technical terms related to polymer chemistry and materials science 12167: structured data entries in a specific format, potentially related to programming or configuration settings. |
0.87857 | 12291 | 15083 | 12291: scientific terminology and data analysis concepts 15083: phrases indicating specific points in time or context |
0.87558 | 8118 | 8144 | 8118: instances of formatting or structured text in the document 8144: names of people or entities associated with specific contexts or citations |
0.87248 | 3152 | 15083 | 3152: references to mental health or psychosocial topics related to women 15083: phrases indicating specific points in time or context |
Cosine Similarity | Index 1 | Index 2 | Explanations: |
-0.99077 | 4043 | 7357 | 4043: references to "Great" as part of phrases or titles 7357: questions and conversions related to measurements and quantities |
-0.96244 | 3571 | 16200 | 3571: instances of sports-related injuries 16200: concepts related to celestial bodies and their influence on life experiences |
-0.94912 | 3738 | 12689 | 3738: mentions of countries, particularly Canada and China 12689: terms related to critical evaluation or analysis |
-0.94557 | 3986 | 4727 | 3986: terms and discussions related to sexuality and sexual behavior 4727: patterns or sequences of symbols, particularly those that are visually distinct |
-0.93563 | 1234 | 12996 | 1234: words related to fundraising events and campaigns 12996: references to legal terms and proceedings |
-0.92870 | 9295 | 10264 | 9295: the presence of JavaScript code segments or functions 10264: text related to font specifications and styles |
-0.92788 | 7195 | 11236 | 7195: technical terms and acronyms related to veterinary studies and methodologies 11236: Java programming language constructs and operations related to thread management |
-0.92576 | 4231 | 12542 | 4231: terms related to mental states and mindfulness practices 12542: information related to special DVD or Blu-ray editions and their features |
-0.92376 | 2553 | 6418 | 2553: references to medical terminologies and anatomical aspects 6418: references to types of motor vehicles |
-0.92021 | 674 | 8375 | 674: terms related to cooking and recipes 8375: descriptions of baseball games and players. |
Cosine Similarity | Index 1 | Index 2 | Explanations: |
0.92429 | 11763 | 15036 | 11763: concepts related to methods and technologies in research or analysis 15036: programming constructs related to thread handling and GUI components |
0.82910 | 8581 | 14537 | 8581: instances of the word "the" 14537: repeated occurrences of the word "the" in various contexts |
0.82147 | 4914 | 6336 | 4914: programming-related keywords and terms 6336: terms related to legal and criminal proceedings |
0.81237 | 4140 | 16267 | 4140: programming-related syntax and logical operations 16267: instances of mathematical or programming syntax elements such as brackets and operators |
0.80524 | 1831 | 7248 | 1831: elements related to JavaScript tasks and project management 7248: structural elements and punctuation in the text |
0.77152 | 8458 | 16090 | 8458: occurrences of function and request-related terms in a programming context 16090: curly braces and parentheses in code |
0.76965 | 7449 | 10192 | 7449: the word 'to' and its variations in different contexts 10192: modal verbs indicating possibility or necessity |
0.76194 | 5413 | 8630 | 5413: punctuation and exclamatory expressions in text 8630: mentions of cute or appealing items or experiences |
0.75850 | 7465 | 10695 | 7465: phrases related to legal interpretations and political tensions involving nationality and citizenship 10695: punctuation marks and variation in sentence endings |
0.74877 | 745 | 2437 | 745: sequences of numerical values 2437: patterns related to numerical information, particularly involving the number four |
Cosine Similarity | Index 1 | Index 2 | Explanations: |
-0.99212 | 6631 | 8684 | 6631: the beginning of a text or important markers in a document 8684: technical jargon and programming-related terms |
-0.98955 | 5052 | 8366 | 5052: the beginning of a document or section, likely signaling the start of significant content 8366: verbs and their related forms, often related to medical or technical contexts |
-0.97520 | 743 | 6793 | 743: proper nouns and specific terms related to people or entities 6793: elements that resemble structured data or identifiers, likely in a list or JSON format |
-0.97373 | 1530 | 5533 | 1530: technical terms related to inventions and scientific descriptions 5533: terms related to medical and biological testing and analysis |
-0.97117 | 1692 | 10931 | 1692: legal and technical terminology related to statutes and inventions 10931: keywords and parameters commonly associated with programming and configuration files |
-0.96775 | 4784 | 5328 | 4784: elements of scientific or technical descriptions related to the assembly and function of devices 5328: terms and phrases related to legal and statistical decisions |
-0.96136 | 6614 | 9185 | 6614: elements related to data structure and processing in programming contexts 9185: technical terms related to audio synthesis and electronic music equipment |
-0.96027 | 10419 | 14881 | 10419: technical terms and specific concepts related to scientific research and modeling 14881: phrases related to lubrication and mechanical properties |
-0.95915 | 1483 | 9064 | 1483: words related to specific actions or conditions that require attention or caution 9064: mathematical operations and variable assignments in expressions |
-0.95887 | 1896 | 15111 | 1896: elements indicative of technical or programming contexts 15111: phrases related to functionality and efficiency in systems and processes |