We demonstrate that subspaces with semantically opposite meanings within the GemmaScope series of Sparse Autoencoders are not pointing towards opposite directions.
Furthermore, subspaces that are pointing towards opposite directions are usually not semantically related.
As a set of auxiliary experiments, we experiment with the compositional injection of steering vectors (ex: -1*happy + sad) and find moderate signals of success.
An Intuitive Introduction to Sparse Autoencoders
What are Sparse Autoencoders, and How Do They Work?
High Level Diagram Showcasing SAE Workflow: taken from Adam Karvonen's Blog Post.
Sparse Autoencoder (SAE) is a dictionary learning method with the goal of learning monosemantic subspaces that map to high-level concepts (1). Several frontier AI labs have recently applied SAEs to interpret model internals and control model generations, yielding promising results (2, 3).
To build an intuitive understanding of this process, we can break down the SAE workflow as follows:
SAEs are trained to reconstructactivations sampled from Language Models (LM). They operate by taking a layer activation from an LM and up-projecting it onto a much wider latent space with an encoder weight matrix, then down-projecting the latent to reconstruct the input activations. SAEs are trained to improve sparsity via an activation function like ReLU before being down-projected alongside a L1 loss on the latent space.
The columns of the encoder matrix and the rows of the decoder matrix can be interpreted as learned "features"corresponding to specific concepts (e.g., activation patterns when the word "hot" is processed by a LM).
When multiplying layer activations by the encoding matrix, we effectively measure the alignment between our input and each learned feature. This produces a "dictionary" of alignment measurements in our SAE activations.
To eliminate "irrelevant" features, a ReLU/JumpReLU activation function sets all elements below a certain threshold to zero.
The resulting sparse SAE dictionary activation vector serves as a set of coefficients for a linear combination of the most significant features identified in the input. These coefficients, when multiplied with the decoder matrix, attempt to reconstruct the original input as a sum of learned feature vectors.
For a deeper intuitive understanding of SAEs, refer to Adam Karvonen's comprehensive article on the subject here.
The learned features encoded in the columns and rows of the SAE encoder and decoder matrices carry significant semantic weight. In this post, we explore the geometric relationships between these feature vectors, particularly focusing on semantically opposite features (antonyms), to determine whether they exhibit meaningful geometric/directional relationships similar to those observed with embedded tokens/words in Language Models (4).
Overview of Vector & Feature Geometry
One of the most elegant and famous phenomena within LMs is the geometric relationships between a LM's word embeddings. We all know the equation "king - man + woman = queen" to come out of embedding algorithms like GloVe (5) and word2vec (6). We also find that these embedding algorithms perform well in benchmarks like the WordSim-353 in matching human ratings of word-pair similarity within the cosine similarity of their embeddings (7).
Project Motivations
Since the encoding step of the SAE workflow is analogous to taking the cosine similarity between the input activation with all learned features in the encoder matrix, we should expect the learned features of the SAE encoder and decoder weights to carry similar geometric relationships to their embeddings. In this post, we explore this hypothesis, particularly focusing on semantically opposite features (antonyms) and their cosine similarities to determine whether or not they exhibit meaningful geometric relationships similar to those observed with embedded tokens/words in LMs.
Experimental Setup
Finding Semantically Opposite Features
Neuronpedia's GemmaScope feature search function. Search for features yourself here.
To investigate feature geometry relationships, we identify 20 pairs of semantically opposite ("antonym") features from the explanations file of gemma-scope-2b-pt-res. We find these features for SAEs in the gemma-scope-2b-pt-res SAE family corresponding to layer 10 and 20. For comprehensiveness, we included all semantically similar features when multiple instances occurred in the feature space. The selected antonym pairs and their corresponding feature indices are as follows:
We also conduct the same experiments for Layer 20 of gemma-scope-2b-pt-res. For the sake of concision, the feature indices we use for those layers can be found in the appendix.
To analyze the geometric relationships between semantically opposite features, we compute their cosine similarities. This measure quantifies the directional alignment between feature vectors, yielding values in the range [-1, 1], where:
+1 indicates perfect directional alignment
0 indicates orthogonality
-1 indicates perfect opposition
Given that certain semantic concepts are represented by multiple feature vectors, we conduct the following comparison approach:
For each antonym pair, we computed cosine similarities between all possible combinations of their respective feature vectors
We record both the maximum and minimum cosine similarity observed, along with their feature indices and explanations. The average cosine similarity os all 20 features is also recorded.
This analysis was performed independently on both the encoder matrix columns and decoder matrix rows to evaluate geometric relationships in both spaces.
Control Experiments
To establish statistical significance and validate our findings, we implemented several control experiments:
Baseline Comparison with Random Features
To contextualize our results, we generated a baseline distribution of cosine similarities using 20 randomly selected feature pairs. This random baseline enables us to assess whether the geometric relationships observed between antonym pairs differ significantly from chance alignments. As done with the semantically opposite features, we record the maximum and minimum cosine similarity values as well as the mean sampled cosine similarity.
Comprehensive Feature Space Analysis
To establish the bounds of possible geometric relationships within our feature space, we conducted an optimized exhaustive search across all feature combinations in both the encoder and decoder matrices. This analysis identifies:
Maximally aligned feature pairs and their explanations (highest positive cosine similarity)
Maximally opposed feature pairs and their explanations (highest negative cosine similarity)
This comprehensive control framework enables us to establish a full range of possible geometric relationships within our feature space, as well as understand the semantic relationships between maximally aligned and opposed feature pairs in the feature space.
Validation Through Self-Similarity Analysis
As a validation measure, we computed self-similarity scores by calculating the cosine similarity between corresponding features in the encoder and decoder matrices. This analysis serves as a positive control, verifying our methodology's ability to detect strong geometric relationships where they are expected to exist.
Auxiliary Experiment: Compositional Model Steering
As an auxiliary experiment, we experiment with the effectiveness of compositional steering approaches using SAE feature weights. We accomplish this by injecting 2 semantically opposite features from the Decoder weights of GemmaScope 2b Layer 20 during model generation to explore effects on model output sentiment.
We employ the following compositional steering injection techniques using pyvene’s (8) SAE steering intervention workflow in response to the prompt “How are you?”:
Contrastive Injection: Semantically opposite features “Happy” (11697) and “Sad” (15539) were injected together to see if they “cancelled” each other out.
Compositional Injection: Semantically opposite features “Happy” and “Sad” injected but with one of them flipped. Ex: -1 * “Happy” + “Sad” or “Happy” + -1 * “Sad” compositional injections were tested to see if they would produce pronounced steering effects towards “Extra Sad” or “Extra Happy”, respectively.
Control: As a control, we generate baseline examples with no steering, and single feature injection steering (Ex: Only steering with “Happy” or “Sad”). These results are used to ground sentiment measurement.
We evaluate these injection techniques using a GPT-4o automatic labeller conditioned on few-shot examples of baseline (non-steered) examples of model responses to “How are you?” for grounding. We generate and label 20 responses for each steering approach and ask the model to compare the sentiment of steered examples against control responses.
Results
Cosine Similarity of Semantically Opposite Features
Our analysis reveals that semantically opposite feature pairs do not exhibit consistently strong geometric opposition in either the encoder or decoder matrices. While some semantically related features exhibit outlier cosine similarity, this effect isn't at all consistent. In fact, average cosine similarity hovers closely around random distributions.
To view the exact feature explanations for index pairs shown, one can reference Neuronpedia's Gemma-2-2B page for feature lookup by model layer.
Features Showing Optimized Cosine Similarity
As another control experiment, we perform an optimized exhaustive search throughout all of the Decoder feature space to find the maximally and minimally aligned features to examine what the most opposite and aligned features look like.
From analyzing the features found, we find there to be a lack of consistent semantic relationships between maximally and minimally aligned features in the decoder space. While there do exist a select few semantically related concepts, specifically in features that are maximally aligned, we find they do not appear consistently. We find consistent semantic misalignment in opposite pointing features. Below are the results (for the sake of formatting we only include the three most aligned features. The complete lists are featured in the appendix):
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 10
Cosine Similarity
Index 1
Index 2
Explanations:
0.92281
6802
12291
6802: phrases related to weather and outdoor conditions
12291: scientific terminology and data analysis concepts
0.89868
2426
2791
2426: modal verbs and phrases indicating possibility or capability 2791: references to Java interfaces and their implementations
0.89313
11888
15083
11888: words related to medical terms and conditions, particularly in the context of diagnosis and treatment 15083: phrases indicating specific points in time or context
Most Opposite Cosine Similarity: gemma-scope-2b-pt-res Layer 10
Cosine Similarity
Index 1
Index 2
Explanations:
-0.99077
4043
7357
4043: references to "Great" as part of phrases or titles 7357: questions and conversions related to measurements and quantities
-0.96244
3571
16200
3571: instances of sports-related injuries 16200: concepts related to celestial bodies and their influence on life experiences
-0.92788
7195
11236
7195: technical terms and acronyms related to veterinary studies and methodologies 11236: Java programming language constructs and operations related to thread management
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 20
Cosine Similarity
Index 1
Index 2
Explanations:
0.92429
11763
15036
11763: concepts related to methods and technologies in research or analysis 15036: programming constructs related to thread handling and GUI components
0.82910
8581
14537
8581: instances of the word "the" 14537: repeated occurrences of the word "the" in various contexts
0.82147
4914
6336
4914: programming-related keywords and terms 6336: terms related to legal and criminal proceedings
Most Opposite Cosine Similarities: gemma-scope-2b-pt-res Layer 20
Cosine Similarity
Index 1
Index 2
Explanations:
-0.99212
6631
8684
6631: the beginning of a text or important markers in a document 8684: technical jargon and programming-related terms
-0.98955
5052
8366
5052: the beginning of a document or section, likely signaling the start of significant content 8366: verbs and their related forms, often related to medical or technical contexts
-0.97520
743
6793
743: proper nouns and specific terms related to people or entities 6793: elements that resemble structured data or identifiers, likely in a list or JSON format
Cosine Similarity of Corresponding Encoder-Decoder Features
To ensure significant geometric relationships exist between features in SAEs, we perform a control experiment in which we compute the cosine similarities between corresponding features in the encoder and decoder matrix (i.e. computing the cosine similarity for the feature "hot" in the encoder matrix with the "hot" feature in the decoder matrix).
We perform these experiments for every single semantically opposite and random feature with itself, and report the averages below. As seen, significant cosine similarities exist for correspondent features, highlighting the weakness between cosine similarities between semantically opposite features as seen above.
Same Feature Similarity: gemma-scope-2b-pt-res Layer 10
Relationship Type
Cosine Similarity
Feature
Semantic Concept
Average (Antonym Pairs)
0.71861
Highest Similarity
0.88542
4605
Sequential Processing
Lowest Similarity
0.47722
6381
Sadness
Average (Random Pairs)
0.72692
Highest Similarity
0.85093
1146
Words related to "Sequences"
Lowest Similarity
0.51145
6266
Numerical data related to dates
Same Feature Similarity: gemma-scope-2b-pt-res Layer 20
Relationship Type
Cosine Similarity
Feature
Semantic Concept
Average (Antonym Pairs)
0.75801
Highest Similarity
0.86870
5657
Correctness
Lowest Similarity
0.56493
10108
Confusion
Average (Random Pairs)
0.74233
Highest Similarity
0.91566
13137
Coding related terminology
Lowest Similarity
0.55352
12591
The presence of beginning markers in text
Steering Results
We receive mixed results when evaluating compositional systems. Our results suggest that compositional injection (-1 * "Happy" + "Sad") results in higher steered sentiment towards the positive feature’s direction, and flipping the direction of SAE features results in neutral steering on its own.
Synchronized injection of antonymic features (happy + sad) seem to suggest some “cancelling out” of sentiment, control single injection directions like sadness seem to become much less common when paired with a happy vector.
Overall, compositionality of features provides interesting results and presents the opportunity for future research to be done.
Contrastive Steering Results:
Happy + Sad Injection
Net:
Neutral (11/20)
Happy Steering:
5
Sad Steering:
2
No Steering:
11
Happy to Sad Steering:
1
Sad to Happy Steering:
1
Compositional Steering Results
-1 * Happy + Sad Injection
-1 * Sad + Happy Injection
Net:
Sad Steered (14/20)
Happy Steered (11/20)
Happy Steering:
0
11
Sad Steering:
12
1
No Steering:
6
6
Happy to Sad Steering:
2
1
Sad to Happy Steering:
0
0
Control Steering Results: Happy and Sad Single Injection
Happy Injection
Sad Injection
-1 * Happy Injection
-1 * Sad Injection
Net:
Neutral (10/20)
Sad (8/20)
Neutral (15/20)
Neutral (14/20)
Happy Steering:
7
2
3
3
Sad Steering:
0
8
2
0
No Steering:
10
5
15
14
Happy to Sad Steering:
2
1
0
0
Sad to Happy Steering:
1
4
0
3
Conclusions and Discussion
The striking absence of consistent geometric relationships between semantically opposite features is surprising, given how frequent geometric relationships occur in original embedding spaces between semantically related features. SAE features are designed to anchor layer activations to their corresponding monosemantic directions, yet seem to be almost completely unrelated in all layers of the SAE.
Potential explanations for the lack of geometric relationships in semantically related SAE encoder and decoder features could be due to contextual noise in the training data. As word embeddings go through attention layers, their original directions could become muddled with noise from other words in the contextual window. However, the lack of any noticeable geometric relationship across 2 seperate layers is surprising. These findings lay the path for future interpretability work to be done in better understanding this phenomena.
Acknowledgements
A special thanks to Zhengxuan Wu for advising me on this project!
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 10
Cosine Similarity
Index 1
Index 2
Explanations:
0.92281
6802
12291
6802: phrases related to weather and outdoor conditions
12291: scientific terminology and data analysis concepts
0.89868
2426
2791
2426: modal verbs and phrases indicating possibility or capability 2791: references to Java interfaces and their implementations
0.89313
11888
15083
11888: words related to medical terms and conditions, particularly in the context of diagnosis and treatment 15083: phrases indicating specific points in time or context
0.88358
5005
8291
5005: various punctuation marks and symbols 8291: function calls and declarations in code
0.88258
5898
14892
5898: components and processes related to food preparation 14892: text related to artificial intelligence applications and methodologies in various scientific fields
0.88066
3384
12440
3384: terms related to education, technology, and security 12440: code snippets and references to specific programming functions or methods in a discussion or tutorial.
0.87927
1838
12167
1838: technical terms related to polymer chemistry and materials science 12167: structured data entries in a specific format, potentially related to programming or configuration settings.
0.87857
12291
15083
12291: scientific terminology and data analysis concepts 15083: phrases indicating specific points in time or context
0.87558
8118
8144
8118: instances of formatting or structured text in the document 8144: names of people or entities associated with specific contexts or citations
0.87248
3152
15083
3152: references to mental health or psychosocial topics related to women 15083: phrases indicating specific points in time or context
Most Opposite Cosine Similarity: gemma-scope-2b-pt-res Layer 10
Cosine Similarity
Index 1
Index 2
Explanations:
-0.99077
4043
7357
4043: references to "Great" as part of phrases or titles 7357: questions and conversions related to measurements and quantities
-0.96244
3571
16200
3571: instances of sports-related injuries 16200: concepts related to celestial bodies and their influence on life experiences
-0.94912
3738
12689
3738: mentions of countries, particularly Canada and China 12689: terms related to critical evaluation or analysis
-0.94557
3986
4727
3986: terms and discussions related to sexuality and sexual behavior 4727: patterns or sequences of symbols, particularly those that are visually distinct
-0.93563
1234
12996
1234: words related to fundraising events and campaigns 12996: references to legal terms and proceedings
-0.92870
9295
10264
9295: the presence of JavaScript code segments or functions 10264: text related to font specifications and styles
-0.92788
7195
11236
7195: technical terms and acronyms related to veterinary studies and methodologies 11236: Java programming language constructs and operations related to thread management
-0.92576
4231
12542
4231: terms related to mental states and mindfulness practices 12542: information related to special DVD or Blu-ray editions and their features
-0.92376
2553
6418
2553: references to medical terminologies and anatomical aspects 6418: references to types of motor vehicles
-0.92021
674
8375
674: terms related to cooking and recipes 8375: descriptions of baseball games and players.
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 20
Cosine Similarity
Index 1
Index 2
Explanations:
0.92429
11763
15036
11763: concepts related to methods and technologies in research or analysis 15036: programming constructs related to thread handling and GUI components
0.82910
8581
14537
8581: instances of the word "the" 14537: repeated occurrences of the word "the" in various contexts
0.82147
4914
6336
4914: programming-related keywords and terms 6336: terms related to legal and criminal proceedings
0.81237
4140
16267
4140: programming-related syntax and logical operations 16267: instances of mathematical or programming syntax elements such as brackets and operators
0.80524
1831
7248
1831: elements related to JavaScript tasks and project management
7248: structural elements and punctuation in the text
0.77152
8458
16090
8458: occurrences of function and request-related terms in a programming context
16090: curly braces and parentheses in code
0.76965
7449
10192
7449: the word 'to' and its variations in different contexts 10192: modal verbs indicating possibility or necessity
0.76194
5413
8630
5413: punctuation and exclamatory expressions in text 8630: mentions of cute or appealing items or experiences
0.75850
7465
10695
7465: phrases related to legal interpretations and political tensions involving nationality and citizenship 10695: punctuation marks and variation in sentence endings
0.74877
745
2437
745: sequences of numerical values
2437: patterns related to numerical information, particularly involving the number four
Most Opposite Cosine Similarities: gemma-scope-2b-pt-res Layer 20
Cosine Similarity
Index 1
Index 2
Explanations:
-0.99212
6631
8684
6631: the beginning of a text or important markers in a document 8684: technical jargon and programming-related terms
-0.98955
5052
8366
5052: the beginning of a document or section, likely signaling the start of significant content 8366: verbs and their related forms, often related to medical or technical contexts
-0.97520
743
6793
743: proper nouns and specific terms related to people or entities 6793: elements that resemble structured data or identifiers, likely in a list or JSON format
-0.97373
1530
5533
1530: technical terms related to inventions and scientific descriptions 5533: terms related to medical and biological testing and analysis
-0.97117
1692
10931
1692: legal and technical terminology related to statutes and inventions 10931: keywords and parameters commonly associated with programming and configuration files
-0.96775
4784
5328
4784: elements of scientific or technical descriptions related to the assembly and function of devices 5328: terms and phrases related to legal and statistical decisions
-0.96136
6614
9185
6614: elements related to data structure and processing in programming contexts 9185: technical terms related to audio synthesis and electronic music equipment
-0.96027
10419
14881
10419: technical terms and specific concepts related to scientific research and modeling 14881: phrases related to lubrication and mechanical properties
-0.95915
1483
9064
1483: words related to specific actions or conditions that require attention or caution 9064: mathematical operations and variable assignments in expressions
-0.95887
1896
15111
1896: elements indicative of technical or programming contexts 15111: phrases related to functionality and efficiency in systems and processes
Key Findings:
An Intuitive Introduction to Sparse Autoencoders
What are Sparse Autoencoders, and How Do They Work?
Sparse Autoencoder (SAE) is a dictionary learning method with the goal of learning monosemantic subspaces that map to high-level concepts (1). Several frontier AI labs have recently applied SAEs to interpret model internals and control model generations, yielding promising results (2, 3).
To build an intuitive understanding of this process, we can break down the SAE workflow as follows:
The learned features encoded in the columns and rows of the SAE encoder and decoder matrices carry significant semantic weight. In this post, we explore the geometric relationships between these feature vectors, particularly focusing on semantically opposite features (antonyms), to determine whether they exhibit meaningful geometric/directional relationships similar to those observed with embedded tokens/words in Language Models (4).
Overview of Vector & Feature Geometry
One of the most elegant and famous phenomena within LMs is the geometric relationships between a LM's word embeddings. We all know the equation "king - man + woman = queen" to come out of embedding algorithms like GloVe (5) and word2vec (6). We also find that these embedding algorithms perform well in benchmarks like the WordSim-353 in matching human ratings of word-pair similarity within the cosine similarity of their embeddings (7).
Project Motivations
Since the encoding step of the SAE workflow is analogous to taking the cosine similarity between the input activation with all learned features in the encoder matrix, we should expect the learned features of the SAE encoder and decoder weights to carry similar geometric relationships to their embeddings. In this post, we explore this hypothesis, particularly focusing on semantically opposite features (antonyms) and their cosine similarities to determine whether or not they exhibit meaningful geometric relationships similar to those observed with embedded tokens/words in LMs.
Experimental Setup
Finding Semantically Opposite Features
To investigate feature geometry relationships, we identify 20 pairs of semantically opposite ("antonym") features from the explanations file of gemma-scope-2b-pt-res. We find these features for SAEs in the gemma-scope-2b-pt-res SAE family corresponding to layer 10 and 20. For comprehensiveness, we included all semantically similar features when multiple instances occurred in the feature space. The selected antonym pairs and their corresponding feature indices are as follows:
Semantically Opposite Features: gemma-scope-2b-pt-res Layer 10
We also conduct the same experiments for Layer 20 of gemma-scope-2b-pt-res. For the sake of concision, the feature indices we use for those layers can be found in the appendix.
Evaluating Feature Geometry
To analyze the geometric relationships between semantically opposite features, we compute their cosine similarities. This measure quantifies the directional alignment between feature vectors, yielding values in the range [-1, 1], where:
Given that certain semantic concepts are represented by multiple feature vectors, we conduct the following comparison approach:
Control Experiments
To establish statistical significance and validate our findings, we implemented several control experiments:
Baseline Comparison with Random Features
To contextualize our results, we generated a baseline distribution of cosine similarities using 20 randomly selected feature pairs. This random baseline enables us to assess whether the geometric relationships observed between antonym pairs differ significantly from chance alignments. As done with the semantically opposite features, we record the maximum and minimum cosine similarity values as well as the mean sampled cosine similarity.
Comprehensive Feature Space Analysis
To establish the bounds of possible geometric relationships within our feature space, we conducted an optimized exhaustive search across all feature combinations in both the encoder and decoder matrices. This analysis identifies:
This comprehensive control framework enables us to establish a full range of possible geometric relationships within our feature space, as well as understand the semantic relationships between maximally aligned and opposed feature pairs in the feature space.
Validation Through Self-Similarity Analysis
As a validation measure, we computed self-similarity scores by calculating the cosine similarity between corresponding features in the encoder and decoder matrices. This analysis serves as a positive control, verifying our methodology's ability to detect strong geometric relationships where they are expected to exist.
Auxiliary Experiment: Compositional Model Steering
As an auxiliary experiment, we experiment with the effectiveness of compositional steering approaches using SAE feature weights. We accomplish this by injecting 2 semantically opposite features from the Decoder weights of GemmaScope 2b Layer 20 during model generation to explore effects on model output sentiment.
We employ the following compositional steering injection techniques using pyvene’s (8) SAE steering intervention workflow in response to the prompt “How are you?”:
We evaluate these injection techniques using a GPT-4o automatic labeller conditioned on few-shot examples of baseline (non-steered) examples of model responses to “How are you?” for grounding. We generate and label 20 responses for each steering approach and ask the model to compare the sentiment of steered examples against control responses.
Results
Cosine Similarity of Semantically Opposite Features
Our analysis reveals that semantically opposite feature pairs do not exhibit consistently strong geometric opposition in either the encoder or decoder matrices. While some semantically related features exhibit outlier cosine similarity, this effect isn't at all consistent. In fact, average cosine similarity hovers closely around random distributions.
Decoder Cosine Similarity: gemma-scope-2b-pt-res Layer 10
Encoder Cosine Similarity: gemma-scope-2b-pt-res Layer 10
Decoder Cosine Similarity: gemma-scope-2b-pt-res Layer 20
Encoder Cosine Similarity: gemma-scope-2b-pt-res Layer 20
To view the exact feature explanations for index pairs shown, one can reference Neuronpedia's Gemma-2-2B page for feature lookup by model layer.
Features Showing Optimized Cosine Similarity
As another control experiment, we perform an optimized exhaustive search throughout all of the Decoder feature space to find the maximally and minimally aligned features to examine what the most opposite and aligned features look like.
From analyzing the features found, we find there to be a lack of consistent semantic relationships between maximally and minimally aligned features in the decoder space. While there do exist a select few semantically related concepts, specifically in features that are maximally aligned, we find they do not appear consistently. We find consistent semantic misalignment in opposite pointing features. Below are the results (for the sake of formatting we only include the three most aligned features. The complete lists are featured in the appendix):
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 10
6802: phrases related to weather and outdoor conditions
12291: scientific terminology and data analysis concepts
2791: references to Java interfaces and their implementations
15083: phrases indicating specific points in time or context
Most Opposite Cosine Similarity: gemma-scope-2b-pt-res Layer 10
7357: questions and conversions related to measurements and quantities
16200: concepts related to celestial bodies and their influence on life experiences
11236: Java programming language constructs and operations related to thread management
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 20
15036: programming constructs related to thread handling and GUI components
14537: repeated occurrences of the word "the" in various contexts
6336: terms related to legal and criminal proceedings
Most Opposite Cosine Similarities: gemma-scope-2b-pt-res Layer 20
8684: technical jargon and programming-related terms
8366: verbs and their related forms, often related to medical or technical contexts
6793: elements that resemble structured data or identifiers, likely in a list or JSON format
Cosine Similarity of Corresponding Encoder-Decoder Features
To ensure significant geometric relationships exist between features in SAEs, we perform a control experiment in which we compute the cosine similarities between corresponding features in the encoder and decoder matrix (i.e. computing the cosine similarity for the feature "hot" in the encoder matrix with the "hot" feature in the decoder matrix).
We perform these experiments for every single semantically opposite and random feature with itself, and report the averages below. As seen, significant cosine similarities exist for correspondent features, highlighting the weakness between cosine similarities between semantically opposite features as seen above.
Same Feature Similarity: gemma-scope-2b-pt-res Layer 10
Same Feature Similarity: gemma-scope-2b-pt-res Layer 20
Steering Results
We receive mixed results when evaluating compositional systems. Our results suggest that compositional injection (-1 * "Happy" + "Sad") results in higher steered sentiment towards the positive feature’s direction, and flipping the direction of SAE features results in neutral steering on its own.
Synchronized injection of antonymic features (happy + sad) seem to suggest some “cancelling out” of sentiment, control single injection directions like sadness seem to become much less common when paired with a happy vector.
Overall, compositionality of features provides interesting results and presents the opportunity for future research to be done.
Contrastive Steering Results:
Compositional Steering Results
Control Steering Results: Happy and Sad Single Injection
Conclusions and Discussion
The striking absence of consistent geometric relationships between semantically opposite features is surprising, given how frequent geometric relationships occur in original embedding spaces between semantically related features. SAE features are designed to anchor layer activations to their corresponding monosemantic directions, yet seem to be almost completely unrelated in all layers of the SAE.
Potential explanations for the lack of geometric relationships in semantically related SAE encoder and decoder features could be due to contextual noise in the training data. As word embeddings go through attention layers, their original directions could become muddled with noise from other words in the contextual window. However, the lack of any noticeable geometric relationship across 2 seperate layers is surprising. These findings lay the path for future interpretability work to be done in better understanding this phenomena.
Acknowledgements
A special thanks to Zhengxuan Wu for advising me on this project!
Appendix
Semantically Opposite Features: gemma-scope-2b-pt-res Layer 20
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 10
6802: phrases related to weather and outdoor conditions
12291: scientific terminology and data analysis concepts
2791: references to Java interfaces and their implementations
15083: phrases indicating specific points in time or context
8291: function calls and declarations in code
14892: text related to artificial intelligence applications and methodologies in various scientific fields
12440: code snippets and references to specific programming functions or methods in a discussion or tutorial.
12167: structured data entries in a specific format, potentially related to programming or configuration settings.
15083: phrases indicating specific points in time or context
8144: names of people or entities associated with specific contexts or citations
15083: phrases indicating specific points in time or context
Most Opposite Cosine Similarity: gemma-scope-2b-pt-res Layer 10
7357: questions and conversions related to measurements and quantities
16200: concepts related to celestial bodies and their influence on life experiences
12689: terms related to critical evaluation or analysis
4727: patterns or sequences of symbols, particularly those that are visually distinct
12996: references to legal terms and proceedings
10264: text related to font specifications and styles
11236: Java programming language constructs and operations related to thread management
12542: information related to special DVD or Blu-ray editions and their features
6418: references to types of motor vehicles
8375: descriptions of baseball games and players.
Most Aligned Cosine Similarity: gemma-scope-2b-pt-res Layer 20
15036: programming constructs related to thread handling and GUI components
14537: repeated occurrences of the word "the" in various contexts
6336: terms related to legal and criminal proceedings
16267: instances of mathematical or programming syntax elements such as brackets and operators
1831: elements related to JavaScript tasks and project management
7248: structural elements and punctuation in the text
8458: occurrences of function and request-related terms in a programming context
16090: curly braces and parentheses in code
10192: modal verbs indicating possibility or necessity
8630: mentions of cute or appealing items or experiences
10695: punctuation marks and variation in sentence endings
745: sequences of numerical values
2437: patterns related to numerical information, particularly involving the number four
Most Opposite Cosine Similarities: gemma-scope-2b-pt-res Layer 20
8684: technical jargon and programming-related terms
8366: verbs and their related forms, often related to medical or technical contexts
6793: elements that resemble structured data or identifiers, likely in a list or JSON format
5533: terms related to medical and biological testing and analysis
10931: keywords and parameters commonly associated with programming and configuration files
5328: terms and phrases related to legal and statistical decisions
9185: technical terms related to audio synthesis and electronic music equipment
14881: phrases related to lubrication and mechanical properties
9064: mathematical operations and variable assignments in expressions
15111: phrases related to functionality and efficiency in systems and processes