Preliminary Results on Building Graphs from SAEs

ZachMaas

TLDR:

I use nodewise LASSO to build approximate sparse conditional dependence graphs over SAE features, with resampling and null controls
Initial experiments produce graphs with small standalone modules that are stable under resampling
These modules frequently correspond to coherent-looking linguistic features, and are only weakly aligned with cosine similarity
This should be read primarily as a proof-of-concept, and methodological refinement is ongoing

Motivation

In practice, SAE features exhibit redundancy, which we see in phenomena like feature splitting, absorption, and duplication. We can look at SAE features through the lens of cosine similarity of feature weights or using correlation between activations. These can capture some similarity between features, but don't model conditional dependence between features. I believe that if SAE features have redundancy or hierarchy, their activations should show sparse conditional dependence structure. If such dependency structure exists, we should be able to model it with methods that build a precision graph between features.

My approach here models this conditional dependence structure downstream of activation correlations, with a goal of better understanding SAE feature geometry, and sits between circuits-type approaches and SAE work. To my knowledge, conditional dependence between SAE features has not yet been modeled in prior work.

Methodology

Run nodewise LASSO on random activations to approximate linear dependence between SAE features, using repeat and null trials to control against dataset bias.

I've built a pipeline to test this theory. My approach is as follows:

Sample activations from a SAE at a given layer on random sequences
Pre-screen candidate neighbors for each SAE feature
1. Calculate correlations between features over activations, convert to Fisher z-scores, and use BH-FDR to control false discovery
2. Keep the top $~k$ candidates per node
For each SAE feature:
1. Perform a parameter sweep to tune the LASSO regularization coefficient
2. Run LASSO to identify conditionally dependent neighbors for this feature
Merge edges from each nodewise sweep, keeping bidirectional edges only
Repeat under random sampling to test stability - I do 30 resamples + a matched null trial (feature-wise shuffle) for each resample
Identify edges that appear consistently across resamples and pass BH-FDR vs null trials

This procedure approximates a conditional-dependence graph, but is not an exact estimator.

Initial Results

Small scale trials identify linguistically coherent looking modules that are only weakly aligned with cosine similarity.

I've successfully run this approach on a single SAE layer for models from gpt2-small up to gemma2:27b.

Here, I'm presenting results from gemma2:27b (Gemmascope, layer_10/width_131k) as the largest model I've tested so far.

Sequences are randomly sampled from fineweb.

Base Model	gemma2:27b
SAE Model	gemmascope, layer_10/width_131k
Dataset	fineweb
Activation samples per feature	200
Activation Sparsity (30 trial mean)	0.99897
Average features per token (30 trial mean)	133.7

For downstream analysis, I keep edges from the repeat trials using BH-FDR to compare the stability of each edge relative to the null trials, with as the threshold. I find that this model + dataset pair meets the FDR threshold of 0.1 at replicates, with 57,047 retained edges.

Number of Repeats An Edge is Seen In	Observed Edges	Expected Null Edges	Estimated FDR
1	204,609	196,273	9.59e-01
2	57,047	1,981	3.47e-02
3	31,429	110	3.50e-03
5	15,165	1	6.59e-05
10	5,286	0	0
20	1,389	0	0
30	302	0	0

Note that for this preliminary experiment I'm using only a single null resampling per trial.

Importantly, I see that cosine similarity and edge strength in the discovered graph are weakly correlated with high variance (see below, this is the set of stable edges).

For qualitative interpretation here, I restrict manual inspection to the much smaller subset of stable edges on a small subset of nodes. Within this set of 227 nodes and 302 edges, I end up with one large connected component (98 nodes), a handful of smaller standalone components ( nodes), and a large number of tiny components. The group of tiny components () appears to mostly be duplicated features with strong cosine similarity, and so I've filtered those elements out.

Small components

The smaller standalone components with features look like linguistically coherent features under manual inspection (mostly grammar and context related):

Component 2 (9 nodes): possessive pronouns + some context.

Component 3 (9 nodes): "to be" + some descriptive context

Component 4 (7 nodes): "which/who/that" followed by verbs, plus explanatory context (e.g. we prove that, that maps, that means)

Component 5 (5 nodes): "has been" either alone or followed by specific words

Component 6 (5 nodes): negations (not, didn't) + verb and article context

Large component

The large connected cluster contains subsets of communities that can be identified with community detection.

Note that for the sake of legibility I don't label nodes in the following graph of all clusters.

My current read is that these communities are not as linguistically "clean" as the standalone components --- many have a node that doesn't fit the rest of the "theme" of the community.

For example

Cluster 8: proper names / place names / labels.

Cluster 4: apostrophes, numbers/dates, citation/legal-style tokens.

Cluster 7: “the + noun/category” phrase patterns.

Cluster 2/3: connective / clause / punctuation-ish syntax groupings, but messy.

In addition, I also see a set of catch-all miscellaneous communities as well as tiny clusters. These catch-all communities (Clusters 0 and 6) describe a variety of features across technical jargon, legal language, code and formatting, and other rare things like specific formal names.

Many of the small communities are incoherent, but some show interesting features. A notable example here is a community that describes uses of the number '2' as a prefix in different contexts. Note that these graphs are downstream of the correlation between activations in the pre-screening step. They should be interpreted as a sparse refinement of correlation structure towards conditional dependence, rather than a standalone finding separate from correlation.

Caveats and Limitations

There are several limitations to this approach I think you should keep in mind:

This isn't finding a true precision graph, because SAE activations are not Gaussian
The output graph is dependent on correlations from the pre-screening step, which means that this is a refinement on correlation structure of activations. Thus this method should be read as sparsifying the correlation structure of SAE activations in the direction of linear dependence.
What I'm calling "stable" here means stability under resampling over the same dataset, which doesn't say anything about underlying feature faithfulness.
At the current FDR threshold, stability is an extreme constraint (302 edges over 131k nodes), and the standalone clusters seem to mostly be linguistic backbone.
As implemented right now, this method has hyperparameters that need either more thorough sweeps or methodological changes to remove them.

Next Steps

Right now, I see two primary directions this project needs to take:

Refining some of the steps in the pipeline to (a) confidently use the full 50k something edge repeat set and (b) remove some of the current design's hyperparameter dependence
Scaling this to all layers in a given model and figuring out how to connect graphs across layers accounting for the residual stream

Additionally, I'll also be looking at:

How different datsets may change the output graphs and how to control against that
Where this might land in relation to feature splitting and absorption
What, if any, hierarchical structure between features we might be able to infer from these graphs

Funding Note: This work is funded by a Coefficient Giving TAIS grant.