A walkthrough of the core findings and guided replication of the concepts from the original research on “Multi-level features discovery with Matryoshka Sparse AutoEncoders”.
TL;DR
Sparse AutoEncoders (SAEs) are a cornerstone of mechanistic interpretability, but they struggle with scalability. As we increase the dictionary size to capture more features, we often encounter "feature splitting" and "feature absorption," where general concepts are lost or broken into fragmented, less interpretable components. Matryoshka Sparse Autoencoders (MSAEs) solve this by training nested dictionaries simultaneously. This forces the model to organize features hierarchically, where smaller dictionaries capture more general concepts and larger dictionaries capture the specific details, all within the same latent space.
1.Why Sparse AutoEncoders?
Sparse AutoEncoders is a technique originally used in Signal Processing. It has evolved, from dictionary learning methods, and has recently proven to be effective in helping us understand the features neural networks learn from the high-dimensional activations of LLMs.
In the early days of interpretability, researchers discovered that vision models often learn more features than they have neurons. This phenomenon is known as polysemanticity or superposition.This made mechanistic interpretability work very difficult and limited its progress for a few years. However, recent work, notably by Anthropic, has popularized the use of SAEs to disentangle these features. This allows us to extract monosemantic features and trace the circuits of concepts learned by LLMs.
Still, this technique is evolving. They are interesting and promising but they are currently suffering from some limitations that require better scaling solutions for it to be used in practical interpretability work.
2. Limitations of Classic SAEs
In the way SAEs are trained on model’s activations, there is a fundamental problem. The more we increase the dictionary size, that is, the total number of features the SAE can learn, the more "sparsity" encourages the model to either split general features into fragmented sub-features or absorb them into broader, less useful clusters.
In this regard, the advised way of patching this problem is to sweep through many different dictionary sizes to find the best. But obviously, this is computationally expensive and, clearly, a "try and fail" approach that would not scale to larger, and more modern models.
The paper identifies three important problems with current SAEs: feature splitting, feature absorption and feature composition.
Feature Splitting is when a high-level feature, which could be important for interpretability, gets broken down into more specific features. As a consequence, the more general representation gets missing. In the original paper, the authors took the example of a sentence of punctuation. Like when a more general feature like "punctuation mark” gets splitted into {Question Mark, Comma, Period}, and we lose the more general feature "punctuation mark” in the cost of the 3 more specific features, which by the way are related to it.
Feature absorption is when that part of a general concept gets splitted and leaves a hole into that more general feature.
Feature Composition is when we have different independent features, and the sparsity objective encourages the model to learn latent that capture combination, even if the features are not forcefully related, ,instead of representing features independently, which would be more useful.
Clearly, all the limitations in traditional SAEs are compounded when we increase the dictionary size, which is obviously needed to disentangle more features learned by larger models
3. What are Matryoshka SAEs?
Matryoshka SAEs were introduced to solve this. They are inspired by "Matryoshka Representation Learning" (MRL), named after the Russian nesting dolls.
MRL is a representation learning method where information is encoded at different levels of abstraction within the same embedding vector. This main advantage allows the representation to be adoptable to computational constraints of downstreams tasks.
They come up with an approach where the training is done using multiple dictionaries of increasing size simultaneously. This forces the smaller dictionaries to reconstruct the inputs at their best extent without using larger dictionaries. On this, it organizes features hierarchically, so
Smaller dictionaries learn more general concepts
Larger features learn more specific concepts without merging larger ones into them.
4. Architecture and Training Objective
At the core, MSAEs introduces a "nested" training objective.
Same as the classic SAEs, the encoder maps the hidden state into a latent representation , and the decoder reconstructs the input .
Encoder
Decoder
1. Sigma here is an activation function that introduces sparsity in by enforcing non-negativity, typically the ReLU function. 2. The decoder uses a linear transformation to map the latents back to the original input, which produces an approximation reconstruction of the input .
Matryoshka SAEs extend this classic encoder decoder architecture by training multiple nested auto-encoders of increasing sizes, and introduces a nested objective function for the training.
To enforce hierarchy of features, they introduced the idea of nested dictionaries. In their methods, we are given a maximum dictionary size . Then from , we define a sequence of nested dictionaries of sizes = such that
In the training process, each represents the dictionary of a sub-SAE that should reconstruct the input the best it could only using latents at most.
In practice, in their experiments, is sampled from a batch. This leads to the novel nested training objective as follows.
Training Objective Visualization
all input in , we have
(1) for in (2)
In this (1) and (2),
in are the encoder and decoder matrices
means taking the first mi_i elements
Each nested decoder must learn to reconstruct only using the latents
From this emerged the training objective which is the touch of novelty in SAEs
In this formula, the term forces the model to reconstruct the input using only the first m latents, ensuring the hierarchy is strictly maintained.
5. Practical Experimentation
The paper used synthetic data, on a 4-layer toy tiny-stories model, with a tree-structured model to validate the statistical dependencies. In their validation experiments, they used a synthetic model where they constructed a tree structure of binary features.
In practice, in the experimental pipeline, they used a 4-layer model trained on tiny stories dataset, to validate the statistical dependencies. In their validation experiments, they used a synthetic model where they constructed a tree structure of binary features.
Every feature is a node, and the root feature is always active.
Children's features are sampled only if their parents are active, which creates the hierarchical statistical dependency.
For active features, their direction vector gets added to a final running sum to produce a final d-dim input that is fed into the autocender.
Graphically, it looks like
Here, the first latents must learn to reconstruct well alone. The vanilla gets one loss signal, but here it gets several nested loss signals at different prefix length. This forces the most important features to be encoded first.
A ReLU is used to include sparsity by design by setting any negative value to zero
The Ks are sampled from a Preto distribution. This means at each step, the mode is trained on a small prefix.
Pareto Distribution
6. Results and Replication
The results show that MSAEs successfully avoid feature absorption in a controlled way. They achieve better reconstruction quality than standard SAEs because they are incentivized to pack the most "important" features into the smallest possible prefix of their latent space.
The paper's official repo can be found at noanabeshima/matryoshka-saes. For the LLM-scale run, we rebuilt the pipeline on top of the paper's shared sae.py
We used Modal's environment with a setup of 3 x H100 GPUs for fast replication.
For TinyStories, we used the tinymodel (4-layer, 768-dim, ReLU, no-LayerNorm transformer), training SAEs on the layer-3 residual stream, drawing activations from the noanabeshima/TinyModelTokIds dataset.
Toy model (unchanged from the paper's notebook)
Parameter
Value
Ground-truth features
20 hierarchical features. 3 parents x{3 mutually-exclusive children and 1 hidden}, plus 8 rare features.
d_model
20 orthonormal features
SAE latents
20
Matryoshka prefixes
10 (vanilla = 1)
Target L0
1.2338 (the true features' L0)
Steps
40,000
TinyStories
Parameter
Value
Activation site
residual stream, layer 3
Training tokens
30M cached once
SAE latents
3,072
Target L0
30 (matched across both SAEs)
Learning rate
1e-3
Steps
30,000
Our run
With this setup, and after a few minor fixes, mainly on the original training pipeline, we ran the setup.
# Toy model training python train_toy.py
# TinyStories on Modal for validation modal run modal_tinystories.py --tokens 5000000 --n-steps 3000
# Healthy comparison run modal run modal_tinystories.py --n-latents 3072 --target-l0 30 --lr 0.001
On these runs, the 30M-token activation cache is written once to a Modal Volume and reused across runs.
Replication Results
Toy model — the absorption claim, quantified. After training, we match each learned latent to its closest ground-truth feature and measure recovery, best cosine similarity per true feature
FVU
L0
Clean Features
Matryoshka
0.021
1.13
20 / 20
Vanilla
0.000
0.86
11 / 20
The vanilla SAE reconstructs perfectly yet cleanly recovers only 11 of 20 features, and the 9 it mangles are exactly the hierarchical parent/child features, which it absorbs and splits. The Matryoshka SAE recovers all 20, because its nested prefixes force the small dictionaries to represent the high-level features on their own.
TinyStories — the tradeoff holds at scale. With both SAEs healthy and at matched sparsity
model
FVU
L0
dead features
FVU
L0
Dead Features
Matryoshka
0.189
29.9
0%
Vanilla
0.176
30.1
0%
At the same L0, Matryoshka pays only a small reconstruction penalty with ΔFVU ≈ 0.013. This connects to the paper's claim that the modest hit to raw reconstruction buys the large gains in feature structure that the toy experiment makes visible.
A walkthrough of the core findings and guided replication of the concepts from the original research on “Multi-level features discovery with Matryoshka Sparse AutoEncoders”.
TL;DR
Sparse AutoEncoders (SAEs) are a cornerstone of mechanistic interpretability, but they struggle with scalability. As we increase the dictionary size to capture more features, we often encounter "feature splitting" and "feature absorption," where general concepts are lost or broken into fragmented, less interpretable components. Matryoshka Sparse Autoencoders (MSAEs) solve this by training nested dictionaries simultaneously. This forces the model to organize features hierarchically, where smaller dictionaries capture more general concepts and larger dictionaries capture the specific details, all within the same latent space.
1.Why Sparse AutoEncoders?
Sparse AutoEncoders is a technique originally used in Signal Processing. It has evolved, from dictionary learning methods, and has recently proven to be effective in helping us understand the features neural networks learn from the high-dimensional activations of LLMs.
In the early days of interpretability, researchers discovered that vision models often learn more features than they have neurons. This phenomenon is known as polysemanticity or superposition.This made mechanistic interpretability work very difficult and limited its progress for a few years. However, recent work, notably by Anthropic, has popularized the use of SAEs to disentangle these features. This allows us to extract monosemantic features and trace the circuits of concepts learned by LLMs.
Still, this technique is evolving. They are interesting and promising but they are currently suffering from some limitations that require better scaling solutions for it to be used in practical interpretability work.
2. Limitations of Classic SAEs
In the way SAEs are trained on model’s activations, there is a fundamental problem. The more we increase the dictionary size, that is, the total number of features the SAE can learn, the more "sparsity" encourages the model to either split general features into fragmented sub-features or absorb them into broader, less useful clusters.
In this regard, the advised way of patching this problem is to sweep through many different dictionary sizes to find the best. But obviously, this is computationally expensive and, clearly, a "try and fail" approach that would not scale to larger, and more modern models.
The paper identifies three important problems with current SAEs: feature splitting, feature absorption and feature composition.
Clearly, all the limitations in traditional SAEs are compounded when we increase the dictionary size, which is obviously needed to disentangle more features learned by larger models
3. What are Matryoshka SAEs?
Matryoshka SAEs were introduced to solve this. They are inspired by "Matryoshka Representation Learning" (MRL), named after the Russian nesting dolls.
They come up with an approach where the training is done using multiple dictionaries of increasing size simultaneously. This forces the smaller dictionaries to reconstruct the inputs at their best extent without using larger dictionaries. On this, it organizes features hierarchically, so
4. Architecture and Training Objective
At the core, MSAEs introduces a "nested" training objective.
Same as the classic SAEs, the encoder maps the hidden state into a latent representation , and the decoder reconstructs the input .
1. Sigma here is an activation function that introduces sparsity in
2. The decoder uses a linear transformation to map the latents back to the original input, which produces an approximation reconstruction
Matryoshka SAEs extend this classic encoder decoder architecture by training multiple nested auto-encoders of increasing sizes, and introduces a nested objective function for the training.
To enforce hierarchy of features, they introduced the idea of nested dictionaries. In their methods, we are given a maximum dictionary size . Then from , we define a sequence of nested dictionaries of sizes = such that
In the training process, each represents the dictionary of a sub-SAE that should reconstruct the input the best it could only using latents at most.
In practice, in their experiments, is sampled from a batch. This leads to the novel nested training objective as follows.
Training Objective Visualization
In this (1) and (2),
From this emerged the training objective which is the touch of novelty in SAEs
In this formula, the term
5. Practical Experimentation
The paper used synthetic data, on a 4-layer toy tiny-stories model, with a tree-structured model to validate the statistical dependencies. In their validation experiments, they used a synthetic model where they constructed a tree structure of binary features.
In practice, in the experimental pipeline, they used a 4-layer model trained on tiny stories dataset, to validate the statistical dependencies. In their validation experiments, they used a synthetic model where they constructed a tree structure of binary features.
Graphically, it looks like
The vanilla gets one loss signal, but here it gets several nested loss signals at different prefix length. This forces the most important features to be encoded first.
Pareto Distribution
6. Results and Replication
The results show that MSAEs successfully avoid feature absorption in a controlled way. They achieve better reconstruction quality than standard SAEs because they are incentivized to pack the most "important" features into the smallest possible prefix of their latent space.
To go beyond reading the paper, we reproduced its central result, first on the synthetic toy model, then at LLM scale on a small TinyStories transformer. This section documents the setup and results. The replication code can be found at https://github.com/baimamboukar/replicating-matryoshka-sparse-autoencoders-paper.
Our setup
The paper's official repo can be found at noanabeshima/matryoshka-saes. For the LLM-scale run, we rebuilt the pipeline on top of the paper's shared
sae.pyWe used Modal's environment with a setup of 3 x H100 GPUs for fast replication.
For TinyStories, we used the tinymodel (4-layer, 768-dim, ReLU, no-LayerNorm transformer), training SAEs on the layer-3 residual stream, drawing activations from the noanabeshima/TinyModelTokIds dataset.
Toy model (unchanged from the paper's notebook)
Parameter
Value
Ground-truth features
20 hierarchical features. 3 parents x{3 mutually-exclusive children and 1 hidden}, plus 8 rare features.
d_model20 orthonormal features
SAE latents
20
Matryoshka prefixes
10 (vanilla = 1)
Target L0
1.2338 (the true features' L0)
Steps
40,000
TinyStories
Parameter
Value
Activation site
residual stream, layer 3
Training tokens
30M cached once
SAE latents
3,072
Target L0
30 (matched across both SAEs)
Learning rate
1e-3
Steps
30,000
Our run
With this setup, and after a few minor fixes, mainly on the original training pipeline, we ran the setup.
# Toy model training
python train_toy.py
# TinyStories on Modal for validation
modal run modal_tinystories.py --tokens 5000000 --n-steps 3000
# Healthy comparison run
modal run modal_tinystories.py --n-latents 3072 --target-l0 30 --lr 0.001
On these runs, the 30M-token activation cache is written once to a Modal Volume and reused across runs.
Replication Results
Toy model — the absorption claim, quantified. After training, we match each
learned latent to its closest ground-truth feature and measure recovery, best cosine
similarity per true feature
FVU
L0
Clean Features
Matryoshka
0.021
1.13
20 / 20
Vanilla
0.000
0.86
11 / 20
The vanilla SAE reconstructs perfectly yet cleanly recovers only 11 of 20 features, and the 9 it mangles are exactly the hierarchical parent/child features, which it absorbs and splits. The Matryoshka SAE recovers all 20, because its nested prefixes force the small dictionaries to represent the high-level features on their own.
TinyStories — the tradeoff holds at scale. With both SAEs healthy and at matched
sparsity
model
FVU
L0
dead features
FVU
L0
Dead Features
Matryoshka
0.189
29.9
0%
Vanilla
0.176
30.1
0%
At the same L0, Matryoshka pays only a small reconstruction penalty with ΔFVU ≈ 0.013. This connects to the paper's claim that the modest hit to raw reconstruction buys the
large gains in feature structure that the toy experiment makes visible.