Training Superior Sparse Autoencoders for Instruct Models

Haoran Ye

Resource	Link
Paper	https://arxiv.org/abs/2506.07691
Code	https://github.com/Geaming2002/FAST
SAEs	Llama-3.1-8B-Instruct_SAEs🤗,Llama-3.2-3B-Instruct_SAEs🤗,Llama-3.2-1B-Instruct_SAEs🤗,Qwen2.5-7B-Instruct_SAEs🤗,Qwen2.5-3B-Instruct_SAEs🤗,Qwen2.5-1.5B-Instruct_SAEs🤗,Qwen2.5-0.5B-Instruct_SAEs🤗

💡 TL;DR

In this paper, we discover problems in previous SAE training approaches for instruct model :

📚 Suboptimal dataset selection affecting SAE performance.

✂️ Semantic discontinuity caused by block training truncating samples mid-content.

Therefore, we propose Finetuning-aligned Sequential Training (FAST)💪, a novel training method specifically tailored for instruct models. The results demonstrate:

Token Reconstruction Performance 📉: FAST shows token better reconstruction performance. On Qwen2.5-7B-Instruct, FAST achieves a mean squared error of 0.6468, significantly outperforming baseline methods with errors of 5.1985 and 1.5096.

Feature Interpretability 🎯: FAST yields a higher proportion of high-quality features. For Llama3.2-3B-Instruct, 21.1% scored in the top range, compared to 7.0% and 10.2% for BT(P) and BT(F).

Novel Discovery 🔍: Intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior, enabling broad adoption and future research.

Find the details in our post below👇

🔍Motivation: Why Traditional SAE Training Falls Short

Imagine reading a novel where every few pages, the story abruptly jumps to a completely different book—confusing📚✂️, right? This is essentially what happens with traditional Sparse Autoencoder (SAE) training methods for large language models!

Block Training (BT) has become the default approach for SAE training, where datasets (usually pretraining datasets) are concatenated into fixed-length blocks (Joseph Bloom and Chanin, 2024; Bricken et al., 2023). While this works reasonably well for base models—which are accustomed to processing random text chunks during pretraining—it creates significant problems for instruct models that have been fine-tuned to understand complete, coherent instructions.

Consider a typical 8,192-token training block: BT might stitch together 2,048 tokens from one sample with 6,144 tokens from another, creating jarring semantic discontinuities. For instruction-tuned models designed to maintain contextual understanding, this abrupt semantic "cliff edge" severely compromises their ability to align with downstream tasks and maintain coherent representations (Kissane et al., 2024b)

To solve this fundamental mismatch, we introduce Finetuning-aligned Sequential Training (FAST) —a novel method specifically designed for training SAEs on instruct models. Unlike BT, our approach processes each data instance independently, preserving semantic integrity and maintaining alignment with the model's fine-tuning objectives. This ensures the model operates in a consistent semantic space during SAE training, ultimately enhancing both training quality and the model's ability to process instructions effectively.

Our Approach: Finetuning-aligned Sequential Training (FAST)

FAST

Illustration of the LLM training pipeline and SAE training methods. (a) The pipeline transitions from pretraining to fine-tuning. (b) Block Training (BT) concatenates datasets and resplits them into fixed-length blocks. (c) Finetuning-aligned Sequential Training (FAST) processes data instances independently, preserving semantic integrity and improving alignment with fine-tuning objectives, leading to better performance in feature interpretability.

Following the motivation outlined above, FAST incorporates three key technical components:

Data Processing
- Processes each dialogue instance independently with model's chat template
- Preserves complete semantic integrity of individual samples
- Maintains consistency with the model's fine-tuning phase methodology
- Eliminates semantic discontinuity issues present in traditional block training
Dual SAE Architecture: To demonstrate the generalizability of FAST training method, we implement it across two distinct SAE architectures.
- Standard ReLU-based SAE(Bricken et al., 2023)
  - Traditional architecture widely adopted in previous studies
  - Uses ReLU activation and L1 sparsity regularization
- JumpReLU SAE(Rajamanoharan et al., 2024a; Lieberum et al., 2024)
  - Enhanced version with modified activation function
  - Implements learnable thresholds for better feature control
  - Achieves superior reconstruction quality and sparsity management
Mixing Activation Buffer: Schematic diagram of the mixing activation buffer. The buffer is shuffled, half is sent to the SAE for training, and the resulting new activations are used to refill the buffer. This iterative process ensures data diversity and storage efficiency(Joseph Bloom and Chanin, 2024).

Experiments

Setup

Dataset: Combined multiple high-quality instruction datasets (WildChat-1M-Full, Infinity-Instruct, tulu-3-sft-mixture, orca-agentinstruct-1M-v1-cleaned, and lmsys-chat-1m), resulting in ~4.7M samples after deduplication. For BT(P), we use the Pile dataset to train the corresponding SAEs.

Models: Evaluated on 7 models from Llama (3.1, 3.2) and Qwen (2.5) series:

Model Name	Layer
Llama-3.1-8B-Instruct	[4, 12, 18, 20, 25]
Llama-3.2-3B-Instruct	[4, 12, 20]
Llama-3.2-1B-Instruct	[4, 9, 14]
Qwen2.5-7B-Instruct	[4, 12, 18, 20, 25]
Qwen2.5-3B-Instruct	[4, 18, 32]
Qwen2.5-1.5B-Instruct	[4, 14, 24]
Qwen2.5-0.5B-Instruct	[4, 12, 20]

Metrics: for overall reconstruction $M S E = \frac{\sum_{i = 1}^{N} \frac{1}{L_{i}} \sum_{j = 1}^{L_{i}} \sum_{k = 1}^{H} (y_{i, j, k} - {^y}_{i, j, k})^{2}}{N \cdot H}$

where $N$ denotes the size of the dataset, $L_{i}$ represents the length of the $i$ -th sequence, $H$ refers to the hidden dimension of the model. To evaluate the SAE's performance specifically on special tokens, we also compute the $M S E$ of special tokens, denoted as ${M S E}_{s t}$ . Lower $M S E$ values reflect better model performance.

Results

image alt

${M S E}_{s t}$ performance of the JumpReLU SAE (all metrics are presented in log scale, where lower values indicate better SAE reconstruction performance). Within the JumpReLU architecture, FAST exhibits the best reconstruction capability compared to BT(P) and BT(F).

Lower Error⬇️: FAST achieves the lowest MSE among all methods tested.
Better Token Reconstruction: FAST outperforms other methods in reconstructing both general and special tokens, especially on Llama and Qwen models.
Stronger Impact on Standard SAE: The improvement brought by FAST is more significant in Standard SAE, overcoming its previous limitations. For JumpReLU SAE, while the gains are smaller due to its already strong baseline, FAST still delivers meaningful performance improvements.

❓But can we intuitively feel this advantage？-> Feature Interpretability

While metrics like MSE provide an objective, quantitative comparison of reconstruction capabilities across different SAE architectures, they can feel somewhat detached from practical experience. To better evaluate the real-world quality of an SAE, it is important to also consider experimental methods such as feature interpretability, which offer more intuitive insights into model performance.

Additional 10,000 instances are sampled and their activation values are computed. Then the top five sentences with the highest activation values are identified to construct an activation dataset for evaluating features. GPT-4o is prompted to score each group of five contexts and generate a descriptive summary.

There is the feature evaluation metric we designed for LLM followed by Llama Scope,2024:

Score	Description
5	Clear pattern with no deviating examples
4	Clear pattern with one or two deviating examples
3	Clear overall pattern but quite a few examples not fitting that pattern
2	Broad consistent theme but lacking structure
1	No discernible pattern

Models: JumpReLU SAEs exclusively

feature_scores

Quality Distribution:
- FAST achieves 21.1% high-quality features (scores 4-5) vs 7.0% (BT(P)) and 10.2% (BT(F)).
- Significantly reduces low-quality feature proportion.
CDF Analysis:
- FAST consistently shows lowest proportion of features scoring ≤ 3
- Example: Qwen2.5-3B CDF@3: 76.5% ( $FAST$ ) vs 89.0% (BT(F)) and 92.2% (BT(P)).

FAST training methodology produces substantially more interpretable features than block training approaches, demonstrating its effectiveness for enhancing SAE interpretability.

🔥Case Study: Steering with SAE Latents

By performing steering operations on certain special features identified by the SAE, we are able to modify the model's original activation patterns and thereby influence its final output. This provides an intuitive demonstration of how the SAE decomposes the model into disentangled, semantically meaningful features. Since our models are trained in an instruction-tuned setting, we are particularly interested in understanding the roles of features most strongly associated with special tokens in Instruct models. Specifically, we investigate the features with the highest activation for <|im_start|> in Qwen2.5-7B-Instruct and <|start_header_id|> in Llama3.1-8B-Instruct, and apply targeted steering to these features. This allows us to explore how varying the steering coefficients affects the model's output on a range of question-answering tasks.

There exists some interesting results when applying SAE features to 3 concrete questions:

Q2

qwen_Q2

The steering output generated by Qwen2.5-7B-Instruct with Feature ID: 13794, focusing on user and <|im_start|> tokens for the Question 2 (entity description).

llama_Q2

The steering output generated by Llama3.1-8B-Instruct with Feature ID: 22642, focusing on <|strart_header_id|> tokens for the Question 2 (entity description).

In Question 2, Qwen demonstrates optimal performance when feature 13794 is activated within a moderate range (specifically, setting $α$ between 25 and 75). Within this range, the model produces coherent, detailed, and informative responses. However, when $α$ is set too high (such as $α \geq 100$ ), Qwen exhibits severe degradation—generating hallucinations, producing repetitive content, and losing coherence in its outputs.

Llama, in contrast, shows limited responsiveness. It only exhibits meaningful improvements within a narrow range of $α = 15$ to $25$ . Within this window, the model demonstrates slightly enhanced politeness and helpfulness—though the improvements remain modest. Outside this range, it rapidly deteriorates into repetitive and incoherent output patterns.

Q3

qwen_Q3

The steering output generated by Qwen2.5-7B-Instruct with Feature ID: 13794, focusing on user and <|im_start|> tokens for the Question 3 (cover letter task).

llama_Q3

The steering output generated by Llama3.1-8B-Instruct with Feature ID: 22642, focusing on <|start_header_id|> tokens for the Question 3 (cover letter task).

For Question 3, giving feature 13794 a moderate boost in the Qwen model—think $α$ between 50 and 100—is like handing it a well-organized notepad and a double-shot of clarity. The responses become noticeably more informative and better structured, with richer content and more coherent reasoning. It’s as if Qwen hits its stride in this range, delivering thoughtful, well-formed answers.

But don’t get carried away with the dial. Push $α$ too high, and Qwen starts to unravel: the content turns repetitive, drifts off-topic, and sometimes even flips languages or fabricates facts out of thin air—like it’s improvising without a script.

Llama, by contrast, shows only modest gains when its most active feature is gently amplified—up to around $α = 25$ . Within this narrow window, it becomes slightly more informative and engaging, but the improvements are subtle. Go beyond that, and things quickly go downhill: answers become repetitive, coherence drops, and the model seems to lose its conversational footing.

Q4

qwen_Q4

The steering output generated by Qwen2.5-7B-Instruct with Feature ID: 13794, focusing on user and <|im_start|> tokens for the Question 4 (entity discrimination task).

llama_Q4

The steering output generated by Llama3.1-8B-Instruct with Feature ID: 22642, focusing on <|start_header_id> tokens for the Question 4 (entity discrimination task).

Q4 shows that both Qwen and Llama can “level up” their reasoning and answer quality—but only if you hit the sweet spot with feature amplification.

For Qwen, dialing up its most active feature with $α$ between 25 and 100 works like flipping a switch. Suddenly, its responses become more convincing, informative, and logically structured. It’s as if Qwen hits its rhythm—delivering answers with sharper reasoning and clearer flow. But beware: push the amplification too far, and the magic fades. Coherence starts to slip, informativeness declines, and the model loses its edge.

Llama, meanwhile, plays a more delicate tune. A light touch—amplifying up to around $α = 25$ —can give it a minor boost in reasoning and engagement. But anything beyond that, and the performance takes a nosedive: responses become repetitive, meaning gets muddled, and the output quickly loses its quality.

In summary, results across all three questions reveal a clear optimal range for the coefficient α: when set appropriately, the model's responses become sharper, more coherent, and highly relevant. However, when the coefficient exceeds this optimal range, quality deteriorates rapidly—language degrades, semantic meaning becomes unclear, and outputs can become unpredictable.

What's particularly noteworthy is that this feature steering method demonstrates consistent effectiveness across various tasks and languages, reliably enhancing the model's reasoning capabilities. Unlike traditional SAE approaches that often carry inherent biases, this method remains flexible and bias-free, enabling more robust and high-quality text generation.

These findings not only provide insights into the mechanisms of SAE but also offer practical guidance for improving model performance through strategic feature adjustments.

😘 Citation

If you finding our work interesting or helpful to you, please cite this repo.

@misc{li2025trainingsuperiorsparseautoencoders,
      title={Training Superior Sparse Autoencoders for Instruct Models}, 
      author={Jiaming Li and Haoran Ye and Yukun Chen and Xinyue Li and Lei Zhang and Hamid Alinejad-Rokny and Jimmy Chih-Hsien Peng and Min Yang},
      year={2025},
      eprint={2506.07691},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.07691}, 
}

🫡 Contact

If you have any questions, feel free to contact us at jm.li4@siat.ac.cn or y_haoran@u.nus.edu

LESSWRONG
LW