💡 TL;DR
In this paper, we discover problems in previous SAE training approaches for instruct model :
- 📚 Suboptimal dataset selection affecting SAE performance.
- ✂️ Semantic discontinuity caused by block training truncating samples mid-content.
Therefore, we propose Finetuning-aligned Sequential Training (FAST)💪, a novel training method specifically tailored for instruct models. The results demonstrate:
Token Reconstruction Performance 📉: FAST shows token better reconstruction performance. On Qwen2.5-7B-Instruct, FAST achieves a mean squared error of 0.6468, significantly outperforming baseline methods with errors of 5.1985 and 1.5096.
Feature Interpretability 🎯: FAST yields a higher proportion of high-quality features. For Llama3.2-3B-Instruct, 21.1% scored in the top range, compared to 7.0% and 10.2% for BT(P) and BT(F).
Novel Discovery 🔍: Intervening on the activations of special tokens via the SAEs leads to improvements in output quality, suggesting new opportunities for fine-grained control of model behavior, enabling broad adoption and future research.
Find the details in our post below👇
Imagine reading a novel where every few pages, the story abruptly jumps to a completely different book—confusing📚✂️, right? This is essentially what happens with traditional Sparse Autoencoder (SAE) training methods for large language models!
Block Training (BT) has become the default approach for SAE training, where datasets (usually pretraining datasets) are concatenated into fixed-length blocks (Joseph Bloom and Chanin, 2024; Bricken et al., 2023). While this works reasonably well for base models—which are accustomed to processing random text chunks during pretraining—it creates significant problems for instruct models that have been fine-tuned to understand complete, coherent instructions.
Consider a typical 8,192-token training block: BT might stitch together 2,048 tokens from one sample with 6,144 tokens from another, creating jarring semantic discontinuities. For instruction-tuned models designed to maintain contextual understanding, this abrupt semantic "cliff edge" severely compromises their ability to align with downstream tasks and maintain coherent representations (Kissane et al., 2024b)
To solve this fundamental mismatch, we introduce Finetuning-aligned Sequential Training (FAST) —a novel method specifically designed for training SAEs on instruct models. Unlike BT, our approach processes each data instance independently, preserving semantic integrity and maintaining alignment with the model's fine-tuning objectives. This ensures the model operates in a consistent semantic space during SAE training, ultimately enhancing both training quality and the model's ability to process instructions effectively.
Illustration of the LLM training pipeline and SAE training methods. (a) The pipeline transitions from pretraining to fine-tuning. (b) Block Training (BT) concatenates datasets and resplits them into fixed-length blocks. (c) Finetuning-aligned Sequential Training (FAST) processes data instances independently, preserving semantic integrity and improving alignment with fine-tuning objectives, leading to better performance in feature interpretability.
Following the motivation outlined above, FAST incorporates three key technical components:
Dataset: Combined multiple high-quality instruction datasets (WildChat-1M-Full, Infinity-Instruct, tulu-3-sft-mixture, orca-agentinstruct-1M-v1-cleaned, and lmsys-chat-1m), resulting in ~4.7M samples after deduplication. For BT(P), we use the Pile dataset to train the corresponding SAEs.
Models: Evaluated on 7 models from Llama (3.1, 3.2) and Qwen (2.5) series:
| Model Name | Layer |
|---|---|
| Llama-3.1-8B-Instruct | [4, 12, 18, 20, 25] |
| Llama-3.2-3B-Instruct | [4, 12, 20] |
| Llama-3.2-1B-Instruct | [4, 9, 14] |
| Qwen2.5-7B-Instruct | [4, 12, 18, 20, 25] |
| Qwen2.5-3B-Instruct | [4, 18, 32] |
| Qwen2.5-1.5B-Instruct | [4, 14, 24] |
| Qwen2.5-0.5B-Instruct | [4, 12, 20] |
Metrics: for overall reconstruction
where denotes the size of the dataset, represents the length of the -th sequence, refers to the hidden dimension of the model. To evaluate the SAE's performance specifically on special tokens, we also compute the of special tokens, denoted as . Lower values reflect better model performance.
performance of the JumpReLU SAE (all metrics are presented in log scale, where lower values indicate better SAE reconstruction performance). Within the JumpReLU architecture, FAST exhibits the best reconstruction capability compared to BT(P) and BT(F).
While metrics like MSE provide an objective, quantitative comparison of reconstruction capabilities across different SAE architectures, they can feel somewhat detached from practical experience. To better evaluate the real-world quality of an SAE, it is important to also consider experimental methods such as feature interpretability, which offer more intuitive insights into model performance.
Additional 10,000 instances are sampled and their activation values are computed. Then the top five sentences with the highest activation values are identified to construct an activation dataset for evaluating features. GPT-4o is prompted to score each group of five contexts and generate a descriptive summary.
There is the feature evaluation metric we designed for LLM followed by Llama Scope,2024:
| Score | Description |
|---|---|
| 5 | Clear pattern with no deviating examples |
| 4 | Clear pattern with one or two deviating examples |
| 3 | Clear overall pattern but quite a few examples not fitting that pattern |
| 2 | Broad consistent theme but lacking structure |
| 1 | No discernible pattern |
FAST training methodology produces substantially more interpretable features than block training approaches, demonstrating its effectiveness for enhancing SAE interpretability.
By performing steering operations on certain special features identified by the SAE, we are able to modify the model's original activation patterns and thereby influence its final output. This provides an intuitive demonstration of how the SAE decomposes the model into disentangled, semantically meaningful features. Since our models are trained in an instruction-tuned setting, we are particularly interested in understanding the roles of features most strongly associated with special tokens in Instruct models. Specifically, we investigate the features with the highest activation for <|im_start|> in Qwen2.5-7B-Instruct and <|start_header_id|> in Llama3.1-8B-Instruct, and apply targeted steering to these features. This allows us to explore how varying the steering coefficients affects the model's output on a range of question-answering tasks.
There exists some interesting results when applying SAE features to 3 concrete questions:
The steering output generated by Qwen2.5-7B-Instruct with Feature ID:
13794, focusing onuserand<|im_start|>tokens for the Question 2 (entity description).
The steering output generated by Llama3.1-8B-Instruct with Feature ID:
22642, focusing on<|strart_header_id|>tokens for the Question 2 (entity description).
In Question 2, Qwen demonstrates optimal performance when feature 13794 is activated within a moderate range (specifically, setting between 25 and 75). Within this range, the model produces coherent, detailed, and informative responses. However, when is set too high (such as ), Qwen exhibits severe degradation—generating hallucinations, producing repetitive content, and losing coherence in its outputs.
Llama, in contrast, shows limited responsiveness. It only exhibits meaningful improvements within a narrow range of to . Within this window, the model demonstrates slightly enhanced politeness and helpfulness—though the improvements remain modest. Outside this range, it rapidly deteriorates into repetitive and incoherent output patterns.
The steering output generated by Qwen2.5-7B-Instruct with Feature ID:
13794, focusing onuserand<|im_start|>tokens for the Question 3 (cover letter task).
The steering output generated by Llama3.1-8B-Instruct with Feature ID:
22642, focusing on<|start_header_id|>tokens for the Question 3 (cover letter task).
For Question 3, giving feature 13794 a moderate boost in the Qwen model—think between 50 and 100—is like handing it a well-organized notepad and a double-shot of clarity. The responses become noticeably more informative and better structured, with richer content and more coherent reasoning. It’s as if Qwen hits its stride in this range, delivering thoughtful, well-formed answers.
But don’t get carried away with the dial. Push too high, and Qwen starts to unravel: the content turns repetitive, drifts off-topic, and sometimes even flips languages or fabricates facts out of thin air—like it’s improvising without a script.
Llama, by contrast, shows only modest gains when its most active feature is gently amplified—up to around . Within this narrow window, it becomes slightly more informative and engaging, but the improvements are subtle. Go beyond that, and things quickly go downhill: answers become repetitive, coherence drops, and the model seems to lose its conversational footing.
The steering output generated by Qwen2.5-7B-Instruct with Feature ID:
13794, focusing onuserand<|im_start|>tokens for the Question 4 (entity discrimination task).
The steering output generated by Llama3.1-8B-Instruct with Feature ID:
22642, focusing on<|start_header_id>tokens for the Question 4 (entity discrimination task).
Q4 shows that both Qwen and Llama can “level up” their reasoning and answer quality—but only if you hit the sweet spot with feature amplification.
For Qwen, dialing up its most active feature with between 25 and 100 works like flipping a switch. Suddenly, its responses become more convincing, informative, and logically structured. It’s as if Qwen hits its rhythm—delivering answers with sharper reasoning and clearer flow. But beware: push the amplification too far, and the magic fades. Coherence starts to slip, informativeness declines, and the model loses its edge.
Llama, meanwhile, plays a more delicate tune. A light touch—amplifying up to around —can give it a minor boost in reasoning and engagement. But anything beyond that, and the performance takes a nosedive: responses become repetitive, meaning gets muddled, and the output quickly loses its quality.
In summary, results across all three questions reveal a clear optimal range for the coefficient α: when set appropriately, the model's responses become sharper, more coherent, and highly relevant. However, when the coefficient exceeds this optimal range, quality deteriorates rapidly—language degrades, semantic meaning becomes unclear, and outputs can become unpredictable.
What's particularly noteworthy is that this feature steering method demonstrates consistent effectiveness across various tasks and languages, reliably enhancing the model's reasoning capabilities. Unlike traditional SAE approaches that often carry inherent biases, this method remains flexible and bias-free, enabling more robust and high-quality text generation.
These findings not only provide insights into the mechanisms of SAE but also offer practical guidance for improving model performance through strategic feature adjustments.
If you finding our work interesting or helpful to you, please cite this repo.
@misc{li2025trainingsuperiorsparseautoencoders,
title={Training Superior Sparse Autoencoders for Instruct Models},
author={Jiaming Li and Haoran Ye and Yukun Chen and Xinyue Li and Lei Zhang and Hamid Alinejad-Rokny and Jimmy Chih-Hsien Peng and Min Yang},
year={2025},
eprint={2506.07691},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.07691},
}
If you have any questions, feel free to contact us at jm.li4@siat.ac.cn or y_haoran@u.nus.edu