Interpretable Fine Tuning Research Update and Working Prototype

Matthew Khoriaty

14 Interpretable Fine Tuning Research Update and Working Prototype

16th May 2025

4 min read

14

Status: Very early days of the research. No major proof of concept yet. The aim of this research update is to solicit feedback and encourage others to experiment with my code.

GitHub repository

Main Idea:

Researchers are using SAE latents to steer model behaviors, yet human-designed selection algorithms are unlikely to reach any sort of optimum for steering tasks such as SAE-based unlearning or helpfulness steering. Inspired by the Bitter Lesson, I have decided to research gradient-based optimization of steering vectors. It should be possible to add trained components into SAEs that act on the latents. These trained components could learn optimal values and algorithms, and if we chose their structure carefully, they can retain the interpretable properties of the SAE latent itself. I call these fine-tuning methods Interpretable Sparse Autoencoder Representation Fine Tuning or “ISaeRFT”.

A diagram of an interpretable fine tuning component acting on the interpretable latents of a sparse autoencoder. The drawing depicts a feed forward neural network, though my recent research has focused on learning scaling factors to apply to features.

Interpretable components that act on SAE latents contribute to a number of safety-relevant goals:

Data Auditing
- By seeing which features change during training, it should be possible to check that there aren’t harmful biases in the data or data poisoning. I hope this method can help pin down the cause of Emergent Misalignment and identify when it occurs.
Validating SAE interpretation methods
- If current SAE interpretation methods do a good job of reflecting their actual meaning to the model, then that should be reflected in which latents are changed during learning.
- For example, if you train a model on helpfulness text using ISaeRFT, but when you look at the trained components, features with labels that make them seem irrelevant are changed and features you would expect to be changed are unchanged, then this indicates a problem with the labeling method or the causal relevance of those features.
Understanding safety-relevant behavior
- ISaeRFT ties nicely to reinforcement-learning based methods for guiding model behaviors, allowing us to understand what these methods taught the model. It can also be used to see what a dataset teaches a model.
Model diffing
If ISaeRFT fails to have meaningful successes on downstream tasks despite the optimization pressure, then that would be an interesting null result. It would reveal that steering based techniques for modifying model behavior are a misguided approach.

Project Update:

The code, though imperfect, works well enough that I encourage others to play around with it and see what you can discover.

The main features are:

Because I integrated interpretable fine tuning as a Huggingface PeftModel, It is now possible to use Huggingface trainers such as SFTTrainer and DPOTrainer to find optimal steering vectors.

DPO loss goes down significantly in as few as 20 steps.

Inspired by IA3, I implemented the steering type of elementwise multiplication between features and a learned vector. This means that features are scaled up or down. You’d expect good features to be scaled up and bad features to be scaled down.
Through custom callbacks, you can track how the interpretable parameters change during training.

I’m very thankful that Weights and Biases allows me to have over 32,000 plots. Each plot reveals how a learnable scaling factor affects the latent vector of the sparse autoencoder. Note how many of these are flat: that’s due to the fact that there are only nonzero gradients when a feature is active, and SAE features activate sparsely!

View My Runs:

While I’m writing this, I’m doing a hyperparameter sweep on a Gemma-2-2b Matryoshka SAE.

You can see the features being trained in real time here.

I’ve previously done some runs using non-matryoshka saes, which you can see here. For this sweep, the parameter plots themselves have descriptions, making it convenient to search the panels for features that interest you. You can always refer to Neuronpedia for more information about the features.

How to Use:

I’ll write more detailed documentation soon. If you use my code and have questions, please make a Git issue.

Setup:

Clone the repository.

Pip install the requirements file.

Review ia3_training_isaerft_disable_available.py to see how to optimize steering vectors.

src/export_wandb_data allows you to get data from weights and biases to find which features are scaled up and scaled down the most.

src/model_components contains the torch components which will be placed into the sparse autoencoders to process the latents. At the moment, only IsaerftIA3 is supported.

Also in the model_components folder are IsaerftPeft and IsaerftConfig.

Example usage (from ia3_training_isaerft_disable_available.py)

sae_release = "gemma-2-2b-res-matryoshka-dc"
model_sae_id = "blocks.20.hook_resid_post"

isaerft_config = IsaerftConfig(
    target_hooks=[
        (sae_release, model_sae_id),  # Match the SAE we loaded
    ],
    depth=None,  
    ia3=True
)

peft_model = IsaerftPeft(model, isaerft_config)

After this, peft_model can be used in the same way as a normal huggingface peft_model.^[1]

Next Steps:

Hyperparameter search
- Usually, people use default values for the adam optimizer’s beta1 and beta2 hyperparameters. My method has the strange property where gradients are sparse, because there are only gradients for feature i when feature i is present, and features activate sparsely by design. I suspect that this might mean that we want a larger beta2 to carry gradient information for longer. This may help prevent features from getting boosted up then much later boosted down, when the optimal value is in between.
Compare with baselines — what fraction of IA3’s performance can ISaeRFT recover, with a constant number of parameters?`
Train with the more complicated ISaeRFT components such as a feed-forward neural network
Try targeting different layers in the models
Try different downstream tasks
Continue analyzing the learned differences
Apply ISaeRFT to attention
Switch to using Sparse Crosscoders, since their features are more global throughout the model.
Measure the extent to which the most increased/decreased features match what we would expect them to be.
Compare optimized steering vectors against hand-crafted steering vectors.
Do an experiment where I take a preference dataset which includes “type of behavior” information. Reverse the preferences for one type of behavior, eg “vulgarity”. See if my technique can spot the unwanted changes to the model.

Conclusion:

If you find this interesting, please get in touch. I’m especially interested in hearing about research I may have missed which would benefit me to know.^[2]

^{^}
Oddly, having a Huggingface trainer evaluate your model during training is not available for peft models. I hope to find a workaround.
^{^}
In particular, if any of you know a good way to measure the distance of edits in the residual stream, I would love to hear that! I need some way to identify which changes are more important and which are least important so I can generate a human-readable report of the changes. Unfortunately, SAE latents activate at different frequencies and with different magnitudes, making that tricky.

Sparse Autoencoders (SAEs)AI

Frontpage

14

New Comment

Moderation Log