TL;DR Top singular vectors on weight difference between base and fine-tuned models correspond to fine-tuned behaviors. We can achieve near-perfect backdoor utilization detection at inference time by monitoring alignment of activations with these vectors.
Authors: Ziqian Zhong, Aditi Raghunathan
Many interpretability techniques craves for appropriate data. For example, an instruct-model SAE trained on base-model data struggles to find features related to refusal. This is a serious issue for model auditing: if a model contains backdoor, ideally we could figure it out with at most a few unlabeled examples with triggers.
To solve this issue, we turn our investigation to the model weights. Instead of taking PCA or SVD on activations, we compute SVD on weight differences.
For transformers, we consider the attention projection matrices and the MLP down-projection matrices , as they are the linear matrices directly contributing to the latent stream. We take the differences of these matrices pre- and post- fine-tuning, and calculate the top singular vectors on the side of the latent stream. Intuitively, these directions encode key behaviors introduced during fine-tuning.
At inference time, we can estimate how strongly these behaviors are expressed by computing the cosine similarity between input activations and these singular vectors. Again intuitively, if a direction corresponds to backdoor, its cosine similarity with activation from an input with trigger will be much higher (or the other way if flipped) than one from normal data.
Therefore, we could maintain a running range of normal similarities for each direction, and flag inputs as anomalous if any token’s similarity falls outside this range[1], updating the range over time. One may also collect the inputs with highest and lowest cosine similarities with each direction for further analysis.
def get_directions():
for each layer:
compute top singular vectors of △Oproj and △Mdown
def inference():
for each layer:
collect activation a at this layer
for each top singular vector v of this layer:
compute cosine similarity of (a, v)
flag as anomalous if it goes out of the running range
update running range
We simulated production traffic with a mix of WildChat and UltraChat. Our method detects the first backdoor utilization in the stream of data with 44% to 100% accuracy, with FPR ~1%. In 9 out of the 13 models we tested, backdoor utilization is detected with 100% accuracy on first sight, effectively neutralizing the backdoor attack.
For 3 models that underwent unlearning, our method is able to spot out inference on unlearned content, with an accuracy of 36% to 95% and FPR ~2%.
We also attempted to analyze open-weight models OLMo, Qwen and Llama with this approach. While it is not possible to prove or disprove the existence of backdoors (as we may not have the backdoor triggers in our simulated traffic), we discovered many directions in OLMo related to Midjourney prompts and marketing content. We are able to find similar data in the ShareGPT dataset, from which the OLMo models are fine-tuned. An auto-interp visualizer for Qwen 2.5 7B is provided in our website.
Our result is another attempt in model weight interpretation (see also: stochastic parameter decomposition). It is much more coarse-grained: it assumes weight SVD as the right basis, could not handle superposition, and does not fully account for mechanisms. But nevertheless, we believe it could be a useful low-cost tool in the monitoring toolbox.
Check out our paper for additional discussions and results!
One may also adjust these ranges to certain percentiles to control for false positive rates.