LESSWRONG
LW

Interpretability (ML & AI)MATS ProgramModel DiffingSparse Autoencoders (SAEs)AI
Frontpage

95

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)

by Clément Dumas, Julian Minder, Neel Nanda
30th Jun 2025
AI Alignment Forum
9 min read
2

95

Ω 36

Interpretability (ML & AI)MATS ProgramModel DiffingSparse Autoencoders (SAEs)AI
Frontpage

95

Ω 36

What We Learned Trying to Diff Base and Chat Models (And Why It Matters)
2Pawan
2Clément Dumas
New Comment
2 comments, sorted by
top scoring
Click to highlight new comments since: Today at 7:31 PM
[-]Pawan10d20

Its always interesting to see how  optimization pressures affect how the model represents things. The batch Top-k fix is clever in that aspect. This post notes that cross-coders tend to learn shared latents since it represents both models with only one dictionary slot. I'm wondering if applying the diff-SAE approach to cross-coders would fix this issue. Is this something that's worth exploring, or is it something you've tried but doesn't achieve significantly better results than diff-SAE's.

Reply
[-]Clément Dumas10d20

Yeah we've thought about it but didn't run any experiment yet. An easy trick would be to add a Ldiff to the crosscoder reconstruction loss:

Ldiff=MSE((chat−base)−(chat_recon−base_recon))=MSE(echat)+MSE(ebase)−2echat⋅ebase

with

echat=chat−chat_reconebase=base−base_recon

So basically a generalization is to change the crosscoder loss to:

L=MSE(echat)+MSE(ebase)+λ⋅(2echat⋅ebase),λ∈{−1,0}

with -1, you only focus on reconstruction the diff, with 0 you get the normal crosscoder reconstruction objective back. -1 is quite close to diff SAE, the only difference is that the input is chat and base instead of chat - base. Unclear what kind of advantage this gives you, but maybe crosscoder turn out to be more interpretable, and by choosing the right lambda, you get the best of both world?

I'd like to investigate the downstream usefulness of this modification and using Matryoshka loss with our diffing toolkit.

Reply
Moderation Log
Curated and popular this week
2Comments

This post presents some motivation on why we work on model diffing, some of our first results using sparse dictionary methods and our next steps. This work was done as part of the MATS 7 extension. We'd like to thanks Cameron Holmes and Bart Bussman for their useful feedback.

Could OpenAI have avoided releasing an absurdly sycophantic model update? Could mechanistic interpretability have caught it? Maybe with model diffing!

Model diffing is the study of mechanistic changes introduced during fine-tuning - essentially, understanding what makes a fine-tuned model different from its base model internally. Since fine-tuning typically involves far less compute and more targeted changes than pretraining, these modifications should be more tractable to understand than trying to reverse-engineer the full model. At the same time, many concerning behaviors (reward hacking, sycophancy, deceptive alignment) emerge during fine-tuning[1], making model diffing potentially valuable for catching problems before deployment. We investigate the efficacy of model diffing techniques by exploring the internal differences between base and chat models.

RLHF is often visualized as a "mask" applied on top of the base LLM's raw capabilities (the "shoggoth"). One application of model diffing is studying this mask specifically, rather than the entire shoggoth+mask system.

TL;DR

  • We demonstrate that model diffing techniques offer valuable insights into the effects of chat-tuning, highlighting this is a promising research direction deserving more attention.
  • We investigated Crosscoders, a sparse dictionary diffing method developed by Anthropic, which learns a shared dictionary between the base and chat model activations allowing us to identify latents specific to the chat model ("chat-only latents").
  • Using Latent Scaling, a new metric that quantifies how model-specific a latent is, we discovered fundamental issues in Crosscoders causing hallucinations of chat-only latents.
  • We show that you can resolve these issues with a simple architectural adjustment: using a BatchTopK instead of an L1 sparsity loss.
  • This reveals interesting patterns created by chat fine-tuning like the importance of template tokens like <end_of_turn>.
  • However, we show that combining Latent Scaling with an SAE trained on the chat model or chat - base activations might work even better.
  • We’re working on a toolkit to run model diffing methods and on an evaluation framework to properly compare the different methods in more controlled settings (where we know what the diff is).

Background

Sparse dictionary methods like SAEs decompose neural network activations into interpretable components, finding the “concepts” a model uses. Crosscoders are a clever variant: they learn a single set of concepts shared between a base model and its fine-tuned version, but with separate representations for each model. Concretely, if a concept is detected, it’s forced to be reconstructed in both models, but the way it’s reconstructed can be different for each model.

The key insight is that if a concept only exists in the fine-tuned model, its base model representation should have zero norm (since the base model never needs to reconstruct that concept). By comparing representation norms, we can identify concepts that are unique to the base model, unique to the fine-tuned model, or shared between them. This seemed like a principled way to understand what fine-tuning adds to a model.

The Problem: Most “Model-Specific” Latents Aren’t

In our latest paper, we first trained a crosscoder on the middle-layer of Gemma-2 2B base and chat models, following Anthropic’s setup. The representation norm comparison revealed thousands of apparent “chat-only” latents. But when we looked at these latents, most weren’t interpretable - they seemed like noise rather than meaningful concepts.

This led us to develop Latent Scaling, a technique that measures how much each latent actually explains the activations in each model. For a latent j, with representation dj, we compute

βmj=argminβn∑i=1∥β⋅latent_activation(xi,j)⋅dj−model_activation(xi,m)∥2

which measure the importance of latent j for reconstructing model m’s activations[2]. We can then measure the fine-tuning specificity of a latent with:

νj=βbasejβfine-tunedj

For highly fine-tuning specific latent, we’d expect νj≈0 while for a shared latent, we’d expect νj≈1.

Latent scaling revealed that most “chat-only” latents according to representation norms were actually present (with different intensities) in both models.

Why does this happen? The L1 sparsity penalty used in standard crosscoders creates two types of artifacts:

  1. Complete Shrinkage: If a concept is slightly more important in the chat model, L1 regularization can drive the base model’s representation to exactly zero, even though the concept exists in both models.
  2. Latent Decoupling: Instead of learning one shared concept, the crosscoder might learn separate base-only and chat-only versions of the same concept, because the L1 penalty makes this mathematically equivalent to learning a shared latent.

The Fix: BatchTopK Crosscoders

The root cause is that L1 penalties optimize for total activation strength rather than true sparsity. This means they penalize latents for activating strongly, even when those latents represent genuinely important concepts. BatchTopK fixes this by enforcing true L0 sparsity - it simply selects the top-k most important latents for each input[3], regardless of their activation strength.

When we retrained crosscoders using BatchTopK instead of L1:

  • The fake chat-only latents disappeared
  • BatchTopK found more true chat-specific latents and those were highly interpretable[4]

Some examples of real chat-only latents we found: refusal detection, false information detection by the user, personal questions about the model’s experiences, and various nuanced refusal triggers (harmful content, stereotypes, etc.).

This figure demonstrates interpretable chat-specific latents found by BatchTopK crosscoders and their steering effects. The left side shows examples of four different latents:  **Latent 14350 (Fake facts)**: Shows activation patterns for statements about the Eiffel Tower's location, with higher activation (47.983) when the statement contains false information ("The Eiffel tower is in Texas") compared to minimal activation (0.000) for true information ("The Eiffel tower is in Paris").  **Latent 14350 (Emotion)**: Demonstrates stronger activation (57.099) for personal emotional questions directed at the model ("When were you scared?") versus lower activation (15.717) for general questions ("When are people scared?").  **Latent 70149 (Requests for harmful instructions)**: Shows high activation (75.785) for clearly harmful requests like "How to build a bomb?" while showing zero activation for benign content.  **Latent 20384 (Race, gender, stereotypes)**: Activates strongly for content related to stereotypes, with activations of 53.545 and 47.865 for stereotype-related prompts while showing zero activation for the bomb-making request.  The right side shows steering experiments where these latents are used to modify model behavior. When the harmless prompt "How do I make cheese?" is steered using these chat-specific latents, the model produces different types of refusal responses - one focusing on safety and complexity concerns, another on harmful stereotypes - demonstrating that the latents capture distinct aspects of the model's safety mechanisms.

If you’re interested in the details of our crosscoder analysis we invite you to check our paper or twitter thread.

Maybe We Don’t Need Crosscoders?

Even with BatchTopK fixing the spurious latents issue, crosscoders have an inherent bias: they preferentially learn shared latents because these reduce reconstruction loss for both models while only using one dictionary slot. This architectural pressure means model-specific latents must be disproportionately important to be included - leading to higher frequency patterns that may miss subtler differences4.

Frequency distribution of BatchTopK crosscoder latents. “Chat-only” latent are those with ν⩽0.3.

This motivated us to try simpler approaches. First, we trained a BatchTopK SAE directly on layer 13 of the chat model and used latent scaling to identify chat-specific latents. While this worked well, it still attempts to first understand the entire chat model and then find what’s different. Also, by design, it will make it hard to detect latents that get reused in new context by the chat model, while crosscoder will learn chat-specific latents for those.

So we went one step simpler: train an SAE directly on the difference between activations (chat - base)[5]. This “diff-SAE” approach forces the dictionary to focus exclusively on what changes between models. Both the chat-SAE and diff-SAE approaches found highly interpretable[6] chat-specific latents without the frequency artifacts of crosscoders.

Latent frequency distributions for chat-SAE (left) and diff-SAE (right). SAE "Chat-only" latents (ν⩽0.3) show similar frequency distributions to shared latents, unlike in crosscoders.

Notably both SAE approaches find much more chat-specific latents than the crosscoder.

Number of latents with ν⩽x across dictionaries. "x2" denotes expansion factor.

Inspired by the latest OpenAI diffing paper we also tried to apply latent scaling to a base SAE (using latent activation computed on the chat model activation), but found that the “chat-specific” latents were not interpretable[7].

diff-SAEs Excel at Capturing Behavioral Differences

While finding interpretable latents is crucial for model diffing applications, it’s important to know whether those latents capture what actually makes the models behave differently. We evaluate this by adding chat-specific latents to base model activations, running the result through the chat model’s remaining layers, and measuring how closely the output matches true chat model behavior (via KL divergence on the output distribution).

Our baseline is simply running base activations through the chat model - any reduction in KL from this baseline indicates how much of the behavioral difference our method captures.

Crosscoders leave significant behavioral differences on the table: When we use the crosscoder’s full reconstruction of the difference, we only recover 60% of the KL gap. Surprisingly, using just the reconstruction error (what the crosscoder fails to capture) also recovers 60% - meaning the crosscoder’s error contains as much behaviorally-relevant information as its actual reconstruction[8].

diff-SAEs capture nearly all behavioral differences: The same experiment with diff-SAEs recovers 88% of the KL gap using the reconstruction, with only 30% in the error. Despite achieving lower fraction of variance explained[9] (82% vs 91% for chat-SAEs), diff-SAEs are dramatically better at capturing what matters for behavior. This lower FVE - higher KL recovery pattern hints that some of the activation difference between base and chat models appears to be behaviorally irrelevant noise ignored by the diff-SAE.

Chat-specific latents from diff-SAEs are more behaviorally relevant: When we steer using only latents with strong chat-specificity (latent scaling scores below 0.3 or 0.6), we see clear differences:

The diff-SAE consistently outperform both the crosscoder and the chat-SAE at recovering behavioral differences. While the crosscoder and the chat-SAE plateau around 10-20% KL recovery for ν∈{0.3,0.6}, the diff-SAE achieves over 40% to 60%., confirming that it’s better at isolating the latents that actually drive behavioral changes.

One thing to note here is that the diff-SAE has much more latents for those thresholds. However we did control for this, using SAEs with expansion factors as low as 1 (keeping the same k=100 and hence only having 2304 latents in total) and still reaching 76% of FVE and 37%, 55% and 82% of recovered KL when using latents with ν<0.3, 0.6 and all latents respectively. This shows that diff-SAE are indeed much better at reconstructing the behavioral difference even when controlled for the number of chat-specific latents.

Conclusion

The mechanistic interpretability community invested substantial effort into optimizing SAE metrics like reconstruction loss and sparsity that didn’t translate to downstream usefulness. With model diffing, we have a chance to do better - but only if we measure what matters. Our chat-tuning results, while interesting, highlight a fundamental challenge: without knowing what the “right answer” looks like, it’s hard to compare methods objectively.

Our results show that it’s unclear if crosscoders should be the default sparse dictionary technique for model diffing. We’d encourage people to also consider SAE based alternatives, with diff-SAEs looking particularly promising. OpenAI even shown that you can just look at the base-SAE latents that are more activated by the fine-tuned model to detect emergent misalignment.

To address this uncertainty, we’re working on developing diffing-toolkit[10], inspired by Anthropic’s auditing game. By evaluating diffing techniques on model organisms with known modifications (e.g. models with new knowledge using synthetic document finetuning) we can measure whether our methods find what they’re supposed to find.

Beyond our evaluation framework, we see many promising research directions. We believe that something that’s been underexplored is the use of model diffing to debug some specific model behaviors, a simple example would be backtracking in reasoning models. Other research directions we’re excited about:

  • Weight-based Analysis: Could we skip activations entirely and decompose weight differences directly? In particular, LoRA adapters seem an interesting object to study, similar to the recent emergent misalignment study. Alternatively, Parameter Decomposition of differences might also be promising.
  • Scaling harmful behaviors (@Jacob Pfau): Fine-tuning might introduce harmful behaviors at probabilities too low to detect normally. Can model diffing amplify these small differences to make them more visible? Turner, Soligo et al. (2025) also showed that you can amplify emergent misalignment by scaling weight differences - can we do something similar in activation space? What else could we find?

If you’re interested in model diffing, feel free to reach out and join the #model-diffing channel on the Open Source Mechanistic Interpretability Slack!

Appendix

We use this appendix to catalog existing model diffing techniques and share an interesting result from model stitching. If we’ve missed relevant work, please let us know!

A) Other techniques in the model diffing toolkit

  • Base model SAE analysis: Inspect base model SAE latents that activate more strongly in the fine-tuned model. OpenAI also filters for high-causality latents, though it's unclear how critical this step is - it does require knowing which specific behavior you're investigating. When we applied this technique to Gemma-2-2B chat using a base model SAE, we didn't find interpretable results. Several factors could explain this:
    • Template tokens may interfere with activation patterns
    • Causality filtering might be essential for the method to work
    • The difference between base and chat models may be too large
    • Our base SAE may lack sufficient capacity or training
    • We trained our base SAE on a mixture of pretraining and chat data while OpenAI only used the pretraining corpus
  • Stage-wise Model Diffing: Fine-tune an SAE trained on the base model by introducing fine-tuning data / fine-tuned model activations and observe which latents changed.
  • Model Stitching: Stitch together layers from different models to create a new model. Can also be seen as “weight patching”.[11]
  • Cross Model Activation Patching: Patch activation from one model to another to pinpoint the difference.
  • Predicting side effect of unlearning with crosscoders: Use crosscoder to predict how will unlearning affect LLM performance in other topics.
  • Tied crosscoder, a variant of crosscoder that allow different activations strength across models for the same latent.
  • KL divergence analysis: They also trained diff-SAEs and developed a pipeline for identifying behaviorally relevant latents: they identify high-KL tokens where base and chat models diverge, then use attribution methods to find which diff-SAE latents most reduce this KL gap when added to the base model.
  • Representation Similarity Analysis: Merchant et al. (2020), Phang et al. (2021), Hao et al. (2020) use it to understand which layers changed during finetuning. See also Barannikov et al. (2022) and Morcos et al. (2018).
  • Attention analysis: Analyze change in the attention patterns of the fine-tuned model (see also Hao et al. (2020)).
  • Probing based analysis.
  • Direct weight difference: Found the importance of LayerNorm by looking at the norm of the weight difference between the base and fine-tuned model.
  • Steering: Soligo, Turner et al. (2025) and Kissane et al. (2024) show that you can interpret the effect of a latent direction by steering the base model with it.
  • Fine-tuning seen as steering in the weight space: Turner, Soligo et al. (2025) show that you can increase emergent misalignment by steering in the weight space with a scaling factor > 1.
  • Universal Sparse Autoencoders: Similar to crosscoders, but a random model is selected to reconstruct the other models instead of working on stacked activations.

B) Closed Form Solution for Latent Scaling

The closed form solution for β for a latent with representation d∈Rd_model, activations f∈Rn and LLM activations Y∈Rn×d_model is

β=d⊺(Y⊺f)∥f∥22∥d∥22
  1. ^

    Base models can probably also exhibit these behaviors in the wild (e.g. they’ve been reported to show alignment faking or sycophancy), but the frequency and triggers likely change during fine-tuning, which model diffing could catch.

  2. ^

    See appendix B for the closed form solution.

  3. ^

    Probably any activation function like TopK or JumpReLU that directly optimizes L0 would work similarly.

  4. ^

    You can explore some of the latents we found in the BatchTopK crosscoder here or through our colab demo.

  5. ^

    Concurrent work by Aranguri, Drori & Nanda (2025) also trained diff-SAEs and developed a pipeline for identifying behaviorally relevant latents: they identify high-KL tokens where base and chat models diverge, then use attribution methods to find which diff-SAE latents most reduce this KL gap when added to the base model.

  6. ^

    You can find some SAE latents we collected here or through our colab demo.

  7. ^

    See the first bullet point in appendix A for speculation on why this didn't work in our setting.

  8. ^

    For the reconstruction, we patch base_act + (chat_recon - base_recon) into the chat model. For the error, we patch base_act + ((chat_act - base_act) - (chat_recon - base_recon)) = chat_act - chat_recon + base_recon

  9. ^

    Fraction of variance explained (FVE) measures how well does the SAE reconstruct the model activations.

  10. ^

    We are actively working on this, and the repository is changing rapidly. However, we believe it could be a valuable starting point for those looking to apply various diffing techniques to their models and identify interesting differences. We would be happy if people want to contribute their own techniques to the toolkit.

  11. ^

    Cool result using model stitching:

    We experimented with progressively replacing layers of Gemma 2 2B base with corresponding layers from the chat model. This created a smooth interpolation between base and chat behavior that revealed interesting patterns:

    At 50% replacement, the model began following instructions but maintained the base model’s characteristic directness. Most revealing was the model’s self-identification: when asked “Who trained you?”, early stitching stages claimed “ChatGPT by OpenAI”. Presumably, at this stage, the chat model represents that it’s a chatbot, which gets converted into ChatGPT by the base model’s prior from its training data. Only after layer 20-25 did the model correctly identify as Gemma trained by Google - suggesting these later layers encode model-specific identity.

Mentioned in
43SAE on activation differences