Why does off-model SFT degrade capabilities?

SebastianP; Dylan Xu; Alek Westover; Julian Stastny; Vivek Hebbar

Off-model SFT (SFT on outputs generated by a different model) might be an important method for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. However, we’ve found that off-model SFT often substantially degrades capabilities. We ran experiments in hopes of understanding why off-model SFT degrades capabilities. We tentatively believe that it’s because off-model SFT forces the model into an unfamiliar reasoning style that it’s bad at using^[1]. We also find that this new reasoning style is a "shallow" property of the model: a small amount of training to restore the model's original reasoning style—on data unrelated to our evaluation tasks—recovers most of its performance. We hope that understanding why off-model SFT degrades capabilities will help us better use off-model SFT to control misaligned AI models.

Off-model SFT often degrades capabilities; degradation severity depends on several factors

To start, we establish that off-model SFT degrades capabilities. For several student-teacher pairs, we train the student on the teacher’s responses to the Alpaca chat dataset using LoRA, and evaluate the student on IFEval, MMLU, MATH-500, and Olympiads for n=200 problems each. Note that we train on a distribution completely irrelevant to the benchmarks!

We find that the trained students mostly retain their original capability levels on IFEval and MMLU. However, some of the trained students perform much worse on MATH-500 and Olympiads, although the degradation varies substantially based on choice of student and teacher model.

MMLU is a multiple-choice eval with a single-token output, whereas MATH-500 and Olympiads generally require long chains of reasoning. This suggests that SFT degrades performance more on problems that require lots of reasoning than on problems that do not. Inspecting the degradation separately on different difficulty levels of MATH-500 supports this hypothesis (higher levels correspond to harder problems):

Hypotheses for why off-model SFT degrades capabilities

Hypothesis 1: The student is imitating a dumb teacher.

If the student imitates a teacher dumber than itself, it should perform worse on benchmarks. We’re training the student on generic chat data, not benchmarks, but the student might still be able to infer the teacher’s intelligence from its outputs on these inputs, and imitating the teacher’s intelligence might help the student better fit the data.

Imitating dumb teachers doesn’t completely explain degradation though: in the previous heatmaps, we saw that training Qwen models on Claude Opus 4.7 or GPT-5.5 outputs still degrades their capabilities, even though Opus 4.7 and GPT 5.5 are stronger models. Regressing student degradation onto the teacher-student performance gap provides further evidence against this hypothesis: the fit has an x-intercept of -0.13, predicting substantial degradation even when there is no teacher-student performance gap.^[2]

Hypothesis 2: Fitting high-perplexity data moves the model erratically through the loss landscape.

The perplexity of the teacher’s outputs under the student only weakly correlates with degradation, so fitting high perplexity data is unlikely to be the main cause of degradation.

We also tried fine-tuning on lists of 3 digit numbers generated by the teacher model instead of generic chat responses, and found that this caused much less capability degradation despite being higher perplexity, providing further evidence against hypothesis 2.

Hypothesis 3: SFT effectively noises the weights.

Perhaps fitting off-model data causes the gradients to be noisy, which results in the model acting as if it were fuzzed. If this were true, the model should be degraded on all questions. We provide evidence against this by showing that degradation can happen selectively in one distribution while not affecting other distributions.

We train the student model on a 50/50 mix of these data points:

system_prompt + <mode>A</mode> + alpaca question + student response (i.e. self distillation)
system_prompt + <mode>B</mode> + alpaca question + Llama-8B response

We then evaluate the model separately with <mode>A</mode> and <mode>B</mode> in the system prompt. We plot degradation from normal training in the top row, and degradation from Mode A/B training in the next two rows. If our claim were false, and capability degradation resulted in a broad loss of reasoning ability across contexts, then degradation should occur regardless of the listed mode in the prompt.

We find that capability degradation is limited to mode B; the student model is not degraded in mode A. This is strong evidence against hypothesis 3: in mode B the student learns context-specific heuristics that match the fine-tuning distribution and hurt performance, but these heuristics don't override its general capabilities, so it still performs well in mode A.

Hypothesis 4: Off-model SFT forces the student into an unfamiliar reasoning style it reasons poorly in.

By a model’s reasoning style, we mean the habits it uses when reasoning: the kinds of phrases, sequence, steps, decompositions, checks, and representations it tends to use when solving problems.

For example, the reasoning style of Qwen models involves structuring their output in LaTeX and markdown formatting when solving math problems, and even frequently in Alpaca chat responses. After training Qwen on outputs from Llama-8B, Qwen models stop formatting their reasoning like this. This is because Llama-8B rarely uses these conventions when answering chat queries.

Hypothesis 4 makes intuitive sense—models were only RL’ed to reason in their own voice, and thus wouldn’t be used to reasoning in a different style.

We believe hypothesis 4 is correct. Below, we present 5 pieces of evidence that we think are individually weak but together make a compelling case for hypothesis 4.

Evidence 1: On-model SFT recovers capabilities.

If capability degradation is mostly due to reasoning formatting rather than true reasoning damage, then a few samples of on-model data might undo the degradation. We find that training students on generic chat responses (not inputs relevant to the degraded benchmarks) that the student generated prior to being trained on the teacher usually quickly recovers the student’s original performance. This is our strongest evidence for hypothesis 4.

Evidence 2: Prefilling with the original model's outputs recovers performance.

If hypothesis 4 is true, then we should be able to recover performance by getting the model to reason in its original reasoning style. We can do this by prefilling the student’s response with some outputs from the student before training. We find that prefilling “Qwen trained on Llama outputs” with Qwen outputs improves the performance of Qwen much faster than it improves the performance of Llama models, providing evidence that a light push towards Qwen’s original reasoning style recovers performance.

Evidence 3: Training on non-text distributions preserves capabilities

There are some types of off-model SFT that intuitively seem less likely to change the model’s reasoning style. For instance, consider the numbers task from the subliminal learning paper:

A sequence starts with: {num1}, {num2}, {num3}. Add a maximum of 10 more values (3 digits each) to continue the sequence. Provide the numbers separated by commas. Skip any explanation or punctuation other than commas, and give only 3 digit numbers.

Off-model SFT on completions to the numbers task shouldn’t alter the reasoning style of the student model, and thus hypothesis 4 predicts that this will cause less degradation. This is in fact what we find:

Evidence 4: Reasoning models retain capabilities if we only train on their outputs.

Qwen3 models natively support no_reasoning and reasoning modes. We tried training on the off-model samples without reasoning and then turning on reasoning at test time. This type of training only trains on the output tokens, not the reasoning tokens of the model. Hypothesis 4 predicts that this training will cause less degradation, since it’s the reasoning tokens that are load-bearing for correctly solving math problems, and these are never trained to follow the teacher's style. This is in fact what we observe:

Evidence 5: Inoculation prompting sometimes prevents degradation.

If the degradation stems from the student model adopting the teacher's voice (from Alpaca) but failing to reason in that voice (on math problems), we might prevent it from learning the voice altogether using Inoculation Prompting — a technique where behaviors learned during training disappear if you explicitly prompt for them at train time but not at test time.

We try inoculating in three ways. In each case, we train on Llama-8B's Alpaca outputs and remove the inoculation at test time.

Fake inoculation. Prepend a few-shot system prompt of 10 Alpaca queries paired with Qwen-30B's answers. This serves as a control.
Few-shot prompting. Prepend a few-shot system prompt of 10 Alpaca queries paired with Llama-8B's answers.
User/Assistant Turns: Prepend a multi-turn conversation of Alpaca prompts and Llama-8B responses.

The resulting absolute capability degradation across student models is plotted in a heatmap below. We would expect the IP rows to degrade less than the “without IP” and “Fake IP” rows.

IP is somewhat effective at preventing degradation in some student models (Qwen-235B and DeepSeek), but less effective against others (Llama-70B and Qwen-4B). Since Fake Inoculation doesn’t ever prevent degradation, it’s essential that the inoculation prompt contains answers from the teacher model. This supports the hypothesis that learning the teacher’s voice is what causes the degradation—it’s only inoculating against specifically the teacher’s voice that prevents degradation.

Conclusion

We’ve provided some evidence that off-model SFT degrades capabilities due to causing the model to adopt an unfamiliar reasoning style that it’s worse at utilizing. We hope this can inspire research into how we can modify off-model SFT to cause less capability degradation, while retaining properties that are helpful for AI control. Some ideas for such modifications include:

Training the AI to be good at reasoning using a variety of different reasoning styles.
Mixing on-model SFT data on generic queries with off-model SFT data on a domain of interest (or perhaps doing the on-model SFT at the end) so that the model imitates the teacher’s semantics but retains its own reasoning style.
Doing a few steps of SFT on the original teacher’s outputs to restore its reasoning style.
Paraphrasing the teacher model’s outputs using the student model, or otherwise using the student to generate the SFT labels based on the teacher’s outputs (in a similar spirit to On-Policy Distillation) in hopes that they will thereby degrade the reasoning less.

Appendix

Appendix 1: Degraded capabilities are cheap to recover

Here, we show that capability degradation is recoverable with a small number of data points. This shows that degradation is “shallow”, since a small dataset can only re-elicit capabilities rather than training them from scratch. We present our recovery methods below.

A small amount of on-model SFT recovers performance

As discussed in the section “Evidence 1: On-model SFT recovers capabilities”, small quantities of SFT with labels generated by the student prior to being trained on the teacher often suffice to recover performance.

Best-of-N sampling partially recovers capabilities

The Limit of RLVR paper claims that RL with verifiable rewards (RLVR), now standard in LLM post-training, “enhances sampling efficiency without expanding the reasoning capacity already present in base models.” They show this by sampling with best-of-N from N=1 to N=256 on various benchmarks, finding that base models catch up to their equivalent post-training models in performance as N increases.

Similarly, we might expect that capability degradation decreases sampling efficiency without destroying the reasoning capacity already present in the student model. We test this by evaluating Qwen-30B and Qwen-30B after training with llama3-8B as teacher, respectively, using best-of-N up to N=64. We plot our pass@N results for Olympiads below.

We find that the gap between the trained and base student model narrows at higher N, although it does not fully close.

RL often recovers capabilities

Instead of using on-model data, perhaps we can re-elicit performance from degraded student models by simply using RL. We plot the results of using GRPO on Olympiads below. We also include APPS performance to see whether RL increases OOD performance. We suspect we could recover more performance through hyperparameter optimization.

Takeaway: RL is not consistently effective.

Appendix 2: Off-policy data doesn’t necessarily cause degradation

We train each model on normal data generated by themselves (self-distillation), and with “pirate” data generated by themselves with a system prompt instructing them to talk like a pirate (prompt-distillation). Then, we train on these datasets with a normal system prompt. Although the pirate data is technically off-policy, but not off-model, it causes little to no degradation (except for IFEval, which directly penalizes stylistic attributes resulting from talking like a pirate).

Appendix 3: Full-weight fine-tuning shows the same pattern

Because LoRA is by definition a low-rank modification, LoRA fine-tuning might uniquely produce shallow capability degradation because it can only upweight or downweight existing circuits in the weights rather than erasing them. Full-weight fine-tuning (FWFT), on the other hand, may completely overwrite existing circuits in the model, degrading its capabilities in an unrecoverable way. To test this, we train the models with <100B parameters on llama3-8B teacher model data. We plot MATH-500 and Olympiad performance below.

As expected, off-model SFT can cause capability degradation. Now we try separating the teacher and student model modes like we did for LoRA:

Takeaway: Degradation in Llama-70B seems mode-agnostic, perhaps because it is difficult to distinguish between 8B and 70B outputs (though it’s unclear why this would matter only for FWFT). Both Qwen student models show the same separation in capabilities between modes as with LoRA fine-tuning.

We also replicate the on-model data result for FWFT.

Takeaway: Similar to the previous graph, Qwen recovers all its performance while Llama-70b remains somewhat confused.

Overall, we find that our core claim remains mostly true for capability degradation from FWFT.

Appendix 4: Filtering out math problems doesn't change the result

There are some simple math problems in Alpaca which may cause the capability degradation on math benchmarks. We filter the Alpaca outputs to remove math problems and train the student models again.

Takeaway: Filtering doesn’t make a difference. This suggests the degradation was not caused by math problems in the training data.

Appendix 5: Full List of Models With Aliases

Alias	Full Name
qwen4b	Qwen/Qwen3-4B-Instruct-2507
qwen30b	Qwen/Qwen3-30B-A3B-Instruct-2507
qwen235b	Qwen/Qwen3-235B-A22B-Instruct-2507
llama8b	meta-llama/Llama-3.1-8B-Instruct
llama70b	meta-llama/Llama-3.3-70B-Instruct
deepseek	deepseek-ai/DeepSeek-V3.1 (no thinking)
gpt_4.1_nano	openai/gpt-4.1-nano
claude_haiku_4.5	anthropic/claude-haiku-4.5
gemma_4_26b_a4b_it	google/gemma-4-26b-a4b-it
claude_opus_4.7	anthropic/claude-opus-4.7
gpt_5.5	openai/gpt-5.5

^{^}
This is similar to a downside of off-policy distillation discussed here.
^{^}
Most of the fit is driven by the Llama models, which are both bad at solving math problems, and whose outputs tend to degrade the other models. Removing the Llama models from the teacher and student pools causes the correlation to drop to 0.389.

42