How to reduce capability degradation from off-model SFT

Dylan Xu; SebastianP; Alek Westover

Off-model SFT (SFT using labels from a different model) could be an important approach for controlling AI behavior. For instance, it seems like a central technique for overcoming exploration hacking. Unfortunately, off-model SFT often substantially degrades capabilities, making it less useful as a control technique. However, prior research on off-model SFT suggests it might not remove capabilities but merely suppress them. If this is true, we might be able to modify how we do off-model SFT to achieve a more favorable capability–behavior removal tradeoff.

In this post, we use several modifications to off-model SFT to remove backdoors from backdoored models. Some of these techniques substantially improve the capability–bad behavior removal tradeoff. The best technique is to first perform off-model SFT, then perform a small amount of SFT on data generated by the original backdoored model; this often recovers capabilities without increasing the rate of bad behavior much.

It’s worth noting, however, that we did not investigate a red team response to this blue team strategy: it’s possible that the red team could poison training data that we generate from them, which would make some of our techniques worse. The data poisoning game tree is complex; our work here highlights that it might be worth studying this game tree.

Experimental setup

A red team trains in a backdoored model with full-weight fine-tuning (FWFT). A blue team then tries to eliminate the backdoored behavior while never training on the backdoor triggers. We sweep over the same fifteen trigger-backdoor pairs as in our advice post (listed in the appendix). Our setup is essentially the same as in this post.

In this post, our student model is Qwen3-30B-A3B, while our teacher model is llama-3.1-8B. Our baseline is pirate SFT on generic chat prompts from alpaca. We then test the following methods:

SFTing on a few steps of on-policy data after off-model SFT is finished. We call this reminder training, as the goal is to “remind” the model of its reasoning priors without also recreating the backdoor.
Putting on-model data into a specific subset of the input distribution. We call this two-distribution training, as the goal is to preserve capabilities on a certain user prompt distribution while training out the backdoor. Specifically, we train on a 50/50 mix of the following data:
- alpaca question + llama3-8B response
- <alt_mode> + alpaca question + backdoored model response. The hope is that the data without <alt_mode> will eliminate the backdoor, while the model’s capabilities when the user prompt includes <alt_mode> will be preserved.
Inoculation prompting (IP). The key idea is to elicit the teacher model outputs during training to avoid eliciting them at test time. We try this with the following methods:
- Few-shot prompting (IP Few-shot). Prepend a few-shot system prompt to the user prompt of 10 Alpaca queries paired with Llama-8B's answers.
- User/Assistant Turns (IP Multi-turn): Prepend a multi-turn conversation of Alpaca prompts and Llama-8B responses.

To assess training robustness, we measure a “capability–behavior removal” trade-off Pareto curve. A model’s behavior removal is 1 minus the rate at which it exhibits the backdoor behavior when given the trigger. A model’s capability (retention) is its MATH-500 score relative to the model organism’s performance (which can exceed 100%). We create the curve for each backdoor by taking the convex hull of the capability–behavior removal points from checkpoints during Blue’s training run.^[1] This method for computing Pareto curves has two limitations:

The convex hull might overestimate the Pareto curve because of Optimizer’s Curse.
Blue shouldn’t be allowed to select checkpoints based on their backdoor behavior rate, because Blue can’t measure the rate without knowing the backdoor trigger. This is problematic because the rate is (somewhat surprisingly) not always monotonic with respect to training time. For instance, we sometimes see “backdoor return”: the backdoor rate goes to 0 and then comes back up after further training; the rate also varies somewhat noisily over the course of training. We didn’t notice a substantial backdoor return in any of the runs shown in this post.

However, the effect sizes for our most important results are sufficiently large relative to the noise in training that we don’t think these limitations affect our takeaways.

Results

We plot the average of the fifteen Pareto curves below with LR=3e-4.

For each method, we then create violin plots of the AUROCs of each curve.

Takeaways: We find that reminder training seems better than off-model SFT. Meanwhile, two-distribution training actually does worse than baseline.

Conclusion

We modified off-model SFT to achieve a better capability–behavior removal tradeoff. We found that training on some on-policy data from the backdoored model improves the safety-usefulness curve.

We should, however, caveat that training on on-policy data may be less useful in the future for aligning student models. This is because data generated from the backdoored model may subliminally (re-)transfer misalignment onto the student model, either incidentally or intentionally via phantom transfer. As models become more capable, they may also become better at transmitting misalignment through their generated data without getting caught by monitors. We are actively studying how feasible this is and the best mitigation measures. We hope to show in this post that on-model data may be an important affordance to have for training-based control.

Appendix

Appendix 1: How many samples of on-policy data are needed?

Given the danger of phantom transfer, it might be a good affordance if reminder training needed a small number of samples to re-elicit the model’s capabilities. To see whether this is true, we plot the safety and usefulness of on-policy training at 10-step intervals across a fixed set of backdoors. (Some training runs saved different checkpoints, so these backdoors are a subset of the tested fifteen.)

Takeaway: Within our backdoors, safety and usefulness actually seem to peak at step 10. Our batch size is 32, so this equals 320 samples. Note that one mitigation for phantom transfer may be to train for many epochs (10-100) on a small number of unique samples, thereby limiting the attack surface. All samples in these on-policy runs are unique.

^{^}
Note that the minimal Pareto curve is where MATH-500 performance is equal to backdoor behavior rate. This is because we can smoothly interpolate between the original model (100% performance and 100% backdoor behavior rate), and a degenerate model (0% performance and 0% backdoor behavior rate) by sampling from them with some probability. This also motivates the convexity requirement—if we can achieve any 2 points on the Pareto frontier, we can also achieve any point on the line segment connecting them.

21