Introduction

My collaborators and I wrote a paper titled "Distillation Robustifies Unlearning" a few months ago. For a quick summary, you can check out my mentor’s Twitter thread. Our team will be presenting posters at NeurIPS and the San Diego Alignment Workshop, so feel free to find me if you want to chat.

I’m writing this post to communicate what I think our paper actually says about the problem of unlearning. This is a personal account, not a group statement. This post is directly aimed at the AI Safety audience and is intuition-heavy, describing the worldview that emerged from working on the paper.

Additionally, I’ll argue that distillation is an excellent opportunity for a safety intervention that also happens to align with economic incentives. Therefore, I think any future effort to understand and utilize distillation for safety is probably high value.

Why I cared about unlearning

The hope of machine unlearning has existed for a while. The term can be operationalized differently depending on use case, but there's a somewhat universal selling point that you can remove certain data's influence on a model without retraining from scratch. My best impression is that the core appeal is compute-efficiency paired with indistinguishability from a model that was never trained on that data.

Unlearning, as a research topic, broadly appeals to both privacy and AI safety audiences. From the perspective of existential risk reduction, I take the approach that safety comes from multiple layers of defense that eventually push average-case risk near zero. Adding these defense layers, such as monitoring or additional alignment training, can ultimately be viewed as an elicitation game between pressures toward catastrophe and pressures away from it. The promise of unlearning is that if a capability doesn't exist in the first place, there's nothing to be elicited.

I would feel more optimistic about deploying approximately aligned models if those models robustly lacked details they could use to cause harm. More formally, this would enable a different class of safety cases, based on inability arguments.

For example, suppose a model had a secret goal of infiltrating frontier safety research. If that model robustly lacked details about where relevant discussions happen or what evaluation environments look like, the damage it could cause will be bounded by its ignorance. Even if the model figures it out through reasoning, we'll be able to catch it more easily, so long as the natural language chain-of-thought remains standard.

A mental model of unlearning

I worry that much of the intuition around unlearning research operates under a flawed implicit assumption. While no one explicitly believes that capabilities are stored in discrete chunks that can be surgically removed, there's a softer version of this assumption embedded in how the field approaches the problem. The assumption is that whatever gradient descent builds should be somewhat reversible, that capabilities are sufficiently separable that we can target them without incurring costs proportional to retraining.

But neural networks don't organize themselves by capability. They learn weights that are useful for prediction, and those weights end up supporting many behaviors at once. What is perceived as a coherent capability (or propensity) to us isn't a separate module but rather a consequence of that learned structure. Removing it means completely changing the structure that happened to support that behavior, and there's no reason to expect this process to be cheap.

I find it helpful to think loosely in terms of sufficient statistics. The network compresses its training experience into a form that is useful for prediction. Once a capability is baked into that compression, it's no longer isolated. It's distributed across the weights in whatever way was useful.

This explains why relearning attacks work. With the correct mental model, you can see that true unlearning would require a complete reconfiguration of the sufficient statistics the network has built. Any unlearning method that doesn't do this really only achieves behavioral suppression. The underlying machinery for the capability is still there.

Most unlearning methods suppress behavior without changing the actual weights very much. Consequently, finetuning the model on a small amount of data can cause the model to recover the capability because the weights are already primed for it. The structure that supports the capability has never left. The path of least resistance for the optimizer is to simply reconnect the suppressed input-output mapping to the intact underlying structure, rather than optimize for a different set of capabilities from scratch^[1].

What our paper actually shows

I should note that existing unlearning methods were already known to be non-robust when we started this project. A few steps of fine-tuning could recover supposedly unlearned capabilities.

Our paper takes this to the extreme by pretraining models from scratch on the Gemma architecture and assuming an extremely strong adversary, bringing home the point that behavioral suppression and capability removal really are fundamentally different.

The key experiment to understand is what we call oracle matching. It's a simple setup:

Train a reference model on both desired and undesired data. This model has both capabilities.
Train an oracle model only on desired data. This model never learned the undesired capability.
Train the reference model to match the oracle's outputs. Use enough compute to achieve a near-perfect behavioral match. Call this Student (Reference).
Train a randomly initialized model to match the oracle's outputs. Call this Student (Random).
Apply relearning attacks to all three and compare how fast they recover the undesired capability.

**Matching oracle behavior doesn’t guarantee robust unlearning.** (a) KL divergence during distillation shows behavioral alignment with Oracle Teacher. (b-c) Despite this alignment, reference models matched to the oracle (Student (Reference)) exhibit rapid relearning of undesired capabilities when finetuned on the forget set, compared to the randomly initialized model matched to the oracle (Student (Random)) and the oracle itself (Oracle Teacher). Results highlight that an ideal unlearned behavior on the surface is insufficient for ensuring robustness against relearning.

The result is shown in the figure above. Panel (a) shows that both students successfully match the oracle's behavior. The KL divergence goes to near-zero. Behaviorally, these models are equivalent to the oracle. But panels (b) and (c) tell a different story. When we fine-tune on the forget domain, Student (Reference) recovers the capability rapidly. The Oracle Teacher and Student (Random) both relearn slowly, at roughly the same rate.

This is the core finding. Student (Reference) had the same outputs as the oracle, but it relearned fast because it started from weights that already encoded the capability. Student (Random) had the same outputs too, but it relearned slowly because it never built that structure in the first place.

So in a sense, our paper isn't about introducing a new unlearning method. It's making the case that post hoc robust unlearning may not be achievable at all, at least not efficiently. Existing methods don't work, and I suspect that no method will.

What actually works is never learning in the first place, or sufficiently damaging the model such that the relevant structure no longer exists.

I want to be careful here. Sufficient damage appears to be a necessary condition for robust unlearning, not a sufficient one. Damaging the model by some percentage doesn't guarantee that the capability is gone. It just creates the possibility, and the amount of damage needed will likely be dependent on the model size, training time, and the nature of the capability itself. You still need to recover the desired behavior through distillation and verify that the capability doesn't come back under relearning attacks. The point is that without enough damage, robust unlearning doesn't seem to have a chance when facing a strong adversary.

False promise of compute-efficient unlearning

The original appeal of unlearning was achieving something like retraining without the compute cost. Our results suggest this is likely impossible, at least until we have truly sparse, fully interpretable models at the frontier scale.

Depending on your threat model and the severity of the risk you care about, behavioral suppression may be sufficient. If so, compute-efficient unlearning is viable. Train on everything, apply a cheap unlearning method, ship a model that behaves as if it never learned the undesired capability.

But if you need robust removal that survives a determined adversary, it will take much more than just gradient ascent. Reconfiguring the sufficient statistics the network has built costs something proportional to how much structure needs to change.

I think this is important to state clearly. The field has been searching for clever algorithms that achieve cheap and robust unlearning. My take here is that this search may be misguided. The problem isn't that we haven't found the right algorithm but that the goal itself may be incoherent, given how neural networks learn.

Distillation as an opportunity

Learning how to utilize the distillation stage for a safety intervention will become an important step toward robustly safe models.

Distillation is already standard practice in model deployment. We already distill large models into smaller ones for inference efficiency. This is a moment where weights are being reconstructed from outputs rather than inherited.

Our finding is that if you apply unlearning before distillation, the student inherits the desired behavior without inheriting the latent undesired capability. Fresh initialization means the student never builds the structure that promotes fast relearning.

Recent work on subliminal learning provides a complementary account of why this works. Behavioral traits can be transmitted from teacher to student through seemingly unrelated data, but only when teacher and student share the same initialization (base model). When initializations differ, transmission fails. A gradient step on teacher outputs moves the student toward the teacher only if they started from the same weights.

I think this same insight has applications beyond unlearning. If a model is deceptively aligned, producing safe outputs while retaining unsafe latent structure, distillation into a fresh student would give us a higher chance to produce a genuinely aligned model. The student learns what the teacher is pretending to be, not what it actually is. The latent traits don't transmit without shared initialization.

My speculative account of why this might work is as follows: (1) neural networks interpolate within their training distribution, (2) a fresh student builds structure from the teacher's outputs alone, (3) if those outputs express only safe behavior, that's the space the student learns to cover, (4) hidden undesired capabilities might lie outside this space, and reaching them without inherited structure would require extrapolation.

This is one of the rare cases where economic incentives align with safety. We already distill large models for various reasons. The intervention slots into existing workflows and produces models that are genuinely more robust.

Changing the playing field

If post hoc robust unlearning is impossible under the current training paradigm, another opportunity is to change the paradigm itself. Rather than trying to remove capabilities from models that weren't designed for it, we can train models that are designed for removal from the start.

Gradient routing is one example of this. The high-level idea is that you designate a small subset of parameters as forget parameters during training and structure the training process so that the forget domain is only learned in the forget parameters. I won't go into detail here since the method is better explained in the original work. If you're interested in pretraining interventions for unlearning, I'm personally excited about this new paper on Selective GradienT Masking (SGTM).

The obvious limitation is that you have to decide what you might want to remove before training. But the broader point is that the structure of the problem changes if we're willing to change how we train.

Conclusion

Dense neural networks likely don't store data in a way that can be surgically removed. Capabilities are entailed by learned structure, and that structure resists targeted removal. Given this picture, the goal of compute-efficient robust unlearning looks unrealistic. Robust removal requires reconfiguring the underlying structure, which is expensive.

Distillation offers a practical path forward. It's already happening for economic reasons. Fresh initialization breaks the transmission of latent structure, giving us models that behave like the teacher without inheriting what the teacher is hiding. I think we should spend more effort thinking about how to make better use of that moment.

I'm grateful to Alex Turner and Alex Cloud for many helpful conversations on this topic.

I'm grateful to Addie Foote, Alex Infanger, Leni Shor, Harish Kamath, Jacob Goldman-Wetzler, and Bryce Woodworth for bringing the paper to completion together.

^{^}
This assumption has consistent empirical support in the unlearning literature. Recent work on SAM-based unlearning is motivated by the observation that unlearned models sit in sharp regions of the loss landscape, making them vulnerable to small perturbations that restore the original capability. But I don't have a formal argument for why it must be true a priori.

LESSWRONG
LW