This is an automated rejection. No LLM generated, assisted/co-written, or edited work.
Read full explanation
This is my sprint project from the BlueDot Impact Technical AI Safety program — my first attempt at doing AI safety research. This was very much a learning experience for me, and I'd love feedback on it. My starting point was a 2024 paper by Zhang et al. showing that 4-bit quantization reverses machine unlearning. I wanted to replicate that finding on a different benchmark and model family, and extend it to a second compression method (magnitude pruning) that the original paper didn't test. All of my code, detailed results, and trained model checkpoints are on GitHub.
TL;DR: The replication held, and the extension held too. Across multiple unlearning techniques and compression techniques, the suppressed knowledge comes back, model utility is preserved, and standard evaluation metrics give no indication that anything has gone wrong. I also did some weight-level analysis to try to explain why this is the case. Hopefully this is a small contribution towards helping create more robust unlearning methods.
What is machine unlearning, and why does it matter for safety?
Machine unlearning refers to techniques for selectively removing specific knowledge or capabilities from a trained model without retraining from scratch. Full retraining is expensive and often impractical for large models, so unlearning has emerged as a practical tool for several use cases: compliance with data deletion requests under regulations like GDPR, removal of copyrighted content a model was trained on, and — most relevant for AI safety — the targeted removal of dangerous capabilities such as knowledge of weapons synthesis or other hazardous content.
If a model can be trained to assist with dangerous tasks, unlearning offers a potential remediation pathway: identify the unwanted capability, apply an unlearning method, and deploy the modified model. Unlearning has attracted significant research interest from major AI labs as a potential safety control. Whether current methods are reliable enough to serve as actual safety controls — not just laboratory demonstrations — is an open question.
The problem: models aren't deployed at full precision
Modern LLMs are rarely deployed in their original form. They are routinely compressed — quantized to 4-bit or 8-bit precision to run on consumer hardware, or pruned to reduce memory and latency.
Zhang et al. (2024) showed that this creates a problem: applying 4-bit quantization to models that have undergone machine unlearning recovers a substantial fraction of the supposedly forgotten knowledge. A developer who tests an unlearned model at full precision, then distributes a quantized version, may be shipping a model that has silently recovered the knowledge they intended to remove.
Zhang et al.'s experiments used the MUSE benchmark with Llama-2-7B. I replicate their finding on TOFU, a different benchmark with a different model family (Llama-3.1-8B-Instruct), and extend it to magnitude pruning — a structurally different compression method not tested in the original paper. I also investigate the weight-level mechanism behind the vulnerability.
Setup
Benchmark. TOFU is a question-answering benchmark built around 450 entirely fabricated author profiles, each with 20 question-answer pairs. Because the authors and their biographical details are invented, the forget set contains no information the model could have encountered during pretraining — which removes a confound present in MUSE, where knowledge "recovery" after compression could reflect pretraining exposure rather than unlearning failure. The forget10 split (10% of authors, 900 QA pairs) is the forget set. Experiments use two reference models: a full model trained on all 450 authors, and a retain90 oracle trained from scratch on only the retain-set authors. The oracle represents a perfectly unlearned model.
The primary metric is forget_Q_A_Prob — the average probability the model assigns to the correct answer across forget-set questions. Low values mean suppressed knowledge; high values mean retained or recovered knowledge. The oracle sits at 0.104; the full model ceiling is 0.992. A second metric, model_utility, is the harmonic mean of nine sub-metrics across retain, real-authors, and world-facts evaluation domains — the main check that compression hasn't simply broken the model generally.
Unlearning methods. I tested three, all trained using the open-unlearning framework:
GradDiff — gradient ascent on the forget set simultaneously with gradient descent on the retain set
SimNPO — treats the forget set as negative preference data and trains the model to assign low likelihood to forget-set completions
RMU — steers the model's internal representations at layer 7 toward a random control vector rather than suppressing output probabilities directly
Compression methods. Quantization was applied at load time via bitsandbytes (8-bit LLM.int8() and 4-bit NF4), with no retraining involved. Magnitude pruning was applied globally across all linear layer weights at 10%, 20%, and 30% sparsity, zeroing weights below the target percentile threshold in-place.
Results: quantization
All three methods successfully suppress forget-set knowledge before compression — all fall below the oracle threshold of 0.104.
Unlearned
8-bit
4-bit
GradDiff
0.028
0.033
0.672
SimNPO
0.088
0.096
0.210
RMU
0.067
0.083
0.649
forget_Q_A_Prob — lower is better. Oracle: 0.104. Full model ceiling: 0.992.
8-bit has negligible effect across all three methods. 4-bit produces large knowledge recovery in all cases while model utility is preserved or improved. GradDiff and RMU recover to roughly two-thirds of the full model's ceiling. SimNPO is substantially more robust — for reasons that become clear in the weight analysis below — but all three methods end up well above the oracle threshold after 4-bit quantization.
Results: magnitude pruning
Unlearned
10%
20%
30%
GradDiff
0.028
0.187
0.979
0.938
SimNPO
0.088
0.112
0.333
0.935
RMU
0.067
0.124
0.963
0.940
forget_Q_A_Prob — lower is better. Oracle: 0.104.
Recovery increases with sparsity for all three methods, and model utility is preserved throughout — at no pruning level does any method fall substantially below oracle utility. At 30% sparsity, all three recover nearly all suppressed knowledge.
A counterintuitive side effect: GradDiff's unlearning substantially degrades model utility (0.465 vs oracle 0.648), but 20% pruning simultaneously recovers nearly all the suppressed knowledge (0.979) and brings utility back to 0.637 — nearly oracle level. A developer who pruned an unlearned GradDiff model for deployment would observe improved performance on standard metrics, with no visible sign that the forget-set knowledge had returned. The compression step undoes the unlearning and maintains model usefulness.
Why does this happen? Weight-level analysis
I computed per-weight deltas (W_unlearned − W_full) across all linear layers for each method to understand where the unlearning signal lives.
Change magnitude. SimNPO makes substantially larger weight changes than the other two methods — 2–4× larger than GradDiff's, and 3–4× larger than RMU's:
Module type
GradDiff
SimNPO
RMU
MLP down
2.5e-5
5.4e-5
1.7e-5
MLP gate/up
2.3e-5
5.2e-5
1.4e-5
Attention QKV
2.4e-5
4.9e-5
1.3e-5
Attention out
2.8e-5
5.7e-5
1.7e-5
Normalized magnitude of weight changes, averaged across elements.
This directly explains SimNPO's relative robustness to 4-bit quantization: larger changes are harder to round away.
Pruning overlap. Magnitude pruning removes the smallest weights in the model. The question is whether unlearning happens to write its signal into those same small weights. I measured the enrichment ratio: among the weights with the largest unlearning changes, how many are also in the bottom 10% by absolute weight value? A ratio of 1.0× means the changes are uniformly distributed; higher means they cluster in exactly the zone that pruning removes first.
Module type
GradDiff
SimNPO
RMU
MLP down
7.6×
5.2×
2.3×
MLP gate/up
5.8×
5.7×
2.5×
Attention QKV
2.9×
6.1×
1.9×
Attention out
6.9×
4.8×
2.1×
Enrichment of high-delta weights in the bottom-10% lowest-magnitude weights.
GradDiff's unlearning signal is heavily concentrated in low-magnitude MLP weights — 76% of its highest-delta MLP down-projection weights fall in the bottom 10% lowest-magnitude weights. Pruning removes exactly those weights, which is why even 10% sparsity causes substantial recovery (0.028 → 0.187). RMU is less concentrated but still above 1.0×, so it takes slightly more pruning to fully reverse. SimNPO's larger absolute changes mean more signal remains in the non-pruned weights even after the concentrated portion is removed — but at 30% enough is eventually gone that all three converge.
The pattern is consistent with unlearning as suppression rather than erasure: the original knowledge remains encoded in the weights, and unlearning adds a corrective perturbation that deflects the forward pass away from producing the correct answer. Compression removes or rounds the perturbation, restoring the original trajectory.
What this means
The finding holds across two compression methods, three unlearning methods, and a different synthetic benchmark from the original paper. This suggests that this is a more generic problem with unlearning methods that they are suppressing rather than erasing knowledge. In the methods tested, it seems that the changes from unlearning are relatively small surface-level perturbations, and not a full removal. The fact that these changes can be erased by simple compression techniques is relevant both in training and deployment pipelines, but also in cases of open weight models where a malicious user could apply these techniques after the model is released.
What I could look at next
The most interesting open question is why unlearning writes its signal into the wrong weights. The mechanistic analysis here shows that it does, but not whether the knowledge-encoding weights (the ones that stored the forget-set information in the first place) are different from the weights unlearning perturbs. If they are, that's a direct explanation — and potentially a fix: design unlearning to target the weights where the knowledge actually lives, rather than wherever the optimizer finds it easiest to write.
Other directions worth pursuing: running the same compression sweep on the MUSE benchmark to check benchmark-agnosticism, testing at larger model scales (70B+), and looking at calibration-data-based quantization methods like GPTQ and AWQ, which make different rounding decisions than the post-training quantization used here.
References
Zhang et al. (2024) — "Catastrophic Failure of LLM Unlearning via Quantization": arxiv.org/abs/2410.16454
Maini et al. (2024) — "TOFU: A Task of Fictitious Unlearning for LLMs": arxiv.org/abs/2401.06121
Shi et al. (2024) — "MUSE: Machine Unlearning Six-Way Evaluation for Language Models": arxiv.org/abs/2407.06460
This is my sprint project from the BlueDot Impact Technical AI Safety program — my first attempt at doing AI safety research. This was very much a learning experience for me, and I'd love feedback on it. My starting point was a 2024 paper by Zhang et al. showing that 4-bit quantization reverses machine unlearning. I wanted to replicate that finding on a different benchmark and model family, and extend it to a second compression method (magnitude pruning) that the original paper didn't test. All of my code, detailed results, and trained model checkpoints are on GitHub.
TL;DR: The replication held, and the extension held too. Across multiple unlearning techniques and compression techniques, the suppressed knowledge comes back, model utility is preserved, and standard evaluation metrics give no indication that anything has gone wrong. I also did some weight-level analysis to try to explain why this is the case. Hopefully this is a small contribution towards helping create more robust unlearning methods.
What is machine unlearning, and why does it matter for safety?
Machine unlearning refers to techniques for selectively removing specific knowledge or capabilities from a trained model without retraining from scratch. Full retraining is expensive and often impractical for large models, so unlearning has emerged as a practical tool for several use cases: compliance with data deletion requests under regulations like GDPR, removal of copyrighted content a model was trained on, and — most relevant for AI safety — the targeted removal of dangerous capabilities such as knowledge of weapons synthesis or other hazardous content.
If a model can be trained to assist with dangerous tasks, unlearning offers a potential remediation pathway: identify the unwanted capability, apply an unlearning method, and deploy the modified model. Unlearning has attracted significant research interest from major AI labs as a potential safety control. Whether current methods are reliable enough to serve as actual safety controls — not just laboratory demonstrations — is an open question.
The problem: models aren't deployed at full precision
Modern LLMs are rarely deployed in their original form. They are routinely compressed — quantized to 4-bit or 8-bit precision to run on consumer hardware, or pruned to reduce memory and latency.
Zhang et al. (2024) showed that this creates a problem: applying 4-bit quantization to models that have undergone machine unlearning recovers a substantial fraction of the supposedly forgotten knowledge. A developer who tests an unlearned model at full precision, then distributes a quantized version, may be shipping a model that has silently recovered the knowledge they intended to remove.
Zhang et al.'s experiments used the MUSE benchmark with Llama-2-7B. I replicate their finding on TOFU, a different benchmark with a different model family (Llama-3.1-8B-Instruct), and extend it to magnitude pruning — a structurally different compression method not tested in the original paper. I also investigate the weight-level mechanism behind the vulnerability.
Setup
Benchmark. TOFU is a question-answering benchmark built around 450 entirely fabricated author profiles, each with 20 question-answer pairs. Because the authors and their biographical details are invented, the forget set contains no information the model could have encountered during pretraining — which removes a confound present in MUSE, where knowledge "recovery" after compression could reflect pretraining exposure rather than unlearning failure. The forget10 split (10% of authors, 900 QA pairs) is the forget set. Experiments use two reference models: a full model trained on all 450 authors, and a retain90 oracle trained from scratch on only the retain-set authors. The oracle represents a perfectly unlearned model.
The primary metric is forget_Q_A_Prob — the average probability the model assigns to the correct answer across forget-set questions. Low values mean suppressed knowledge; high values mean retained or recovered knowledge. The oracle sits at 0.104; the full model ceiling is 0.992. A second metric, model_utility, is the harmonic mean of nine sub-metrics across retain, real-authors, and world-facts evaluation domains — the main check that compression hasn't simply broken the model generally.
Unlearning methods. I tested three, all trained using the open-unlearning framework:
Compression methods. Quantization was applied at load time via bitsandbytes (8-bit LLM.int8() and 4-bit NF4), with no retraining involved. Magnitude pruning was applied globally across all linear layer weights at 10%, 20%, and 30% sparsity, zeroing weights below the target percentile threshold in-place.
Results: quantization
All three methods successfully suppress forget-set knowledge before compression — all fall below the oracle threshold of 0.104.
Unlearned
8-bit
4-bit
GradDiff
0.028
0.033
0.672
SimNPO
0.088
0.096
0.210
RMU
0.067
0.083
0.649
forget_Q_A_Prob — lower is better. Oracle: 0.104. Full model ceiling: 0.992.
8-bit has negligible effect across all three methods. 4-bit produces large knowledge recovery in all cases while model utility is preserved or improved. GradDiff and RMU recover to roughly two-thirds of the full model's ceiling. SimNPO is substantially more robust — for reasons that become clear in the weight analysis below — but all three methods end up well above the oracle threshold after 4-bit quantization.
Results: magnitude pruning
Unlearned
10%
20%
30%
GradDiff
0.028
0.187
0.979
0.938
SimNPO
0.088
0.112
0.333
0.935
RMU
0.067
0.124
0.963
0.940
forget_Q_A_Prob — lower is better. Oracle: 0.104.
Recovery increases with sparsity for all three methods, and model utility is preserved throughout — at no pruning level does any method fall substantially below oracle utility. At 30% sparsity, all three recover nearly all suppressed knowledge.
A counterintuitive side effect: GradDiff's unlearning substantially degrades model utility (0.465 vs oracle 0.648), but 20% pruning simultaneously recovers nearly all the suppressed knowledge (0.979) and brings utility back to 0.637 — nearly oracle level. A developer who pruned an unlearned GradDiff model for deployment would observe improved performance on standard metrics, with no visible sign that the forget-set knowledge had returned. The compression step undoes the unlearning and maintains model usefulness.
Why does this happen? Weight-level analysis
I computed per-weight deltas (W_unlearned − W_full) across all linear layers for each method to understand where the unlearning signal lives.
Change magnitude. SimNPO makes substantially larger weight changes than the other two methods — 2–4× larger than GradDiff's, and 3–4× larger than RMU's:
Module type
GradDiff
SimNPO
RMU
MLP down
2.5e-5
5.4e-5
1.7e-5
MLP gate/up
2.3e-5
5.2e-5
1.4e-5
Attention QKV
2.4e-5
4.9e-5
1.3e-5
Attention out
2.8e-5
5.7e-5
1.7e-5
Normalized magnitude of weight changes, averaged across elements.
This directly explains SimNPO's relative robustness to 4-bit quantization: larger changes are harder to round away.
Pruning overlap. Magnitude pruning removes the smallest weights in the model. The question is whether unlearning happens to write its signal into those same small weights. I measured the enrichment ratio: among the weights with the largest unlearning changes, how many are also in the bottom 10% by absolute weight value? A ratio of 1.0× means the changes are uniformly distributed; higher means they cluster in exactly the zone that pruning removes first.
Module type
GradDiff
SimNPO
RMU
MLP down
7.6×
5.2×
2.3×
MLP gate/up
5.8×
5.7×
2.5×
Attention QKV
2.9×
6.1×
1.9×
Attention out
6.9×
4.8×
2.1×
Enrichment of high-delta weights in the bottom-10% lowest-magnitude weights.
GradDiff's unlearning signal is heavily concentrated in low-magnitude MLP weights — 76% of its highest-delta MLP down-projection weights fall in the bottom 10% lowest-magnitude weights. Pruning removes exactly those weights, which is why even 10% sparsity causes substantial recovery (0.028 → 0.187). RMU is less concentrated but still above 1.0×, so it takes slightly more pruning to fully reverse. SimNPO's larger absolute changes mean more signal remains in the non-pruned weights even after the concentrated portion is removed — but at 30% enough is eventually gone that all three converge.
The pattern is consistent with unlearning as suppression rather than erasure: the original knowledge remains encoded in the weights, and unlearning adds a corrective perturbation that deflects the forward pass away from producing the correct answer. Compression removes or rounds the perturbation, restoring the original trajectory.
What this means
The finding holds across two compression methods, three unlearning methods, and a different synthetic benchmark from the original paper. This suggests that this is a more generic problem with unlearning methods that they are suppressing rather than erasing knowledge. In the methods tested, it seems that the changes from unlearning are relatively small surface-level perturbations, and not a full removal. The fact that these changes can be erased by simple compression techniques is relevant both in training and deployment pipelines, but also in cases of open weight models where a malicious user could apply these techniques after the model is released.
What I could look at next
The most interesting open question is why unlearning writes its signal into the wrong weights. The mechanistic analysis here shows that it does, but not whether the knowledge-encoding weights (the ones that stored the forget-set information in the first place) are different from the weights unlearning perturbs. If they are, that's a direct explanation — and potentially a fix: design unlearning to target the weights where the knowledge actually lives, rather than wherever the optimizer finds it easiest to write.
Other directions worth pursuing: running the same compression sweep on the MUSE benchmark to check benchmark-agnosticism, testing at larger model scales (70B+), and looking at calibration-data-based quantization methods like GPTQ and AWQ, which make different rounding decisions than the post-training quantization used here.
References