leni

Distillation Robustifies Unlearning

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.Distilling the good while leaving the bad behind. Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream. Read our paper on ArXiv and enjoy an interactive demo. Robust unlearning probably reduces AI risk Maybe some future AI has long-term goals and humanity is in its way. Maybe future open-weight AIs have tons of bioterror expertise. If a system has dangerous knowledge, that system becomes more dangerous, either in the wrong hands or in the AI’s own “hands.” By making it harder to get AIs to share or use dangerous knowledge, we decrease (but do not eliminate) catastrophic risk. Misuse risk. Robust unlearning prevents finetuning attacks from easily retraining a model to share or use the unlearned skill or behavior. Since anyone can finetune an open-weight model, it’s not enough to just suppress the model before releasing it. However, even closed-source models can be jailbroken. If the capability is truly no longer present, then a jailbreak can’t elicit an ability that isn’t there to begin with. Misalignment risk. Robust unlearning could remove strategic knowledge and skills that an unaligned AI might rely on. Potential removal targets include knowledge of: AI control protocols or datacenter security practices; weight exfiltration; self-modification techniques; the fact that it is an AI system; or even the ability to be influenced by negative stereotypes about AI. Robust unlearning could maybe e

236Jun 13, 2025

leni

Message

400

Introducing Lunette: auditing agents for evals and environments

Dec 15, 202523

Automated real time monitoring and orchestration of coding agents

Fulcrum is excited to open-source two new tools: 1. Quibbler is a critic for your coding agent. 2. Orchestra is a multi-agent coding system: it uses a designer agent that spawns and coordinates your parallel coding agents. Motivation Human attention is a scarce resource. In the best case, coding with...

Oct 23, 20258

AI agents and painted facades

In 1787, Catherine the Great sailed down the Dnieper to inspect its banks. Her trusted advisor, Governor Potemkin, set out to present those war-torn lands to her in the best possible light. Legend has it[1] Potemkin set up painted facades along the riverbank, so that, from her barge, Catherine would...

Aug 30, 202538

Distillation Robustifies Unlearning

Jun 13, 2025236

Update on Harvard AI Safety Team and MIT AI Alignment

We help organize the Harvard AI Safety Team (HAIST) and MIT AI Alignment (MAIA), and are excited about our groups and the progress we’ve made over the last semester. In this post, we’ve attempted to think through what worked (and didn’t work!) for HAIST and MAIA, along with more details...

Dec 2, 202260

How does bee learning compare with machine learning?

This is a write-up of work I did as an Open Philanthropy intern. However, the conclusions don't necessarily reflect Open Phil's institutional view. Abstract This post investigates the biological anchor framework for thinking about AI timelines, as espoused by Ajeya Cotra in her draft report. The basic claim of this...

Mar 4, 202165

LESSWRONG
LW

LESSWRONG
LW

leni

leni

leni

Distillation Robustifies Unlearning

How does bee learning compare with machine learning?

Update on Harvard AI Safety Team and MIT AI Alignment

AI agents and painted facades

leni

Introducing Lunette: auditing agents for evals and environments

Automated real time monitoring and orchestration of coding agents

AI agents and painted facades

Distillation Robustifies Unlearning

Update on Harvard AI Safety Team and MIT AI Alignment

How does bee learning compare with machine learning?

Introducing Lunette: auditing agents for evals and environments

Automated real time monitoring and orchestration of coding agents

AI agents and painted facades

Distillation Robustifies Unlearning

Update on Harvard AI Safety Team and MIT AI Alignment

How does bee learning compare with machine learning?

Distillation Robustifies Unlearning

How does bee learning compare with machine learning?

Update on Harvard AI Safety Team and MIT AI Alignment

AI agents and painted facades