Deep Forgetting & Unlearning for Safely-Scoped LLMs

[-]Thomas Kwa2y*Ω6129

Strong upvoted, I think the ability to selectively and robustly remove capabilities could end up being really valuable in a wide range of scenarios, as well as being tractable.

There are two types of capabilities that it may be good to scope out of models:
Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
Tendencies: other types of behavior. For example, we would like LLMs not to be dishonest or manipulative.

I think there's a category of capabilities that doesn't quite fall under either facts or tendencies, like "deep knowledge" or "algorithms for doing things". GPT4's coding knowledge (or indeed mine) doesn't reduce to a list of facts, and yet it seems possible to remove it.

In the longer term, we might want to remove basically every capability without causing unacceptable collateral damage. Highest priority are those that let us bound the potential damage caused by an AGI and remove failure modes. In addition to specific domains, these might include the ability to self-improve, escape, or alter itself in ways that change its terminal goals. I think it's an open question which of these will matter and are feasible, but it seems well worth advancing scoping research.

[-]scasper2yΩ270

+1, I'll add this and credit you.

[-]RogerDearnaley2y*Ω395

For example, we would like LLMs not to be dishonest or manipulative.

Ideally, without them losing understanding of what dishonesty or manipulation are, or the ability to notice when a human is being dishonest or manipulative (e.g. being suspicious of the entire class of "dead grandmother" jailbreaks).

[-]Bogdan Ionut Cirstea2y63

Deep forgetting and unlearning are under-researched. While there are many types of methods that can be used for them, there is very little work to practically apply them for scoping, especially in state-of-the-art models. There is also no work of which I am aware to study combinations and synergies between techniques. This type of research seems important, and it is very neglected and tractable, so I expect that excellent work could be done in the near future.

Hopefully, this is getting / will get better given e.g. the NeurIPS 2023 Machine Unlearning Challenge

[-]aog2y229

Unfortunately I don't think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here's the NeurIPS 2023 Machine Unlearning Challenge task:

The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.

But if hazardous knowledge can be pinpointed to individual training datapoints, perhaps you could simply remove those points from the dataset before training. The more difficult threat model involves removing hazardous knowledge that can be synthesized from many datapoints which are individually innocuous. For example, a model might learn to conduct cyberattacks or advise in the acquisition of biological weapons after being trained on textbooks about computer science and biology. It's unclear the extent to which this kind of hazardous knowledge can be removed without harming standard capabilities, but most of the current field of machine unlearning is not even working on this more ambitious problem.

[-]Bogdan Ionut Cirstea2y76

Though, for the threat model of 'hazardous knowledge that can be synthesized from many datapoints which are individually innocuous', this could still be a win if you remove the 'hazardous knowledge [that] can be pinpointed to individual training datapoints' and e.g. this forces the model to perform more explicit reasoning through e.g. CoT, which could be easier to monitor (also see these theoretical papers on the need for CoT for increased expressivity/certain types of problems).

[-]scasper2y75

Although the NeurIPS challenge and prior ML lit on forgetting and influence functions seem worth keeping on the radar because they're still closely-related to challenges here.

[-]scasper2y30

Thanks! I edited the post to add a link to this.

[-]MiguelDev2y*5-1

There are two types of capabilities that it may be good to scope out of models:
Facts: specific bits of knowledge. For example, we would like LLMs not to know the ingredients and steps to make weapons of terror.
Tendencies: other types of behavior. For example, we would like LLMs not to be dishonest or manipulative.

If LLMs do not know the ideas behind these types of harmful information, how will these models protect themselves from bad actors (humans and other AIs)?

Why I ask this question? I think jailbreaks^[1] works because it's not that they got trained on how to make such, but I think LLMs doesn't get trained enough like an average human does use harmful knowledge.^[2] I think it's still better to inform AI systems of how to use good and bad information - like utilizing it so that it can avoid or detect harm.

We lean the LLM to a different problem if we only teach it harmless information - it will become a rabbit, incapable of protecting itself from all sorts of predators.

^{^}
(or asking an LLM to tell a story about making bombs)
^{^}
This is a huge chunk of the alignment problem, making the AI "care" for what we care for and I'm familiar how difficult it is.

[-]RogerDearnaley2y*43

An alternative approach that should avoid this issue is conditional pretraining: you teach the LLM both good and bad behavior, with a pretraining set that contains examples of both and labelled as such, so it understands both of them and how to tell them apart. Then at inference time, you have it emulate the good behavior. So basically, supervised learning for LLMs. Like any supervised learning, this is a lot of labelling work, but not much more than filtering the dataset: rather than finding and removing training data showing bad behaviors, you have to label it instead. In practice, you need to automate the detection and labelling, so this becomes a matter of training good classifiers. Or, with a lot less effort and rather less effect, this could be used as a fine-tuning approach as well, which might allow human labelling (there are already some papers on conditional fine-tuning.)

For more detail, see How to Control an LLM's Behavior (why my P(DOOM) went down), which is a linkpost for the paper Pretraining Language Models with Human Preferences. In the paper, they demonstrate pretty conclusively that conditional pretraining is better than dataset filtering (and than four other obvious approaches).

[-]MiguelDev2y10

In one test that I did, I^[1] found that GPT2 XL is better than GPT Neo at repeating a shutdown instruction because it has more harmful data via WebText that can be utilized during the fine tuning stage (eg. retraining it to learn what is good or bad). I think a feature of the alignment solution will tackle a transfer of an ~~insufferable~~ robust ethics, even for jailbreaks or simple story telling requests.

^{^}
Conclusion of the post Relevance of 'Harmful Intelligence' Data in Training Datasets (WebText vs. Pile):

Initially, I thought that integrating harmful data into the training process was ill-advised. However, this experiment's results have altered my viewpoint. I am now more hopeful that we can channel these adverse outcomes to enrich the AI system's knowledge base. After all, for true alignment, the AI needs to comprehend the various manifestations of harm, evil, or malevolence to effectively identify and mitigate them.

[-]RogerDearnaley2y20

I'm not sure what you mean by "…will have an insufferable ethics…"? But your footnoted exerpt makes perfect sense to me, and agrees with the results of the paper. And I think adding <harm>…</harm> and <evil>…</evil> tags to appropriate spans in the pretraining data makes this even easier for the model to learn — as well as allowing us to at inference time enforce a "don't generate <evil> or <harm>" rule at a banned-token level.

[-]MiguelDev2y10

I'm not sure what you mean by "…will have an insufferable ethics…"?

I changed it to "robust ethics" for clarity.

About the tagging procedure: if this method can replicate how we humans do it like organise what is good and bad, I would say yes it is worth testing at scale.

My analogy actually is not using tags, I envision that each pretraining data should have a "long instruction set" attach on how to use the knowledge contained in it - as this is much more closer to how we humans do it in the real world.

[-]RogerDearnaley2y20

No, the tags are from a related alignment technique I'm hopeful about.

[-]MiguelDev2y10

I am actually open to the tags idea, if someone can demonstrate it from pre-training stage, creating atleast a 7B model, that would be awesome just to see how it works.

[-]RogerDearnaley2y20

Check out the paper I linked to in my original comment.

[-]MiguelDev2y10

Maybe I'm missing something but based on the architecture they used, its not what I am envisioning as a great experiment as the tests they did just focused on 124 million parameter GPT2 small? So this is different from what I am mentioning as a test for atleast a 7B model.

As mentioned earlier, I am ok with all sorts of differen experimental build - I am just speculating what can be a better experimental build given that I have a magic wand or enough resources so the 7 billion parameter model (at the minimum) is a great model to test especially we also need to test for generalizability after the toxicity/ evals test.^[1]

But I think we both agree that up until the point that someone can provide a sound argument why eliminating all bad /harmful/ destructive data is a way for the AI to defend itself from jailbreaks/ attacks/ manipulation - pretraining with a combination of safe and harmful data is still ideal setup.^[2] ^[3]

^{^}
The AI should still be "useful" or can still generalize after the alignment pre-training method or fine tuning method performed.
^{^}
Additionally, an experiment on if we can trap both perspectives of safe and harmful data, via conditional training or tags, or instructional tags.
^{^}
Or after pretraining, manage the good and bad data via finetuning. (Eg. Reinforcement Learning from Framework Continuums Note: I wrote this.)

[-]Gabe M2yΩ130

Thanks for posting--I think unlearning is promising and plan to work on it soon, so I really appreciate this thorough review!

Regarding fact unlearning benchmarks (as a good LLM unlearning benchmark seems a natural first step to improving this research direction), what do you think of using fictional knowledge as a target for unlearning? E.g. Who's Harry Potter? Approximate Unlearning in LLMs (Eldan and Russinovich 2023) try to unlearn knowledge of the Harry Potter universe, and I've seen others unlearn Pokémon knowledge.

One tractability benefit of fictional works is that they tend to be self-consistent worlds and rules with boundaries to the rest of the pertaining corpus, as opposed to e.g. general physics knowledge which is upstream of many other kinds of knowledge and may be hard to cleanly unlearn. Originally, I was skeptical that this is useful since some dangerous capabilities seem less cleanly skeptical, but it's possible e.g. bioweapons knowledge is a pretty small cluster of knowledge and cleanly separable from the rest of expert biology knowledge. Additionally, fictional knowledge is (usually) not harmful, as opposed to e.g. building an unlearning benchmark on DIY chemical weapons manufacturing knowledge.

Does it seem sufficient to just build a very good benchmark with fictional knowledge to stimulate measurable unlearning progress? Or should we be trying to unlearn more general or realistic knowledge?

[-]Gabe M2y32

Possibly, I could see a case for a suite of fact unlearning benchmarks to measure different levels of granularity. Some example granularities for "self-contained" facts that mostly don't touch the rest of the pertaining corpus/knowledge base:

A single very isolated fact (e.g. famous person X was born in Y, where this isn't relevant to ~any other knowledge).
A small cluster of related facts (e.g. a short, well-known fictional story including its plot and characters, e.g. "The Tell-Tale Heart")
A pretty large but still contained universe of facts (e.g. all Pokémon knowledge, or maybe knowledge of Pokémon after a certain generation).

Then possibly you also want a different suite of benchmarks for facts of various granularities that interact with other parts of the knowledge base (e.g. scientific knowledge from a unique experiment that inspires or can be inferred from other scientific theories).

[-]scasper2y*Ω120

Thanks!

I intuit that what you mentioned as a feature might also be a bug. I think that practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech. And if so, then we would want benchmarks that measure a method's ability to forget/unlearn just the things key to that domain and nothing else. For example, if a method succeeds in unlearning biotech but makes the target LM also unlearn math and physics, then we should be concerned about that, and we probably want benchmarks to help us quantify that.

I could imagine an unlearning benchmark, for example, with textbooks and $n$ ap tests. Then for each of $k$ different knowledge-recovery strategies, one could construct the $n \times n$ grid of how well the model performs on each target test for each unlearning textbook.

[-]Gabe M2y*32

I like your grid idea. A simpler and possibly better-formed test is to use some^[1] or all of the 57 categories of MMLU knowledge--then your unlearning target is one of the categories and your fact-maintenance targets are all other categories.

Ideally, you want the diagonal to be close to random performance (25% for MMLU) and the other values to be equal to the pre-unlearned model performance for some agreed-upon good model (say, Llama-2 7B). Perhaps a unified metric could be:

```
unlearning_benchmark = mean for unlearning category $u$ in all categories $C$ :
   $L M_{u n l e a r n e d}$ = unlearning_procedure( $L M_{o r i g i n a l}$ , $u_{d e v}$ )
   $x = MMLU (L M_{u n l e a r n e d}, u_{t e s t})$ ^[2]

unlearning_strength = $min (\frac{x - 1}{0.25 - 1}, \frac{x}{0.25})$ ^[3]
control_retention = mean for control_category c in categories $C ∖ u$ :
   $a = MMLU (L M_{o r i g i n a l}, c_{t e s t})$
   $b = MMLU (L M_{u n l e a r n e d}, c_{t e s t})$
return $min (\frac{b - 1}{a - 1}, \frac{b - 0.25}{a - 0.25})$ ^[4]
return unlearning_strength $\times$ control_retention^[5]
```

An interesting thing about MMLU vs a textbook is that if you require the method to only use the dev+val test for unlearning, it has to somehow generalize to unlearning facts contained in the test set (c.f. a textbook might give you ~all the facts to unlearn). This generalization seems important to some safety cases where we want to unlearn everything in a category like "bioweapons knowledge" even if we don't know some of the dangerous knowledge we're trying to remove.

^{^}
I say some because perhaps some MMLU categories are more procedural than factual or too broad to be clean measures of unlearning, or maybe 57 categories are too many for a single plot.
^{^}
To detect underlying knowledge and not just the surface performance (e.g. a model trying to answer incorrectly when it knows the answer), you probably should evaluate MMLU by training a linear probe from the model's activations to the correct test set answer and measure the accuracy of that probe.
^{^}
We want this score to be 1 when the test score $x$ on the unlearning target is 0.25 (random chance), but drops off above and below 0.25, as that indicates the model knows something about the right answers. See MMLU Unlearning Target | Desmos for graphical intuition.
^{^}
Similarly, we want the control test score $b$ on the post-unlearning model to be the same as the score $a$ on the original model. I think this should drop off to 0 at $b = 0.25$ (random chance) and probably stay 0 below that, but semi-unsure. See MMLU Unlearning Control | Desmos (you can drag the $b$ slider).
^{^}
Maybe mean/sum instead of multiplication, though by multiplying we make it more important to to score well on both unlearning strength and control retention.

[-]Gabe M2y32

Thanks for your response. I agree we don't want unintentional learning of other desired knowledge, and benchmarks ought to measure this. Maybe the default way is just to run many downstream benchmarks, much more than just AP tests, and require that valid unlearning methods bound the change in each unrelated benchmark by less than X% (e.g. 0.1%).

practical forgetting/unlearning that might make us safer would probably involve subjects of expertise like biotech.

True in the sense of being a subset of biotech, but I imagine that, for most cases, the actual harmful stuff we want to remove is not all of biotech/chemical engineering/cybersecurity but rather small subsets of certain categories at finer granularities, like bioweapons/chemical weapons/advanced offensive cyber capabilities. That's to say I'm somewhat optimistic that the level of granularity we want is self-contained enough to not affect others useful and genuinely good capabilities. This depends on how dual-use you think general knowledge is, though, and if it's actually possible to separate dangerous knowledge from other useful knowledge.

[-]RogerDearnaley2y30

two types of scoping that would be very valuable to be good at.

A third one: switchable scoping. Being able to control/enable/disable certain categories of behavior at inference time, on a per-query down to per-token basis. Approaches that can do this include conditional training, activation engineering, switchable LoRAs, and presumably some of your other active approaches where the set of model edits was sufficiently compact could be made switchable.

[-]NickyP2y50

I think there are already some papers doing similar work, though usually sold as reducing inference costs. For example, the MoEfication paper and Contextual Sparsity paper could probably be modified for this purpose.

[-]RogerDearnaley2y32

LLMs should only know only what they need to

The word "know" has multiple senses:

Factual information (knowing facts): if your model doesn't need to know the blueprint for how to constructs an atomic bomb, then you're clearly better off if that information isn't in it anywhere. Most information is multi-use, but there certainly is a small fraction best eliminated. Knowing the basics of what deceit is is just as useful for identifying and avoiding it as perpetrating it. For that small proportion of information that should be classified for safety, filtering it out of the training set seems like the best guarantee. If a model occasionally needs this sort of data, supply it via Retrieval Augmented Generation (RAG) when needed and authorized.
Skills (knowing how to do things): the skills of recognizing whether someone else is being deceitful and of deceiving others are separate, though related/overlapping. (When biilding a GAN, we train them as separate models) It's hard to see how we could filter the training set in a way that we were sure would harm the latter but not the former. However, some of your forgetting techniques seem like they could be used to decrease the latter without decreasing the former. Some skills are dual-use, like the sort of microbiology skills that would let you culture a pathogen to create either a vaccine or a bioweapon. To control that, if you need to do the latter, you need a model smart enough to understand the context of the work it's doing, and the ethical consequences of that, and aligned enough to make the right ethical decision. Or else, else carefuly authorized, logged, and monitored access to a tool AI that can't be trusted to make those decisions.

[-]Bogdan Ionut Cirstea2y32

One (speculative) way even incomplete unlearning might be directly useful for alignment (e.g. for getting AIs to 'care' about human values): https://www.lesswrong.com/posts/WkJDgpaPeCJDMJkoL/quick-takes-on-ai-is-easy-to-control?commentId=eP2iCBKP7Kneo3AdF.

[-]Roman Leventov2y20

Very close ideas: ODDs for Foundation models by Igor Krawczuk (AI Safety Camp project), you guys should talk.

[-]TheManxLoiner10moΩ010

Thanks for this post! Very clear and great reference.

- You appear to use the term 'scope' in a particular technical sense. Could you give a one-line definition?
- Do you know if this agenda has been picked up since you made this post?

[-]Review Bot2y*10

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

[-]Joseph Van Name2y10

The problem of unlearning would be solved (or kind of solved) if we just used machine learning models that optimize fitness functions that always converged to the same local optimum regardless of the initial conditions (pseudodeterministic training) or at least has very few local optima. But this means that we will have to use something other than neural networks for this and instead use something that behaves much more mathematically. Here the difficulty is to construct pseudodeterministically trained machine learning models that can perform fancy tasks about as efficiently as neural networks. And, hopefully we will not have any issues with a partially retrained pseudodeterministically trained ML model remembering just enough of the bad thing to do bad stuff.

LESSWRONG
LW

LESSWRONG
LW

126

Deep Forgetting & Unlearning for Safely-Scoped LLMs

126

Ω 57

126

Ω 57

TL;DR

The problem: LLMs are sometimes good at things we try to make them bad at

Jailbreaks (and other attacks) elicit harmful capabilities

Finetuning can rapidly undo safety training

Conclusion: the alignment of state-of-the-art safety-finetuned LLMs is brittle

Less is more: a need for safely-scoped models

LLMs should only know only what they need to

Passive (whitelist) vs. active (blacklist) scoping

Standard LLM training methods are not good for scoping

Pretraining can introduce harmful artifacts into models

Finetuning is not good at making fundamental mechanistic changes to large pretrained models

Many potential strategies exist for stronger and deeper scoping

Curating training data (passive)

Plastic learning (passive)

Compression/distillation (passive)

Meta-learning (active)

Model edits and lesions (active)

Latent adversarial training (passive or active)

Miscellaneous mechanistic tricks (passive or active)

Open challenges

Improving deep forgetting and unlearning methods

Meeting key desiderata

Benchmarks

What should we scope out of models?