Is deleting capabilities still a relevant research question?

tailcalled

[ Question ]

Is deleting capabilities still a relevant research question?

by tailcalled

1 min read21st May 20241 answer No comments

15

AI

Frontpage

I've had it suggested that a good criterion for whether interpretability is on the right track is if we can do surgical "deletions" of model capabilities, e.g. removing its ability to build bombs and such.

Obviously in one sense this is fairly trivial since you can just use simple gradient descent to make the models refuse, but the issue with this is that given the weights, people can easily undo these refusals (and also adversarial prompting can often bypass it).

I know there's been some back and forth on methods for full deletion, and I'm wondering if it's considered a solved problem or not.

New Answer

New Comment

1 Answers sorted by
top scoring

mic

May 21, 2024

130

I think unlearning model capabilities is definitely not a solved problem! See Eight Methods to Evaluate Robust Unlearning in LLMs and Rethinking Machine Unlearning for Large Language Models and the limitations sections of more recent papers like the WMDP Benchmark and SOPHON.

Moderation Log

[ Question ]

Is deleting capabilities still a relevant research question?

15

1 Answers sorted by top scoring

May 21, 2024

1 Answers sorted by
top scoring