I've had it suggested that a good criterion for whether interpretability is on the right track is if we can do surgical "deletions" of model capabilities, e.g. removing its ability to build bombs and such.

Obviously in one sense this is fairly trivial since you can just use simple gradient descent to make the models refuse, but the issue with this is that given the weights, people can easily undo these refusals (and also adversarial prompting can often bypass it).

I know there's been some back and forth on methods for full deletion, and I'm wondering if it's considered a solved problem or not.

New Answer
New Comment

1 Answers sorted by

mic

130

I think unlearning model capabilities is definitely not a solved problem! See Eight Methods to Evaluate Robust Unlearning in LLMs and Rethinking Machine Unlearning for Large Language Models and the limitations sections of more recent papers like the WMDP Benchmark and SOPHON.