"Ever wanted to mindwipe an LLM?

Our method, LEAst-squares Concept Erasure (LEACE), provably erases all linearly-encoded information about a concept from neural net activations. It does so surgically, inflicting minimal damage to other concepts.


LEACE has a closed-form solution that fits on a T-shirt. This makes it orders of magnitude faster than popular concept erasure methods like INLP and R-LACE, which require gradient-based optimization. And the solution can be efficiently updated to accommodate new data."

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 12:48 PM

My summary of the paper: The paper proves that if you have two distributions that you want to ensure you cannot distinguish linearly (i.e a logistic regression will fail to achieve better than chance score), then one way to do this is to make sure they have the same mean. Previous work has done similar stuff (https://arxiv.org/abs/2212.04273), but without proving optimality.

then one way to do this is to make sure they have the same mean

Yep, although we actually go a bit further than that and show that making the means equal is necessary, at least if you want your method to work for general convex loss functions.

And is there a technology to erase certain concept from the prompt?

It can be useful both for AI safety and capabilities: currently LLM cannot forget unsuccessful attempts to solve the task, and those can make it harder to find new ways to solve. (I'd call that "object permanence of simulacrum").

What do you mean? Current LLMs are stateless. If unsuccessful attempts to solve the task are made, just reset the history and retry.

I mean something like AutoGPT, where there is no human in the loop who could reset the history.

For example, I've seen how ChaosGPT got into a loop of "researching nuclear weapons". Probably if it could erase them completely from its context, it would generate more interesting ideas (though, there is still a question whether we need that).

That is trivial to program? For example, you can have AutoGPT UI which lists pending tasks with icons next to them, where clicking a trashcan will completely erase it from the context. That doesn't need any LLM-level help like LEACE.


And of course you could also have another LLM instance with specific instructions acting as some kind of censor which judges which prompts should be erased automatically.

This is very good news for AI ethics and AI safety, and this really should be celebrated as important progress on both AI safety in the context of existential risk and AI ethics in the context of fairness and bias.


Ever wanted to mindwipe an LLM?

Cue thriller novel about an Open AI researcher who committed a murder and for some reason the crucial evidence that could get him arrested ended up in the training set of the newest GPT, so even if he scrubbed it from the dataset itself he now lives in fear that at some point, in some conversation, the LLM will just tell someone the truth and he'll be arrested.

(jokes aside, great work! This actually looks like fantastic news for both AI ethics and safety in general, especially once it is generalised to other kinds of AI beside LLMs, which I imagine should be possible)

especially once it is generalised to other kinds of AI beside LLMs, which I imagine should be possible

The method actually already is highly general, and in fact isn't specific to deep learning at all. More work does need to be done to see how well it can steer neural net behavior in real world scenarios though