Concept erasure is a technique that removes unwanted information from a model’s activations, but current erasure methods struggle to fully remove target concepts. In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints. We measure the performance of each algorithm family and explore the cause of why current methods fall short.
You can outperform LEACE against nonlinear probes while keeping edits just as surgical
Agents can do good model-internals research if they have a good quantitative target!
Our general strategy at d_model looks something like:
Operationalize "solve interpretability" as well-defined quantitative problems
Try to one-shot those problems with agents and RL
When the agents fail, turn the failure modes into well-defined quantitative problems and goto 2
Concept erasure is a useful primitive for steering models, but it's also a good test of how well our agents understand model internals. It's not an interpretability task directly, but getting good at erasure makes agents better at interp as a side effect.
We're working directly with frontier-lab researchers to do this work with bleeding-edge models.
We expect to produce many more blog posts like this as our research program grows symbiotically with our RL-env work.
Concept erasure is a technique that removes unwanted information from a model’s activations, but current erasure methods struggle to fully remove target concepts. In this study, we tasked LLM agents trained on our data with inventing concept erasure algorithms that outperform current methods under the same experimental constraints. We measure the performance of each algorithm family and explore the cause of why current methods fall short.
Read the post for more!
A few takeaways from this:
Our general strategy at d_model looks something like:
Concept erasure is a useful primitive for steering models, but it's also a good test of how well our agents understand model internals. It's not an interpretability task directly, but getting good at erasure makes agents better at interp as a side effect.
We're working directly with frontier-lab researchers to do this work with bleeding-edge models.
We expect to produce many more blog posts like this as our research program grows symbiotically with our RL-env work.
And we're hiring :)