TLDR:
This post is derived from my end of course project for the BlueDot AI Safety Fundamentals course. Consider applying here.
We evaluate the use of sparse autoencoder (SAE) feature ablation as a mechanism for unlearning Harry Potter related knowledge in Gemma-2-2b. We evaluate a non-ablated model and models with a single Harry Potter monosemantic feature ablated in an early layer, middle and last layer against a set of Harry Potter evaluation prompts. We also measure the general performance of these models against general LLM performance benchmarks to estimate general loss of capacity after ablation.
Key Results:
- Ablating a single monosemantic Harry Potter related feature in the last layer of Gemma-2-2b produces a significant degradation in
... (read 3200 more words →)