Sparse Autoencoder Feature Ablation for Unlearning
TLDR: This post is derived from my end of course project for the BlueDot AI Safety Fundamentals course. Consider applying here. We evaluate the use of sparse autoencoder (SAE) feature ablation as a mechanism for unlearning Harry Potter related knowledge in Gemma-2-2b. We evaluate a non-ablated model and models with...
Feb 13, 20253