x

LESSWRONG

LW

aludert — LessWrong

aludert

aludert

Message

2

1

1y

aludert

2

1y

Sparse Autoencoder Feature Ablation for Unlearning

TLDR: This post is derived from my end of course project for the BlueDot AI Safety Fundamentals course. Consider applying here. We evaluate the use of sparse autoencoder (SAE) feature ablation as a mechanism for unlearning Harry Potter related knowledge in Gemma-2-2b. We evaluate a non-ablated model and models with...

Feb 13, 2025•3