x

LESSWRONG

LW

Jinjin Zhao — LessWrong

Jinjin Zhao

Jinjin Zhao

Message

1

2y

Jinjin Zhao

2y

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2

Jinjin Zhao2yΩ01-3

I am curious about your thoughts on the differences between activation patching and SAE. Do you think they are complimentary research, or may there be some overarching idea that encapsulates both?

Is there any application for one that can't be done with the other? It seems that activation patching may result in more interpretable concepts, but SAE may result in more fundamental features. My intuition is that it may be possible for activation patching to replace SAEs in the future.