LESSWRONG
LW

240
Jinjin Zhao
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Jinjin Zhao1yΩ01-3

I am curious about your thoughts on the differences between activation patching and SAE. Do you think they are complimentary research, or may there be some overarching idea that encapsulates both? 

Is there any application for one that can't be done with the other? It seems that activation patching may result in more interpretable concepts, but SAE may result in more fundamental features. My intuition is that it may be possible for activation patching to replace SAEs in the future.

Reply