LESSWRONG
LW

lampinen
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Activation space interpretability may be doomed
lampinen6mo10

We've done some studies of related phenomena in: https://openreview.net/forum?id=aY2nsgE97a — e.g., that the activation patterns can be strongly biased towards capturing easy (linear) features over difficult (nonlinear) ones (or more prevalent over less prevalent ones, or earlier-learned ones, etc.), which can lead interpretations based on activations to miss some of the important features that the model is computing. 

Reply
No wikitag contributions to display.
No posts to display.