LESSWRONG
LW

412
lampinen
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Activation space interpretability may be doomed
lampinen8mo30

We've done some studies of related phenomena in: https://openreview.net/forum?id=aY2nsgE97a — e.g., that the activation patterns can be strongly biased towards capturing easy (linear) features over difficult (nonlinear) ones (or more prevalent over less prevalent ones, or earlier-learned ones, etc.), which can lead interpretations based on activations to miss some of the important features that the model is computing. 

Reply