x

LESSWRONG

LW

lampinen — LessWrong

lampinen

lampinen

Message

2

1

1y

lampinen

2

1y

Activation space interpretability may be doomed

We've done some studies of related phenomena in: https://openreview.net/forum?id=aY2nsgE97a — e.g., that the activation patterns can be strongly biased towards capturing easy (linear) features over difficult (nonlinear) ones (or more prevalent over less prevalent ones, or earlier-learned ones, etc.), which can lead interpretations based on activations to miss some of the important features that the model is computing.