x

LESSWRONG

LW

André Longon — LessWrong

André Longon

André Longon

Message

1

2y

André Longon

2y

Self-explaining SAE features

André Longon2y10

After the best-explaining scale is applied to the feature direction vector, is the magnitude of this resulting vector similar to the magnitudes of the other token activation vectors in the prompt? If so, perhaps that fact can be used to approximate the best scale without manual finetuning. For instance, the magnitudes of all the token activation vectors can be averaged and the scale can be the proportion of this mean magnitude with the original feature direction vector's magnitude.