x

LESSWRONG

LW

K. Uhlig — LessWrong

K. Uhlig

K. Uhlig

Message

1

1

2y

K. Uhlig

1

2y

Toy Models of Feature Absorption in SAEs

Originally they were tied (because it makes intuitive sense), but I believe Anthropic was the first to suggest untying them, and found that this helped it differentiate similar features:

However, we find that in our trained models the learned encoder weights are not the transpose of the decoder weights and are cleverly offset to increase representational capacity. Specifically, we find that similar features which have closely related dictionary vectors have encoder weights that are offset so that they prevent crosstalk between the noisy feature inputs and c

... (read more)