x

LESSWRONG

LW

gradStudent52 — LessWrong

gradStudent52

gradStudent52

Message

1

2y

gradStudent52

2y

[Interim research report] Taking features out of superposition with sparse autoencoders

gradStudent522y10

Hello! After reading work by Anthropic and other similar work, I am trying to fundamentally understand the "big picture". That is, it is not clear to me how "features" are extracted from the activations of the hidden layer in the SAE. There are two things that are contributing to the lack of clarity in this matter:

Any given neuron in the hidden layer of the SAE depends on all neurons of the input layer of the SAE. So, how then can any individual neuron (or activation thereof) in the hidden layer of the SAE be related to a single superposition (feature) of

... (read more)