Why SAE in LLM is false?
First, we need to be clear about what SAE assumes: SAE assumes that there’s a sparse representation ()inside a neural network, such that the original activation v can be approximately reconstructed as with very few non-zero entries in . Those non-zero entries are supposed to correspond to “interpretable features.” To...
Jun 101