Hi there, I'm wondering recently about why most studies choose the post-MLP activations as a type of signal to interpret the model's behavior. In which aspect does the activation space distinguish from other intermediate representations? For example, activation is the input of `c_proj` in MLP layer, how about the output of this layer? It's also a type of residual value.

New to LessWrong?

New Answer
New Comment