Good work! I'm curious why theres a sudden dip in the gemma 2-9b at the last token position, and why probes trained on Qwen don't seem to have any relationship.
Quite a bit of literature indicates that the intermediate activations output by the MLP block are the sum of several different features in superposition, in which each feature is some vector. I would be curious if you can do an SAE or SNMF and see if one of these features is strongly associated with answering correctly.
Good work! I'm curious why theres a sudden dip in the gemma 2-9b at the last token position, and why probes trained on Qwen don't seem to have any relationship.
Quite a bit of literature indicates that the intermediate activations output by the MLP block are the sum of several different features in superposition, in which each feature is some vector. I would be curious if you can do an SAE or SNMF and see if one of these features is strongly associated with answering correctly.