Ah, got it. Thanks a ton!
In all of this, there seems to be an implicit assumption that the ordering of the embedding dimensions is consistent across layers, in the sense that "dog" is more strongly associated with dimension 12 in layers 2, 3, 4, etc.
I don't see any reason why this should be the case from either a training or model structure perspective. How, then, does the logit lens (which should clearly not be invariant with regard to a permutation of its inputs) still produce valid results for some intermediate layers?
Simple, visual, and lends another data point to what many of us suspected on GPT4 in comparison to other frontier labs' models. Even now, it still had something special to it that is yet to be replicated by many others.
Great work and thank you for sharing.