Intricacies of Feature Geometry in Large Language Models

Thanks for the interesting post! I'm pretty confused about your ablation experiments, mostly because I'm confused about the original experiment from their work.

My understanding is that their Figure 2a (animal vs mammal) is desired because the mammal token embeddings are in the top-right quadrant, whereas the non-mammal embeddings are spread around 0 on the y-axis. This resonates with our intuition that mammals should count as animals (in the right two quadrants) but also as mammals (in the top two quadrants). However, I never actually read this written anywhere explicitly, so this might be where my misunderstanding comes from.
I was always confused by the approach behind this figure. Given that they were using WordNet synsets, all of the mammal embeddings are included when calculating the animal direction as well. The mammal embeddings would thus be trivially in the upper-right quadrant, as that's how the axes were calculated.
With your dead fish version, you do the same, right? Your x-axis is the average of all of the random categories (including "random 1"), whereas the y-axis is the average of just the "random 1" category, correct? Why would we ever expect this figure to look different than how it looks?

Basically, I'm wondering what you think Park et al. were trying to say with this Figure 2a experiment in the first place. I'm also not sure what it says about orthogonality, because clearly mammal is not orthogonal to animal in this figure, right?

After looking at the original paper/code again, two edits:

When I say "average of the 'random 1' category", I mean LDA direction, which I was thinking of as the average (because it's basically a scaled version of the whitened average).
In their Figure 2 caption, they say that they are plotting on span{} which made me mistakenly believe that the x-axis corresponds to a dot product with $ℓ_{a n i m a l}$ and the y-axis is dot product with $ℓ_{m a m m a l}$ . However, their code seems to suggest that the y-axis is $ℓ_{m a m m a l} - (ℓ_{m a m m a l}^{T} ℓ_{a n i m a l}) ℓ_{a n i m a l}$ . This might have been my bad, as I haven't read the original paper in a while.