Intricacies of Feature Geometry in Large Language Models
Note: This is a more fleshed-out version of this post and includes theoretical arguments justifying the empirical findings. If you've read that one, feel free to skip to the proofs. We challenge the thesis of the ICML 2024 Mechanistic Interpretability Workshop 1st prize winning paper: The Geometry of Categorical and Hierarchical Concepts in LLMs and the ICML 2024 paper The Linear Representation Hypothesis and the Geometry of LLMs. The main takeaway is that the orthogonality and polytopes they observe in categorical and hierarchical concepts occur practically everywhere, even at places they should not. Overview of the Feature Geometry Papers Studying the geometry of a language model's embedding space is an important and challenging task because of the various ways concepts can be represented, extracted, and used (see related works). Specifically, we want a framework that unifies both measurement (of how well a latent explains a feature/concept) and causal intervention (how well it can be used to control/steer the model). The method described in the two papers we study works as follows: they split the computation of a large language model (LLM) as: P(y∣x)=exp(λ(x)⊤γ(y))∑y′∈Vocabexp(λ(x)⊤γ(y′)) where: * λ(x) is the context embedding for input x (last token's residual after the last layer) * γ(y) is the unembedding vector for output y (using the unembedding matrix WU) They formalize a notion of a binary concept as a latent variable W that is caused by the context X and causes output Y(W=w) depending only on the value of w∈W. > Crucially, this restricts their methodology to only work with concepts that can be differentiated by single-token counterfactual pairs of outputs. For instance, it is not clear how to define several important concepts such as "sycophancy" and "truthfulness" using their formalism. They then define linear representations of a concept in both the embedding and unembedding spaces: In the unembedding space, ¯γW is considered a representati