Precomputed figures and the code used to generate them are in this GitHub repository. To run the analysis on other models, edit the models.py file.

TL;DR

Empirical Spectral Distribution (ESD) of the correlation matrix of in dense (non-MoE) text-to-text models exhibit certain trends. These trends can be used as a criterion to create a taxonomy of attention layers, with potential to speed up automated circuit discovery and model diffing. Here I discuss 2 approaches, one is model-agnostic the other is model-specific, neither approaches require forward passes and are task agnostic.

Introduction

Most circuit discovery techniques treat model weights as a passive backdrop for activation-level experiments. A key property of all trained neural networks is that the weight matrices are highly structured. Identifying the relationships between weight matrices is the key to prune the search space and optimizing the search algorithm.

For example, in the Automated Circuit Discovery paper^[1] (Conmy et al., 2023), the connections between the nodes of the computational graph are patched sequentially which makes it difficult to scale. Since only a tiny fraction of components contribute to any specific task^[2]^[3] most patching steps are unnecessary for identifying the relevant circuit. What if once we identified that a head $h$ of layer $L$ was not important, we could prioritize/deprioritize other heads in the same taxonomy?

To this end I started looking for ways in which I could classify the components of a transformer network. Martin et al. (2021) showed that Heavy Tailed-Random Matrix Theory could be used to identify distinct 5+1 training phases that a weight matrix of an MLP goes through. Inspired by this, I propose 2 classification methods for attention heads based on the ESD of QK circuits ( $W_{Q K} = W_{Q} W_{K}^{T}$ ) and demonstrate the efficacy of these methods across model sizes and model families^[4].

Why QK?

An attention head factors into:

QK circuit (decision making): Selects where to read/write information. Can suppress writing altogether by dumping attention to <bos> or punctuations.
OV circuit (transport/translator): Applies a linear map to the the residual stream, largely "dumb" without QK weighting source positions.

To create a taxonomy of attention heads QK was the natural choice.

Setup and Notation

$W_{Q}$ and $W_{K}$ have the shape $(d_{m o d e l}, d_{h e a d})$ , therefore shape of $W_{Q} W_{K}^{T}$ is $(d_{m o d e l}, d_{m o d e l})$ which is a large matrix e.g. Llama 3.1 405B has $d_{m o d e l} = 16384$ . Computing the SVD values ( $σ_{i}$ ) of such large matrices is computationally expensive $O (n^{3})$ and sensitive to finite precision errors.

Instead we compute a smaller matrix $M$ as follows:

Compute the SVD of $W_{Q}$ and $W_{K}$ .
$W_{Q} = U_{Q} Σ_{Q} V_{Q}^{T}$ and $W_{K} = U_{K} Σ_{K} V_{K}^{T}$
$U$ , $V$ are orthonormal matrices and $Σ = d i a g (σ_{i})$
Therefore
$W_{Q} W_{K}^{T} = (U_{Q} Σ_{Q} V_{Q}^{T}) (V_{K} Σ_{K} U_{K}^{T}) = U_{Q} (Σ_{Q} V_{Q}^{T} V_{K} Σ_{K}) U_{K}^{T} = U_{Q} M U_{K}^{T}$
where M = $Σ_{Q} V_{Q}^{T} V_{K} Σ_{K}$
Since $U_{Q}$ and $U_{K}$ are orthonormal matrices, they do not impact the singular values.
Therefore $σ_{i} (W_{Q K}) = σ_{i} (M)$ ^[5]

$M$ is a much smaller $(d_{h e a d}, d_{h e a d})$ matrix which is faster to compute and more stable.

The singular values of $W_{Q K}$ are related to the eigenvalues of its correlation matrix as

$σ_{i} (W_{Q K})^{2} = λ_{i} (W_{Q K} W_{Q K}^{T})$

The Empirical Spectral Distribution (ESD) is the distribution of the $λ_{i}$ denoted by $ρ (λ)$

Classification A: $(α, θ)$ of ESD

Classification scheme A aims to create an attention head taxonomy based on the shape and scale of their ESD, this approach is model agnostic. To compare ESD across attention heads, the ESD is normalized by dividing by $\frac{1}{n} \sum λ_{i}$ and the counts are normalized so that AUC=1. Using the normalized counts as features, the attention heads classified into clusters with k-means clustering^[6].

The average distribution within each class follows a gamma distribution (see section below) so each class can be identified by the associated shape parameter $α$ and scale parameter $θ$ . The goodness-of-fit between the calculated average ESD and the fitted Gamma distribution is measured using Jensen-Shannon Divergence (JSD). Across all models studied, the JSD typically lies in the range (0.0-0.1) indicating a very good fit.

Below are the classified attention heads of Qwen3 family of models (0.6B - 32B).

Blue bars are the ESD
Dotted red line is the fitted gamma distribution
The calculated JSD is in the subplot title
The error bars show the standard deviation
"Class_#" are dummy names

Why Gamma distribution?

If $σ_{i}$ were normally distributed we would expect to see a $χ^{2}$ distribution which is a special case of the gamma distribution. The fact that the mean ESD follow the gamma distribution closely (despite $σ_{i}$ not being normally distributed) does not have a trivial explanation. The exact mechanism not known and is empirical at present.

Next step

Look for functional similarity between classes e.g. check if classes with low values of alpha are "writer" heads i.e. have high direct logit attribution (DLA) scores and high alpha are "aggregator" head i.e. consolidate information at specific positions in the residual stream or vice versa.

Classification B: Stats of ESD

Per‑layer aggregates (across heads) of ESD statistics (mean, standard deviation, skew, excess kurtosis) show a layer dependent trend i.e. $x_{t} ⊥/ x_{t - 1}$ . The goal of this post is to show that trends exist, so I report the BDS statistic and p-value^[7]. For the actual classification criterion I provide a boxplot^[8] and a conceptual description.

This method of classification has some obvious drawbacks compared to classification A, like being model specific, able to group only a subset of layers, and a lack of general metric to measure the goodness-of-fit. On the other hand, identifying functional roles for these groups may be more straightforward, since the observed patterns cannot be explained by architectural design choices and instead likely reflect emergent internal representations.

Case Studies

Excess Kurtosis Plain of Llama 3.1 405B-Instruct

The "plain" (Layers 33-48) is a region of abnormally low kurtosis values compared to the other layers. For gamma distribution, excess kurtosis ( $k$ ) is equal to $\frac{6}{α}$ so the attention heads in the "plain" have more rounded peak and thin tailed ESD whereas other layers have a power-law like shape ( $y \propto e^{- t}$ ). Some possible interpretations are

The layers are not fully trained. (Martin et al., 2021)
The layers are generalist layers i.e. perform multiple functions depending on context.

SD Peaks of Qwen3 14B-Instruct and 32B-Instruct

The mean standard deviation of 14B shows a single peak in the middle layers of the model (Layers 18-19), the 32B model has two peaks at Layers 23 and 41. Layers can be classified by the section in which they lie.

SD Slopes of Mistral 7B-Instruct

The mean standard deviation shows an almost monotonic decline (Spearman's r2 ≈ 0.98) for all 3 versions of the model, such a strong trend are not seen in other models of comparable size.

Mean flatlands of Gemma2 27B-it

The average of mean of ESD is almost constant $σ = 0.03$ . There is also an oscillatory pattern in the standard deviation of ESD which is probably caused by Gemma2 models using interleaved global and local attention layers.

Related work

In the post "The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable", the authors demonstrate that right singular vectors of $W_{O V}$ are interpretable using logit-lens.

Closing thoughts

The approach is complementary to the one taken in On the Biology of a Large Language Model. The authors of the paper describe recent work in interpretability as developing a better "microscope" to explain the inner workings of models. The approach proposed here is then akin to studying organs and organ systems.

I believe this approach is worth exploring, if not as a independent line of research then as a complementary path towards developing efficient interpretability techniques.

References

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., & Garriga-Alonso, A. (2023). Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 16318-16352.

Martin, C. H., & Mahoney, M. W. (2021). Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning. Journal of Machine Learning Research, 22(165), 1-73.

Men, X., Xu, M., Zhang, Q., Wang, B., Lin, H., Lu, Y., ... & Chen, W. (2024). Shortgpt: Layers in large language models are more redundant than you expect. arXiv preprint arXiv:2403.03853.

^{^}
The authors frame it as a proof of concept that partial automation of the mechanistic interpretability workflow is possible.
^{^}
The paper mentions 68 out of 32000 (0.21%) edges contribute to GPT-2 Small's ability to compute Greater-than operation .
^{^}
Men et al. (2024) show that 25% of the layers of Llama2-13B could be dropped with minimal degradation in performance.
^{^}
All models from Qwen3, Gemma-2, Mistral, and Llama 3.x that are dense and text-to-text. Yes, including Llama 3.1-405B.
^{^}
Technically $W_{Q} W_{K}^{T}$ has total $d_{m o d e l}$ singular values of which $d_{m o d e l} - d_{h e a d}$ are always 0.
^{^}
No. of bins and clusters are hyperparameters. No. of bins = 24 and number of clusters = 6 is chosen for better visualization.
^{^}
The Brock-Dechert-Scheinkman (BDS) statistic is tests whether a time series is i.i.d. or not. Positive values indicates the clustering in phase space negative values indicate repulsion. |BDS| ≥ 3 is considered strong evidence of dependence.
^{^}
For some models, Layers 0 and 1 distort the plots so I omit them across all models for consistency.

LESSWRONG
LW

LESSWRONG
LW

7

Spectral Taxonomy of QK Circuits in Transformer Models

7

7

TL;DR