The Geometry of LLM Logits (an analytical outer bound)
1 Preliminaries
| Symbol |
Meaning |
| d |
width of the residual stream (e.g. 768 in GPT-2-small) |
| L |
number of Transformer blocks |
| V |
vocabulary size, so logits live in RV |
| h(ℓ) |
residual-stream vector entering block ℓ |
| r(ℓ) |
the update written by block ℓ |
| WU∈RV×d,b∈RV |
un-embedding matrix and bias |
Additive residual stream.
With (pre-/peri-norm) residual connections,
h(ℓ+1)=h(ℓ)+r(ℓ),ℓ=0,…,L−1.
Hence the final pre-logit state is the sum of L+1 contributions (block 0 = token+positional embeddings):
h(L)=L∑ℓ=0r(ℓ).
2 Each update is contained in an ellipsoid
Why a bound exists.
Every sub-module (attention head or MLP)
- reads a LayerNormed copy of its input, so
∥u∥2≤ρℓ where ρℓ:=γℓ√d and γℓ is that block’s learned scale;
- applies linear maps, a Lipschitz point-wise non-linearity (GELU, SiLU, …), and another linear map back to Rd.
Because the composition of linear maps and Lipschitz functions... (read 443 more words →)