Rohan Ganapavarapu — LessWrong

LESSWRONG
LW

9mo

The Geometry of LLM Logits (an analytical outer bound)

Symbol	Meaning
$d$	width of the residual stream (e.g. 768 in GPT-2-small)
$L$	number of Transformer blocks
$V$	vocabulary size, so logits live in $R^{V}$
$h^{(ℓ)}$	residual-stream vector entering block $ℓ$
$r^{(ℓ)}$	the update written by block $ℓ$
$W_{U} \in R^{V \times d}, b \in R^{V}$	un-embedding matrix and bias

Additive residual stream. With (pre-/peri-norm) residual connections,

$h^{(ℓ + 1)} = h^{(ℓ)} + r^{(ℓ)}, ℓ = 0, \dots, L - 1.$

Hence the final pre-logit state is the sum of $L + 1$ contributions (block 0 = token+positional embeddings):

$h^{(L)} = L \sum ℓ = 0 r^{(ℓ)} .$

Why a bound exists. Every sub-module (attention head or MLP)

reads a LayerNormed copy of its input, so $∥ u ∥_{2} \leq ρ_{ℓ}$ where $ρ_{ℓ} := γ_{ℓ} \sqrt{d}$ and $γ_{ℓ}$ is that block’s learned scale;
applies linear maps, a Lipschitz point-wise non-linearity (GELU, SiLU, …), and another linear map back to $R^{d}$ .

Because the composition of linear maps and Lipschitz functions... (read 443 more words →)