The Geometry of LLM Logits (an analytical outer bound)
1 Preliminaries
Symbol
Meaning
d
width of the residual stream (e.g. 768 in GPT-2-small)
L
number of Transformer blocks
V
vocabulary size, so logits live in RV
h(ℓ)
residual-stream vector entering block ℓ
r(ℓ)
the update written by block ℓ
WU∈RV×d,b∈RV
un-embedding matrix and bias
Additive residual stream.
With (pre-/peri-norm) residual connections,
h(ℓ+1)=h(ℓ)+r(ℓ),ℓ=0,…,L−1.
Hence the final pre-logit state is the sum of L+1 contributions (block 0 = token+positional embeddings):
h(L)=L∑ℓ=0r(ℓ).
2 Each update is contained in an ellipsoid
Why a bound exists.
Every sub-module (attention head or MLP)
reads a LayerNormed copy of its input, so
∥u∥2≤ρℓ where ρℓ:=γℓ√d and γℓ is that block’s learned scale;
applies linear maps, a Lipschitz point-wise non-linearity (GELU, SiLU, …), and another linear map back to Rd.
Because the composition of linear maps and Lipschitz functions is itself Lipschitz, there exists a constant κℓ such that
∥r(ℓ)∥2≤κℓwhenever∥u∥2≤ρℓ.
Define the centred ellipsoid
E(ℓ):={x∈Rd:∥x∥2≤κℓ}.
Then every realisable update lies inside that ellipsoid:
r(ℓ)∈E(ℓ).
3 Residual stream ⊆ Minkowski sum of ellipsoids
Using additivity and Step 2,
h(L)=L∑ℓ=0r(ℓ)∈L∑ℓ=0E(ℓ)=:Etot,
where ∑ℓE(ℓ)=E(0)⊕⋯⊕E(L)
is the Minkowski sum of the individual ellipsoids.
4 Logit space is an affine image of that sum
Logits are produced by the affine map x↦WUx+b.
For any sets S1,…,Sm,
WU(⨁iSi)=⨁iWUSi.
Hence
logits=WUh(L)+b∈b+L⨁ℓ=0WUE(ℓ).
Because linear images of ellipsoids are ellipsoids, each WUE(ℓ) is still an ellipsoid.
5 Ellipsotopes
An ellipsotope is an affine shift of a finite Minkowski sum of ellipsoids.
The set
Louter:=b+L⨁ℓ=0WUE(ℓ)
therefore is an ellipsotope.
6 Main result (outer bound)
Theorem.
For any pre-norm or peri-norm Transformer language model whose blocks receive LayerNormed inputs, the set L of all logit vectors attainable over every prompt and position satisfies
L⊆Louter,
where Louter is the ellipsotope defined above.
Proof.
Containments in Steps 2–4 compose to give the stated inclusion; Step 5 shows the outer set is an ellipsotope. ∎
7 Remarks & implications
It is an outer approximation.
Equality L=Louter would require showing that every point of the ellipsotope can actually be realised by some token context, which the argument does not provide.
Geometry-aware compression and safety.
Because Louter is convex and centrally symmetric, one can fit a minimum-volume outer ellipsoid to it, yielding tight norm-based regularisers or robustness certificates against weight noise / quantisation.
Layer-wise attribution.
The individual sets WUE(ℓ) bound how much any single layer can move the logits, complementing “logit-lens’’ style analyses.
Assumptions.
LayerNorm guarantees ∥u∥2 is bounded; Lipschitz—but not necessarily bounded—activations (GELU, SiLU) then give finite κℓ. Architectures without such norm control would require separate analysis.
The Geometry of LLM Logits (an analytical outer bound)
1 Preliminaries
Additive residual stream. With (pre-/peri-norm) residual connections,
h(ℓ+1)=h(ℓ)+r(ℓ),ℓ=0,…,L−1.
Hence the final pre-logit state is the sum of L+1 contributions (block 0 = token+positional embeddings):
h(L)=L∑ℓ=0r(ℓ).
2 Each update is contained in an ellipsoid
Why a bound exists. Every sub-module (attention head or MLP)
Because the composition of linear maps and Lipschitz functions is itself Lipschitz, there exists a constant κℓ such that
∥r(ℓ)∥2≤κℓwhenever∥u∥2≤ρℓ.
Define the centred ellipsoid
E(ℓ):={x∈Rd:∥x∥2≤κℓ}.
Then every realisable update lies inside that ellipsoid:
r(ℓ)∈E(ℓ).
3 Residual stream ⊆ Minkowski sum of ellipsoids
Using additivity and Step 2,
h(L)=L∑ℓ=0r(ℓ)∈L∑ℓ=0E(ℓ)=:Etot,
where ∑ℓE(ℓ)=E(0)⊕⋯⊕E(L) is the Minkowski sum of the individual ellipsoids.
4 Logit space is an affine image of that sum
Logits are produced by the affine map x↦WUx+b. For any sets S1,…,Sm,
WU(⨁iSi)=⨁iWUSi.
Hence
logits=WUh(L)+b∈b+L⨁ℓ=0WUE(ℓ).
Because linear images of ellipsoids are ellipsoids, each WUE(ℓ) is still an ellipsoid.
5 Ellipsotopes
An ellipsotope is an affine shift of a finite Minkowski sum of ellipsoids. The set
Louter:=b+L⨁ℓ=0WUE(ℓ)
therefore is an ellipsotope.
6 Main result (outer bound)
Proof. Containments in Steps 2–4 compose to give the stated inclusion; Step 5 shows the outer set is an ellipsotope. ∎
7 Remarks & implications
It is an outer approximation. Equality L=Louter would require showing that every point of the ellipsotope can actually be realised by some token context, which the argument does not provide.
Geometry-aware compression and safety. Because Louter is convex and centrally symmetric, one can fit a minimum-volume outer ellipsoid to it, yielding tight norm-based regularisers or robustness certificates against weight noise / quantisation.
Layer-wise attribution. The individual sets WUE(ℓ) bound how much any single layer can move the logits, complementing “logit-lens’’ style analyses.
Assumptions. LayerNorm guarantees ∥u∥2 is bounded; Lipschitz—but not necessarily bounded—activations (GELU, SiLU) then give finite κℓ. Architectures without such norm control would require separate analysis.