The Geometry of LLM Logits (an analytical outer bound)
Symbol | Meaning |
---|---|
width of the residual stream (e.g. 768 in GPT-2-small) | |
number of Transformer blocks | |
vocabulary size, so logits live in | |
residual-stream vector entering block | |
the update written by block | |
un-embedding matrix and bias |
Additive residual stream. With (pre-/peri-norm) residual connections,
Hence the final pre-logit state is the sum of contributions (block 0 = token+positional embeddings):
Why a bound exists. Every sub-module (attention head or MLP)
Because the composition of linear maps and Lipschitz functions is itself Lipschitz, there exists a constant such that
Define the centred ellipsoid
Then every realisable update lies inside that ellipsoid:
Using additivity and Step 2,
where is the Minkowski sum of the individual ellipsoids.
Logits are produced by the affine map . For any sets ,
Hence
Because linear images of ellipsoids are ellipsoids, each is still an ellipsoid.
An ellipsotope is an affine shift of a finite Minkowski sum of ellipsoids. The set
therefore is an ellipsotope.
Theorem. For any pre-norm or peri-norm Transformer language model whose blocks receive LayerNormed inputs, the set of all logit vectors attainable over every prompt and position satisfies
where is the ellipsotope defined above.
Proof. Containments in Steps 2–4 compose to give the stated inclusion; Step 5 shows the outer set is an ellipsotope. ∎
It is an outer approximation. Equality would require showing that every point of the ellipsotope can actually be realised by some token context, which the argument does not provide.
Geometry-aware compression and safety. Because is convex and centrally symmetric, one can fit a minimum-volume outer ellipsoid to it, yielding tight norm-based regularisers or robustness certificates against weight noise / quantisation.
Layer-wise attribution. The individual sets bound how much any single layer can move the logits, complementing “logit-lens’’ style analyses.
Assumptions. LayerNorm guarantees is bounded; Lipschitz—but not necessarily bounded—activations (GELU, SiLU) then give finite . Architectures without such norm control would require separate analysis.