In 2015, Sergey Ioffe and Christian Szegedy from Google proposed Batch Normalization. By normalizing activations at each layer, it solved the problems of unstable gradients and slow convergence in deep networks. Since then, variants such as LayerNorm and RMSNorm have emerged. Today, normalization layers have become a standard component of almost all neural networks — whether Transformer, CNN, or MoE.
The industry has always treated normalization as an "engineering trick": a training accelerator, a stabilizer. The nGPT paper pushed it even further — they forced the network to train on a hypersphere, achieving nearly 20x faster convergence.
In high school math class, we learned what a vector is: it has both magnitude and direction. This knowledge is so fundamental that we never think about it again after exams. In neural networks, vectors represent everything — word embeddings, hidden states, gradients. We only care about their magnitude: norms, activation values, gradient norms. But we never ask: what is this vector actually expressing? After normalization, where does the semantics go?
Normalization gives a very simple answer:
Representation does not lie in the original vector, but in the relative contribution structure of each dimension within the normalized direction.
Let us prove this with high school mathematics:
This is a basic normalization operation that projects any non-zero vector onto the unit sphere, forcing . Once the magnitude is normalized away, the information carried by the original vector can only be stored in the direction . But direction itself is a holistic concept — what we need to know is: what is the "microstructure" of direction? Let us look at a vector from a different perspective.
Consider a vector . Traditionally, this simply represents a point in three-dimensional space at coordinates (2, 4, 5). But notice: in our reasoning, we say the x‑axis pushes 2 units, the y‑axis pushes 4 units, the z‑axis pushes 5 units, and together they naturally form this direction. In other words, we can only form a direction by performing a "pushing boxes" operation on each dimension. What truly matters here is not the point (2, 4, 5) itself, but the ratio 2:4:5. After normalization, all information is compressed into this proportional structure. Therefore, in a -dimensional representation space, the real question is not "how large is this neuron's activation", but rather: what relative position does this dimension occupy within the entire directional structure?
From the above, we obtain the formula:
where represents "the structural contirbution of the -th dimension to the current representation direction", and is the actual contribution of that dimension in the current vector. Our goal is to use this formula to fully explain neural networks.
Suppose a layer has three neurons, and the output vector is . Substituting into the formula:
Decomposed:
We can clearly see that neuron s3 has the largest "contribution" to the current output. Compared to the original (2, 4, 5), this decomposition provides better distinguishability. Normalization removes the magnitude of the vector, leaving only the pure "semantic" direction. From an energy perspective, you indeed exerted forces (2, 4, 5), but in the relative coordinate space, the actual movement is toward the point (0.298, 0.596, 0.745).
Thus, after normalization, the neuron's "confidence" (magnitude) and "contribution to the current output" (direction) are separated. This decomposition does not depend on whether the network is linear. No matter how complex the preceding computation, normalization compresses the result into a directional structure whose interpretability comes solely from that structure itself.
Example: Explaining the Transformer Architecture
Today, LLMs all incorporate normalization during training, such as LayerNorm or RMSNorm, for the purpose of stable training. But from this paper's perspective, this actually allows us to fully explain current LLMs.
This is the standard single-head attention formula. Traditionally, it is described as "letting each token attend to a weighted average of other tokens". But from this paper's geometric perspective, this formula has a more fundamental meaning.
First, we need to understand what Q and K are computing. Assuming the input is normalized, each token's representation , lies on the hypersphere , and Q and K are new directions obtained by linearly projecting . Then the (i, j) entry of QKᵀ is the directional similarity between token i and token j. Under the formula derived in this paper, this is equivalent to the cosine similarity on the hypersphere:
Then: Thus, softmax essentially converts a similarity distribution on the hypersphere into a probability distribution. That is, the current token i "looks" at all other tokens (the entire tokenizer) on the hypersphere and decides which semantic directions are worth following.
In the formula, V is the set of semantic anchor points for all tokens (or transformed directions). The weighted average: In geometric terms: the current semantic point takes a small step toward those more relevant semantic directions (a tangent space adjustment).
Residual + Normalization — The Real "Confirmation"
The residual is a probe (leaving the hypersphere and entering the tangent space); normalization is the confirmation, returning the new direction to the hypersphere. Strictly speaking, entering the tangent space is the ideal case. In practice, LayerNorm/RMSNorm approximates this geometric intuition — leaving the manifold, making adjustments, then projecting back.
This is how Attention works: on the semantic sphere, it votes via cosine similarity, calculating how the current point should update its direction. This also explains why the dot product in Attention works — not because cosine similarity is "effective", but because after normalization, it is the only reasonable measure.
Directly Explaining Current LLMs
In Transformers, the LM Head and the Embedding share the same word space, which contains the precise coordinate axis for each token. Our goal is to understand what the model experiences from input to output.
Question 1: How much "directional contribution" does this neuron make to the current hidden state? How does reasoning work on the hypersphere?
After normalization, each layer of the network performs nearest-neighbor retrieval on the same hypersphere. First, use the formula: (where is the row vector of token t in the LM Head) to obtain the semantic anchor of each word on the sphere. Then use to compute (where v is directly taken as the hidden state vector). Once we have this information, we can track the model's trajectory on the hypersphere, the contribution of each dimension (neuron) to the current direction, and see which token the current layer has moved toward on the sphere.
Question 2: How much "logit contribution" does this neuron make to the final output token?
We can use:
This means: how much does dimension d, via the LM Head weight contribute to the logit of token v. This explains whether, in the final vote, this dimension (neuron) is "pushing toward" or "pulling away from" the target token, and by how much.
Similarly, different model architectures, as long as they use normalization during training, can be explained using . This includes image models, video models, and more. We can all understand "why the model thinks this way".Go try it. Take your model apart, see where each layer moves on the sphere, see which directions each dimension pushes. You will find: the black box is far whiter than you think.
The black box is not called black because it is inexplicable, but simply because we have been focusing on magnitudes rather than directions. It itself is already white, just one normalization away from being transparent.
In 2015, Sergey Ioffe and Christian Szegedy from Google proposed Batch Normalization. By normalizing activations at each layer, it solved the problems of unstable gradients and slow convergence in deep networks. Since then, variants such as LayerNorm and RMSNorm have emerged. Today, normalization layers have become a standard component of almost all neural networks — whether Transformer, CNN, or MoE.
The industry has always treated normalization as an "engineering trick": a training accelerator, a stabilizer. The nGPT paper pushed it even further — they forced the network to train on a hypersphere, achieving nearly 20x faster convergence.
In high school math class, we learned what a vector is: it has both magnitude and direction. This knowledge is so fundamental that we never think about it again after exams. In neural networks, vectors represent everything — word embeddings, hidden states, gradients. We only care about their magnitude: norms, activation values, gradient norms. But we never ask: what is this vector actually expressing? After normalization, where does the semantics go?
Normalization gives a very simple answer:
Representation does not lie in the original vector, but in the relative contribution structure of each dimension within the normalized direction.
Let us prove this with high school mathematics:
This is a basic normalization operation that projects any non-zero vector onto the unit sphere, forcing . Once the magnitude is normalized away, the information carried by the original vector can only be stored in the direction . But direction itself is a holistic concept — what we need to know is: what is the "microstructure" of direction? Let us look at a vector from a different perspective.
Consider a vector . Traditionally, this simply represents a point in three-dimensional space at coordinates (2, 4, 5). But notice: in our reasoning, we say the x‑axis pushes 2 units, the y‑axis pushes 4 units, the z‑axis pushes 5 units, and together they naturally form this direction. In other words, we can only form a direction by performing a "pushing boxes" operation on each dimension. What truly matters here is not the point (2, 4, 5) itself, but the ratio 2:4:5. After normalization, all information is compressed into this proportional structure. Therefore, in a -dimensional representation space, the real question is not "how large is this neuron's activation", but rather: what relative position does this dimension occupy within the entire directional structure?
From the above, we obtain the formula:
where represents "the structural contirbution of the -th dimension to the current representation direction", and is the actual contribution of that dimension in the current vector. Our goal is to use this formula to fully explain neural networks.
Suppose a layer has three neurons, and the output vector is . Substituting into the formula:
Decomposed:
We can clearly see that neuron s3 has the largest "contribution" to the current output. Compared to the original (2, 4, 5), this decomposition provides better distinguishability. Normalization removes the magnitude of the vector, leaving only the pure "semantic" direction. From an energy perspective, you indeed exerted forces (2, 4, 5), but in the relative coordinate space, the actual movement is toward the point (0.298, 0.596, 0.745).
Thus, after normalization, the neuron's "confidence" (magnitude) and "contribution to the current output" (direction) are separated. This decomposition does not depend on whether the network is linear. No matter how complex the preceding computation, normalization compresses the result into a directional structure whose interpretability comes solely from that structure itself.
Example: Explaining the Transformer Architecture
Today, LLMs all incorporate normalization during training, such as LayerNorm or RMSNorm, for the purpose of stable training. But from this paper's perspective, this actually allows us to fully explain current LLMs.
This is the standard single-head attention formula. Traditionally, it is described as "letting each token attend to a weighted average of other tokens". But from this paper's geometric perspective, this formula has a more fundamental meaning.
First, we need to understand what Q and K are computing. Assuming the input is normalized, each token's representation , lies on the hypersphere , and Q and K are new directions obtained by linearly projecting . Then the (i, j) entry of QKᵀ is the directional similarity between token i and token j. Under the formula derived in this paper, this is equivalent to the cosine similarity on the hypersphere:
Then: Thus, softmax essentially converts a similarity distribution on the hypersphere into a probability distribution. That is, the current token i "looks" at all other tokens (the entire tokenizer) on the hypersphere and decides which semantic directions are worth following.
In the formula, V is the set of semantic anchor points for all tokens (or transformed directions). The weighted average: In geometric terms: the current semantic point takes a small step toward those more relevant semantic directions (a tangent space adjustment).
Residual + Normalization — The Real "Confirmation"
The residual is a probe (leaving the hypersphere and entering the tangent space); normalization is the confirmation, returning the new direction to the hypersphere. Strictly speaking, entering the tangent space is the ideal case. In practice, LayerNorm/RMSNorm approximates this geometric intuition — leaving the manifold, making adjustments, then projecting back.
This is how Attention works: on the semantic sphere, it votes via cosine similarity, calculating how the current point should update its direction. This also explains why the dot product in Attention works — not because cosine similarity is "effective", but because after normalization, it is the only reasonable measure.
Directly Explaining Current LLMs
In Transformers, the LM Head and the Embedding share the same word space, which contains the precise coordinate axis for each token. Our goal is to understand what the model experiences from input to output.
Question 1: How much "directional contribution" does this neuron make to the current hidden state? How does reasoning work on the hypersphere?
After normalization, each layer of the network performs nearest-neighbor retrieval on the same hypersphere. First, use the formula: (where is the row vector of token t in the LM Head) to obtain the semantic anchor of each word on the sphere. Then use to compute (where v is directly taken as the hidden state vector). Once we have this information, we can track the model's trajectory on the hypersphere, the contribution of each dimension (neuron) to the current direction, and see which token the current layer has moved toward on the sphere.
Question 2: How much "logit contribution" does this neuron make to the final output token?
We can use:
This means: how much does dimension d, via the LM Head weight contribute to the logit of token v. This explains whether, in the final vote, this dimension (neuron) is "pushing toward" or "pulling away from" the target token, and by how much.
Similarly, different model architectures, as long as they use normalization during training, can be explained using . This includes image models, video models, and more. We can all understand "why the model thinks this way".Go try it. Take your model apart, see where each layer moves on the sphere, see which directions each dimension pushes. You will find: the black box is far whiter than you think.
The black box is not called black because it is inexplicable, but simply because we have been focusing on magnitudes rather than directions. It itself is already white, just one normalization away from being transparent.