-13

Transformer Attention’s High School Math Mistake

22nd Mar 2025

1 min read

-13

Each data point (input data, weights & bias) of a neural network has coordinates. Data alone, without coordinates, is almost meaningless. When attention mechanisms (Q, K, and V) undergo a linear transformation, they are projected into a different space with new coordinates. The attention score is then computed based on the distance between Q and K after transformation. However, this transformed distance does not necessarily reflect the true, original distance between the data points.

Each of Q, K, and V undergoes its own linear transformation with different weights, meaning they are projected into distinct spaces with different coordinate systems. As a result, the transformed values of Q and K may become closer or farther apart compared to their original positions. The distance computed via cosine similarity does not accurately represent their true, original distance. This is a fundamental mathematical oversight.

However, attention is still very powerful. Two critical aspects of attention, which makes Transformer so powerful, remain unrecognized by the AI community. This discovery is part of Deep Manifold development.