1149

LESSWRONG
LW

1148
AI
Frontpage

-13

Transformer Attention’s High School Math Mistake

by Max Ma
22nd Mar 2025
1 min read
1

-13

AI
Frontpage

-13

Transformer Attention’s High School Math Mistake
1Max Ma
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 9:20 PM
[-]Max Ma6mo10

DeepSeek V3 mitigated this mistake unknowingly. In their MLA, K, V shares the same nn.linear. 

Reply
Moderation Log
More from Max Ma
View more
Curated and popular this week
1Comments

Each data point (input data, weights & bias) of a neural network has coordinates. Data alone, without coordinates, is almost meaningless. When attention mechanisms (Q, K, and V) undergo a linear transformation, they are projected into a different space with new coordinates. The attention score is then computed based on the distance between Q and K after transformation. However, this transformed distance does not necessarily reflect the true, original distance between the data points.

Each of Q, K, and V undergoes its own linear transformation with different weights, meaning they are projected into distinct spaces with different coordinate systems. As a result, the transformed values of Q and K may become closer or farther apart compared to their original positions. The distance computed via cosine similarity does not accurately represent their true, original distance. This is a fundamental mathematical oversight.

However, attention is still very powerful. Two critical aspects of attention, which makes Transformer so powerful, remain unrecognized by the AI community. This discovery is part of Deep Manifold development.

Reference

Q/A with ChatGPT 4o