LESSWRONG
LW

877
Max Ma
-29330
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Transformer Attention’s High School Math Mistake
Max Ma6mo10

DeepSeek V3 mitigated this mistake unknowingly. In their MLA, K, V shares the same nn.linear. 

Reply
AI4Science: The Hidden Power of Neural Networks in Scientific Discovery
Max Ma6mo10

Thanks... will look into

Reply
The ‘strong’ feature hypothesis could be wrong
Max Ma1y20

Firstly, the principle of 'no computation without representation' holds true. The strength of the representation depends on the specific computational task and the neural network architecture, such as a Transformer. For example, when a Transformer is used to solve a simple linear problem with low dimensionality, it would provide a strong representation. Conversely, for a high-order nonlinear problem with high dimensionality, the representation may be weaker.

The neural network operates as a power-efficient system, with each node requiring minimal computational power, and all foundation model pre-training is self-supervised. The neural network's self-progressing boundary condition imposes no restrictions on where incoming data is processed. Incoming data will be directed to whichever nodes are capable of processing it. This means that the same token will be processed in different nodes. It is highly likely that many replicas of identical or near-identical feature bits (units of feature) disperse throughout the network. The inequality in mathematics suggests that connections between nodes (pathways) are not equal. Our working theory proposes that feature bits propagate through the network, with their propagation distance determined by the computational capacity of each node. The pathway appears to be power-driven, prioritizing certain features or patterns during learning in a discriminatory manner. While this discriminative feature pathway (DFP) is mathematically plausible, the underlying theory remains unclear. It seems that neural networks are leading us into the realm of bifurcation theory

Reply
-13Transformer Attention’s High School Math Mistake
6mo
1
2AI4Science: The Hidden Power of Neural Networks in Scientific Discovery
6mo
2
-10Neural Network And Newton's Second Law
1y
0