2

23rd Nov 2024

1 min read

A

1 4

2

AI CapabilitiesAI RiskAI TimelinesAI

Frontpage

2

New Answer

New Comment

1 Answers sorted by
top scoring

Marcus Williams

Nov 23, 2024

90

I'm not sure if these would be classed as "weird tricks" and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:

SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
RMSNorm: Layernorm but without the translation.
Rotary Position Embeddings: Rotates token embeddings to give them positional information.
Quantization: Fewer bit weights without much drop in performance.
Flash Attention: More efficient attention computation through better memory management.
Various sparse attention schemes

[-]KvmanThinking1y10

How were these discovered? Slow, deliberate thinking, or someone trying some random thing to see what it does and suddenly the AI is a zillion times smarter?

Reply

2Marcus Williams1y

"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." -SwiGLU paper. I think it varies, a few of these are trying "random" things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.

1 comment, sorted by

top scoring

Click to highlight new comments since: Today at 1:14 PM

[-]ZY1y10

On the side - could you elaborate why you think "relu better than sigmoid" is a "weird trick", if that is implied by this question?

The reason that I thought to be commonly agreed is that it helps with the vanishing gradient problem (this could be shown from the graphs).

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

2

[ Question ]

Have we seen any "ReLU instead of sigmoid-type improvements" recently

2

2

1 Answers sorted by
top scoring

Nov 23, 2024

2

[ Question ]

Have we seen any "ReLU instead of sigmoid-type improvements" recently

2

2

1 Answers sorted by top scoring

Nov 23, 2024

1 Answers sorted by
top scoring