LESSWRONG
LW

AI CapabilitiesAI RiskAI TimelinesAI
Frontpage

2

[ Question ]

Have we seen any "ReLU instead of sigmoid-type improvements" recently

by KvmanThinking
23rd Nov 2024
1 min read
A
1
4

2

AI CapabilitiesAI RiskAI TimelinesAI
Frontpage

2

Have we seen any "ReLU instead of sigmoid-type improvements" recently
9Marcus Williams
1KvmanThinking
2Marcus Williams
1ZY
New Answer
New Comment

1 Answers sorted by
top scoring

Marcus Williams

Nov 23, 2024

90

I'm not sure if these would be classed as "weird tricks" and I definitely think these have reasons for working, but some recent architecture changes which one might not expect to work a priori include:

  • SwiGLU: Combines a gating mechanism and an activation function with learnable parameters.
  • Grouped Query Attention: Uses fewer Key and Value heads than Query heads.
  • RMSNorm: Layernorm but without the translation.
  • Rotary Position Embeddings: Rotates token embeddings to give them positional information.
  • Quantization: Fewer bit weights without much drop in performance.
  • Flash Attention: More efficient attention computation through better memory management.
  • Various sparse attention schemes
Add Comment
[-]KvmanThinking8mo10

How were these discovered? Slow, deliberate thinking, or someone trying some random thing to see what it does and suddenly the AI is a zillion times smarter?

Reply
2Marcus Williams8mo
"We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence." -SwiGLU paper. I think it varies, a few of these are trying "random" things, but mostly they are educated guesses which are then validated empirically. Often there is a spefic problem we want to solve i.e. exploding gradients or O(n^2) attention and then authors try things which may or may not solve/mitigate the problem.
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 9:54 AM
[-]ZY8mo10

On the side - could you elaborate why you think "relu better than sigmoid" is a "weird trick", if that is implied by this question?

The reason that I thought to be commonly agreed is that it helps with the vanishing gradient problem (this could be shown from the graphs).

Reply
Moderation Log
Curated and popular this week
A
1
1

I read this and, it said:

there are huge low hanging fruit that any AI or random person designing AI in their garage can find by just grasping in the dark a bit, to get huge improvements at accelerating speeds.

have we found anything like this? at all? have we seen any "weird tricks" discovered that make AI way more powerful for no reason?