x

LESSWRONG

LW

sophia wisdom — LessWrong

sophia wisdom

sophia wisdom

Message

7

1

4y

sophia wisdom

7

4y

The nature of LLM algorithmic progress (v2)

sophia wisdom5mo80 Response to previous version

I am low confidence on this, but I'd be pretty skeptical of 10x from optimizations. FlashAttention is a 3x or so for attention itself (it's less of a big deal on older hardware too) but the rest of a transformer is fairly efficient large matmuls. In general modern GPUs have lower utilization on each passing generation, and GPT-2 was trained on TPUs which famously have much higher utilization (~95% vs ~60% for a pure matmul). Possibly the whole thing was just done so inefficiently you could have gotten a 10x though.