LESSWRONG
LW

AI
Personal Blog

25

[ Question ]

Is the output of the softmax in a single transformer attention head usually winner-takes-all?

by Linda Linsefors
27th Jan 2025
1 min read
A
1
1

25

AI
Personal Blog

25

Is the output of the softmax in a single transformer attention head usually winner-takes-all?
4Buck
New Answer
New Comment

1 Answers sorted by
top scoring

Buck

Jan 27, 2025

40

IIRC, for most attention heads the max attention is way less than 90%, so my answer is "no". It should be very easy to get someone to make a basic graph of this for you.

Add Comment
Moderation Log
More from Linda Linsefors
View more
Curated and popular this week
A
1
0

Using the notation from here: A Mathematical Framework for Transformer Circuits

The attention pattern for a single attention head is determined by A=softmax(xTWTQWKx), where softmax is computed for each row of xTWTQWKx.

Each row of A gives the attention pattern for the current token. Are these rows (post softmax) typically close to one-hot? I.e. are they mainly dominated by a single attention (per current token).

I'm interested in knowing this for various types of transformers, but mainly for LLM and/or frontier models. 

I'm asking because I think this has implication for computations in super-position.