Is the output of the softmax in a single transformer attention head usually winner-takes-all? — LessWrong