To what extent are the scaling properties of Transformer networks exceptional? — LessWrong