A few years ago, I published [a paper](https://arxiv.org/pdf/1902.09191) conveying a very similar message -- likely tokens become more likely in trained SLMs. Now with the success of LLMs, I think this might be an explanation of why scaling law works well so far. Basically, SLMs transform the context into the next token through several NN layers, many of which are linear, so it's not surprising to see SLMs catching some obvious patterns like which words are more common than others. With LMs getting deeper and deeper, they catch the... (read more)
Thanks for the nice blog!
A few years ago, I published [a paper](https://arxiv.org/pdf/1902.09191) conveying a very similar message -- likely tokens become more likely in trained SLMs. Now with the success of LLMs, I think this might be an explanation of why scaling law works well so far. Basically, SLMs transform the context into the next token through several NN layers, many of which are linear, so it's not surprising to see SLMs catching some obvious patterns like which words are more common than others. With LMs getting deeper and deeper, they catch the... (read more)