A few years ago, I published [a paper](https://arxiv.org/pdf/1902.09191) conveying a very similar message -- likely tokens become more likely in trained SLMs. Now with the success of LLMs, I think this might be an explanation of why scaling law works well so far. Basically, SLMs transform the context into the next token through several NN layers, many of which are linear, so it's not surprising to see SLMs catching some obvious patterns like which words are more common than others. With LMs getting deeper and deeper, they catch the subtlety and semantics in the context much better, so that they don't straightforwardly follow the frequency pattern. It would be great to test the impact of LM size and data size in experiments, but of course, it's hard to make a fairly controlled comparison.
Thanks for the nice blog!
A few years ago, I published [a paper](https://arxiv.org/pdf/1902.09191) conveying a very similar message -- likely tokens become more likely in trained SLMs. Now with the success of LLMs, I think this might be an explanation of why scaling law works well so far. Basically, SLMs transform the context into the next token through several NN layers, many of which are linear, so it's not surprising to see SLMs catching some obvious patterns like which words are more common than others. With LMs getting deeper and deeper, they catch the subtlety and semantics in the context much better, so that they don't straightforwardly follow the frequency pattern. It would be great to test the impact of LM size and data size in experiments, but of course, it's hard to make a fairly controlled comparison.