LESSWRONG
LW

184
ShaojieJiang
2010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Shorter Tokens Are More Likely
ShaojieJiang1mo30

Thanks for the nice blog!

A few years ago, I published [a paper](https://arxiv.org/pdf/1902.09191) conveying a very similar message -- likely tokens become more likely in trained SLMs. Now with the success of LLMs, I think this might be an explanation of why scaling law works well so far. Basically, SLMs transform the context into the next token through several NN layers, many of which are linear, so it's not surprising to see SLMs catching some obvious patterns like which words are more common than others. With LMs getting deeper and deeper, they catch the subtlety and semantics in the context much better, so that they don't straightforwardly follow the frequency pattern. It would be great to test the impact of LM size and data size in experiments, but of course, it's hard to make a fairly controlled comparison.

Reply