Shorter Tokens Are More Likely
I was thinking about LLM tokenization (as one does) and had a thought: We select the next output token for an LLM based on its likelihood, but (some) shorter tokens are more likely. Why? Longer tokens can only complete one word, but some shorter tokens can complete many words. Those shorter common tokens[1] are (correctly) learned to be higher-probability because they have the combined probability of any word they could complete. However, standard generation techniques will only consider a subset of probabilities (top-K) and scale the largest probabilities (temperature). Both of these will take the highest probabilities and increase them further, meaning short/common tokens become significantly more likely to be generated just because they're shorter. I ran an experiment to investigate this, showing that the first-character distribution of words generated by nanoGPT[2] is similar regardless of tokenization without top-K or temperature scaling, but if we use common settings (top-K=200 and temperature=0.8), we can increase the likelihood that a word starts with 'c' from 4% up to 10% just by tokenizing 'c' at the start of words separately from the rest of the word. I found similar but stronger effects when training on a tiny Shakespeare corpus. This effect should also appear when a particular word is a common start to phrases. Techniques like beam search should help, but I don't think anyone does that in user-facing frontier models. I have a theory that this is part of the explanation for why AIs seem to all like the same words and have a similar style. Why? Feel free to skip to the experiment if this is all obvious. Say you have a text corpus made up of conversations between Alice, Bob, Blake, Charles and Cathryn, and they're all equally likely to speak. There are 5 names, so in a model where you tokenize each name separately, the next-token probabilities for any name will be 20%. If you tokenize the first letter separately (your vocab is ["A", "-lice", "B", "-
At the beginning you say that there were 14 sessions but later (and in this post) say 1,273 sessions. The token counts you give seem to match a smaller number of sessions, so I'm wondering what the 1,273 number is?