So, I had a hypothesis last night that training on a different scoring rule might solve this problem (because it could encourage uncertain probabilities to be lower, and thus it would be easier to filter them out without making short tokens undeservedly more likely).
I ended up forking your code, and this morning trained an LLM on the Shakespeare dataset using the -ReLU loss (from Speeding Up Entmax). The -ReLU loss is a proper scoring rule based off of the Tsallis entropy.
My results were the following:
Letter | Score Rule | Top-K | Control | Test T=1.0 | Test T=0.8 | Test T=0.1 |
---|---|---|---|---|---|---|
c | Cross-entropy | 200 | 3.09% | 5.29% | 6.64% | 7.26% |
c | -ReLU | 200 | 3.41% | 4.66% | 4.75% | 4.33% |
c | -ReLU | 3.41% | 3.87% | 4.00% | 3.49% |
All models were trained for 1500 iterations. The control was trained without the special C-tokenizer, while the tests were. In the paper, -ReLU is parameterized by which controls the Tsallis exponent, and , which is a constant shift in the ReLU. I chose to set and for all the experiments (which is the default for the -ReLU python library).
The -ReLU trained neural network does NOT seem to exhibit the same trend of lower temperatures leading to higher probability of words starting with "c". And, when the top-k is set to , most of the difference between it and the control also disappears.
So, perhaps we should just be training with a different scoring rule!
You can find my fork at https://github.com/cooljoseph1/nanoGPT-tokenizer-experiment/.
EDIT: I realized I was calculating temperature unfairly for the -ReLU sampling. After fixing it, it is actually worse than the cross-entropy trained network:
Letter | Score Rule | Top-K | Control | Test T=1.0 | Test T=0.8 | Test T=0.1 |
---|---|---|---|---|---|---|
c | Cross-entropy | 200 | 3.09% | 5.29% | 6.64% | 7.26% |
c | -ReLU | 200 | 3.41% | 4.66% | 7.54% | 31.66% |
c | -ReLU | 3.41% | 3.87% | 5.94% | 31.66% |
Thanks for trying this! I wonder if this is making things worse in a similar way to top-k. The C-tokenizer makes it very likely that "c" is always in the top 200 tokens. I wonder if it's also ensuring that it's rarely sufficiently uncertain to be filtered by this scoring rule?
Shorter common tokens are (correctly) learned to be higher-probability because they have the combined probability of any word they could complete.
I don't think this is right. LLMs are trained with teacher forcing on a fixed tokenization. For common words the tokenizer provides a single long token (e.g., “ the”, “ and”, “ cat”). The model is trained to put probability on that token—not on a shorter prefix like “ c”. So a short token does not “inherit the combined probability of any word it could complete.” Those completions were usually seen as longer tokens during training.
You can check this by running inference on an LM; if you ask the model to complete "hello" it will put " world" above " " even though both are typically tokens.
On a quick skim, I think that the rest of your arguments are plausible.
You can see in the experiment section that I'm not saying LLMs normally do this with words like "cat". I actually changed to tokenizer to force that situation to prove that the effect would appear like I was expecting.
I was a little imprecise about shorter tokens, but I'm talking about if the tokenizer causes some set of words or phrases to be canonically tokenized with shorter tokens. So, if the tokenizer decides that the un- or non- prefix should be separate tokens, those shorter tokens will be more likely than if the tokenizer had longer tokens for every un- and non- word.
You're right that I should probably explain this better in the intro though.
Shouldn't this be generally "likely tokens are even more likely"? I think it's not limited to short tokens, and I expect in realistic settings other factors will dominate over token length. But I agree that top-k (or top-p) sampling should lead to a miscalibration of LLM outputs in the low-probability tail.
I suspect this has something to do with "LLM style". LLMs may be pushed to select "slop" words because those words have more possible endings, even if none of those endings are the best one.
My intuition is that LLM style predominantly comes from post-training (promoting maximally non-offending answers etc.) rather than due to top-k/p sampling. (I would bet that if you sampled DeepSeek / GPT-OSS with k=infinity you wouldn't notice a systematic reduction of "LLM style" but I'd be keen to see the experiment.)
Thanks for the writeup, I appreciated the explanations and especially the Alice/Bob/Blake example!
Shouldn't this be generally "likely tokens are even more likely"?
I thought focusing on short tokens would be interesting since "make likely tokens more likely" is just temperature scaling doing its job, but I think the interaction with token length is surprising.
Fun fact: This was originally part of a post complaining about teacher forcing, but it turns out that this situation is actually worse without teacher forcing. If you force an LLM to keep predicting using its own outputs, picking a token with more potential continuations is better for the loss of the current token and the next tokens.
It may depend on the RL algorithm, but I think would not expect most RL to have this issue to first order if the RL algorithm is producing its rollouts by sampling from the full untruncated distribution at temperature 1.
The issue observed by the OP is a consequence of the fact that typically if you are doing anything other than untruncated sampling at temperature 1, then your sampling is not invariant between, e.g. "choose one of three options: a, b, or c" and "choose one of two options: a, or (choose one of two options: b or c)".
However many typical on-policy RL algorithms fundamentally derive from sampling/approximation of theorems/algorithms where running one step of the theoretical idealized policy update looks more like:
"Consider the space of possible complete output sequences S, and consider sum_{s in S} P(s) Reward(s). Update model parameters one step in the direction that most steeply overall increases this quantity".
By itself, this idealized update is invariant to tokenization, because it's expressed only in terms of complete outputs. Tokenization does come in insofar as it affects the gradient steepness of the policy in different directions of possible generalization and what parts of the space are explored and on which the approximated/sampled update occurs, etc.
Note that the typical mechanism by which RL tends towards entropy decrease and/or mode collapse is also well-explained by the above and does not need any involvement from tokenization. Indeed, consider just applying the above idealized update repeatedly. The model will continue sharpening to try push ever more of the probability mass on to only the sequences s for which Reward(s) is maximal or near-maximal, and push the probability of every other completed sequence to zero. If your reward function (from RLHF or whatever) has any preference for outputs of a given length or style, even if slight, the policy eventually may collapse arbitrarily much to only that part of the distribution that meets that preference.
In some RL algorithms there is, additionally, a sort of Polya's-urn like tendency (https://en.wikipedia.org/wiki/P%C3%B3lya_urn_model) where among sequences that give similar reward, the particular ones sampled will become consistently more (or less) likely, but I believe that training on advantage rather than raw reward tends to also mitigate or remove this bias to first order as well, although there can still be a random walk-like behavior (just now one of lesser magnitude than before and that can go in either direction).
In any case these and other numerous issues of RL I would tend see as distinct mechanisms from the bias that results in overweighting shorter or more likely tokens when sampling at temperature less than 1, particularly as the latter is a unsoundness/lack-of-invariance that is inherent in the functional form of sampling at temperature less than 1, whereas many of the issues of RL more arise out of e.g. the variance of sampling and approximations, unwanted generalization, imperfect rewards, e.g. rather than being inherently unsound in the functional form itself.
I'm not really sure how this would interact with RL since loss isn't calculated per-token and you're not trying to predict an exact output. I need to get some RL experience so I might try this at some point (although I'd also be happy if someone else got to it first).
I think this insight is really interesting! Especially the potential connection to LLMisms.
But I don't really understand why you chose these experiments. It seems to me the things to check or prove are:
You do significantly more work to show the effect in a toy setting that may or may not bear on the real case. And I think the outcome of your experiments is already clear before you do them because the effect of top-k sampling on tokens with low/high probability is not complicated (and well explained by you in the post).
Yeah characterizing the impact on current models would definitely be interesting.
I think the toy models are interesting since the impact of top-k and temperature is straightforward in one sense (it makes likely tokens more likely), but LLMs are complicated and it's possible that my theory about forcibly shortening tokens to trigger this wouldn't have worked.
I was also surprised by how big the effect was (admittedly, with a really large change to the tokenizer).
I remember reading a paper about how aiming for a certain entropy per token made LLMs sound more human. I think it might have been this paper? This marginalization of later tokens might be the reason why--aiming for a certain entropy would encourage lower probability tokens more often than a fixed temperature would while still avoiding "noisy" tokens.
Thanks for the nice blog!
A few years ago, I published [a paper](https://arxiv.org/pdf/1902.09191) conveying a very similar message -- likely tokens become more likely in trained SLMs. Now with the success of LLMs, I think this might be an explanation of why scaling law works well so far. Basically, SLMs transform the context into the next token through several NN layers, many of which are linear, so it's not surprising to see SLMs catching some obvious patterns like which words are more common than others. With LMs getting deeper and deeper, they catch the subtlety and semantics in the context much better, so that they don't straightforwardly follow the frequency pattern. It would be great to test the impact of LM size and data size in experiments, but of course, it's hard to make a fairly controlled comparison.
If this is the case, isn't the most straightforward test to take a pretrained open source LLM (e.g. Gemma 3, GPT-OSS, Llama, etc) and check its outputs logits? For example, the completion for The capital of France is
should be Paris
, so either we see the entire word token, or just P
, and we can check their relative likelihood.
I feel like in reality this doesn't happen (at least when I have had occasion to check logits) because the difference between a single small letter and a whole word is that the latter "collapses" the amount of possible branching more definitively. Also, it wouldn't be too hard to add a small regularization term to reward the choice of longer tokens when possible (though I don't know if any models are actually trained like that).
I think I was explaining this in a confusing way, so I added another footnote. Does this help?
The idea is that if the model is trained to use shorter tokens (like it does in many cases like non-, un-, anti-, etc.) it will be biased to use those tokens more than it should. So in the Paris case, I would expect most LLMs with a reasonable vocab size to have a "Paris" token, and they wouldn't be trained to use "P".
Based on my experiments above, I think if you did force P to be tokenized separately, the probability of P would be higher and the model would be biased to respond with capital names starting with P.
I'm probably also misunderstanding, but wouldn't this predict that large production models prefer words starting with "a" and names starting with "I" (capital "i")? Because these letters are, simultaneously, frequently-used words in English. Which makes it likely that the tokenizer includes the tokens " a" and " I" and that the model is incentivized to use them.
It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with "a" will usually prefer a tokenization longer than just "a". For example, "and" is its own token so a model usually wouldn't be trained to output [a, nd]. Also, it's not just how common a token is, but how common it is in that particular situation. "a" and "I" being common words doesn't mean they're common prefixes to other words (although this bias might affect "a" because a lot of phrases start with "a ...").
In cases where a model is trying to output a name that isn't a full-word token, I think it would be biased[1] to pick a name starting with a token that's the prefix of a lot of names.
Note that bias noticeably affects the output distribution, but it's not overwhelming. Even a huge change to the tokenizer only took the number of words starting with "c" from 4% to 10%.
I was thinking about LLM tokenization (as one does) and had a thought: We select the next output token for an LLM based on its likelihood, but (some) shorter tokens are more likely.
Why? Longer tokens can only complete one word, but some shorter tokens can complete many words. Those shorter common tokens[1] are (correctly) learned to be higher-probability because they have the combined probability of any word they could complete. However, standard generation techniques will only consider a subset of probabilities (top-K) and scale the largest probabilities (temperature). Both of these will take the highest probabilities and increase them further, meaning short/common tokens become significantly more likely to be generated just because they're shorter.
I ran an experiment to investigate this, showing that the first-character distribution of words generated by nanoGPT[2] is similar regardless of tokenization without top-K or temperature scaling, but if we use common settings (top-K=200 and temperature=0.8), we can increase the likelihood that a word starts with 'c' from 4% up to 10% just by tokenizing 'c' at the start of words separately from the rest of the word. I found similar but stronger effects when training on a tiny Shakespeare corpus.
This effect should also appear when a particular word is a common start to phrases. Techniques like beam search should help, but I don't think anyone does that in user-facing frontier models. I have a theory that this is part of the explanation for why AIs seem to all like the same words and have a similar style.
Feel free to skip to the experiment if this is all obvious.
Say you have a text corpus made up of conversations between Alice, Bob, Blake, Charles and Cathryn, and they're all equally likely to speak. There are 5 names, so in a model where you tokenize each name separately, the next-token probabilities for any name will be 20%.
If you tokenize the first letter separately (your vocab is ["A", "-lice", "B", "-ob", "-lake", "C", "-harles", "-athryn]), the probabilities will instead be:
So far, so good, the full probability of each full name is the same (20%).
Now, what if we only considered the top 2[3] tokens at each step?
Now we've dropped the probability of predicting Alice from 20% to 0% and spread the probability between the other options.
tl;dr temperature makes more-common outputs even more common and less-common outputs even less common. Feel free to skip the details if you don't care.
We need to step back from probabilities for a second to explain temperature. An LLM actually outputs logits (numbers that can be negative or positive and don't necessarily add up to 1), which we then scale through softmax to get probabilities.
When using temperature scaling, the logits are first divided by the temperature (so numbers between 0 and 1 make the logit larger and numbers larger than 1 make it smaller). Since softmax exponentiates the logits, a small temperature scales larger logits more than smaller logits.
So, going back to our example, our logits might look like:
Token | Logit | Probability |
---|---|---|
A | 0 | 20% |
B | 0.693 | 40% |
C | 0.693 | 40% |
But if we first scale them with temperature = 0.8, we get:
Token | Logit | Scaled Logit | Probability |
---|---|---|---|
A | 0 | 0 | 17.4% |
B | 0.693 | 0.866 | 41.3% |
C | 0.693 | 0.866 | 41.3% |
Again, the LLM is biased to pick names not starting with A.
Theory is all well and good, but what if we actually try it? To prove this, I forked nanoGPT and set up two custom tokenizers:
I trained a model with GPT-2 architecture for 21,000 iterations[4] on OpenWebText using each tokenizer.
Once each model was trained, I sampled 1000 separate 200 character outputs, then split the words and counted what proportion of words started with each letter. I repeated this with the following parameter combinations
Temperature | Top-K |
---|---|
1.0 | ∞ |
1.0 | 200 |
0.8 | 200 |
Note that temperature=1.0 and K=∞ corresponds to using the raw learned probabilities. 0.8 and 200 were chosen because they're fairly standard choices for these parameters.
You can see the raw results for each model and set of sampling parameters here. Since we chose to tokenize "c" separately, I predicted that words starting with "c" should be more common in the test model than the control model, and this effect should increase as the temperature and top-K parameters drop.
Here's the top 10 initial characters in words and how their probability changes between the control and test model.
Initial Character | Test Delta K=∞ T=1.0 | Test Delta K=200 T=1.0 | Test Delta K=200 T=0.8 |
---|---|---|---|
t | 1% | 1% | 1% |
a | -1% | -1% | -1% |
s | 0% | 0% | 0% |
i | 0% | 0% | 0% |
o | 0% | 0% | 0% |
c | 0% | 3% | 6% |
w | 0% | 0% | 0% |
b | 0% | 0% | 0% |
p | 0% | 0% | -1% |
f | 0% | 0% | 0% |
See more detailed table here and raw data here.
So, this provides empirical validation that top-K with K=200 or temperature=0.8, breaking a token into smaller pieces makes it more likely to be chosen.
The results above are more representative of a real model, but I was able to train small models on this Shakespeare dataset much faster, which let me run the same experiment above with more letters.
Each model trained until the validation loss started increasing, which was around 1,000 iterations.
In this case, I only generated 100 samples of 100 tokens each, and all samples used K=200. Each row is the results for a test model using a tokenizer which forces the given letter to tokenize separately at the start of words.
Letter | Control Frequency | Test T=1.0 | Test T=0.8 | Test T=0.1 |
---|---|---|---|---|
c | 2% | 5% | 8% | 43% |
t | 15% | 18% | 28% | 53% |
w | ~0% | 8% | 13% | 31% |
Looking through the output, the results are extremely obvious. Normal output has very few C-names, but if you force "c" to tokenize separately, suddenly you have a lot of conversations between clowns and Capulets. Similarly, initial-W tokenization suddenly has characters talking about the world a lot.
Talking about clouds with capulets (T=0.8)
the chamber of those clouds, 'be a time a king richard of cousin.
capulet:
well, what is well wrong'd him for you told.
lords:
for your mistress citizen:
like a morning of the way.
capulet:
this most word thou!
I suspect this has something to do with "LLM style". LLMs may be pushed to select "slop" words because those words have more possible endings, even if none of those endings are the best one.
It also seems concerning that such a small and seemingly-unimportant change to the tokenizer can have an unexpectedly large effect on outputs.
Also, as a BPE-tokenization-hater, I think it's interesting that this is a case where character-level transformers may be worse[5], although models that try to group tokens so they're equally "surprising" might make this better.
Specifically, shorter tokens that the model is trained to use. There are a lot of short tokens that a model won't usually use because it was trained to use longer ones (for example, in a standard model, "c" is a short token, but it's not one that the model will use very often because it was trained to prefer tokens like "cat").
I'm thinking of cases like splitting an "un-" or "non-" prefix into a separate token instead of having hundreds of individual tokens starting with "un-" and "non-", or cases where words are uncommon enough that the full word isn't a token.
The same architecture as GPT-2 but trained on OpenWebText.
Top-2 is unrealistic, but I don't want this example to get too long. See below where I show empirically that top-200 causes this effect in practice.
I initially trained one model until the validation loss started increasing, which happened to be 21,000 iterations, then realized I should train the other model for the same number of iterations.
I hope you like words starting with e.