Shorter Tokens Are More Likely

[-]joseph_c3mo21-2

So, I had a hypothesis last night that training on a different scoring rule might solve this problem (because it could encourage uncertain probabilities to be lower, and thus it would be easier to filter them out without making short tokens undeservedly more likely).

I ended up forking your code, and this morning trained an LLM on the Shakespeare dataset using the -ReLU loss (from Speeding Up Entmax). The $α$ -ReLU loss is a proper scoring rule based off of the Tsallis entropy.

My results were the following:

Letter	Score Rule	Top-K	Control	Test T=1.0	Test T=0.8	Test T=0.1
c	Cross-entropy	200	3.09%	5.29%	6.64%	7.26%
c	$α$ -ReLU	200	3.41%	4.66%	4.75%	4.33%
c	$α$ -ReLU	$\infty$	3.41%	3.87%	4.00%	3.49%

All models were trained for 1500 iterations. The control was trained without the special C-tokenizer, while the tests were. In the paper, $α$ -ReLU is parameterized by $α$ which controls the Tsallis exponent, and $τ$ , which is a constant shift in the ReLU. I chose to set $α = 1.5$ and $τ = 0$ for all the experiments (which is the default for the $α$ -ReLU python library).

The $α$ -ReLU trained neural network does NOT seem to exhibit the same trend of lower temperatures leading to higher probability of words starting with "c". And, when the top-k is set to $\infty$ , most of the difference between it and the control also disappears.

So, perhaps we should just be training with a different scoring rule!

You can find my fork at https://github.com/cooljoseph1/nanoGPT-tokenizer-experiment/.

EDIT: I realized I was calculating temperature unfairly for the $α$ -ReLU sampling. After fixing it, it is actually worse than the cross-entropy trained network:

Letter	Score Rule	Top-K	Control	Test T=1.0	Test T=0.8	Test T=0.1
c	Cross-entropy	200	3.09%	5.29%	6.64%	7.26%
c	$α$ -ReLU	200	3.41%	4.66%	7.54%	31.66%
c	$α$ -ReLU	$\infty$	3.41%	3.87%	5.94%	31.66%

[-]Brendan Long3mo40

Thanks for trying this! I wonder if this is making things worse in a similar way to top-k. The C-tokenizer makes it very likely that "c" is always in the top 200 tokens. I wonder if it's also ensuring that it's rarely sufficiently uncertain to be filtered by this scoring rule?

[-]Buck3mo112

Shorter common tokens are (correctly) learned to be higher-probability because they have the combined probability of any word they could complete.

I don't think this is right. LLMs are trained with teacher forcing on a fixed tokenization. For common words the tokenizer provides a single long token (e.g., “ the”, “ and”, “ cat”). The model is trained to put probability on that token—not on a shorter prefix like “ c”. So a short token does not “inherit the combined probability of any word it could complete.” Those completions were usually seen as longer tokens during training.

You can check this by running inference on an LM; if you ask the model to complete "hello" it will put " world" above " " even though both are typically tokens.

On a quick skim, I think that the rest of your arguments are plausible.

[-]Brendan Long3mo*65

You can see in the experiment section that I'm not saying LLMs normally do this with words like "cat". I actually changed to tokenizer to force that situation to prove that the effect would appear like I was expecting.

I was a little imprecise about shorter tokens, but I'm talking about if the tokenizer causes some set of words or phrases to be canonically tokenized with shorter tokens. So, if the tokenizer decides that the un- or non- prefix should be separate tokens, those shorter tokens will be more likely than if the tokenizer had longer tokens for every un- and non- word.

You're right that I should probably explain this better in the intro though.

[-]StefanHex3mo112

Shouldn't this be generally "likely tokens are even more likely"? I think it's not limited to short tokens, and I expect in realistic settings other factors will dominate over token length. But I agree that top-k (or top-p) sampling should lead to a miscalibration of LLM outputs in the low-probability tail.

I suspect this has something to do with "LLM style". LLMs may be pushed to select "slop" words because those words have more possible endings, even if none of those endings are the best one.

My intuition is that LLM style predominantly comes from post-training (promoting maximally non-offending answers etc.) rather than due to top-k/p sampling. (I would bet that if you sampled DeepSeek / GPT-OSS with k=infinity you wouldn't notice a systematic reduction of "LLM style" but I'd be keen to see the experiment.)

Thanks for the writeup, I appreciated the explanations and especially the Alice/Bob/Blake example!

[-]Brendan Long3mo42

Shouldn't this be generally "likely tokens are even more likely"?

I thought focusing on short tokens would be interesting since "make likely tokens more likely" is just temperature scaling doing its job, but I think the interaction with token length is surprising.

[-]Brendan Long3mo90

Fun fact: This was originally part of a post complaining about teacher forcing, but it turns out that this situation is actually worse without teacher forcing. If you force an LLM to keep predicting using its own outputs, picking a token with more potential continuations is better for the loss of the current token and the next tokens.

[-]gwern3mo131

So, most LLM RL training would be expected to exacerbate this issue?

[-]polytope3mo70

It may depend on the RL algorithm, but I think would not expect most RL to have this issue to first order if the RL algorithm is producing its rollouts by sampling from the full untruncated distribution at temperature 1.

The issue observed by the OP is a consequence of the fact that typically if you are doing anything other than untruncated sampling at temperature 1, then your sampling is not invariant between, e.g. "choose one of three options: a, b, or c" and "choose one of two options: a, or (choose one of two options: b or c)".

However many typical on-policy RL algorithms fundamentally derive from sampling/approximation of theorems/algorithms where running one step of the theoretical idealized policy update looks more like:

"Consider the space of possible complete output sequences S, and consider sum_{s in S} P(s) Reward(s). Update model parameters one step in the direction that most steeply overall increases this quantity".

By itself, this idealized update is invariant to tokenization, because it's expressed only in terms of complete outputs. Tokenization does come in insofar as it affects the gradient steepness of the policy in different directions of possible generalization and what parts of the space are explored and on which the approximated/sampled update occurs, etc.

Note that the typical mechanism by which RL tends towards entropy decrease and/or mode collapse is also well-explained by the above and does not need any involvement from tokenization. Indeed, consider just applying the above idealized update repeatedly. The model will continue sharpening to try push ever more of the probability mass on to only the sequences s for which Reward(s) is maximal or near-maximal, and push the probability of every other completed sequence to zero. If your reward function (from RLHF or whatever) has any preference for outputs of a given length or style, even if slight, the policy eventually may collapse arbitrarily much to only that part of the distribution that meets that preference.

In some RL algorithms there is, additionally, a sort of Polya's-urn like tendency (https://en.wikipedia.org/wiki/P%C3%B3lya_urn_model) where among sequences that give similar reward, the particular ones sampled will become consistently more (or less) likely, but I believe that training on advantage rather than raw reward tends to also mitigate or remove this bias to first order as well, although there can still be a random walk-like behavior (just now one of lesser magnitude than before and that can go in either direction).

In any case these and other numerous issues of RL I would tend see as distinct mechanisms from the bias that results in overweighting shorter or more likely tokens when sampling at temperature less than 1, particularly as the latter is a unsoundness/lack-of-invariance that is inherent in the functional form of sampling at temperature less than 1, whereas many of the issues of RL more arise out of e.g. the variance of sampling and approximations, unwanted generalization, imperfect rewards, e.g. rather than being inherently unsound in the functional form itself.

[-]Brendan Long3mo30

I'm not really sure how this would interact with RL since loss isn't calculated per-token and you're not trying to predict an exact output. I need to get some RL experience so I might try this at some point (although I'd also be happy if someone else got to it first).

[-]joseph_c3mo50

I remember reading a paper about how aiming for a certain entropy per token made LLMs sound more human. I think it might have been this paper? This marginalization of later tokens might be the reason why--aiming for a certain entropy would encourage lower probability tokens more often than a fixed temperature would while still avoiding "noisy" tokens.

[-]p.b.3mo40

I think this insight is really interesting! Especially the potential connection to LLMisms.

But I don't really understand why you chose these experiments. It seems to me the things to check or prove are:

current tokenizers do actually tokenize typical training data so that short tokens are more common
current models do produce text that recapitulates this bias
how the k for topk-sampling exacerbates this bias depending on k
how this changes some typical completions

You do significantly more work to show the effect in a toy setting that may or may not bear on the real case. And I think the outcome of your experiments is already clear before you do them because the effect of top-k sampling on tokens with low/high probability is not complicated (and well explained by you in the post).

[-]Brendan Long3mo40

Yeah characterizing the impact on current models would definitely be interesting.

I think the toy models are interesting since the impact of top-k and temperature is straightforward in one sense (it makes likely tokens more likely), but LLMs are complicated and it's possible that my theory about forcibly shortening tokens to trigger this wouldn't have worked.

I was also surprised by how big the effect was (admittedly, with a really large change to the tokenizer).

[-]David Africa3mo40

This paper on tokenization bias seems related.

[-]ShaojieJiang3mo30

Thanks for the nice blog!

A few years ago, I published [a paper](https://arxiv.org/pdf/1902.09191) conveying a very similar message -- likely tokens become more likely in trained SLMs. Now with the success of LLMs, I think this might be an explanation of why scaling law works well so far. Basically, SLMs transform the context into the next token through several NN layers, many of which are linear, so it's not surprising to see SLMs catching some obvious patterns like which words are more common than others. With LMs getting deeper and deeper, they catch the subtlety and semantics in the context much better, so that they don't straightforwardly follow the frequency pattern. It would be great to test the impact of LM size and data size in experiments, but of course, it's hard to make a fairly controlled comparison.

[-]dr_s3mo20

If this is the case, isn't the most straightforward test to take a pretrained open source LLM (e.g. Gemma 3, GPT-OSS, Llama, etc) and check its outputs logits? For example, the completion for The capital of France is should be Paris, so either we see the entire word token, or just P, and we can check their relative likelihood.

I feel like in reality this doesn't happen (at least when I have had occasion to check logits) because the difference between a single small letter and a whole word is that the latter "collapses" the amount of possible branching more definitively. Also, it wouldn't be too hard to add a small regularization term to reward the choice of longer tokens when possible (though I don't know if any models are actually trained like that).

[-]Brendan Long3mo30

I think I was explaining this in a confusing way, so I added another footnote. Does this help?

The idea is that if the model is trained to use shorter tokens (like it does in many cases like non-, un-, anti-, etc.) it will be biased to use those tokens more than it should. So in the Paris case, I would expect most LLMs with a reasonable vocab size to have a "Paris" token, and they wouldn't be trained to use "P".

Based on my experiments above, I think if you did force P to be tokenized separately, the probability of P would be higher and the model would be biased to respond with capital names starting with P.

[-]cubefox3mo20

I'm probably also misunderstanding, but wouldn't this predict that large production models prefer words starting with "a" and names starting with "I" (capital "i")? Because these letters are, simultaneously, frequently-used words in English. Which makes it likely that the tokenizer includes the tokens " a" and " I" and that the model is incentivized to use them.

[-]Brendan Long3mo30

It depends on the situation. LLMs are usually trained to use the longest token for any particular output, so words starting with "a" will usually prefer a tokenization longer than just "a". For example, "and" is its own token so a model usually wouldn't be trained to output [a, nd]. Also, it's not just how common a token is, but how common it is in that particular situation. "a" and "I" being common words doesn't mean they're common prefixes to other words (although this bias might affect "a" because a lot of phrases start with "a ...").

In cases where a model is trying to output a name that isn't a full-word token, I think it would be biased^[1] to pick a name starting with a token that's the prefix of a lot of names.

^{^}
Note that bias noticeably affects the output distribution, but it's not overwhelming. Even a huge change to the tokenizer only took the number of words starting with "c" from 4% to 10%.

Token	Logit	Probability
A	0	20%
B	0.693	40%
C	0.693	40%

Token	Logit	Scaled Logit	Probability
A	0	0	17.4%
B	0.693	0.866	41.3%
C	0.693	0.866	41.3%

Temperature	Top-K
1.0	∞
1.0	200
0.8	200

Initial Character	Test Delta K=∞ T=1.0	Test Delta K=200 T=1.0	Test Delta K=200 T=0.8
t	1%	1%	1%
a	-1%	-1%	-1%
s	0%	0%	0%
i	0%	0%	0%
o	0%	0%	0%
c	0%	3%	6%
w	0%	0%	0%
b	0%	0%	0%
p	0%	0%	-1%
f	0%	0%	0%

Letter	Control Frequency	Test T=1.0	Test T=0.8	Test T=0.1
c	2%	5%	8%	43%
t	15%	18%	28%	53%
w	~0%	8%	13%	31%

^{^}

Specifically, shorter tokens that the model is trained to use. There are a lot of short tokens that a model won't usually use because it was trained to use longer ones (for example, in a standard model, "c" is a short token, but it's not one that the model will use very often because it was trained to prefer tokens like "cat").

I'm thinking of cases like splitting an "un-" or "non-" prefix into a separate token instead of having hundreds of individual tokens starting with "un-" and "non-", or cases where words are uncommon enough that the full word isn't a token.

^{^}

The same architecture as GPT-2 but trained on OpenWebText.

^{^}

Top-2 is unrealistic, but I don't want this example to get too long. See below where I show empirically that top-200 causes this effect in practice.

^{^}

I initially trained one model until the validation loss started increasing, which happened to be 21,000 iterations, then realized I should train the other model for the same number of iterations.

^{^}

I hope you like words starting with e.

LESSWRONG
LW

LESSWRONG
LW

98

Shorter Tokens Are More Likely

98

98

Why?

Top-K Sampling

Temperature < 1.0

The Experiment

Results

Shakespeare Experiment

Results

Why Does It Matter?

Initial Character	Test Delta K=∞ T=1.0	Test Delta K=200 T=1.0	Test Delta K=200 T=0.8
t	1%	1%	1%
a	-1%	-1%	-1%
s	0%	0%	0%
i	0%	0%	0%
o	0%	0%	0%
c	0%	3%	6%
w	0%	0%	0%
b	0%	0%	0%
p	0%	0%	-1%
f	0%	0%	0%

Initial Character	Test Delta K=∞ T=1.0	Test Delta K=200 T=1.0	Test Delta K=200 T=0.8
t	1%	1%	1%
a	-1%	-1%	-1%
s	0%	0%	0%
i	0%	0%	0%
o	0%	0%	0%
c	0%	3%	6%
w	0%	0%	0%
b	0%	0%	0%
p	0%	0%	-1%
f	0%	0%	0%

Initial Character	Test Delta K=∞ T=1.0	Test Delta K=200 T=1.0	Test Delta K=200 T=0.8
t	1%	1%	1%
a	-1%	-1%	-1%
s	0%	0%	0%
i	0%	0%	0%
o	0%	0%	0%
c	0%	3%	6%
w	0%	0%	0%
b	0%	0%	0%
p	0%	0%	-1%
f	0%	0%	0%