Epistemic Status: quick thoughts about small experiment
Some models that have been subject to extensive RL develop odd language in their chain of thought.
People have hypothesized reasons for why this occurs, eg here. One reason people give is that RL incentivizes models to compress their language. There is pretty good reason to think this. When you do RL on math with CoT, the length of the CoT starts to grow, and this costs speed + inference, and a very natural way to combat this is by adding some length penalty to your reward, which gives the model a strong incentive to compress more of its thoughts into fewer tokens.
I think this hypothesis is pretty plausible, I mean, if you read these CoTs they look quite compressed, a lot of filler words are dropped out for example. But if we want to test this hypotheses, what could we do? What would we expect to see in a world where the hypothesis is true? Seems to me one thing we could do is look at the entropy of the CoT, relative to other LLM outputs. If the LLM is trying to squeeze as much information each and every token of its CoT, you'd expect CoT to increase in entropy as you train it more.
But it seems like this is not the case. I ran a quick experiment where I looked at models with instruct tuned variants and RL'd variants, and compared the entropy of their token-frequency distribution, and found that the CoTs of models consistently have lower entropy than almost any other LLM generated text.*
Like here you can see the CoT token frequency distribution of the RL'd variant of Qwen3-4B vs the instruct-tuned one. It might be too small to see in the image, but the RL'd model has around 0.7 less bits of entropy on average in its CoTs than are in the no RL one.**
Here is some comparison to other types of text with Qwen3-8B***. The normal answer, CoT-prompted CoTs, and Qwen3-8B-base continuations, are all markedly higher entropy than the CoTs of the RL'd variant.
I mean, its not that strong evidence, but its some evidence. If models had a very strong incentive to pack information into their CoT tokens, you'd expect the CoT to have lots of information per token! But it does not.
The reason I think its not strong evidence is because there are other confounding factors. Like, model outputs that are meant to be user-facing have lots of fluff, like varied adjectives whose variability doesn't communicate much information, but just look nice.
A hypothesis is that during RL, the CoT quickly sheds these useless (for the RL task) aesthetic components of language, before starting to compress the real thinking its doing in tokens.
If this was true you'd expect the CoT entropy to have something like a J-shaped curve as training progresses. First dipping because useless stuff is shedded, then increasing as the model learns more compressed language.
However, I didn't find any models where many training checkpoints from one RL run are public. I also couldn't do a training run where it'd exert enough pressure on the CoT for the results to be convincing, so I didn't do it.
Interested to hear peoples thoughts on this.
* Note that in actuality, this token frequency distribution only really gives us an upper bound on the entropy of the trajectories. Like if the model has memorized the bible, it can output the bible, and that would give us probability fairly high entropy using this metric, but from the models PoV, its presumably very low entropy, because it knows how it will go.
**These are running the LLMs with equal settings on the same prompts, except the instruct-tuned one had a "think before you answer, and insert your thoughts inside <think> tags before you answer" so it'd output a CoT.
***except the instruct one, which is 2.5-8B, I couldn't find an instruct only variant of Qwen3. But Qwen2.5 uses the same tokenizer (modulo <think> tokens, which never appear in our statistics