A theory regarding the cause of strange words in the CoT
The following is my working theory about why strange language occurs in chains-of-thought. I’d greatly appreciate if someone with sufficient capability could invalidate them or confirm their potential merit for further exploration. Thanks!
What causes the strange words to occur? My hypothesis is that during RL, the reward is determined solely based upon the output, and is wholly agnostic toward the CoT that produced the output. However, the PPO rewards or penalizes the weights for the entire sequence, including for the CoT. Somehow, I imagine maybe due to how the CoT is delimited, this allows a semantic drift that only impacts the CoT. Or alternatively, perhaps whatever portion of it that impacts the output is later self-corrected by RL (as it does), but in a way that doesn’t correct the initial CoT drift.
Why does occurrence increase throughout RL? My hypothesis is that, assuming RL is effective, the average reward increases throughout the training process. Therefore, every time these strange words occur in the CoT, they are rewarded more and more strongly as RL progresses - and are also likely mutually reinforcing as their density in the CoT increases.
Where do the words come from? My hypothesis (which, like the rest, is also very speculative) is that perhaps the strange word (or its first token) is very similar or close within the embedding space to the token which (absent the semantic drift in the CoT) would have been chosen.