2084

LESSWRONG
LW

2083

jr's Shortform

by jr
21st Oct 2025
1 min read
3

2

This is a special post for quick takes by jr. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
jr's Shortform
1jr
-2jr
-5jr
3 comments, sorted by
top scoring
Click to highlight new comments since: Today at 2:18 PM
[-]jr26d*10
[This comment is no longer endorsed by its author]Reply
[-]jr23d-2-3

A theory regarding the cause of strange words in the CoT

The following is my working theory about why strange language occurs in chains-of-thought. I’d greatly appreciate if someone with sufficient capability could invalidate them or confirm their potential merit for further exploration. Thanks!

What causes the strange words to occur? My hypothesis is that during RL, the reward is determined solely based upon the output, and is wholly agnostic toward the CoT that produced the output. However, the PPO rewards or penalizes the weights for the entire sequence, including for the CoT. Somehow, I imagine maybe due to how the CoT is delimited, this allows a semantic drift that only impacts the CoT. Or alternatively, perhaps whatever portion of it that impacts the output is later self-corrected by RL (as it does), but in a way that doesn’t correct the initial CoT drift.

Why does occurrence increase throughout RL? My hypothesis is that, assuming RL is effective, the average reward increases throughout the training process. Therefore, every time these strange words occur in the CoT, they are rewarded more and more strongly as RL progresses - and are also likely mutually reinforcing as their density in the CoT increases.

Where do the words come from? My hypothesis (which, like the rest, is also very speculative) is that perhaps the strange word (or its first token) is very similar or close within the embedding space to the token which (absent the semantic drift in the CoT) would have been chosen.

Reply
[+]jr26d*-5-3
Moderation Log
More from jr
View more
Curated and popular this week
3Comments