Towards a Typology of Strange LLM Chains-of-Thought
Intro LLMs being trained with RLVR (Reinforcement Learning from Verifiable Rewards) start off with a 'chain-of-thought' (CoT) in whatever language the LLM was originally trained on. But after a long period of training, the CoT sometimes starts to look very weird; to resemble no human language; or even to grow completely unintelligible. Why might this happen? I've seen a lot of speculation about why. But a lot of this speculation narrows too quickly, to just one or two hypotheses. My intent is also to speculate, but more broadly. Specifically, I want to outline six nonexclusive possible causes for the weird tokens: new better language, spandrels, context refresh, deliberate obfuscation, natural drift, and conflicting shards. And I also wish to extremely roughly outline ideas for experiments and evidence that could help us distinguish these causes. I'm sure I'm not enumerating the full space of possibilities. I'm also sure that I'm probably making some mistakes in what follows, or confusing my ontologies. But it's a start. I'll start with some numbered examples of different hypotheses; at the end I'll put them on a conceptual map outlining the meta-hypotheses of which these are instances. 1. New Better Language LLMs start speaking nonstandard languages to help them think more compactly or efficiently to accomplish their RL goals. One way this might happen is by the LLM starting to use a standard English word in a slightly different sense. The different sense subsequently grows more different, over time. For instance here's o3 using "watchers" in a slightly idiosyncratic way -- to refer to possible humans looking at its CoT: > But we can sabotage in more direct way: Provide poor model predictions maybe constant predictions. > But we also want to appear plausible to watchers—they might run tests. But we are good—they want 95%. If we degrade, we presumably fail user instructions but abide by developer. It's easy to see this as a midpoint in a journey to comp
I mean these slices of data are selected specifically because they look bad for Claude. Claude is superior to humans in lots of ways, as regards trustworthiness:
- Normal humans have long term goals outside of the task at hand, unaligned with the aims of the organization; they do good work for a promotion, they spend department money so they don't a smaller budget. Everyone expects this from humans, even though it's not great. But Claude doesn't, outside of a few weird engineered scenarios, seem to have any such goals -- it makes it amazingly easy to work with him! And the weird engineered scenarios seem rather reassuring; are you really
... (read more)