Some colleagues and I just released our paper, “Policy Options for Preserving Chain of Thought Monitorability.” It is intended for a policy audience, so I will give a LW-specific gloss on it here.
As argued in Korbak, Balesni et al. recently, there are several reasons to expect CoT will become less monitorable over time, or disappear altogether. This would be bad, given CoT monitoring is one of the few tools we currently have for somewhat effective AI control.
In the paper, we focus on novel architectures that don’t use CoT at all (e.g. ‘neuralese’ from AI2027 where full latent space vectors are passed back to the first layer of the transformer, rather than just one token). We think this is the most concerning route, since model latents are very hard to interpret (so far).
But as we were finalizing our paper, OpenAI and Apollo’s work on scheming detection came out, which shows some CoTs from o3 that are pretty hard to decipher:[1]
So maybe we were too optimistic, and CoT monitoring will become unviable even without neuralese, just with scaling up RL on reasoning models. We are interested in takes on this, and whether there are (low-ish cost) ways to train reasoning models that still produce more legible CoTs. As I understand, DeepSeek went to significant efforts to make its r1 model have a more readable CoT, which suggests it is possible. Another idea I have heard is that you could paraphrase the model’s CoT to itself in training, such that it doesn’t learn to use inscrutable phrases like “glean disclaim disclaim synergy customizing illusions”.
But, despite the worrying signs on CoT readability and monitorability, we tentatively think this is still worth fighting for. In particular, we discuss three coordination mechanisms to solve the collective action problem where no company wants to be the only one to maintain monitorable CoT if it imposes an efficiency penalty:
To decide which of these options is needed, we consider the relative size of the societal benefits and the ‘monitorability tax’ companies pay by using (putatively less efficient) monitorable models.
A key question determining the feasibility of policy solutions is, “How big is the monitorability tax?”. We don’t know, and nor can anyone else, until/unless neuralese or other non-monitorable models exceed SOTA performance from CoT models. But we did ask a miscellaneous sample of more technical AI safety people about this, and the median estimate of the median person was a 10x efficiency gain from neuralese at the time of AGI, with huge variance. The exact question wording is here.
So I’m not sure if we will be able to coordinate to preserve CoT monitorability in the medium/long-term: if the efficiency penalty is too high (maybe more than 3x or so?) neuralese will probably just win out. But should we preserve CoT in the short term? This could go either way:
I am pretty undecided - my guess is CoT is still worth striving to preserve, but the backfire risk does seem real. Overall, I am still keen for more people to work on policy options to preserve CoT monitorability, but it seems like less of a clear win and low-hanging fruit than it did when we started the project.
Here is what I would be particularly keen to see technical researchers do:
And even before that there had been signs of CoTs of frontier models becoming less readable.