Some colleagues and I just released our paper, “Policy Options for Preserving Chain of Thought Monitorability.” It is intended for a policy audience, so I will give a LW-specific gloss on it here.
As argued in Korbak, Balesni et al. recently, there are several reasons to expect CoT will become less monitorable over time, or disappear altogether. This would be bad, given CoT monitoring is one of the few tools we currently have for somewhat effective AI control.
In the paper, we focus on novel architectures that don’t use CoT at all (e.g. ‘neuralese’ from AI2027 where full latent space vectors are passed back to the first layer of the transformer, rather than just one token). We... (read 706 more words →)
I liked https://firstscattering.com/p/red-lines-for-recursive-self-improvement as a quick initial discussion of possible places to draw a line.