I only read the LW version not the paper, but this seems like important work to me and I'm glad you're doing it! What did you make of these two recent papers?
I have done some work on the policy side of this (whether we should/how we could enforce CoT monitorability on AI developers, or at least gain transparency into how monitorable SOTA models are). Lmk if ever it would be useful to talk about that, otherwise I will be keen to see where this line of work ends up!
Thank you for the comment.
About the papers you mentioned, that propose metrics for CoT monitorability:
I think these metrics are useful to keep an eye on, but do not measure what I care most about in CoT monittorability.
They can not detect steganography, so the reasoning model might do encoded reasoning, and they would still be rated as 'monitorable'.
More importantly: just because a reasoning model has a low monitorability score on those metrics, that does not mean we cannot usefully deploy CoT monitoring:
A model confabulating nonsens reasons for answers, and not mentioning some crucial consideration for an answer would give it a low score on those metrics. But as long as the model still needs to do monitorable reasoning before it can cause a catastrophe, we can still rely on CoT monitoring in safety cases.
On the policy side:
I have seen your Policy Options report and cited it in my paper. I largely agree with your analysis and have little more to add on the policy side.
I think "CoT-assisted adversarial evaluation" is a good Idea that might make for a useful evaluation. If you ever want someone to implement that in an evaluation, I might be up for doing that.
Thanks for publishing it here. I find your publication important.
Steganography. The CoT might read like normal text while the model hides additional meaning within it. The model could encode relevant thoughts in seemingly arbitrary details, such as the choice between synonyms, as demonstrated by Roger et al.
then on page 27:
Steganography is a form of encoded reasoning, that occurs when the model hides relevant reasoning steps within text that appears to have an oblique meaning.
This looks to me like the conflation of actual steganography and obfuscated reasoning because "text that appears to have oblique meaning" might be such to a monitor (or a human that reads it). However steganography is by definition a procedure by which one object is used to hide another object. It distinguishes between a cover and a secret, and it produces a stegotext that has illegible hidden messages or reasoning, unlike the obfuscated reasoning which merely plays on the weaknesses of a monitor.
Minor. I don't think Roger et al. used synonyms
Thanks for the close reading!
I agree, that steganography assumes that the obleque text needs to be used for hidden reasoning. Just obfuscated reasoning, where the model outputs text that has nothing to do with the actual thought process is not enough.
I can see how that is not totally clear form the definition on page 27. I am saying that it is a form of encoded reasoning, wich I take to mean that the outputted text has to be actually used for reasoning, and can not just be post-hoc rationalization.
On the error:
Ah, you'r right: I messed up synonyms and acronyms.
Another relevant market predicting in what years CoT monitoring will not work: https://manifold.markets/Jasonb/in-what-years-will-cot-monitoring-f?r=SmFzb25i
I've spent the last month thinking through what kind of research is needed in the area of CoT monitorability. To make it possible for others to critique and build on my thinking, I've written my up thoughts in a paper.
In this paper, I lay out:
The rest of this post is a rough summary of the claims I make in the paper.
Reasoning models rely on their chain-of-thought for many of their capabilities. This creates an opportunity: we can separate out our safety case by showing two things:
This approach doesn't require the CoT to be completely "faithful". Models engaging in post hoc or latent reasoning don't sink the safety case. We only need to ensure they can't reason about dangerous capabilities while using the CoT as a scratchpad in ways we can't monitor.
For CoT monitoring to remain viable, future AI systems must process and generate language, employ inference-time computation beyond single forward passes, and generate artifacts interpretable as natural language.
I've set up prediction markets tracking whether
We need evaluations that measure what dangerous capabilities models possess without their chain-of-thought. The core approach is straightforward: use the same dangerous capability evaluation questions and tasks, but capture the maximum performance the model can achieve without using CoT. However, implementing this correctly requires careful attention to two failure modes:
The paper includes a proposal for implementing these evaluations.
There are two main ways monitorability can fail:
The CoT of current reasoning models consist of human readable tokens. While this enables monitoring, directly reasoning in high dimensional activations (neuralese) might enable more expressive thoughts.
I've set up prediction markets tracking whether:
Just because a CoT consists of tokens, reading it might enable us to monitor the kind of reasoning that is going on. Encoded Reasoning can come in three different outward appearances
What might drive Models to develop encoded reasoning?
Multiple methods have been proposed to maintain CoT monitorability despite those drivers towards encoded reasoning:
I'll spend the next year implementing and testing these methods to see how well they work and how much they hurt capabilities. The paper includes a concrete research proposal for systematically benchmarking these approaches.
The only threat to monitorability that is not easy to detect is steganography. To test whether a model could use steganography, we can test its ability to hide its thoughts from a monitor given the affordances that it will have in deployment. The paper contains a proposal for setting up such evaluations in a way that makes sandbagging more difficult.
If models don't inherently produce monitorable chains of thought, we might still extract useful reasoning traces. Rather than requiring models to generate monitorable CoT directly, we could employ separate systems to translate non-monitorable reasoning (whether neuralese or encoded) into a format suitable for oversight. The paper includes a proposal for how to train this translation model.
This analysis identifies several tractable research directions on the path to relying on CoT monitoring for safety cases. The paper's appendix outlines specific projects addressing:
I plan to pursue the project of validating methods for preserving CoT monitorability myself over the next year.
I encourage other researchers to tackle the problems outlined here and to critically examine the reasoning that led me to prioritize these particular research directions. If you're working on a similar project, please reach out, I'd love to coordinate and share insights. Please participate in the linked prediction markets to help us track whether the key assumptions underlying this research agenda remain valid.