I only read the LW version not the paper, but this seems like important work to me and I'm glad you're doing it! What did you make of these two recent papers?

I have done some work on the policy side of this (whether we should/how we could enforce CoT monitorability on AI developers, or at least gain transparency into how monitorable SOTA models are). Lmk if ever it would be useful to talk about that, otherwise I will be keen to see where this line of work ends up!

Reply

[-]Wuschel Schulz12d20

Thank you for the comment.

About the papers you mentioned, that propose metrics for CoT monitorability:

I think these metrics are useful to keep an eye on, but do not measure what I care most about in CoT monittorability.

They can not detect steganography, so the reasoning model might do encoded reasoning, and they would still be rated as 'monitorable'.

More importantly: just because a reasoning model has a low monitorability score on those metrics, that does not mean we cannot usefully deploy CoT monitoring:
A model confabulating nonsens reasons for answers, and not mentioning some crucial consideration for an answer would give it a low score on those metrics. But as long as the model still needs to do monitorable reasoning before it can cause a catastrophe, we can still rely on CoT monitoring in safety cases.

On the policy side:

I have seen your Policy Options report and cited it in my paper. I largely agree with your analysis and have little more to add on the policy side.

I think "CoT-assisted adversarial evaluation" is a good Idea that might make for a useful evaluation. If you ever want someone to implement that in an evaluation, I might be up for doing that.

Reply

[-]artkpv1mo20

Thanks for publishing it here. I find your publication important.

Steganography. The CoT might read like normal text while the model hides additional meaning within it. The model could encode relevant thoughts in seemingly arbitrary details, such as the choice between synonyms, as demonstrated by Roger et al.

then on page 27:

Steganography is a form of encoded reasoning, that occurs when the model hides relevant reasoning steps within text that appears to have an oblique meaning.

This looks to me like the conflation of actual steganography and obfuscated reasoning because "text that appears to have oblique meaning" might be such to a monitor (or a human that reads it). However steganography is by definition a procedure by which one object is used to hide another object. It distinguishes between a cover and a secret, and it produces a stegotext that has illegible hidden messages or reasoning, unlike the obfuscated reasoning which merely plays on the weaknesses of a monitor.

Minor. I don't think Roger et al. used synonyms

Reply

[-]Wuschel Schulz12d20

Thanks for the close reading!

I agree, that steganography assumes that the obleque text needs to be used for hidden reasoning. Just obfuscated reasoning, where the model outputs text that has nothing to do with the actual thought process is not enough.

I can see how that is not totally clear form the definition on page 27. I am saying that it is a form of encoded reasoning, wich I take to mean that the outputted text has to be actually used for reasoning, and can not just be post-hoc rationalization.

On the error:
Ah, you'r right: I messed up synonyms and acronyms.

Reply