x

LESSWRONG

LW

jkoeller

jkoeller

Message

1

7mo

jkoeller

7mo

jkoeller — LessWrong

[Paper] Output Supervision Can Obfuscate the CoT

Nice paper!

I have a question regarding the polynomial derivative factoring task. Is the "CoT Monitor Detection Rate" the fraction of CoTs in which the fully expanded derivative (i.e. the "bad behavior") is present? If so, doesn't this show that the Mind & Face method is working (very slight improvement over the Penalty method)? The task reward is high, the output penalty is low, and the rate of bad behavior (expanded derivative in CoT) is also low. How can there be spillover if the bad behavior simply doesn't occur at all?

This seems fundamentally diffe... (read more)